问题描述
我在地理数据框中有一个点列表(经度和纬度),以及它们关联的点几何。所有的点都应该能够细分为单独的多边形,因为这些点通常聚集在几个区域中。我想做的是有某种算法来遍历点并检查前一个点和当前点之间的距离。如果距离足够小,它会将这些点组合在一起。这个过程会一直持续到当前点太远。它将用这些接近的点制作一个多边形,然后用下一组点继续这个过程。
gdf
longitude latitude geometry
0 -76.575249 21.157229 POINT (-76.57525 21.15723)
1 -76.575035 21.157453 POINT (-76.57503 21.15745)
2 -76.575255 21.157678 POINT (-76.57526 21.15768)
3 -76.575470 21.157454 POINT (-76.57547 21.15745)
5 -112.973177 31.317333 POINT (-112.97318 31.31733)
... ... ... ...
2222 -113.492501 47.645914 POINT (-113.49250 47.64591)
2223 -113.492996 47.643609 POINT (-113.49300 47.64361)
2225 -113.492379 47.643557 POINT (-113.49238 47.64356)
2227 -113.487443 47.643142 POINT (-113.48744 47.64314)
2230 -105.022627 48.585669 POINT (-105.02263 48.58567)
因此在上面的数据中,前 4 个点将组合在一起并变成一个多边形。然后,它将移动到下一组,依此类推。每组点不是均匀分布的,即下一组可能是 7 对点,接下来可能是 3。理想情况下,最终输出将是另一个地理数据框,它只是一堆多边形。
解决方法
您可以尝试 DBSCAN 聚类,因为它会自动找到最佳聚类数,并且您可以指定点之间的最大距离 ( ε )。
使用您的示例,该算法识别出两个集群。
import pandas as pd
from sklearn.cluster import DBSCAN
df = pd.DataFrame(
[
[-76.575249,21.157229,(-76.,21.15723)],[-76.575035,21.157453,(-76.57503,21.15745)],[-76.575255,21.157678,(-76.57526,21.15768)],[-76.575470,21.157454,(-76.57547,[-112.973177,31.317333,(-112.97318,31.31733)],[-113.492501,47.645914,(-113.49250,47.64591)],[-113.492996,47.643609,(-113.49300,47.64361)],[-113.492379,47.643557,(-113.49238,47.64356)],[-113.487443,47.643142,(-113.48744,47.64314)],[-105.022627,48.585669,(-105.02263,48.58567)]
],columns=["longitude","latitude","geometry"])
clustering = DBSCAN(eps=0.3,min_samples=4).fit(df[['longitude','latitude']].values)
gdf = pd.concat([df,pd.Series(clustering.labels_,name='label')],axis=1)
print(gdf)
gdf.plot.scatter(x='longitude',y='latitude',c='label')
longitude latitude geometry label
0 -76.575249 21.157229 (-76.0,21.15723) 0
1 -76.575035 21.157453 (-76.57503,21.15745) 0
2 -76.575255 21.157678 (-76.57526,21.15768) 0
3 -76.575470 21.157454 (-76.57547,21.15745) 0
4 -112.973177 31.317333 (-112.97318,31.31733) -1 # not in cluster
5 -113.492501 47.645914 (-113.4925,47.64591) 1
6 -113.492996 47.643609 (-113.493,47.64361) 1
7 -113.492379 47.643557 (-113.49238,47.64356) 1
8 -113.487443 47.643142 (-113.48744,47.64314) 1
9 -105.022627 48.585669 (-105.02263,48.58567) -1 # not in cluster
如果我们向您的数据集中添加随机数据,运行聚类算法,并过滤掉那些不在聚类中的数据点,您就会更清楚地了解它的工作原理。
import numpy as np
rng = np.random.default_rng(seed=42)
arr2 = pd.DataFrame(rng.random((3000,2)) * 100,columns=['latitude','longitude'])
randdf = pd.concat([df[['latitude','longitude']],arr2]).reset_index()
clustering = DBSCAN(eps=1,min_samples=4).fit(randdf[['longitude','latitude']].values)
labels = pd.Series(clustering.labels_,name='label')
gdf = pd.concat([randdf[['latitude',labels],axis=1)
subgdf = gdf[gdf['label']> -1]
subgdf.plot.scatter(x='longitude',c='label',colormap='viridis',figsize=(20,10))
print(gdf['label'].value_counts())
-1 2527
16 10
3 8
10 8
50 8
...
57 4
64 4
61 4
17 4
0 4
Name: label,Length: 99,dtype: int64
从该数据框中获取聚类点相对简单。像这样:
subgdf['point'] = subgdf.apply(lambda x: (x['latitude'],x['longitude']),axis=1)
subgdf.groupby(['label'])['point'].apply(list)
label
0 [(21.157229,-76.575249),(21.157453,-76.5750...
1 [(47.645914,-113.492501),(47.643609,-113.49...
2 [(46.67210037270342,4.380376578722878),(46.5...
3 [(85.34030732681661,23.393948586534073),(86....
4 [(81.40203846660347,16.697291990770392),(82....
...
93 [(61.419880354359925,23.25522624430636),(61....
94 [(50.893415175135424,90.70863269095085),(52....
95 [(88.80586950148697,81.17523712192651),(88.6...
96 [(34.23624333000541,40.8156668231013),(35.86...
97 [(16.10456828199399,67.41443008931344),(15.9...
Name: point,Length: 98,dtype: object
尽管您可能需要进行某种排序以确保在绘制多边形时连接最近的点。
,Haversine Formula in Python (Bearing and Distance between two GPS points)
https://gis.stackexchange.com/questions/121256/creating-a-circle-with-radius-in-metres
您或许可以使用半正弦公式对一定距离内的点进行分组。使用公式为每个点(下面的函数)创建多边形,然后从主点列表中过滤内部点并重复,直到没有更多点。
#import modules
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame,GeoSeries
from shapely import geometry
from shapely.geometry import Polygon,Point
from functools import partial
import pyproj
from shapely.ops import transform
#function to create polygons on radius
def polycir(lat,lon,radius):
local_azimuthal_projection = """+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0=
{}""".format(lat,lon)
wgs84_to_aeqd = partial(
pyproj.transform,pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),pyproj.Proj(local_azimuthal_projection),)
aeqd_to_wgs84 = partial(
pyproj.transform,)
center = Point(float(lon),float(lat))
point_transformed = transform(wgs84_to_aeqd,center)
buffer = point_transformed.buffer(radius)
# Get the polygon with lat lon coordinates
circle_poly = transform(aeqd_to_wgs84,buffer)
return circle_poly
#Convert df to gdf
gdf = gpd.GeoDataFrame(df,geometry=gpd.points_from_xy(df.longitude,df.latitude))
#Create circle polygons col
gdf['polycir'] = [polycir(x,y,<'Radius in Meters'>) for x,y in zip(gdf.latitude,gdf.longitude)]
gdf.set_geometry('polycir',inplace=True)
#You should be able to loop through the polygons and find the geometries that overlap with
# gdf_filtered = gdf[gdf.polycir.within(gdf.iloc[0,4])]
,
看起来像 k-means clustering 的工作。
您可能需要注意如何定义距离(“穿过”地球的实际距离,或最短路径?)
将每个簇变成多边形取决于您想要做什么...只需将它们链接起来或寻找它们的凸包络...