如何有效地从 geopandas df 过滤不在匀称多边形范围内的行？

问题描述

我有一个常规的 Pandas 数据框，我可以像这样一次性转换为 geopandas

from shapely.geometry import polygon,Point
import geopandas
geo_df = geopandas.GeoDataFrame(input_df,geometry=geopandas.points_from_xy(input_df.Longitude,input_df.Latitude))

我还有一个坐标列表，我可以将其转换为 Shapely polygon，如下所示：

grid_polygon = polygon(shape_coordinates)

然后我想过滤 geo_df 中不在匀称多边形 grid_polygon 范围内的所有行。

我目前实现这一目标的方法是：

geo_df['withinpolygon'] = ""
withinQlist = []
for lon,lat in zip(geo_df['longitude'],geo_df['latitude']):
    pt = Point(lon,lat)
    withinQ = pt.within(grid_polygon)
    withinQlist.append(withinQ)
geo_df['withinpolygon'] = withinQlist
geo_df = geo_df[geo_df.withinpolygon==True]

但这效率很低。我认为有一种方法可以在不迭代每一行的情况下做到这一点，但我能找到的大多数解决方案都没有使用匀称的多边形进行过滤。有什么想法吗？

谢谢

解决方法

作为第一步，正如您在评论中已经提到的，您的代码可以像这样简化：

import geopandas
geo_df = geopandas.GeoDataFrame(input_df,geometry=geopandas.points_from_xy(input_df.Longitude,input_df.Latitude)

geo_df_filtered = geo_df.loc[geo_df.within(grid_polygon)]

但是有一些技术可以加快速度，具体取决于您拥有的数据类型和使用模式：

使用准备好的几何体

如果您的多边形非常复杂，则创建 prepared geometry 将加快包含检查的速度。这会在一开始就预先计算各种数据结构，加快后续操作。（更多详情here。）

from shapely.prepared import prep

grid_polygon_prep = prep(grid_polygon)
geo_df_filtered = geo_df.loc[geo_df.geometry.apply(lambda p: grid_polygon_prep.contains(p))]

（不能像上面那样只做 geo_df.loc[geo_df.within(grid_polygon_prep)]，因为 geopandas 不支持准备好的几何图形。）

使用空间索引

如果您需要针对多个 grid_polygon 对一组给定的点运行包含检查，而不仅仅是一个，那么在这些点上使用空间索引是有意义的。它将显着加快速度，尤其是在有很多点的情况下。

Geopandas 为此提供了 GeoDataFrame.sindex.query：

match_indices = geo_df.sindex.query(grid_polygon,predicate="contains")
# note that using `iloc` instead of `loc` is important here
geo_df_filtered = geo_df.iloc[match_indices]

不错的博文，还有更多解释：https://geoffboeing.com/2016/10/r-tree-spatial-index-python/

geopandas geospatial pandas pandas python shapely