pandas 在一个 Dataframe 中找到围绕不同 Dataframe (例如酒店)中的点的实体(例如餐馆)的数量(坐标计数问题)

5t7ly7z5 于 2023-04-28 发布在其他

关注(0)|答案(1)|浏览(118)

对于一个项目，我们正在努力统计（和名称）一个dataframe中的点的数量，这些点在另一个dataframe中的点周围，具有给定的一定半径。我们尝试了很多，但是在Tableau中通过手动计算点来验证我们的解决方案时，我们没有得到令人满意的解决方案。但是我们非常接近。我们有两个 Dataframe 。一个数据框大约有70 k行和50列，包含唯一的酒店ID、纬度、经度、名称和酒店的不同信息另一个具有大约25 k行和9列，具有唯一的机构ID、炜度、经度、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称、名称等。便利设施类型（例如，“餐厅”与“酒吧”）和其他信息，诸如菜肴和vegan_available等。
由于数据集的大小，嵌套循环计算每个酒店到每个餐厅的距离等似乎是不可能的。出于计算原因，使用六边形而不是围绕酒店的真实圆圈似乎是一个好主意。
输入：

radius_in_m = 200

酒店预订：

id  lat lon name
0   1   50.600840   -1.194608   Downtown Hotel
1   2   50.602031   -10.193503  Hotel 2
2   3   50.599579   -10.196028  Hotel 3

df_poi：

id  lat         lon         name                    amenity
0   451152  51.600840   -0.194608   King of Prussia         restaurant
1   451153  51.602031   -0.193503   Central Restaurant  restaurant
2   451154  51.599579   -0.196028   The Catcher in the Rye  bar

预期结果：df_hotels_new：

id  lat lon name      num_restaurants       restaurants_list     num_bar     bars_list  
0   1   50.600840   -1.194608   Downtown Hotel        2         [451152, 451153]      0       []
1   2   50.602031   -10.193503  Hotel 2               0         []                    1       [451154]
2   3   50.599579   -10.196028  Hotel 3               0         []                    0       []

在本例中，前两个餐厅将位于第一家酒店的200 m范围内。该计数将添加到新列中。包含已计数的两个餐厅的ID的列表将添加到另一列中。第三个餐厅是酒吧，因此不计入餐厅。请注意，在本例中，纬度/经度完全是虚构的，实际上不在200 m半径范围内。
到目前为止，最成功的尝试是这个，但它大多高估了餐馆的数量。它也没有在另一列中列出餐馆/酒吧/等，但我们已经做到了。通过这个，我们能够看到半径似乎“稍微”（大约1.5倍）大于指定，也许也有一点偏移。这可能是舍入或Map投影错误吗？

import geopandas as gpd
from shapely.geometry import Point
from shapely.ops import transform
from functools import partial
import pyproj
import math

# Define the conversion factor from meters to degrees based on the latitude
def meters_to_degrees(meters, latitude):
    proj_meters = pyproj.CRS("EPSG:3857")  # meters
    proj_latlon = pyproj.CRS("EPSG:4326")  # degrees
    transformer = pyproj.Transformer.from_crs(proj_meters, proj_latlon, always_xy=True)
    lon, lat = transformer.transform(meters, 0)
    lat_dist_per_deg = 111132.954 - 559.822 * math.cos(2 * math.radians(latitude)) + 1.175 * math.cos(4 * math.radians(latitude))
    lon_dist_per_deg = 111412.84 * math.cos(math.radians(latitude))
    lat_degrees = meters / lat_dist_per_deg
    lon_degrees = meters / lon_dist_per_deg
    return lat_degrees, lon_degrees



# Convert the hotels DataFrame to a GeoDataFrame with a Point geometry column
hotels_geo = gpd.GeoDataFrame(df_hotels, geometry=gpd.points_from_xy(df_hotels["longitude"], df_hotels["latitude"]))

# Convert the poi/restaurant DataFrame to a GeoDataFrame with a Point geometry column
poi_geo = gpd.GeoDataFrame(df_poi, geometry=gpd.points_from_xy(df_poi["longitude"], df_poi["latitude"]))

# Create an R-tree spatial index for the df_poi GeoDataFrame
df_poi_sindex = poi_geo.sindex

# Define the radius of the search in meters
radius_meters = 200

# Loop through each row in hotels_geo
for index, row in hotels_geo.iterrows():
    # Convert the radius from meters to degrees based on the latitude
    lat, lon = row["latitude"], row["longitude"]
    lat_deg, lon_deg = meters_to_degrees(radius_meters, lat)
    
    # Use the R-tree spatial index to find the df_poi rows within the search radius
    candidate_indices = list(df_poi_sindex.intersection(row.geometry.buffer(lon_deg).bounds))

    # Filter the street_test rows to only those within the search radius
    candidate_rows = poi_geo.iloc[candidate_indices]

    # Group the candidate rows by amenity and count the occurrences
    counts = candidate_rows.groupby("amenity").size().to_dict()

    # Add the counts as new columns in the df_hotels DataFrame
    for amenity_type, count in counts.items():
        df_hotels.at[index, amenity_type] = count

    # Print progress
    if index % 10000 == 0:
        print(f"Processed {index} rows")

# Replace NaN values with 0
airbnb_test.fillna(value=0, inplace=True)

pandas

来源：https://stackoverflow.com/questions/76064223/finding-the-number-of-entities-e-g-restaurants-in-one-dataframe-around-points

1条答案

按热度按时间

7eumitmz1#

要以有效的方式批量计算，您可以尝试geopandas.sjoin_nearest。
关于精度，geopandas只计算平面距离，所以对于lat-long的数据，你总是会得到显著的误差。听起来你不是在世界范围内工作，所以也许可以将数据重新投影到投影（等距？）坐标系以获得更好的精度。
使用sjoin_nearest的示例代码：

countries = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
cities = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))
radius_meters = 200

cities_w_country_data = geopandas.sjoin_nearest(cities, countries, distance_col="distance", max_distance=radius_meters)

赞(0）回复(0）举报 2023-04-28

我来回答

pandas 在一个 Dataframe 中找到围绕不同 Dataframe (例如酒店)中的点的实体(例如餐馆)的数量(坐标计数问题)

1条答案

相关问题

热门标签

最新问答