pandas Python -2个 Dataframe 之间的半正矢距离,并将相应的限制分配给具有最小距离的loc

ehxuflar  于 2022-12-16  发布在  Python
关注(0)|答案(2)|浏览(131)

我有2个 Dataframe :

  1. df_exposure(〉30万行)
ID  Limit   Lat Lon
0   1   49  21.066107   121.930200
1   2   49  20.932773   121.913533
2   3   49  20.932773   121.921867
3   4   49  20.924440   121.930200
4   5   49  20.899440   121.905200

字符串
来自df_暴露i提取

lat_loc = df_exposure.loc[:, 'Lat']
lon_loc = df_exposure.loc[:, 'Lon']
  1. df(3k行):
Lat Lon Limit
0   4.125   116.125 0.0
1   4.375   116.125 0.0
2   4.625   116.125 0.0
3   4.875   116.125 0.0
4   5.125   116.125 0.0

这是半正矢函数

def haversine(lat2, lon2, lat1, lon1):
    
    lat1_ = lat1 * np.pi / 180
    lat2_ = lat2 * np.pi / 180
    lon1_ = lon1 * np.pi / 180
    lon2_ = lon2 * np.pi / 180
     
    a = (np.sin((lat2_ - lat1_) / 2)**2) + (np.sin((lon2_ - lon1_) / 2)**2) * np.cos(lat1_) * np.cos(lat2_)
    dist = 2 * 6371 * np.arcsin(np.sqrt(a))

    return dist

本质上,df是df_exposure的一个子集,具有较大的网格大小,我希望获得df中所有位置与df_exposure中lat long的每个位置(行)之间的距离,以找到最小距离,并将相应df_exposure行中的限值分配给df中距离最小的位置,这将在df_exposure中的每个位置上迭代,直到计算完所有位置。
这是目前的做法,但由于df_exposure的大小(〉300k行),这需要很长时间

for i in range(len(lat_loc)):

    r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
    dist = r.min() # find minimum distance
    df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] + df_exposure.loc[i, 'Limit']

我将感谢一些建议,以改善目前的代码。谢谢。

3vpjnl9f

3vpjnl9f1#

您可以使用sklearn.neighbors.DistanceMetric表示haversine距离,

from sklearn.neighbors import DistanceMetric
distance = DistanceMetric.get_metric('haversine')

lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']

lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']

(6371*distance.pairwise((np.array([lat1,lon1])* np.pi / 180).T, 
                    (np.array([lat2,lon2])* np.pi / 180).T).min(1))
xmd2e60i

xmd2e60i2#

让我们按顺序来。我已经创建了指定维度的 Dataframe 。下面是您的实现的运行时:

import time

import numpy as np
import pandas as pd

EXPOSURE_SIZE = 300_000
DF_SIZE = 3000

df_exposure = pd.DataFrame({'Limit': np.random.randint(0, 1000, size=(EXPOSURE_SIZE,)),
                            'Lat': np.random.uniform(-10, 10, size=EXPOSURE_SIZE),
                            'Lon': np.random.uniform(-10, 10, size=EXPOSURE_SIZE)})

df = pd.DataFrame(
    {'Limit': np.random.randint(0, 1000, size=(DF_SIZE,)),
     'Lat': np.random.uniform(-10, 10, size=DF_SIZE),
     'Lon': np.random.uniform(-10, 10, size=DF_SIZE)})

def haversine(lat2, lon2, lat1, lon1):
    lat1_ = lat1 * np.pi / 180
    lat2_ = lat2 * np.pi / 180
    lon1_ = lon1 * np.pi / 180
    lon2_ = lon2 * np.pi / 180

    a = (np.sin((lat2_ - lat1_) / 2) ** 2) + (np.sin((lon2_ - lon1_) / 2) ** 2) * np.cos(lat1_) * np.cos(lat2_)
    dist = 2 * 6371 * np.arcsin(np.sqrt(a))

    return dist

if __name__ == '__main__':
    lat_loc = df_exposure.loc[:, 'Lat']
    lon_loc = df_exposure.loc[:, 'Lon']

    start = time.monotonic()
    for i in range(len(lat_loc)):
        r = haversine(df.loc[:, 'Lat'], df.loc[:, 'Lon'], lat_loc[i], lon_loc[i])
        dist = r.min()  # find minimum distance
        df.loc[list(r).index(dist), 'Limit'] = df.loc[list(r).index(dist), 'Limit'] + df_exposure.loc[i, 'Limit']
    print(f'with for loop and series time took: {time.monotonic() - start:.1f} s.')
Out:
     with for loop and series time took: 456.3 s.

你应该明白,在这个例子中,你将latlon作为pd.Series传递给haversine函数,这样,你的函数就被向量化了。
一个二个一个一个
哇!加速是~ 7倍。
让我们尝试使用V.M answer和sklearn.metrics模块中的DistanceMetric类:

from sklearn.metrics import DistanceMetric

distance = DistanceMetric.get_metric('haversine')

lat1 = df_exposure.loc[:, 'Lat']
lon1 = df_exposure.loc[:, 'Lon']

lat2 = df.loc[:, 'Lat']
lon2 = df.loc[:, 'Lon']
start = time.monotonic()
res = (6371 * distance.pairwise((np.array([lat1, lon1]) * np.pi / 180).T,
                                     (np.array([lat2, lon2]) * np.pi / 180).T)).argmin(axis=1)
print(f'with sklearn pairwise distance time took: {time.monotonic() - start:.1f} s.')
Out: 
    with sklearn pairwise distance time took: 45.6 s.

更好!加速约10倍
但是,如果将循环中的逻辑移到一个新函数中,并使用apply方法,该怎么办?

def foo(row, lat, lon):
    """
    row: row of DataFrame
    lat: ndarray with latitude
    lon: ndarray with longitude
    """
    r = haversine(lat, lon, row[1], row[2])
    return r.argmin()

start = time.monotonic()
res = df_exposure.apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'synchronous apply time took: {time.monotonic() - start:.1f} s.')
Out:
    synchronous apply time took: 32.4 s.

哇!它更快了。
我们可以进一步加快计算速度吗?可以!如果我们记住pandas总是在CPU的一个内核上运行。我们需要并行化最好的方法。这可以通过parallel-pandas轻松完成

#pip install parallel-pandas
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(disable_pr_bar=True)

#p_apply is a parallel analog of apply method
start = time.monotonic()
res = df_exposure.p_apply(foo, raw=True, axis=1, args=(lat, lon))
print(f'parallel apply time took: {time.monotonic() - start:.1f} s.')
Out: 
     parallel apply time took: 3.7

这太神奇了!总加速456/3.7 ~ 120

相关问题