R语言 查找数据框行之间的最小距离

xxls0lw8  于 2023-04-27  发布在  其他
关注(0)|答案(1)|浏览(176)

我正在使用R编程语言。
假设我有以下两个 Dataframe :

set.seed(123)

df_1 <- data.frame(
  name_1 = c("john", "david", "alex", "kevin", "trevor"),
  lon = rnorm(5, mean = -74.0060, sd = 0.01),
  lat = rnorm(5, mean = 40.7128, sd = 0.01)
)

df_2 <- data.frame(
  name_2 = c("matthew", "tyler", "sebastian"),
  lon = rnorm(3, mean = -74.0060, sd = 0.01),
  lat = rnorm(3, mean = 40.7128, sd = 0.01)
)

**我的问题:**对于df_1中的每个人,我想找出df_2中最接近df_1中那个人的人,并记录距离。

我使用以下函数计算距离:

library(geosphere)
haversine_distance <- function(lon1, lat1, lon2, lat2) {
  distHaversine(c(lon1, lat1), c(lon2, lat2))
}

我现在尝试创建一个 Dataframe (“final”),其行数与df_1相同:该数据框应该包含关于DF_1中每个人的经度/纬度、DF_2中最近的人的经度/纬度以及相应的距离(以米为单位)的列。
我尝试使用LOOPS来解决这个问题,循环所有组合并存储每个人的最小距离。以下是我的尝试:

# Create a matrix to store results
distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))

# calculate the distances
for (i in 1:nrow(df_1)) {
    for (j in 1:nrow(df_2)) {
        distances[i, j] <- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
    }
}

# find  closest person 
closest <- apply(distances, 1, which.min)

# Create final
final <- data.frame(
    name_1 = df_1$name_1,
    lon_1 = df_1$lon,
    lat_1 = df_1$lat,
    name_2 = df_2$name_2[closest],
    lon_2 = df_2$lon[closest],
    lat_2 = df_2$lat[closest],
    distance = distances[cbind(1:nrow(df_1), closest)]
)

最后的答案看起来是这样的:

name_1     lon_1    lat_1    name_2     lon_2    lat_2  distance
1   john -74.01160 40.72995 sebastian -74.00199 40.73067  814.8114
2  david -74.00830 40.71741     tyler -74.00240 40.70724 1236.4946
3   alex -73.99041 40.70015     tyler -74.00240 40.70724 1283.3369
4  kevin -74.00529 40.70593     tyler -74.00240 40.70724  284.3799
5 trevor -74.00471 40.70834     tyler -74.00240 40.70724  229.9678

代码似乎运行没有错误-但有人可以告诉我,如果我做的正确吗?
谢谢!

gpnt7bae

gpnt7bae1#

另一种方法是使用cross_join()

cross_join(df_1, df_2) %>% 
  rowwise() %>% 
  mutate(dist=distHaversine(c(lon.x, lat.x), c(lon.y, lat.y))) %>% 
  group_by(name_1) %>% 
  filter(dist==min(dist))

cross_join(df_1, df_2) %>%
  rowwise() %>% 
  mutate(dist=distHaversine(c(lon.x, lat.x), c(lon.y, lat.y))) %>% 
  ungroup() %>% 
  slice_min(order_by = dist, by = name_1)

输出(在任一选项下):

name_1 lon.x lat.x name_2    lon.y lat.y  dist
  <chr>  <dbl> <dbl> <chr>     <dbl> <dbl> <dbl>
1 john   -74.0  40.7 sebastian -74.0  40.7  815.
2 david  -74.0  40.7 tyler     -74.0  40.7 1236.
3 alex   -74.0  40.7 tyler     -74.0  40.7 1283.
4 kevin  -74.0  40.7 tyler     -74.0  40.7  284.
5 trevor -74.0  40.7 tyler     -74.0  40.7  230.

相关问题