在R中用不同但一致的字符串连接 Dataframe

rbl8hiat  于 2023-06-19  发布在  其他
关注(0)|答案(1)|浏览(110)

我一直在努力解决两个数据框X和Y之间的连接X是一系列行业,以及他们的子行业和在那里花费的美元,Y是与代码匹配的相同行业和子行业的指数:

IND<- c("Ag", "Ag", "Total Ag", "Min", "Min", "Min", "Total Min")
SubIND<- c("agriculture", "aquaculture", "Total", "gold", "copper", "zinc", "Total")
Dollars<-sample(1:100,7)
INDcode<-c("A","B","C","D","E","G","H","M","R","Y","Z")
INDi<-c("Ag","Bar","Car","Don","Ec","Gl","Hu","Min","Run","Yt","Zal")

X <- data.frame(IND,SubIND,Dollars)
Y <- data.frame(INDi,INDcode)
join<-left_join(X,Y,by=join_by(IND==INDi))

       IND      SubIND Dollars INDcode
1        Ag agriculture       4       A
2        Ag aquaculture      63       A
3  Total Ag       Total      35    <NA>
4       Min        gold      68       M
5       Min      copper      14       M
6       Min        zinc      80       M
7 Total Min       Total      48    <NA>

“Total”在整个dataframe中弹出,我想知道是否有一种方法可以让我加入,以便例如。“Min”和“Total Min”均以INDcode“M”结束
我的df有足够的这些,我实际上可以做它的手,或做一个总和为每一个代码,并取代总行完全,但想知道是否有人有任何想法,如何做得更好?
我一直在看fuzzyjoin包,但不能想出如何使它为这个任务工作!
谢谢!

70gysomp

70gysomp1#

可以执行fuzzy连接:

library(fuzzyjoin)
IND <- c("Ag", "Ag", "Total Ag", "Min", "Min", "Min", "Total Min")
SubIND <- c("agriculture", "aquaculture", "Total", "gold", "copper", "zinc", "Total")
Dollars <- sample(1:100,7)
INDcode <- c("A","B","C","D","E","G","H","M","R","Y","Z")
INDi <- c("Ag","Bar","Car","Don","Ec","Gl","Hu","Min","Run","Yt","Zal")

X <- data.frame(IND, SubIND, Dollars)
Y <- data.frame(INDi, INDcode)
join <- fuzzy_left_join(X, Y, by = dplyr::join_by(IND == INDi), match_fun = stringr::str_detect)
join
#>         IND      SubIND Dollars INDi INDcode
#> 1        Ag agriculture      20   Ag       A
#> 2        Ag aquaculture      99   Ag       A
#> 3  Total Ag       Total      35   Ag       A
#> 4       Min        gold      64  Min       M
#> 5       Min      copper      23  Min       M
#> 6       Min        zinc      98  Min       M
#> 7 Total Min       Total       5  Min       M

创建于2023 - 06 - 13带有reprex v2.0.2

相关问题