使用fuzzy wuzzy比较来自不同 Dataframe 的2列的部分匹配

w46czmvw  于 2021-09-29  发布在  Java
关注(0)|答案(1)|浏览(591)

我要比较此 Dataframe df1:

  1. Product Price
  2. 0 Waterproof Liner 40
  3. 1 Phone Tripod 50
  4. 2 Waterproof Pants 0
  5. 3 baby Kids play Mat 985
  6. 4 Hiking BACKPACKS 34
  7. 5 security Camera 160

使用df2,如下所示:

  1. Product Id
  2. 0 Home Security IP Camera 508760
  3. 1 Hiking Backpacks Spring Products 287950
  4. 2 Waterproof Eyebrow Liner 678897
  5. 3 Waterproof Pants Winter Product 987340
  6. 4 Baby Kids Water Play Mat Summer Product 111500

我想比较df1和df2中的产品列。以便找到产品的良好id。如果相似度<80,则会在id字段中输入'remove':df1和df2中产品列的文本不是100%匹配的。有人能帮我吗?或者我如何使用fuzzy wazzy获得良好的id?
这是我的密码

  1. import pandas as pd
  2. from fuzzywuzzy import process
  3. data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
  4. 'Price':[40,50,0,985,34,160]}
  5. data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
  6. 'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
  7. 'Id': [508760,287950,678897,987340,111500],}
  8. df1 = pd.DataFrame(data1)
  9. df2 = pd.DataFrame(data2)
  10. dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
  11. .tolist(), columns=['Product1',"match_comp", "Id"])

我得到的是:

  1. Product1 match_comp Id
  2. 0 Waterproof Eyebrow Liner 86 2
  3. 1 Waterproof Eyebrow Liner 50 2
  4. 2 Waterproof Pants Winter Product 90 3
  5. 3 Baby Kids Water Play Mat Summer Product 86 4
  6. 4 Hiking Backpacks Spring Products 90 1
  7. 5 Home Security IP Camera 86 0

预计会是什么:

  1. Product Price ID
  2. 0 Waterproof Liner 40 678897
  3. 1 Phone Tripod 50 Remove
  4. 2 Waterproof Pants 0 987340
  5. 3 baby Kids play Mat 985 111500
  6. 4 Hiking BACKPACKS 34 287950
  7. 5 security Camera 160 508760
lp0sw83n

lp0sw83n1#

您可以创建一个 Package 器函数:

  1. def extract(s):
  2. name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
  3. if score < 80:
  4. return 'Remove'
  5. return df2.set_index('Product2').loc[name, 'Id']
  6. df1['ID'] = df1["Product1"].apply(extract)

输出:

  1. Product1 Price ID
  2. 0 Waterproof Liner 40 678897
  3. 1 Phone Tripod 50 Remove
  4. 2 Waterproof Pants 0 987340
  5. 3 baby Kids play Mat 985 111500
  6. 4 Hiking BACKPACKS 34 287950
  7. 5 security Camera 160 508760

注意。输出不完全是您所期望的,您必须解释为什么应该删除第4/5行

展开查看全部

相关问题