基于Pandas Dataframe 检索子字符串

bogh5gae  于 2022-12-28  发布在  其他
关注(0)|答案(1)|浏览(144)

我有以下Pandasdfs:

print(df)

text_description     
ROME AND MILAN ARE AMAZING CITIES
NEW YORK AND LONDON REPRESENT GLOBAL FINANCE MARKETS
I LOVE MADRID 
BANGKOK IS AN AMAZING CITY
VAL D'ISERE IS A MAGIC PLACE

...

print(df_1)

City_List

PARIS
MILAN
ROME
NEW YORK
LONDON
MADRID
V. D'ISERE

我想过滤掉df[“text_description”]中的文本,只保留df_1[“City_List”]中包含的城市名称,从而获得两个单独的列:

print(final_df)

text_description_0     text_description_1
ROME                          MILAN
NEW YORK                     LONDON
MADRID                         na
VAL D'ISERE                    na
...

如何创建“final_df”?

6pp0gazn

6pp0gazn1#

你不会得到瓦尔D'ISERE,因为它并不存在于城市列表中。它有一个缩写,但程序无法识别它。你必须找到一种方法来解释缩写。下面的代码只处理在两列中找到的精确单词:

from itertools import product
from collections import defaultdict
d = defaultdict(list)
#create a cross Cartesian of the two columns
#and keep only values where City list can be found in text description
for first,last in product(df1.text_description,df2.City_List):
    if last in first:
        d[first].append(last)

d = {k:','.join(v) for k,v in d.items()}

#map the dictionary to text description and create two columns
df1[['city1','city2']] = df1.text_description.map(d).str.split(',',expand=True)

df1
         text_description                               city1       city2
0   ROME AND MILAN ARE AMAZING CITIES                   MILAN       ROME
1   NEW YORK AND LONDON REPRESENT GLOBAL FINANCE M...   NEW YORK    LONDON
2   I LOVE MADRID                                       MADRID      None
3   BANGKOK IS AN AMAZING CITY                          NaN         NaN
4   VAL D'ISERE IS A MAGIC PLACE                        NaN         NaN

相关问题