pandas 如何使用两个 Dataframe 来获得基于关键字的最终 Dataframe ?

nfzehxib  于 2023-05-21  发布在  其他
关注(0)|答案(1)|浏览(135)

有两个 Dataframe -

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

如何从上述两个 Dataframe 中获取 Dataframe df 01,以便如果df 2 ['ColD']中的任何关键字存在于df 1 ['ColE']中,则随后将返回用逗号分隔的ColA,ColB,ColC和ColD的值,并将相同的值放入df 1中的ColA 0,ColB 0,ColC 0,ColD 1中,如-

df01 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2'],
                    'ColD1': ['d4', 'd10,d15'],
                    'ColA0': ['A1', 'A2,A3'],
                    'ColB0': ['B1', 'B2,B3'],
                    'ColC0': ['C1', 'C2,C3'],
                   })

以前的代码-

import pandas as pd

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.',
                             'Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

df2['ColD'] = df2['ColD'].str.split(',')

df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))

df01 = df1

这给出以下输出-
| 科尔|ColF| ColG| ColH| ColJ| ColD1| ColA0| ColB0| ColC0|
| --------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Lorem Ipsum is simply dummy text of the printi..|F1| G1| H1| J1| d4| A1| B1| C1|
| 与流行的看法相反,Lorem Ipsum is not..|F1| G1|氢气|J2|第1天、第10天、第15天|A1、A2、A3| B1、B2、B3| C1、C2、C3|
预期输出为-
| 科尔|ColF| ColG| ColH| ColJ| ColD1| ColA0| ColB0| ColC0|
| --------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Lorem Ipsum is simply dummy text of the printi..|F1| G1| H1| J1| d4| A1| B1| C1|
| 与流行的看法相反,Lorem Ipsum is not..|F1| G1|氢气|J2| d10、d15| A2、A3| B2、B3| C2、C3|
怎么弄成这样?

8yoxcaq7

8yoxcaq71#

你的问题是keyword in x对于d1是真的,因为d1d10d15的一部分。您需要使用正则表达式搜索来确保只匹配完整的单词。注意,为了提高效率,应该只计算一次ColD的总和,并将ColD1作为列表保存,直到完成计算:

df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
                    'ColF': ['F1', 'F1'],
                    'ColG': ['G1', 'G1'],
                    'ColH': ['H1', 'H2'],
                    'ColJ': ['J1', 'J2']})

df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
                    'ColB': ['B1', 'B2', 'B3'],
                    'ColC': ['C1', 'C2', 'C3'],
                    'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})

all_keywords = df2['ColD'].str.split(',').sum()

df1['ColD1'] = df1['ColE'].apply(lambda x: [keyword.strip() for keyword in all_keywords if re.search(fr'\b{keyword}\b', x)])
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x)), 'ColC']))
df1['ColD1'] = df1['ColD1'].apply(','.join)

输出:

ColE ColF ColG ColH ColJ    ColD1  ColA0  ColB0  ColC0
0  Lorem Ipsum is simply dummy text of the printi...   F1   G1   H1   J1       d4     A1     B1     C1
1  Contrary to popular belief, Lorem Ipsum is not...   F1   G1   H2   J2  d10,d15  A2,A3  B2,B3  C2,C3

相关问题