有两个 Dataframe -
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
如何从上述两个 Dataframe 中获取 Dataframe df 01,以便如果df 2 ['ColD']中的任何关键字存在于df 1 ['ColE']中,则随后将返回用逗号分隔的ColA,ColB,ColC和ColD的值,并将相同的值放入df 1中的ColA 0,ColB 0,ColC 0,ColD 1中,如-
df01 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry-standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.','Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2'],
'ColD1': ['d4', 'd10,d15'],
'ColA0': ['A1', 'A2,A3'],
'ColB0': ['B1', 'B2,B3'],
'ColC0': ['C1', 'C2,C3'],
})
以前的代码-
import pandas as pd
df1 = pd.DataFrame({'ColE': ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the d4, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.',
'Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from d15 BC, making it over d10 years old.'],
'ColF': ['F1', 'F1'],
'ColG': ['G1', 'G1'],
'ColH': ['H1', 'H2'],
'ColJ': ['J1', 'J2']})
df2 = pd.DataFrame({'ColA': ['A1', 'A2', 'A3'],
'ColB': ['B1', 'B2', 'B3'],
'ColC': ['C1', 'C2', 'C3'],
'ColD': ['d1,d2,d3,d4,d5,d6', 'd7,d8,d9,d10,d11,d12', 'd13,d14,d15,d16,d17,d18']})
df2['ColD'] = df2['ColD'].str.split(',')
df1['ColD1'] = df1['ColE'].apply(lambda x: ','.join([keyword.strip() for keyword in df2['ColD'].sum() if keyword in x]))
df1['ColA0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColA']))
df1['ColB0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColB']))
df1['ColC0'] = df1['ColD1'].apply(lambda x: ','.join(df2.loc[df2['ColD'].apply(lambda y: any(keyword in y for keyword in x.split(','))), 'ColC']))
df01 = df1
这给出以下输出-
| 科尔|ColF| ColG| ColH| ColJ| ColD1| ColA0| ColB0| ColC0|
| --------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Lorem Ipsum is simply dummy text of the printi..|F1| G1| H1| J1| d4| A1| B1| C1|
| 与流行的看法相反,Lorem Ipsum is not..|F1| G1|氢气|J2|第1天、第10天、第15天|A1、A2、A3| B1、B2、B3| C1、C2、C3|
预期输出为-
| 科尔|ColF| ColG| ColH| ColJ| ColD1| ColA0| ColB0| ColC0|
| --------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Lorem Ipsum is simply dummy text of the printi..|F1| G1| H1| J1| d4| A1| B1| C1|
| 与流行的看法相反,Lorem Ipsum is not..|F1| G1|氢气|J2| d10、d15| A2、A3| B2、B3| C2、C3|
怎么弄成这样?
1条答案
按热度按时间8yoxcaq71#
你的问题是
keyword in x
对于d1
是真的,因为d1
是d10
和d15
的一部分。您需要使用正则表达式搜索来确保只匹配完整的单词。注意,为了提高效率,应该只计算一次ColD
的总和,并将ColD1
作为列表保存,直到完成计算:输出: