pandas 如何合并由分隔符分隔的字符串中具有相同唯一元素列表的行?

nhaq1z21  于 2022-12-21  发布在  其他
关注(0)|答案(2)|浏览(104)

我有一个python数据框,其中字符串中的值用分隔符分隔,对应的列中有数值。我想合并字符串中具有相同唯一元素列表的行,而不考虑它们的顺序,并将对应的列聚合到该行
数据框如下所示:

df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])

df
我希望结果如下所示:

df = pd.DataFrame([['Direct, Organic Search',560], ['Direct, Organic Search, Not Found',11]], columns = ['A','B'])

最初,我想到使用for循环遍历 Dataframe 的每一行,并按字母顺序将元素列表存储在数组中,然后根据处理过的行组合行。
然而,我无法将其公式化为密码。

ujv3wf0j

ujv3wf0j1#

您可以轻松地做到这一点。例如:

import pandas as pd

df = pd.DataFrame([['Organic Search, Direct', 171], ['Organic Search, Direct, Not Found', 3], ['Direct, Organic Search', 389], ['Direct, Organic Search, Not Found', 8]], columns=['A', 'B'])

print(df)
Out:
      A                                 B
0   Organic Search, Direct             171
1   Organic Search, Direct, Not Found   3
2   Direct, Organic Search             389
3   Direct, Organic Search, Not Found   8
def foo(x):
    x = x.split(', ')
    x.sort()
    return tuple(x)

df.groupby(df.A.apply(foo))['B'].sum()
out:
               A
(Direct, Not Found, Organic Search)     11
(Direct, Organic Search)               560
ny6fqffe

ny6fqffe2#

我不能把这归功于我,但我不知道如何提出它以外的一个单独的答案。
我也是Python新手,使用padu's logic code,我已经根据你最初帖子的数据进行了修改。
我还用.reset_index()替换了['B']以保留原始列,因为求和之前与新分组的值合并在一个列中。我还添加了numeric_only=True,因为如果不指定要求和的列,通常会假设坚持使用数字列,但该假设现在是一个弃用的特性,根据我收到的以下消息:
DataFrameGroupBy.sum中numeric_only的默认值已弃用。在将来的版本中,numeric_only将默认为False。请指定numeric_only或仅选择对该函数有效的列。df = df.groupby(df['MCF通道分组路径_1'].apply(sort_groupings)).sum().reset_index()
您可能还想在最后一行的开头添加df =,如下所示,以保持结果为原始变量名。错过这一步导致我在学习调试时损失了大量时间。

import pandas as pd

df = pd.DataFrame([
['Direct, Organic Search',171],
['Direct, Organic Search, Not Found',4],
['Direct, Organic Search, Not Found, Referral',12],
['Direct, Organic Search, Not Found, Referral, Email',2],
['Organic Search, Direct',389],
['Organic Search, Direct, Not Found',3],
['Organic Search, Direct, Referral',76],
['Organic Search, Direct, Referral, Social Network',1]],
columns =['MCF Channel Grouping Path_1', 'Conversions'])

def sort_groupings(row_group):
    row_group = row_group.split(', ') # Converts string into list using ', ' as the separator       
    row_group.sort()                  # Sorts newly, made list in ascending order
    return tuple(row_group)           # Converts list to a tuple (required by the 'apply' function), and returns it

df = df.groupby(df['MCF Channel Grouping Path_1'].apply(sort_groupings)).sum(numeric_only=True).reset_index()

# groupby: column in which to group common values
# apply: call a function to run on values specified
# reset_index: reinstates the first column in the final df

相关问题