python-3.x 按组捕获所有唯一信息

qacovj5a  于 2022-11-26  发布在  Python
关注(0)|答案(1)|浏览(139)

我想创建一个唯一的水果数据集。我不知道每种水果下可能存在的所有类型(例如颜色商店、价格)。对于每种类型,也可能存在重复行。是否有一种方法可以检测所有可能的重复项,并以完全概括的方式捕获所有唯一信息?

type    val       detail
0 fruit    apple
1 colour   green     greenish
2 colour   yellow    
3 store    walmart    usa
4 price    10
5 NaN
6 fruit    banana
7 colour   yellow
8 fruit    pear
9 fruit    jackfruit
...

预期输出

fruit      colour            store    price       detail           ...
0  apple     [green, yellow ]  [walmart]  [10]      [greenish, usa] 
1  banana     [yellow]           NaN      NaN
2  pear        NaN               NaN      NaN    
3  jackfruit   NaN               NaN      NaN

我试过了。但是这没有接近预期的输出。它也没有显示列名称。

df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit   {"apple",...}
1 colour  ...
fykwrbwg

fykwrbwg1#

首先,如果类型为fruit,则使用val值创建fruit列,将不匹配的值替换为NaN并向前填充缺失值,然后使用自定义函数按DataFrame.pivot_table进行透视,以获得不含NaN的唯一值,然后展平MultiIndex

m = df['type'].eq('fruit')

df['fruit'] = df['val'].where(m).ffill()

df1 = (df.pivot_table(index='fruit',columns='type', 
                      aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
        .drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
          detail_colour detail_price detail_store       val_colour val_price  \
fruit                                                                          
apple        [greenish]           []        [usa]  [green, yellow]      [10]   
banana               []          NaN          NaN         [yellow]       NaN   
jackfruit           NaN          NaN          NaN              NaN       NaN   
pear                NaN          NaN          NaN              NaN       NaN   

           val_store  
fruit                 
apple      [walmart]  
banana           NaN  
jackfruit        NaN  
pear             NaN

相关问题