Pandas合并具有重叠ID的 Dataframe

1szpjjfi  于 2022-12-09  发布在  其他
关注(0)|答案(4)|浏览(145)

我有三个 Dataframe :

df1 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.0, 0.1, 0.0]})
                    
df2 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.2, 0.5, 0.0]})

df3 = pd.DataFrame({'sample': [Sample1, Sample2, Sample3, Sample4],
                    'value': [0.00, 0.3, 0.6, 0.0]})

# Table A
Sample1     0.0
Sample2     0.0
Sample3     0.1
Sample4     0.0

# Table B
Sample1     0.0
Sample2     0.2
Sample3     0.5
Sample4     0.0

# Table C
Sample1     0.0
Sample2     0.3
Sample3     0.6
Sample4     0.0

我想合并所有三个 Dataframe ,使所有大于0.00的值都包含在具有不同名称的新 Dataframe 中。每个样本ID中,0.00的值应仅出现一次:

# Output
Sample1     0.0
Sample2_A   0.0 (not required)
Sample2_B   0.2
Sample2_C   0.3
Sample3_A   0.1
Sample3_B   0.5
Sample3_C   0.6
Sample4     0.0

如何创建这样的合并 Dataframe ?

arknldoa

arknldoa1#

您可以定义一个自定义函数,如果value大于0,该函数将向列sample添加连字符:

def add_hypen(df, df_name):
    df['sample'] = df['sample'].where(df['value'].le(0), lambda x: x + f'_{df_name}')
    return df 

df1 = add_hypen(df1, 'A')
df2 = add_hypen(df2, 'B')
df3 = add_hypen(df3, 'C')

output = pd.concat([df1, df2, df3])

如果在每个 Dataframe 中样本id已经彼此不同,则drop_duplicates将足以去除value为零的重复:

output = output.drop_duplicates()

否则,您可能需要另一个concat来删除重复项:

output = (pd.concat([output[output['value'].ne(0)], output[output['value'].eq(0)].drop_duplicates()],
                    ignore_index=True).sort_values(by='sample'))
fd3cxomn

fd3cxomn2#

一个命题使用pandas.concatkeys参数:

out = (
        pd.concat([df1, df2, df3], keys=["A", "B", "C"])
            .reset_index(names=["suffix", "index"])
            .assign(sample= lambda x: np.where(x["value"].gt(0),
                                               x["sample"].add("_" + x["suffix"]), x["sample"]))
            .loc[lambda s: ~s["sample"].duplicated(keep="first"), ["sample", "value"]]
            .sort_values(by="sample")
​
      )
#输出:
print(out)

       sample  value
0     Sample1    0.0
1     Sample2    0.0
5   Sample2_B    0.2
9   Sample2_C    0.3
2   Sample3_A    0.1
6   Sample3_B    0.5
10  Sample3_C    0.6
3     Sample4    0.0
c0vxltue

c0vxltue3#

编码:

alpha = ['', 'A', 'B', 'C']
df = pd.concat([df1, df2, df3]).drop_duplicates().reset_index(drop=True)
df['sample'] = df['sample']+df.groupby('sample').cumcount().apply(lambda x: '_'+alpha[x] if alpha[x]!='' else alpha[x])
df

输出量:

sample         value
0   Sample1     0.0
1   Sample2     0.0
2   Sample3     0.1
3   Sample4     0.0
4   Sample2_A   0.2
5   Sample3_A   0.5
6   Sample2_B   0.3
7   Sample3_B   0.6
mqkwyuun

mqkwyuun4#

另一种可能的解决方案,基于pandas.DataFrame.mergepandas.melt

(df1[~df1.value.eq(0)].merge(df2[~df2.value.eq(0)], on='sample', how='outer')
 .merge(df3[~df3.value.eq(0)], on='sample', how='outer')
 .set_axis(['sample'] + list('ABC'), axis='columns')
 .melt(id_vars='sample').dropna()
 .assign(sample = lambda x: x['sample'] + '_' + x['variable'])
 .merge(df1[df1.value.eq(0)], on='sample', how='outer')
 .assign(value = lambda x: x.sum(numeric_only=True, axis=1))
 .sort_values('sample').loc[:, ['sample', 'value']])

输出量:

sample  value
5    Sample1    0.0
6    Sample2    0.0
2  Sample2_B    0.2
4  Sample2_C    0.3
0  Sample3_A    0.1
1  Sample3_B    0.5
3  Sample3_C    0.6
7    Sample4    0.0

相关问题