pandas 删除COLUMNS中与另一列中的行不匹配的行(所有行都属于1个行框)

h9a6wy2h  于 2023-10-14  发布在  其他
关注(0)|答案(1)|浏览(86)

我的dataframe看起来像这样:

A         B    C    D        E   F   G  H       I      J 
FP002    12     FP001   113 406 519  85 82  FP001   6240
FP003    7610   FP002   99  552 651  49 64  FP002   12294
FP005    12,    FP003   102 131 1416 24  89 FP003   761
FP005    1250   FP004   94  739 833 122 215 FP004   400

我希望我的输出是这样的:

A         B    C         D   E       F       G       H       I       J
FP002    12     FP002   99  552     651      49      64   FP002   12294
FP003    7610   FP003   102 1314    1416    247      89   FP003  761
FP005    12,    
FP005    1250

所以基本上保留了A列后面的行。
我的代码是这样的:

dfR = df1.join( df1 ,on=['A','C'], how='inner')

但它没有给我想要的结果

7fhtutme

7fhtutme1#

您可以在axis=1上使用groupby拆分块,在使用groupby.cumcount进行重复数据删除后,使用functools.reduce拆分join

from functools import reduce

# start groups based on first row whenever a value starts with "FP"
group = df.iloc[0].astype(str).str.startswith('FP').cumsum()
# [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]

out = reduce(lambda a,b: a.join(b, how='left'), 
             (d.set_index([d.iloc[:, 0].to_numpy(),
                           d.groupby(d.iloc[:, 0]).cumcount().to_numpy()])
              for k, d in df.groupby(group, axis=1))
            )

输出量:

A     B      C      D      E       F     G     H      I        J
FP002 0  FP002    12  FP002   99.0  552.0   651.0  49.0  64.0  FP002  12294.0
FP003 0  FP003  7610  FP003  102.0  131.0  1416.0  24.0  89.0  FP003    761.0
FP005 0  FP005   12,    NaN    NaN    NaN     NaN   NaN   NaN    NaN      NaN
      1  FP005  1250    NaN    NaN    NaN     NaN   NaN   NaN    NaN      NaN

请注意,axis=1上的groupby将来将被弃用,正确的方法是:

from functools import reduce

group = df.iloc[0].astype(str).str.startswith('FP').cumsum()

out = reduce(lambda a,b: a.join(b, how='left'), 
             ((d2:=d.T).set_index([d.iloc[0].to_numpy(),
                                   d2.groupby(d.iloc[0]).cumcount().to_numpy()])
              for k, d in df.T.groupby(group))
            )

传递给reduce以执行左连接的中间体:

[             A     B
 FP002 0  FP002    12
 FP003 0  FP003  7610
 FP005 0  FP005   12,
       1  FP005  1250,
              C    D    E     F    G    H
 FP001 0  FP001  113  406   519   85   82
 FP002 0  FP002   99  552   651   49   64
 FP003 0  FP003  102  131  1416   24   89
 FP004 0  FP004   94  739   833  122  215,
              I      J
 FP001 0  FP001   6240
 FP002 0  FP002  12294
 FP003 0  FP003    761
 FP004 0  FP004    400]

注意:你也可以使用其他逻辑来形成组,例如,当数据不是数字时开始分组:

group = pd.to_numeric(df.iloc[0], errors='coerce').isna().cumsum()

甚至完全手动:

group = [1, 1, 2, 2, 2, 2, 2, 2, 3, 3]

或者使用一个显式的索引列表:

group = df.columns.isin(['A', 'C', 'I']).cumsum()

相关问题