在具有多个索引的Pandas DataFrame中使用合并/挤压最小化行

lf3rwulv  于 2022-11-20  发布在  其他
关注(0)|答案(3)|浏览(146)

使用类似的DataFrame,

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id_1': [33,33,33,33,22,22,88,100],
    'id_2': [64,64,64,64,12,12,77,100],
    'col_1': [np.nan, 'dog', np.nan, 'kangaroo', np.nan, np.nan, np.nan, np.nan],
    'col_2': ['bike', 'car', np.nan, np.nan, 'train', np.nan, 'horse', np.nan],
    'col_3': [np.nan, np.nan, 'star', 'meteor', np.nan, 'rock', np.nan, np.nan]
})

"""
   id_1  id_2     col_1  col_2   col_3
0    33    64       NaN   bike     NaN
1    33    64       dog    car     NaN
2    33    64       NaN    NaN    star
3    33    64  kangaroo    NaN  meteor
4    22    12       NaN  train     NaN
5    22    12       NaN    NaN    rock
6    88    77       NaN  horse     NaN
7   100   100       NaN    NaN     NaN
"""

如何将其转换为最小数量的行,而不像下面这样聚合或丢失数据?

id_1  id_2     col_1  col_2   col_3
0    33    64       dog   bike    star
1    33    64  kangaroo    car  meteor
3    22    12       NaN  train    rock
4    88    77       NaN  horse     NaN
5   100   100       NaN    NaN     NaN

基本上,对于id_X列的每个组,如果适用的话,col_X列的NaN值被替换为其他组值。

sbdsn5lh

sbdsn5lh2#

另一种可能的解决方案:

# this is to push up all not NaN values to the top of each column
df.loc[:, 'col_1':'col_3'] = df.groupby(
    ['id_1', 'id_2'], sort=False).transform(lambda x: sorted(x, key=pd.isnull))

# this is to remove all useless rows of NaN
df.loc[~(df.duplicated(['id_1', 'id_2']) &
         df.loc[:, 'col_1':'col_3'].isna().all(axis=1))]

输出量:

id_1  id_2     col_1  col_2   col_3
0    33    64       dog   bike    star
1    33    64  kangaroo    car  meteor
4    22    12       NaN  train    rock
6    88    77       NaN  horse     NaN
7   100   100       NaN    NaN     NaN
3vpjnl9f

3vpjnl9f3#

为了避免难以辨认的Pandas voodoo,在导入和df示例化之后,可以执行以下操作

def get_max_vals_from_row_sets(row, cols):
  mn = 1
  for col in cols:
    mn = max(mn, len(row[col]))
  return mn

def add_id_row(d, row, ids, cols):
  max_vals = get_max_vals_from_row_sets(row, cols)

  for _ in range(max_vals):
    for id_ in ids:
      d[id_].append(row[id_])

    for col in cols:
      if len(row[col]) != 0:
        d[col].append(row[col].pop())
      else:
        d[col].append(np.nan)

def drop_set_nans(row, cols):
  for col in cols:
    if np.nan in row[col]:
      row[col].remove(np.nan)
  return row

def squash_out_redundant_nans(df, ids, cols):
  d = {k: [] for k in df.columns}
  for _, row in df1.iterrows():
    drop_set_nans(row, cols)
    add_id_row(d, row, ids, cols)

  df = pd.DataFrame(d)
  return df

ids = ['id_1', 'id_2']
cols = ['col_1', 'col_2', 'col_3']
df1 = df.groupby(ids).agg(set).reset_index()
df = squash_out_redundant_nans(df1, ids, cols)
print(df)

相关问题