按填充值透视pandas Dataframe

lymgl2op  于 2023-04-28  发布在  其他
关注(0)|答案(4)|浏览(113)

我想转换/透视下面的 Dataframe ,以指示从(源)到(目标)的即时数据流。

l0      l1      l2      l3      sum
0   IN      TOTAL   <NA>    <NA>    1
1   <NA>    TOTAL   OUT_A   OUT_B   2
2   <NA>    TOTAL   <NA>    OUT_C   3

在上述示例中,数据流由例如行0中的l0至l1表示。等效地,l1至l2和l2至l3表示行1中的(直接)数据流,以及行2中的l1至l3。
期望值:

source  target  sum
0   IN      TOTAL   1
1   TOTAL   OUT_A   2
2   TOTAL   OUT_C   3
3   OUT_A   OUT_B   2

再现性:

import pandas as pd
df = pd.DataFrame({
    "l0": ['IN', pd.NA, pd.NA],
    "l1": ['TOTAL','TOTAL','TOTAL'],
    "l2": [pd.NA,'OUT_A', pd.NA],
    "l3": [pd.NA,'OUT_B',"OUT_C"],
    "sum": [1,2,3]})

pd.DataFrame({
    "source": ["IN","TOTAL","TOTAL","OUT_A"],
    "target": ["TOTAL","OUT_A","OUT_C","OUT_B"],
    "sum": [1,2,3,2]
})
5jvtdoz2

5jvtdoz21#

基于itertools.pairwise(非nan值)和pd.concat函数:

from itertools import pairwise

df_ = (pd.concat(df.set_index('sum').apply(
    lambda x: pd.DataFrame([[*p, x.name] for p in pairwise(x.dropna())]),
    axis=1).values, ignore_index=True)
       .set_axis(['source', 'target', 'sum'], axis=1))
source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3
5ktev3wc

5ktev3wc2#

您可以使用DataFrame.melt通过DataFrame.dropna使用自定义函数为幻灯片窗口删除缺失值,最后创建2列DataFrame并追加sum列:

#https://stackoverflow.com/a/6822773/2901002
def window(seq, n=2):
    return [tuple(seq[i: i + n]) for i in range(len(seq) - n + 1)]
   

s = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value']
       .apply(window)
       .explode()
       )

df1 = (pd.DataFrame(s.tolist(), columns=['source', 'target'])
         .assign(sum=s.index.get_level_values('sum')))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3

列表理解中的另一个想法:

g = (df.melt('sum', ignore_index=False)
       .dropna(subset=['value'])
       .reset_index()
       .groupby(['index','sum'])['value'])

n = 2

df1 = (pd.DataFrame([(*(k[i: i + n]), j) for (i, j), k in g for i in range(len(k) - n + 1)],
                    columns=['source', 'target','sum']))
print (df1)
  source target  sum
0     IN  TOTAL    1
1  TOTAL  OUT_A    2
2  OUT_A  OUT_B    2
3  TOTAL  OUT_C    3
hm2xizp9

hm2xizp93#

Apply一个函数,它获取每行中的非空值对,explode将这些对扩展到单独的行,将每对扩展到单独的列,并将结果与原始 Dataframe 中的sum列组合。

def pairs(row):
    " Return a list of the non-null consecutive pairs in a dataframe row "
    items = row.dropna().values
    return [items[i:i+2] for i in range(len(items)-1)]

cols = ["l0", "l1", "l2", "l3"]
result = (df[cols]
          .apply(pairs, axis=1)    
          .explode()               # series of source-target pairs
          .apply(pd.Series)        # expand each pair to two columns
          .rename(columns={0: "source", 1: "target"}) # name the resulting columns
          .assign(sum=df["sum"])   # add the sum column from original df
         )
6l7fqoea

6l7fqoea4#

stack删除NA,然后使用groupbyzipconcat的字典理解来合并连续的值:

out = (pd.concat({k: pd.DataFrame(zip(g, g.iloc[1:]), columns=['source', 'target'])
                 for k, g in df.set_index('sum').stack().groupby(level=0)},
                names=['sum'])
         .reset_index('sum')
         .sort_index(ignore_index=True, kind='stable') # optional
       )

输出:

sum source target
0    1     IN  TOTAL
1    2  TOTAL  OUT_A
2    3  TOTAL  OUT_C
3    2  OUT_A  OUT_B

相关问题