Pandas：取消类别和子标签位于同一列的Excel数据透视

vom3gejh 于 2023-08-01 发布在其他

关注(0)|答案(4)|浏览(106)

基本上，我认为解释这一点的最简单方法是，我试图扩展一个多索引表，但索引都在同一列中。
我的数据结构如下：
| 总和| Sum |
| --| ------------ |
| 二十二| 22 |
| 10个| 10 |
| 十二岁| 12 |
| 三十三| 33 |
| 三十三| 33 |
| 四十五| 45 |
| 十四岁| 14 |
| 三十一| 31 |
| 一百| 100 |
我想要的是这样的Dataframe：
| 数据1|数据2|总和| Sum |
| --|--|--| ------------ |
| 10个|十二岁|二十二| 22 |
| 三十三|0|三十三| 33 |
| 十四岁|三十一|四十五| 45 |
| 五十七|四十三|一百| 100 |
有没有内置的pandas方法，或者直接的方法来处理这种类型的翻译？
我试过手动分解表并重新创建它，方法是收集重复的行标签，并从这些标签中创建列，使用带有该标签的行的数据，但是，棘手的地方是缺少子数据的地方;就像上面的例子，Collection 2 Data 2不存在。使用这种方法，我可以计算每行Data 1是否等于Collection 1，如果是，在该索引处将0添加到Data 2。但是，它似乎超级丑陋，并认为可能有一个更优雅的方法。

pandas

来源：https://stackoverflow.com/questions/76783069/pandas-unpivot-excel-data-where-category-and-children-labels-are-in-same-column

4条答案

按热度按时间

g2ieeal71#

使用pivot_table：

# identify groups
m = df['Row Labels'].str.match(r'Collection\d+|Total')

# reshape
out = (df
   .assign(index=df['Row Labels'].where(m).ffill(),
           col=df['Row Labels'].mask(m, 'Sum')
          )
   .pivot_table(index='index', columns='col', values='Sum', fill_value=0)
   .rename_axis(columns=None)
)

# recompute Total
out.loc['Total'] = out.drop('Total').sum()

out = out.reset_index()

字符串
输出量：

index  Data 1  Data 2  Sum
0  Collection1      10      12   22
1  Collection2      33       0   33
2  Collection3      14      31   45
3        Total      57      43  100

型

赞(0）回复(0）举报 2023-08-01

qf9go6mv2#

我不确定是否存在一些简单的Pandas解决方案，但你可以试试这个例子：

# remove the Total row - will recreate it as last step
df = df[df["Row Labels"] != "Total"]

# find the indices for pivoting
mask = df["Row Labels"].str.startswith("Collection")
df["idx"] = mask.cumsum()

# do the actual transformation here: pivot + merge
df = (
    pd.merge(
        df[mask],
        df[~mask].pivot(index="idx", columns="Row Labels", values="Sum"),
        left_on="idx",
        right_index=True,
    )
    .drop(columns=["idx"])
    .fillna(0)
)

# add Total row back
df = pd.concat(
    [
        df,
        pd.DataFrame(
            {"Row Labels": ["Total"], **{c: [df[c].sum()] for c in df.loc[:, "Sum":]}}
        ),
    ]
)

print(df)

字符串
图纸：

Row Labels  Sum  Data 1  Data 2
0  Collection1   22    10.0    12.0
3  Collection2   33    33.0     0.0
5  Collection3   45    14.0    31.0
0        Total  100    57.0    43.0

型

赞(0）回复(0）举报 2023-08-01

yqyhoc1h3#

欢迎来到SO。在我看来，最简单的台词是：
输入数据：

df = pd.DataFrame(columns = ['Row Labels', 'Sum'],
                  data =   [['Collection1', 22],
                            ['Data 1',      10],
                            ['Data 2',      12],
                            ['Collection2', 33],
                            ['Data 1',      33],
                            ['Collection3', 45],
                            ['Data 1',      14],
                            ['Data 2',      31],
                            ['Total',      100]])
    Row Labels  Sum
0  Collection1   22
1       Data 1   10
2       Data 2   12
3  Collection2   33
4       Data 1   33
5  Collection3   45
6       Data 1   14
7       Data 2   31
8        Total  100

字符串

（1）重新制定表格，以便它可以旋转

（解析 * 收集 * 和 * 数据 * 信息）

# Extract Collection number as a new row
df['Collection'] = df['Row Labels'].str.extract("Collection\s*(\d)", expand=True).ffill()
#df['Collection'] = df['Row Labels'].str.startswith('Coll').cumsum()#old: assumed Collection always came in natural order -removed following mozway's comment, thank you!

# keep only 'Data' rows, because pivoting won't work well with redundant information.
df = df[df['Row Labels'].str.contains('Data')]

型
重新制定的输入表：

Row Labels  Sum  Collection
1     Data 1   10           1
2     Data 2   12           1
4     Data 1   33           2
6     Data 1   14           3
7     Data 2   31           3

型

（2）透视，然后用两个方向的总和完成表格：

pt = pd.pivot_table(data    = df,
                    values  = 'Sum',
                    index   = 'Collection',
                    columns = 'Row Labels', 
                    fill_value=0)

# Recalculate the sums and totals
pt['Sum']       = pt.sum(axis=1)
pt.loc['Total'] = pt.sum(axis=0)

型
最终输出：

Row Labels  Data 1  Data 2  Sum
Collection                     
1               10      12   22
2               33       0   33
3               14      31   45
Total           57      43  100

型

赞(0）回复(0）举报 2023-08-01

cngwdvgl4#

另一种可能的解决方案：

s = df['Row Labels'].str.startswith('Collection')

(df.assign(aux = s.cumsum())
 .pivot(index='aux', columns='Row Labels', values='Sum')
 .set_axis(df['Row Labels'].loc[s])
 .filter(like='Data')
 .rename_axis(None, axis=1)
 .fillna(0)
 .pipe(lambda x: pd.concat([x, x.sum().to_frame().T.set_axis(['Total'])]))
 .assign(Sum = lambda x: x.sum(axis=1))
 .reset_index(names = 'Row Labels'))

字符串
输出量：

Row Labels  Data 1  Data 2    Sum
0  Collection1    10.0    12.0   22.0
1  Collection2    33.0     0.0   33.0
2  Collection3    14.0    31.0   45.0
3        Total    57.0    43.0  100.0

型

赞(0）回复(0）举报 2023-08-01

我来回答

Pandas：取消类别和子标签位于同一列的Excel数据透视

4条答案

相关问题

热门标签

最新问答