python使用set_index()和unstack在配置单元中生成带下划线的列，但pivot_table()可以工作

3xiyfsfu 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(445)

与我之前问过的以下问题有关：pythonDataframepivot只适用于pivot\u table（），而不适用于set\u index（）和unstack（）
我已经能够使用这两种方法成功地透视以下示例数据 set_index() 与 unstack() 以及使用 pivot_table() 与 aggfunc=first 参数。
样本数据：

id  responseTime    label   answers
ABC 2018-06-24  Category_1  [3]
ABC 2018-06-24  Category_2  [10]
ABC 2018-06-24  Category_3  [10]
DEF 2018-06-25  Category_1  [7]
DEF 2018-06-25  Category_8  [10]
GHI 2018-06-28  Category_3  [7]

期望输出：

id  responseTime    category_1  category_2 category_3 category_8
ABC  2018-06-24           [3]     [10]         [10]       NULL
DEF  2018-06-25           [7]     NULL         NULL       [10]
GHI  2018-06-28           NULL    NULL         [7]        NULL

代码：


# this works but having issues with reset_index so leaving it here as comment.
# df=pdDF.pivot_table(index=['items_id','responseTime'], columns='label', values='answers', aggfunc='first')
df=pdDF.set_index(['items_id','responseTime','label']).unstack('label')
# reset the index so all columns can be preserved for table creation
df.reset_index(inplace=True)
# create pyspark dataframe from pandas dataframe after pivoting is done.
psDF=spark.createDataFrame(df)
# create hive table
psDF.write.mode('overwrite').saveAsTable('default.test_table')

当我将第二段代码用于 set_index() 以及 unstack() ，生成的输出具有附加标头 answers 打印Dataframe时。当我用这个Dataframe创建一个配置单元表时，这会导致重复的列。
重置索引（）前的Dataframe头：

answers
id  responseTime    category_1  category_2 category_3 category_8

重置索引后的Dataframe列：

('items_id', '')|('responseTime', '')|('answers', u'category_1')|('answers', u'category_2')|('answers', u'cateogry_3')|('answers', u'category_8')

配置单元列名：

_'items_id'_''_     
_'responsetime'_''_
_'answers'_u'category_1'_
_'answers'_u'category_2'_
_'answers'_u'category_3'_
_'answers'_u'category_8'_

我相信这是因为 unstack() 创建具有多个级别的分层列。有没有办法 answer 水平消失，并删除这些垃圾下划线字符和 answer 在Dataframe本身中引用，以便我可以创建正常的配置单元列？

Hive python DataFrame pandas pivot-table

来源：https://stackoverflow.com/questions/52413585/python-pandas-set-index-and-unstack-results-in-columns-with-underscores-in-hiv

1条答案

按热度按时间

a0x5cqrl1#

回答我自己的问题。
我可以用 droplevel() 函数从Dataframe中删除最顶层。
刚好在…之后 set_index() 以及 unstack() ，我可以添加以下行以删除 answer Dataframe的级别。

df.columns = df.columns.droplevel(0)

在这之后， reset_index() 可以调用以保留dataframe中的所有列，就像上面的代码一样。
我的Dataframe列和配置单元列现在不包含带下划线的级别信息。

|items_id|responseTime|category_1|category_2|category_3|category_8|

附加参考 droplevel() 在以下位置提供：
问题：Pandas：从多级列索引中删除一级？
Pandasapi：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.multiindex.droplevel.html#pandas.multiindex.droplevel

赞(0）回复(0）举报 2021-06-26

我来回答

python使用set_index()和unstack在配置单元中生成带下划线的列，但pivot_table()可以工作

1条答案

相关问题

热门标签

最新问答