pandas 将 Dataframe 列的唯一值作为新 Dataframe 获取的最快方法

ncgqoxb0 于 2022-12-25 发布在其他

关注(0)|答案(2)|浏览(180)

像这样转变数据的最佳方法是什么：

| col1 | col2 | ... col400
|  tes | abc  |      max
|  tes | onet |      ups

变成这样：

Index | col | unique
  1   | col1| tes
  2   | col2| abc
  3   | col2| onet
  ...    
  639 | col400| max
  649 | col400| ups

pandas

来源：https://stackoverflow.com/questions/51253712/fastest-way-to-get-unique-values-for-my-dataframe-columns-as-a-new-dataframe

2条答案

按热度按时间

oxcyiej71#

我认为您必须添加一个额外的索引，因为否则在每列上，您只能添加 * 一 * 行。
您可能正在查找DataFrame.unstack(..)。例如：

>>> df = pd.DataFrame([['tes', 'abc', 'max'], ['tes', 'onet', 'ups']], columns=["col1", "col2", "col400"])
>>> df
  col1  col2 col400
0  tes   abc    max
1  tes  onet    ups
>>> df.unstack()
col1    0     tes
        1     tes
col2    0     abc
        1    onet
col400  0     max
        1     ups
dtype: object

可能与.reset_index()结合使用以引入具有唯一ID和两列的索引：一个用于“* 原始行号 "，一个用于“ 列名 *"，例如：

>>> df.unstack().reset_index()
  level_0  level_1     0
0    col1        0   tes
1    col1        1   tes
2    col2        0   abc
3    col2        1  onet
4  col400        0   max
5  col400        1   ups

或

df = (df.unstack()
      .reset_index(level=0)
      .rename(columns={'level_0':'col',0:'unique'})
      .reset_index(drop=True))

df.index += 1
print(df)

#      col unique
#1    col1    tes
#2    col1    tes
#3    col2    abc
#4    col2   onet
#5  col400    max
#6  col400    ups

赞(0）回复(0）举报 2022-12-25

dw1jzc5e2#

考虑到实际数据的性能，您可能更喜欢melt而不是unstack，在本例中，melt的加速比为2.5倍，语法也更简单。

假设我们有以下数据

df = pd.DataFrame({f"col{i}": range(100_000) for i in range(400)})
df.shape
# (100000, 400)

melt的性能

%%timeit
df.melt(var_name="col", value_name="unique")
# 857 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

unstack的性能

%%timeit
(
    df.unstack()
    .reset_index(level=0)
    .rename(columns={"level_0": "col", 0: "unique"})
    .reset_index(drop=True)
)
# 2.15 s ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

赞(0）回复(0）举报 2022-12-25

我来回答

pandas 将 Dataframe 列的唯一值作为新 Dataframe 获取的最快方法

2条答案

相关问题

热门标签

最新问答