在Pandas中按组分配和重用基于分位数的存储桶

qco9c6ql  于 2023-03-28  发布在  其他
关注(0)|答案(3)|浏览(110)

我想做的事

我有一个长格式的pandas DataFrame,包含不同组的值。我想计算并应用基于分位数的装箱(例如,本例中的五分位数)到DataFrame的每个组。
我还需要能够保留每个组的bin边缘,并将相同的标签(通过pd.cut)应用于新的DataFrame。
例如,对于每个组,找到五分位数并将其分配给新的列value_label

import numpy as np
import pandas as pd

df1 = pd.DataFrame({"group": "A", "val": np.random.normal(loc=10, scale=5, size=100)})
df2 = pd.DataFrame({"group": "B", "val": np.random.normal(loc=5, scale=3, size=100)})
df = pd.concat([df1, df2], ignore_index=True)

# apply qcut
labels_and_bins = df.groupby("group")["val"].apply(
    lambda x: pd.qcut(x, q=5, duplicates="drop", retbins=True)
)

# where e.g.
labels_and_bins["A"][0]  # are the applied labels to all the rows in group "A"
labels_and_bins["A"][1]  # are the bin edges to apply the same segmentation going forward

for group in df.group.unique():
    df.loc[df["group"] == group, "value_label"] = labels_and_bins[group][0]

当我尝试运行它时,在第二次迭代时,我得到了以下错误:TypeError: Cannot set a Categorical with another, without identical categories
因此,本质上我需要Pandas接受扩展属于列dtype的类别。

我考虑过的

变形

使用.transform()可能会解决在第一个DataFrame上分配标签的问题,但我不清楚如何在未来的迭代中重用已标识的bin

联合分类dtype

我尝试了两种方法:

add_categories()

labels_and_bins['A'][0].cat.add_categories(labels_and_bins['B'][0].cat.as_unordered())
结果为ValueError: Categorical categories must be unique

union_categoricals()

pd.api.types.union_categoricals(
    [labels_and_bins["A"][0].cat.as_unordered(), labels_and_bins["B"][0].cat.as_unordered()].get_inde
)

结果为InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

一个解决方案

通过调用不带标签的qcut来摆脱Interval对象,例如:

labels_and_bins = df.groupby("group")["val"].apply(
    lambda x: pd.qcut(x, q=5, duplicates="drop", retbins=True, labels=False)
)

然而,如果可能的话,为了更好的解释性,我会对保持间隔的方法感兴趣

总的来说,这感觉像是一个大的反模式,所以我相信我错过了这个问题的一个更基本的解决方案!

提前感谢您的输入!

vlju58qv

vlju58qv1#

你可以使用groupby(...).quantile来获取你的bins。获取标签是一个棘手的部分,如果你想拥有cutqcut返回的相同类型的标签,你可以将这个结果转换为pandas.arrays.IntervalArray,然后从那里抓取左边缘。

bins = (
    df.groupby('group')['val']
    .quantile([0, .2, .4, .6, .8, 1])

    # all of the below is to get those silly labels
    .groupby('group').apply(lambda s: 
        pd.Series(pd.arrays.IntervalArray.from_breaks(s))
    )
    .to_frame('labels')
    .assign(breaks=lambda d: d['labels'].array.left)
)

print(bins)
                                           labels     breaks
group                                                       
A     0   (1.0537779249398262, 6.012235091686201]   1.053778
      1    (6.012235091686201, 8.290140550780626]   6.012235
      2   (8.290140550780626, 10.486391253124191]   8.290141
      3  (10.486391253124191, 14.187164706356482]  10.486391
      4   (14.187164706356482, 30.41871778451676]  14.187165
B     0   (-2.223397003597773, 2.929644956582803]  -2.223397
      1    (2.929644956582803, 4.205189382479166]   2.929645
      2    (4.205189382479166, 5.512744579486252]   4.205189
      3   (5.512744579486252, 6.9496607395781504]   5.512745
      4  (6.9496607395781504, 11.247907758459675]   6.949661

# Note that the 'labels' columns are not actually a bunch of strings,
#   but is a `Series` wrapped around the underlying `IntervalArray`
#   you can access the `IntervalArray` via the `.array` property

在这里,您可以使用pd.merge_asof将您的bin与其他数据集对齐,甚至可以返回到原始的DataFrame

# all of the sorting is required for `merge_asof`
result = pd.merge_asof(
    df.sort_values('val'), bins.sort_values('breaks'),
    left_on='val', right_on='breaks', by='group'
)

print(result.sample(8, random_state=0))
    group        val                                    labels     breaks
18      B   2.323406   (-2.223397003597773, 2.929644956582803]  -2.223397
170     A  12.238075  (10.486391253124191, 14.187164706356482]  10.486391
107     A   6.680556    (6.012235091686201, 8.290140550780626]   6.012235
98      A   6.499656    (6.012235091686201, 8.290140550780626]   6.012235
177     A  13.156868  (10.486391253124191, 14.187164706356482]  10.486391
182     A  14.734745   (14.187164706356482, 30.41871778451676]  14.187165
5       B   0.227872   (-2.223397003597773, 2.929644956582803]  -2.223397
146     A   9.562343   (8.290140550780626, 10.486391253124191]   8.290141
eni9jsuy

eni9jsuy2#

如果我理解正确的话,你可以用chain来实现:

import numpy as np
import pandas as pd
from itertools import chain

df1 = pd.DataFrame({"group": "A", "val": np.random.normal(loc=10, scale=5, size=100)})
df2 = pd.DataFrame({"group": "B", "val": np.random.normal(loc=5, scale=3, size=100)})
df = pd.concat([df1, df2], ignore_index=True)

# apply qcut
labels_and_bins = df.groupby("group")["val"].apply(
    lambda x: pd.qcut(x, q=5, duplicates="drop", retbins=True)
)

df['labels'] = list(chain.from_iterable([x[0].values for x in labels_and_bins]))
gpnt7bae

gpnt7bae3#

我认为你想达到的目标是:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({"group": "A", "val": np.random.normal(loc=10, scale=5, size=100)})
df2 = pd.DataFrame({"group": "B", "val": np.random.normal(loc=5, scale=3, size=100)})
df = pd.concat([df1, df2], ignore_index=True)

labels_and_bins = df.groupby("group")["val"].apply(
    lambda x: pd.cut(x, bins=5, duplicates="drop", retbins=True, labels=False)
)

bins = labels_and_bins.apply(lambda x: x[1])

for group in df.group.unique():
    df.loc[df["group"] == group, "value_label"] = pd.cut(
        df.loc[df["group"] == group, "val"], bins=bins[group], labels=False
    )

new_df = pd.DataFrame({"group": ["A", "B"], "val": [11, 3]})
for group in new_df.group.unique():
    new_df.loc[new_df["group"] == group, "value_label"] = pd.cut(
        new_df.loc[new_df["group"] == group, "val"], bins=bins[group]
    )

print(df)
print(new_df)

这给予了你原始的df

group        val  value_label
0       A  14.365754          3.0
1       A  10.620769          2.0
2       A  20.730151          4.0
3       A   0.947749          0.0
4       A   6.483682          2.0
..    ...        ...          ...
195     B  11.636778          4.0
196     B  12.288265          4.0
197     B   8.188767          3.0
198     B   7.419500          3.0
199     B   1.622162          1.0

[200 rows x 3 columns]

新的一个:

group  val       value_label
0     A   11  (10.867, 15.798]
1     B    3    (1.512, 4.243]

相关问题