pandas 计算数据框中的组合比率

ua4mk5z4  于 2023-08-01  发布在  其他
关注(0)|答案(1)|浏览(92)

我试图计算包含多个样本列的 Dataframe 列中的组合比率,每列有超过60k个值(行),我想计算每列中每两个值的比率组合。我从一个空的数据框开始,并在每次迭代中添加比率。
这是我目前得到的代码:

non_norm_data = data.values.T  # turns df into numpy array, row = sample, column = feature\value  
df_for_pairwise_ratio = pd.DataFrame()  
numerators_df = pd.DataFrame()  
for subject in range(np.shape(non_norm_data)[0]):  # run on subjects number
    subject_values = non_norm_data[subject, :]
    for idx, (feature1, feature2) in enumerate(product(non_norm_data[subject, :],              non_norm_data[subject, 1:])):  
        big_small_pair = max(feature1, feature2),\
            min(feature1, feature2)  # the first one is the bigger value
        ratio = big_small_pair[0] / big_small_pair[1]

        df_for_pairwise_ratio.loc[idx, f"Subject {subject}"] = ratio
        numerators_df.loc[idx, f"Subject {subject}"] = big_small_pair[0]

字符串
这需要很长时间,我有内存错误,有什么方法可以使这更有效?这是我的一小部分数据:

0        1     2     3
40.96  50.19  30.46 33.17
118.71 55.55  43.56 142.89 
22.67  102.33 8.48  14.56


谢谢你,谢谢

dw1jzc5e

dw1jzc5e1#

需要矢量化。首先构造一个高维对数组(避免自引用),排序,然后向量化划分。

from io import StringIO
import numpy as np
import pandas as pd

with StringIO("""
     0       1      2      3
 40.96   50.19  30.46  33.17
118.71   55.55  43.56 142.89 
 22.67  102.33   8.48  14.56""") as f:
    data = pd.read_fwf(f)
print(data)

# Make a 3x3x4x2 matrix:
# numerator row, by denominator row, by subject, by numerator-or-denominator
pair_stack = np.stack(
    np.broadcast_arrays(
        data.values[:, np.newaxis, :],
        data.values[np.newaxis, :, :],
    ),
    axis=-1,
)

# Lower-triangular index avoiding self-reference via k=-1 (remove ratio=1 outputs).
# This produces a 3x4x2 matrix.
pairs = pair_stack[np.tril_indices(pair_stack.shape[0], k=-1)]

# Put the denominator first and numerator second on the last axis.
pairs.sort(axis=-1)

# Transpose and unpack to separate denominator and numerator arrays, each 3x4.
denominator, numerator = np.transpose(pairs, (2, 0, 1))

df_for_pairwise_ratio = pd.DataFrame(
    data= numerator / denominator,
    columns=pd.Index(
        name='Subject', data=data.columns,
    ),
)
print(df_for_pairwise_ratio)

# Same as old output, but without ratio=1 rows
with StringIO("""
Subject 0  Subject 1  Subject 2  Subject 3
 2.898193   1.106794   1.430072   4.307808
 1.806793   2.038852   3.591981   2.278159
 5.236436   1.842124   5.136792   9.813874
""") as f:
    expected_df = pd.read_fwf(f)
assert np.allclose(expected_df.values, df_for_pairwise_ratio.values)

字符串

相关问题