scipy stats与具有列名的数据框的关联

68de4m5k  于 2022-11-10  发布在  其他
关注(0)|答案(1)|浏览(182)

当scipy.stats.spearmanr(df1,df2)用于两个数据框时,您可以将输出值解栈以获得2个numpy相关性和p值数组。如何将它们放回与原始数据框具有相同列名的数据框中?

correlation, p_value = scipy.stats.spearmanr(df1, df2)

生成2个numpy数组,但希望获得具有相关性的 Dataframe ,并且对于x列,p_value与每列相邻,如下所示:

不明显...这里有一些代码来获取随机数据(由ouroboros1建议)

import scipy.stats
import numpy as np
import pandas as pd

# Create array of 5 rows and 3 columns,

# filled with random values from 1 to 10

data = np.random.randint(1,10,size=(5,3))

# Create Dataframe with data

df1 = pd.DataFrame(data, columns=['Col_1','Col_2','Col_3'])

# Set target column in new data frame

df2 = df1['Col_3'].to_frame().copy()
df1.drop(['Col_3'], axis=1, inplace=True)

# Obtain correlation coefficient for each

# value in column   against value in every other column

correlation, p_value = scipy.stats.spearmanr(df1, df2)

(编辑)我想我找到了一条路,但必须有一条更短的路:


# Concat frames that contain features with frame that

# contains target

correlation_frame = pd.concat([df1, df2], axis=1)
col_cor_list = list(correlation_frame)

# Create new frame containing all correlation coefficients

cor = pd.DataFrame(correlation, columns=col_list)
cor
    Col_1     Col_2        Col_3
0   1.000000    -0.883883   -0.229416
1   -0.883883   1.000000    0.081111
2   -0.229416   0.081111    1.000000

# Get a dataframe for p-values

col_p_list = ['p_value' + str(x) for x in
              range(1,len(p_value)+1)]
p_frame = pd.DataFrame(p_value, columns=col_p_list)
p_frame

    p_value1        p_value2    p_value3
0   1.404265e-24    0.046662    0.710482
1   4.666188e-02    0.000000    0.896840
2   7.104817e-01    0.896840    0.000000

# Combine values by alternating column names

# so p-values are placed correctly

alter_names = (list(sum(zip(col_cor_list, col_p_list), ())))
final = cor.join(p_frame)
results = final[alter_names]
results

   Col_1            p_value1       Col_2    p_value2      Col_3 p_value3
0   1.000000    1.404265e-24    -0.883883   0.046662 -0.229416  0.710482
1   -0.883883   4.666188e-02    1.000000    0.000000    0.081111    0.896840
2   -0.229416   7.104817e-01    0.081111    0.896840    1.000000    0.000000
8xiog9wr

8xiog9wr1#

建议的重构版本:

获取相关性和p_value

import scipy.stats
import numpy as np
import pandas as pd

n = 3
data = np.random.randint(1,10,size=(5,n))

# create cols `Col_1`, `_2` etc.

cols = [f'Col_{i+1}' for i in range(n)]
df1 = pd.DataFrame(data, columns=cols)

# create `df2` based on index, and drop

df2 = df1.iloc[:,-1].to_frame().copy()
df1 = df1.iloc[:,:-1]
correlation, p_value = scipy.stats.spearmanr(df1, df2)

变成df

  • 使用np.hstack将相关性和p_value按顺序水平堆叠。
  • 使用sorted对列进行排序。Key由每个列名称末尾的连续数字组成,我们使用rsplit提取它。
res = pd.DataFrame(np.hstack((correlation,p_value)), columns=cols
                   +[f'p_value_{i+1}' for i in range(n)], index=cols)

res = res.loc[:,sorted(res.columns, 
                       key=lambda x: int(x.rsplit('_', maxsplit=1)[1]))]

print(res)

          Col_1  p_value_1     Col_2     p_value_2     Col_3  p_value_3
Col_1  1.000000   0.000000  0.410391  4.925358e-01  0.648886   0.236154
Col_2  0.410391   0.492536  1.000000  1.404265e-24 -0.263523   0.668397
Col_3  0.648886   0.236154 -0.263523  6.683968e-01  1.000000   0.000000

因此,这里唯一的变量是n。例如,对于n = 5,它将产生:

Col_1     p_value_1     Col_2  ...  p_value_4     Col_5  p_value_5
Col_1  1.000000  1.404265e-24 -0.461690  ...   0.552815  0.307794   0.614384
Col_2 -0.461690  4.337662e-01  1.000000  ...   0.334035 -0.763158   0.133339
Col_3  0.666886  2.188940e-01 -0.263158  ...   0.362245  0.500000   0.391002
Col_4  0.359092  5.528147e-01  0.552632  ...   0.000000 -0.157895   0.799801
Col_5  0.307794  6.143840e-01 -0.763158  ...   0.799801  1.000000   0.000000

相关问题