pandas数据分析根据多列中的值比较两个不同的行

nc1teljy  于 2023-08-01  发布在  其他
关注(0)|答案(2)|浏览(129)

我有一个dataframe,看起来像这样:

df
   col1  col2  col3
0     1   "A"    10
1     1   "B"    20
2     1   "C"    30
...
n     k   "A"    15
n+1   k   "B"    10
n+2   k   "C"     5

字符串
我想比较col3值与col2的特定值匹配的col1的行之间的col3值(“A”与“B”,“A”与“C”)。
假设我生成了一个结果数据分析数据框架,它看起来像这样:

da_df
       col1    col2
0    "AvsB"    33.3  #10/30*100
1    "AvsC"    50.0  #10/20*100
...
2k   "AvsB"     150  #15/10*100
2k+1 "AvsC"     300  #15/5*100


我怎么能做到没有for循环?

w1e3prcc

w1e3prcc1#

您可以pivot,然后循环combinations列:

from itertools import combinations

comb = combinations(df['col2'].unique(), r=2)

tmp = df.pivot(index='col1', columns='col2', values='col3')

out = (pd.concat({f'{x}vs{y}': tmp[x].div(tmp[y]).mul(100)
                  for x, y in comb},
                 names=['combination'])
         .reset_index('combination', name='value')
         .sort_index()
      )

字符串
输出量:

combination       value
col1                        
1           AvsB   50.000000
1           AvsC   33.333333
1           BvsC   66.666667
k           AvsB  150.000000
k           AvsC  300.000000
k           BvsC  200.000000

限制特定组合

请注意,如果您只对特定条件感兴趣,您可以手动输入:

comb = [('A', 'B'), ('A', 'C')]

tmp = df.pivot(index='col1', columns='col2', values='col3')

out = (pd.concat({f'{x}vs{y}': tmp[x].div(tmp[y]).mul(100) for x,y in comb},
                 names=['combination'])
         .reset_index('combination', name='value')
         .sort_index()
      )


输出量:

combination       value
col1                        
1           AvsB   50.000000
1           AvsC   33.333333
k           AvsB  150.000000
k           AvsC  300.000000

af7jpaap

af7jpaap2#

类似于mozway's的答案。但我认为我的代码更简单,应该是高效的。

import pandas as pd
import numpy as np

# Create Col1 by replicating 5 cases 3 times each
col1_values = np.repeat(['Case1', 'Case2', 'Case3', 'Case4', 'Case5'], 3)

# Create Col2 with 3 different observations for each value of Col1
col2_values = np.tile(['A', 'B', 'C'], 5)

# Create Col3 as numeric values
col3_values = np.random.rand(15)*100

# Create the DataFrame
df = pd.DataFrame({'col1': col1_values, 'col2': col2_values, 'col3': col3_values})
df1 = pd.pivot_table(df, values = ['col3'], index = ['col1'], columns = ['col2'], aggfunc= 'mean').reset_index()
df1.columns = df1.columns.map('_'.join).str.strip('|')
df1 = df1.assign(AvsC = lambda x: x['col3_A']/x['col3_C'])
df1 = df1.assign(BvsC = lambda x: x['col3_B']/x['col3_C'])

print(df1)

col1_     col3_A     col3_B     col3_C      AvsC      BvsC
0  Case1  39.101404  28.564307  69.350939  0.563819  0.411881
1  Case2  19.903692  50.989200  94.227183  0.211231  0.541130
2  Case3  14.830127  28.185849  14.900414  0.995283  1.891615
3  Case4  29.918815  36.282592  84.880638  0.352481  0.427454
4  Case5  74.119780  95.239943  99.377965  0.745837  0.958361

字符串

相关问题