scipy 组合合并行和列以创建用于Fisher精确检验的2x2表

bq3bfh9z  于 11个月前  发布在  其他
关注(0)|答案(2)|浏览(108)

我需要使用python对以下交叉表ct执行独立性测试:


的数据
因为有一些值小于5,所以我不能执行卡方独立性检验。相反,我需要执行Fisher精确检验。
由于Fisher在Scipy上的精确测试实现仅支持2x2表,因此我实现了以下解决方案:

from scipy.stats import fisher_exact

# Combine rows and columns to create a 2x2 table
table_2x2 = np.array([[ct[1][4] + ct[2][4] + ct[1][3] + ct[2][3], ct[3][4] + ct[4][4] + ct[3][3] + ct[3][3]],
                      [ct[1][2] + ct[2][2] + ct[1][1] + ct[2][1], ct[3][2] + ct[4][2] + ct[3][1] + ct[4][1]]])

# Perform Fisher's exact test on the 2x2 table
odds_ratio, p_value = fisher_exact(table_2x2)

# Display the results
print(f'Odds Ratio: {odds_ratio}')
print(f'P-value: {p_value}')

字符串
这是一个有效的解决方案吗?如果不是,有没有其他建议在Python中实现它?

alen0pnh

alen0pnh1#

如果没有,有没有其他建议在Python中实现它?
如果您愿意接受随机排列检验,您可以使用scipy.stats.permutation_test创建自己的检验。我们将使用与scipy.stats.chi2_contingency相同的检验统计量,但零假设将类似于Fisher精确检验。

import numpy as np
from scipy import stats

table = np.asarray([[20, 49, 25, 4], 
                    [35, 54, 43, 12], 
                    [27, 44, 29, 8], 
                    [7, 20, 16, 4]])

# perform chi-squared test as a sanity check
ref = stats.chi2_contingency(table)

def untab(table):
    # convert 2d contingency table to two (paired) samples
    # e.g. you have 20 pairs of (0, 0), 49 pairs of (0, 1)...
    x = []
    y = []
    m, n = table.shape
    for i in range(m):
        for j in range(n):
            count = table[i, j]
            x += [i]*count
            y += [j]*count
    return np.asarray(x), np.asarray(y)

x, y = untab(table)

def statistic(x):
    # Given one of the samples, compute the chi-squared statistic.
    # `permutation_test` will pass in random permutations of `x`,
    # and this will compute the statistic for each. This gives us
    # a sense of the distribution of the statistic under the null
    # hypothesis of no association.
    table = stats.contingency.crosstab(x, y).count
    return stats.chi2_contingency(table).statistic

res = stats.permutation_test((x,), statistic, alternative='greater', 
                             permutation_type='pairings')

print(res.pvalue, ref.pvalue)  # 0.6592 0.6500840391351904

字符串
对于原始帖子中显示的列联表,与卡方检验相比,p值几乎没有差异。尽管表中的一些计数很小,但我们的随机排列检验的零分布似乎与卡方分布非常相似,具有适当的自由度:

import matplotlib.pyplot as plt
plt.hist(res.null_distribution, bins=30, density=True, label='normalized histogram')

# see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
# for degrees of freedom
df = table.size - sum(table.shape) + table.ndim - 1
dist = stats.chi2(df)
x = np.linspace(0, 40, 300)
plt.plot(x, dist.pdf(x), label='chi2')
plt.legend()


的数据
有关测试背后的理论(呃,直觉)的更多信息,请参阅关于Resampling and Monte Carlo Methods,特别是2c,Correlated Sample Permuation Tests的SciPy教程。

ycggw6v2

ycggw6v22#

这是我想到的另一个答案:我们可以执行Monte Carlo测试,而不是排列测试,这在概念上更简单。

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

table = np.asarray([[20, 49, 25, 4], 
                    [35, 54, 43, 12], 
                    [27, 44, 29, 8], 
                    [7, 20, 16, 4]])

# Get distribution of contingency tables under the null
# hypothesis of no association
rowsums, colsums = stats.contingency.margins(table)
dist = stats.random_table(rowsums.ravel(), colsums.ravel())

# Monte Carlo null distribution: compute statistic for contingency
# tables randomly sampled under the null hypothesis
n = 9999
null_distribution = np.empty(n, dtype=float)
for i in range(n):
    resampled = stats.chi2_contingency(dist.rvs())
    null_distribution[i] = resampled.statistic

# Compared observed statistic against Monte Carlo null distribution
observed = stats.chi2_contingency(table)
count_extreme = (null_distribution >= observed.statistic).sum()
pvalue = (count_extreme + 1)/(n + 1)  # 0.6534

# Plot Monte Carlo null distribution against astymptotic approximation
plt.hist(null_distribution, bins=30, density=True, label='Monte Carlo')
df = table.size - sum(table.shape) + table.ndim - 1
dist = stats.chi2(df)
x = np.linspace(0, 40, 300)
plt.plot(x, dist.pdf(x), label='Asymptotic')
plt.legend()
plt.title("Null Distribution of Chi2 Test")

字符串


的数据
p值之间存在非常好的一致性,并且渐近检验和Monte Carlo检验的零分布似乎匹配。
假设你对小的p值感兴趣,更有用的是生存函数概率对另一个的图。这将告诉你渐近检验在显著性阈值附近是太保守还是不够保守(对于类似的列联表):

ecdf = stats.ecdf(null_distribution)
q = ecdf.sf.quantiles[::-1]
prob_mc = ecdf.sf.probabilities[::-1]
prob_asymp = dist.sf(q)
plt.plot(prob_mc, prob_asymp)
plt.xlabel("MC Null Distribution Survival Probability")
plt.ylabel("Asymptotic Null Distribution Survival Probability")
plt.plot([0, 1], [0, 1], '--')
plt.xlim(0, 0.1)
plt.ylim(0, 0.1)



它们非常接近,所以对于这个列联表(至少),渐近卡方检验是相当安全的。

相关问题