R中的For循环对所有列运行卡方检验(_S)

gmxoilav  于 2022-12-06  发布在  其他
关注(0)|答案(3)|浏览(167)
head(Data)
                       
          Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 
  SAMPLE1   0    1      0      0     0     1   1     1          
  SAMPLE2   0    1      0      1     0     1   0     1        
  SAMPLE3   0    0      0      1     0     0   1     1          
  SAMPLE4   1    0      0      0     1     0   0     0

我正在运行fisher检验和卡方检验,如下所示

Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and 
fisher.test(Gene7_Gene8)

我怎样才能在所有的数据集中同时运行for loop的chisq检验呢?我正在寻找我的数据集中每一对可能的基因的p值。(我的原始数据有大约60列)

w51jfk4q

w51jfk4q1#

我希望这对你有用。

# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
  
  gene1 = sample(c(0,1),10,T),
  gene2 = sample(c(0,1),10,T),
  gene3 = sample(c(0,1),10,T),
  gene4 = sample(c(0,1),10,T),
  gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))


# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
                           values_to = "true_false") %>%
  arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)

# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()

# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05)

# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.

final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05) %>% 
  ggplot(aes(x=gene_type,y=p.value,color=is_significant))+
  geom_text(aes(label=gene_type),show.legend = F)+
  labs(title = "P values of different genes")+
  theme_minimal()
nzk0hqpo

nzk0hqpo2#

我们可以使用combn

lst1 <- combn(seq_len(ncol(Data)), 2, function(i) 
   FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")
ve7v8dk2

ve7v8dk23#

不需要for循环,我们可以这样做:

com <- data.frame(t(combn(colnames(Data), 2)))
com
#       X1    X2
# 1  Gene1 Gene2
# 2  Gene1 Gene3
# 3  Gene1 Gene4
# 4  Gene1 Gene5
# 5  Gene1 Gene6
# 6  Gene1 Gene7
# 7  Gene1 Gene8
# 8  Gene2 Gene3
# ...

chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# ...

chisq
#  [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000

对于Fisher检验,我们遇到了一些问题:

fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) : 
#   'x' must have at least 2 rows and columns

这是因为不是所有的表都是2x2的,所以我们需要捕获这个错误并返回一些有意义的信息。

fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
#  [1] 1.0000000        NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000        NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000        NA        NA        NA        NA        NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000

从这里,我们可以轻松地将它们组合在一起:

out <- cbind(com, chisq=chisq, fisher=fisher)
out
#       X1    X2     chisq    fisher
# 1  Gene1 Gene2 1.0000000 1.0000000
# 2  Gene1 Gene3 0.3173105        NA
# 3  Gene1 Gene4 1.0000000 1.0000000
# 4  Gene1 Gene5 0.5049851 0.2500000
# 5  Gene1 Gene6 1.0000000 1.0000000
# 6  Gene1 Gene7 1.0000000 1.0000000
# 7  Gene1 Gene8 0.5049851 0.2500000
# 8  Gene2 Gene3 1.0000000        NA
# 9  Gene2 Gene4 1.0000000 1.0000000
# ...

虽然combn确实有FUN=参数,但我发现在fisher.test自动化中重用结果也更容易。它还清除了输出(在我看来),因为它清楚地显示了在每次比较中考虑哪些基因。

相关问题