根据多个rowMeans在R中创建 Dataframe

dpiehjr4  于 2022-12-06  发布在  其他
关注(0)|答案(1)|浏览(124)

我在R方面没有太多经验。因此,在我的情况下,我有一个数据框,其中包含基因名称和它们在不同组织(组织,例如RAM、SAM等)中的表达(每个组织重复3次-RAM 1、RAM 2、RAM 3)(DeSEQ 2包结果)。看起来像这样:

RAM1    RAM2       RAM3       SAM1      SAM2.....
gene.01G000150   3.112134   0.00000   0.00000   7.5206516 1.252147 
.....

所以我想计算每个组织的平均值,然后用这些平均值建立一个新的数据框架。
所需输出:一个类似下面的表:

RAM(mean)   SAM(mean)....
gene.01G000150   5.578          3.5...
...

你现在有什么有效的和可重复的方法来做吗?
update:这个函数看起来很有用:check_genes$SAM <-apply(check_genes[c("SAM1", "SAM2","SAM3")], MARGIN=1, FUN=function(x) mean(x[x!=0])),但您必须对您拥有的每个复制品单独执行此操作。

fbcarpbf

fbcarpbf1#

如果我理解正确的话--一个很大的如果!--你想要每个组织/基因组合的平均表达,对吗?所以,你想要三次运行的平均值,

  • RAM-gene.01G000150
  • SAM-gene.01G000150
  • FEC-gene.01G000150
  • ...

如果是这样的话,我会先重组数据,让生活更容易。你没有包括一个数据集,我可以用来演示的目的,所以我构建了一个虚拟的。

# Load the library
library(data.table)

# Dummy data
dt <- data.table(genes = letters[1:5],
                 RAM1 = runif(5),
                 RAM2 = runif(5),
                 RAM3 = runif(5),
                 SAM1 = runif(5),
                 SAM2 = runif(5),
                 SAM3 = runif(5))

#    genes      RAM1       RAM2      RAM3      SAM1       SAM2       SAM3
# 1:     a 0.7063121 0.43993347 0.6226666 0.8476453 0.05446956 0.44600060
# 2:     b 0.4897803 0.50044643 0.1632004 0.1464797 0.43376887 0.27821878
# 3:     c 0.9418203 0.67954434 0.2502699 0.5309522 0.48029960 0.82622447
# 4:     d 0.3498353 0.81831758 0.8970066 0.9042565 0.25258854 0.09807793
# 5:     e 0.2324447 0.06337947 0.7269116 0.5776730 0.37568198 0.68405615

接下来,我将数据从宽格式转换为长格式:

# From wide to long format
dt_l <- melt(dt, id.vars = "genes", variable.name = "tissue_run", value.name = "expression")

#    genes tissue_run expression
# 1:      a       RAM1 0.70631211
# 2:      b       RAM1 0.48978034
# 3:      c       RAM1 0.94182026
# 4:      d       RAM1 0.34983529
# 5:      e       RAM1 0.23244474
# 6:      a       RAM2 0.43993347
# 7:      b       RAM2 0.50044643
# 8:      c       RAM2 0.67954434
# 9:      d       RAM2 0.81831758
# 10:     e       RAM2 0.06337947
# 11:     a       RAM3 0.62266657
# 12:     b       RAM3 0.16320043
# 13:     c       RAM3 0.25026990
# 14:     d       RAM3 0.89700660
# 15:     e       RAM3 0.72691159
# 16:     a       SAM1 0.84764528
# 17:     b       SAM1 0.14647973
# 18:     c       SAM1 0.53095222
# 19:     d       SAM1 0.90425646
# 20:     e       SAM1 0.57767296
# 21:     a       SAM2 0.05446956
# 22:     b       SAM2 0.43376887
# 23:     c       SAM2 0.48029960
# 24:     d       SAM2 0.25258854
# 25:     e       SAM2 0.37568198
# 26:     a       SAM3 0.44600060
# 27:     b       SAM3 0.27821878
# 28:     c       SAM3 0.82622447
# 29:     d       SAM3 0.09807793
# 30:     e       SAM3 0.68405615
# genes tissue/run expression

最后,我将组织/运行编号变量分成其组成部分(即组织和运行编号),按组织和基因分组,然后取重复样本的平均值。

# Split into tissue and replicate number
dt_l[, replicate := gsub(".*([0-9]+)$", "\\1", tissue_run)
     ][, tissue := gsub("^(.*)[0-9]+$", "\\1", tissue_run)
       ][, .(mean = mean(expression)), by = c("tissue", "genes")]

#     tissue genes      mean
# 1:     RAM     a 0.5896374
# 2:     RAM     b 0.3844757
# 3:     RAM     c 0.6238782
# 4:     RAM     d 0.6883865
# 5:     RAM     e 0.3409119
# 6:     SAM     a 0.4493718
# 7:     SAM     b 0.2861558
# 8:     SAM     c 0.6124921
# 9:     SAM     d 0.4183076
# 10:    SAM     e 0.5458037

这是否是所需的输出?
根据更新后的问题,需要采取最后一个步骤:

dt_l_s <- dt_l[, replicate := gsub(".*([0-9]+)$", "\\1", tissue_run)
               ][, tissue := gsub("^(.*)[0-9]+$", "\\1", tissue_run)
                 ][, .(mean = mean(expression)), by = c("tissue", "genes")]

dcast(dt_l_s, genes ~ tissue)

# Using 'mean' as value column. Use 'value.var' to override
#    genes       RAM       SAM
# 1:     a 0.5896374 0.4493718
# 2:     b 0.3844757 0.2861558
# 3:     c 0.6238782 0.6124921
# 4:     d 0.6883865 0.4183076
# 5:     e 0.3409119 0.5458037

相关问题