如何打开多个csv,在一个单一的R df中有3列(基因,长度和计数)的文件?

carvr3hs  于 2023-05-04  发布在  其他
关注(0)|答案(1)|浏览(98)

我是R的新手,我试图用read_csv打开多个文件。我已经尝试了以下方法,但它们总是为大多数样本返回空列,除了第一个(原始数据看起来不错)。

# Importing multiple csv files at once - returns empty columns for most samples
setwd("~/Desktop/cardio_metabolism_annotated_counts/asdf RNA seq/raw_data")
install.packages('data.table')
library(tidyverse)
library(dplyr)
library(data.table)
df <- 
  list.files(path = "/Users/asdf/Desktop/cardio_metabolism_annotated_counts/asdf RNA seq/raw_data/", pattern = "*.csv") %>% 
  map_df(~fread(.))

head(df)

# Using tidyverse - second method
library(tidyverse)
df <-
  list.files(path = "~/Desktop/nox4_cardio_metabolism_annotated_counts/Nox4 RNA seq/raw_data/", pattern = "*.csv") %>% 
  map_df(~read_csv(.))
df

这是我运行第一个后得到的一个例子:

The downloaded binary packages are in
    /var/folders/9w(...)aded_packages
> library(tidyverse)
> library(dplyr)
> library(data.table)
> df <- 
+   list.files(path = "/Users/asfd/Desktop/_cardio_metabolism_annotated_counts/ RNA seq/raw_data/", pattern = "*.csv") %>% 
+   map_df(~fread(.))
> 
> head(df)
          GeneID Length R210808_2459_Tg_counts R210809_2459_Tg_counts R210810_2460_Tg_counts
1: 4933401J01Rik   1070                      0                     NA                     NA
2:       Gm26206    110                      0                     NA                     NA
3:          Xkr4   6094                      0                     NA                     NA
4:       Gm18956    480                      0                     NA                     NA
5:       Gm37180   2819                      0                     NA                     NA
6:       Gm37363   2233                      1                     NA                     NA
   R210811_2460_Tg_counts R210812_2461_WT_counts R210813_2461_WT_counts R210814_2462_Tg_counts
1:                     NA                     NA                     NA                     NA
2:                     NA                     NA                     NA                     NA
3:                     NA                     NA                     NA                     NA
4:                     NA                     NA                     NA                     NA
5:                     NA                     NA                     NA                     NA
6:                     NA                     NA                     NA                     NA

因此,我的问题是如何摆脱NA值,以及是否有可能将每个csv文件的计数列合并到一个文件中。
谢啦
萨拉

gstyhher

gstyhher1#

如果所有文件都有3列,则可以使用类似于以下内容的代码:

FileList <- list.files(path = "/Users/asdf/Desktop/cardio_metabolism_annotated_counts/asdf RNA seq/raw_data/", pattern = "*.csv")
DT <- rbindlist(lapply(FileList, function(x) {
  tmpDT <- fread(input = x)
  tmpDT[,SampleID:=colnames(tmpDT)[3],]
  setnames(x = tmpDT, old = colnames(tmpDT)[3], new = "Count")
  return(tmpDT)
}))

这将返回一个data.table,其中列已更正。您可以更改名称等。什么更适合你
要获得更详细的解决方案,您可以在使用fread加载文件后将其发布
编辑:
我们也可以只看一个表,我们可以以典型的fread方式加载它。因为我没有你的文件,我将只是弥补一些数据,并假设你有一个类似的结构(也许不同的名称)。

library(data.table)

# make up data
tmpDT <- data.table(GeneID=LETTERS, Length=sample.int(26), GeneName1_Count=sample(x = 0:4, size = 26, replace = T))

tmpDT
#    GeneID Length GeneName1_Count
# 1:      A     15               3
# 2:      B     20               0
# 3:      C      5               0
# 4:      D     24               4
# 5:      E      7               4
#---                              
#22:      V     11               3
#23:      W     13               0
#24:      X     25               1
#25:      Y     10               3
#26:      Z     17               0

# add one column with the name of the sample
# so that we keep track of it as well if needed
tmpDT[,SampleID:=colnames(tmpDT)[3],]

# change the name of the count column to "Count"
setnames(x = tmpDT, old = colnames(tmpDT)[3], new = "Count")

# return the new DT
tmpDT
#    GeneID Length Count        SampleID
# 1:      A     15     3 GeneName1_Count
# 2:      B     20     0 GeneName1_Count
# 3:      C      5     0 GeneName1_Count
# 4:      D     24     4 GeneName1_Count
# 5:      E      7     4 GeneName1_Count
#---                                    
#22:      V     11     3 GeneName1_Count
#23:      W     13     0 GeneName1_Count
#24:      X     25     1 GeneName1_Count
#25:      Y     10     3 GeneName1_Count
#26:      Z     17     0 GeneName1_Count

所以这是你用fread读入数据之后的部分。
lapply部分会检查文件列表,应用其中的函数,加载数据,转换数据并返回列表。然后rbindlist将列表折叠成一个包含所有信息的data.table

相关问题