R语言 使用逗号作为分隔符拆分到新行时出错

o3imoua4  于 2023-02-14  发布在  其他
关注(0)|答案(2)|浏览(207)

我有以下 Dataframe

temp = structure(list(pid = c("s1", "s1", "s1"), LEFT_GENE = c("PTPRO", "EPS8", "DPY19L2,AC084357.2,AC027667.1"
), RIGHT_GENE = c("", "FOx,D", "DPY19L2P2,S100A11P1")), row.names = c(1L, 2L, 3L), class = "data.frame")

  pid                     LEFT_GENE          RIGHT_GENE
1  s1                         PTPRO                    
2  s1                          EPS8                 FOx, D
3  s1 DPY19L2,AC084357.2,AC027667.1 DPY19L2P2,S100A11P1

我想将每个用逗号分隔的项目拆分成一个新行并创建新的组合。例如,最后一行应该创建6个新的附加行。但是,我得到这个错误,我不明白。

temp %>%
  separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%  
  data.frame ( stringsAsFactors = F)

Error in `fn()`:
! In row 3, can't recycle input of size 3 to size 2.
Run `rlang::last_error()` to see where the error occurred.

但是,错误似乎来自第3行,因为第1:2行工作正常

> temp[1:2, 
+      ] %>%
+   separate_rows(LEFT_GENE:RIGHT_GENE, sep=",") %>%  
+   data.frame ( stringsAsFactors = F)
  pid LEFT_GENE RIGHT_GENE
1  s1     PTPRO           
2  s1      EPS8        FOx
3  s1      EPS8          D

有人知道问题出在哪吗?

ljsrvy3e

ljsrvy3e1#

一次只能分隔一列

temp %>%
   separate_rows(RIGHT_GENE)%>%
   separate_rows(LEFT_GENE)

# A tibble: 9 × 3
  pid   LEFT_GENE  RIGHT_GENE 
  <chr> <chr>      <chr>      
1 s1    PTPRO      ""         
2 s1    EPS8       "FOx"      
3 s1    EPS8       "D"        
4 s1    DPY19L2    "DPY19L2P2"
5 s1    AC084357.2 "DPY19L2P2"
6 s1    AC027667.1 "DPY19L2P2"
7 s1    DPY19L2    "S100A11P1"
8 s1    AC084357.2 "S100A11P1"
9 s1    AC027667.1 "S100A11P1"
2ul0zpep

2ul0zpep2#

如果需要6行,则选项为

library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
temp %>% 
  mutate(across(ends_with("_GENE"), ~ strsplit(.x,  split = ",")), 
  cnt = pmax(lengths(LEFT_GENE), lengths(RIGHT_GENE))) %>% 
  mutate(across(ends_with("_GENE"),
    ~ map2(.x, cnt, ~ `length<-`(.x, .y)))) %>%
  select(-cnt) %>%
  unnest_longer(where(is.list))
  • 输出
# A tibble: 6 × 3
  pid   LEFT_GENE  RIGHT_GENE
  <chr> <chr>      <chr>     
1 s1    PTPRO      <NA>      
2 s1    EPS8       FOx       
3 s1    <NA>       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 <NA>

如果NA应替换为之前的非NA,则在末尾添加fill

...
%>% fill(ends_with("_GENE"))
# A tibble: 6 × 3
  pid   LEFT_GENE  RIGHT_GENE
  <chr> <chr>      <chr>     
1 s1    PTPRO      <NA>      
2 s1    EPS8       FOx       
3 s1    EPS8       D         
4 s1    DPY19L2    DPY19L2P2 
5 s1    AC084357.2 S100A11P1 
6 s1    AC027667.1 S100A11P1

相关问题