R语言 使用分隔符拆分键值对列,并将键作为新列插入

q8l4jmvw  于 2023-07-31  发布在  其他
关注(0)|答案(3)|浏览(101)

我有这个数据:

data <- data.frame(INFO=c(
  "IMPUTED;FREQ=TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0001686;GENEINFO=LOC105372733:105372733;GNO;INT;R3;RS=1216947677;SSR=0;VC=INDEL;dbSNPBuildID=151",
  "IMPUTED;FREQ=GnomAD:0.9999,8.488e-05|TOPMED:1,2.267e-05|dbGaP_PopFreq:1,0;GENEINFO=LOC105372733:105372733;GNO;INT;R3;RS=1415762563;SSR=0;VC=INDEL;dbSNPBuildID=151",
  "IMPUTED;FREQ=TOPMED:1,3.778e-06|dbGaP_PopFreq:1,0;GENEINFO=LOC105372733:105372733;GNO;INT;R3;RS=1237792419;SSR=0;VC=DEL;dbSNPBuildID=155"
    )
)

字符串
我想在delimiter '上拆分INFO列;在这些字段中,有以'='分隔的键/值对,我希望键是新列的名称,值是行的元素。如果没有值(例如键“IMPUTED”),我希望值为空字符串。
就像这样

data_new <- data.frame(IMPUTED=c('', '', ''), 
FREQ=c('TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0001686', 
'GnomAD:0.9999,8.488e-05|TOPMED:1,2.267e-05|dbGaP_PopFreq:1,0', 
'TOPMED:1,3.778e-06|dbGaP_PopFreq:1,0'))


我被困在这里,我不总是知道键名,我有数百万行要做这件事,所以我宁愿不使用任何循环
我试过这个

library(stringr)
library(tidyr)
library(dplyr)

KVsep <- fixed(";")  #key-value separator
Vsep <- fixed("=")     #value separator

newDF <-  data %>%
  mutate(KVpairs = str_split(INFO, KVsep)) %>%
  unnest(KVpairs) %>%
  separate(KVpairs, into = c("key", "value"), Vsep) %>%
  spread(key, value) %>%
  select(-INFO)


here,但得到

Error in `check_not_stringr_pattern()`:
! `sep` can't use modifiers from stringr.

irlmq6kh

irlmq6kh1#

使用pivot_wider构造所需的键值宽格式的方法

library(dplyr)
library(tidyr)

KVsep <- ";"
Vsep <- "="

data %>% 
  mutate(st_sp = strsplit(INFO, KVsep), INFO = NULL) %>% 
  unnest(st_sp) %>% 
  separate(st_sp, c("key", "value"), Vsep, fill="right") %>% 
  mutate(row = row_number(), .by = key, value = replace_na(value, "")) %>% 
  pivot_wider(names_from = key, values_from = value) %>%  
  select(-row)
# A tibble: 3 × 10
  IMPUTED FREQ               GENEI…¹ GNO   INT   R3    RS    SSR   VC    dbSNP…²
  <chr>   <chr>              <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>  
1 ""      TOPMED:1,4.534e-0… LOC105… ""    ""    ""    1216… 0     INDEL 151    
2 ""      GnomAD:0.9999,8.4… LOC105… ""    ""    ""    1415… 0     INDEL 151    
3 ""      TOPMED:1,3.778e-0… LOC105… ""    ""    ""    1237… 0     DEL   155    
# … with abbreviated variable names ¹​GENEINFO, ²​dbSNPBuildID

字符串

ecfsfe2w

ecfsfe2w2#

一个比较普通的方法是:

do.call(rbind, lapply(strsplit(data$INFO, ";"), strsplit, "=") |>
  lapply(function(x) dplyr::as_tibble(t(sapply(x, function(y) {
    setNames(if(length(y) == 1) "" else y[2], y[1])})))))
#> # A tibble: 3 x 10
#>   IMPUTED FREQ         GENEINFO GNO   INT   R3    RS    SSR   VC    dbSNPBuildID
#>   <chr>   <chr>        <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr>       
#> 1 ""      TOPMED:1,4.~ LOC1053~ ""    ""    ""    1216~ 0     INDEL 151         
#> 2 ""      GnomAD:0.99~ LOC1053~ ""    ""    ""    1415~ 0     INDEL 151         
#> 3 ""      TOPMED:1,3.~ LOC1053~ ""    ""    ""    1237~ 0     DEL   155

字符串
创建于2023-07-21使用reprex v2.0.2

lp0sw83n

lp0sw83n3#

管道的第一行试图在第二步中创建一个列表列(str_split将返回一个列表),但data.frame不会创建这样的列表列。如果你想使用tidyverse工作流,我建议作为第一步,制作一个tibble的数据框,它将按照预期接受列表列:

library('dplyr')
library('tidyr')
library('stringr')

data = as_tibble(data)

data |>
  mutate(
    KVpairs = str_split(INFO, ';'),
    .keep = 'unused'
  ) |>
  glimpse()
## Rows: 3
## Columns: 1
## $ KVpairs <list> <"IMPUTED", "FREQ=TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0…

现在我们有一个列表列,我们可以unnest。我将在这里添加一个唯一的ìd,以便我们可以跟踪原始观察结果:

data |>
  mutate(
    id = row_number(),
    KVpairs = str_split(INFO, ';'),
    .keep = 'unused'
  ) |>
  unnest(cols = KVpairs)
## # A tibble: 30 × 2
##       id KVpairs                                               
##    <int> <chr>                                                 
##  1     1 IMPUTED                                               
##  2     1 FREQ=TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0001686
##  3     1 GENEINFO=LOC105372733:105372733                       
##  4     1 GNO                                                   
##  5     1 INT                                                   
##  6     1 R3                                                    
##  7     1 RS=1216947677                                         
##  8     1 SSR=0                                                 
##  9     1 VC=INDEL                                              
## 10     1 dbSNPBuildID=151                                      
## # ℹ 20 more rows

为了到达最后一列,我们可以使用pivot_widerspread已经被这个函数取代),但是首先最好注意不正确的键值对(例如:IMPUTED)。由于所需的输出是这些列的''值,因此在没有=时追加=将完成这项工作:

data |>
  mutate(
    id = row_number(),
    KVpairs = str_split(INFO, ';'),
    .keep = 'unused'
  ) |>
  unnest(cols = KVpairs) |>
  mutate(
    KVpairs = if_else(grepl('=', KVpairs), KVpairs, paste0(KVpairs, '='))
  )
## # A tibble: 30 × 2
##       id KVpairs                                               
##    <int> <chr>                                                 
##  1     1 IMPUTED=                                              
##  2     1 FREQ=TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0001686
##  3     1 GENEINFO=LOC105372733:105372733                       
##  4     1 GNO=                                                  
##  5     1 INT=                                                  
##  6     1 R3=                                                   
##  7     1 RS=1216947677                                         
##  8     1 SSR=0                                                 
##  9     1 VC=INDEL                                              
## 10     1 dbSNPBuildID=151                                      
## # ℹ 20 more rows

我们还需要在它们各自的列中获取键和值,为此我们可以使用separate_delim_wider,它已经取代了separate

data |>
  mutate(
    id = row_number(),
    KVpairs = str_split(INFO, ';'),
    .keep = 'unused'
  ) |>
  unnest(cols = KVpairs) |>
  mutate(
    KVpairs = if_else(grepl('=', KVpairs), KVpairs, paste0(KVpairs, '='))
  ) |>
  separate_wider_delim(
    cols = KVpairs,
    delim = '=',
    names = c('key', 'value')
    )
## # A tibble: 30 × 3
##       id key          value                                              
##    <int> <chr>        <chr>                                              
##  1     1 IMPUTED      ""                                                 
##  2     1 FREQ         "TOPMED:1,4.534e-05|dbGaP_PopFreq:0.9998,0.0001686"
##  3     1 GENEINFO     "LOC105372733:105372733"                           
##  4     1 GNO          ""                                                 
##  5     1 INT          ""                                                 
##  6     1 R3           ""                                                 
##  7     1 RS           "1216947677"                                       
##  8     1 SSR          "0"                                                
##  9     1 VC           "INDEL"                                            
## 10     1 dbSNPBuildID "151"                                              
## # ℹ 20 more rows

最后pivot_wider

data |>
  mutate(
    id = row_number(),
    KVpairs = str_split(INFO, ';'),
    .keep = 'unused'
  ) |>
  unnest(cols = KVpairs) |>
  mutate(
    KVpairs = if_else(grepl('=', KVpairs), KVpairs, paste0(KVpairs, '='))
  ) |>
  separate_wider_delim(
    cols = KVpairs,
    delim = '=',
    names = c('key', 'value')
    ) |>
    pivot_wider(
      names_from = key,
      values_from = value
    )
## # A tibble: 3 × 11
##      id IMPUTED FREQ   GENEINFO GNO   INT   R3    RS    SSR   VC    dbSNPBuildID
##   <int> <chr>   <chr>  <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr>       
## 1     1 ""      TOPME… LOC1053… ""    ""    ""    1216… 0     INDEL 151         
## 2     2 ""      GnomA… LOC1053… ""    ""    ""    1415… 0     INDEL 151         
## 3     3 ""      TOPME… LOC1053… ""    ""    ""    1237… 0     DEL   155

希望这个能帮上忙。请注意,我在这里假设IMPUTEDGNO等将始终存在并具有唯一值。如果不是这种情况,那么输出将很难预测。

相关问题