R语言 使用键-值将元素列表拆分为多个列

z8dt9xmd  于 2023-02-14  发布在  其他
关注(0)|答案(3)|浏览(156)

使用包含元素列表的字符串(如chr)处理大型 Dataframe 。我想分隔字符串,以便每个元素都有自己的列,并具有键-值。我尝试了“tidyr::separate”和“tidyverse::unnest_wideer()”,但没有一个返回我想要的输出。
下面是一个虚拟数据:

df1 <- tibble(
    id = c('000914', '000916'),
    code = c('NN', 'SS'),
    values2 = c("{DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}" , "{DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}"         
  ) )

# A tibble: 2 x 3
  id     code  values2                                           
  <chr>  <chr> <chr>                                             
1 000914 NN    {DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
2 000916 SS    {DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}

我尝试了分开,这是没有错的,但它不完全是我正在寻找的,它需要几个pivot_longger和pivot_wide的重塑。有没有更好更快的替代品?

df1 %>% 
    separate(values2, into = paste("Col", 1:14)) 

# A tibble: 2 x 16
  id     code  `Col 1` `Col 2` `Col 3` `Col 4` `Col 5` `Col 6` `Col 7` `Col 8` `Col 9`
  <chr>  <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 000914 NN    ""      DS      15      FPLUC   0       N       CELL    R       NINT1  
2 000916 SS    ""      DS      0       FPLUC   0       N       CELL    R       NINT1  
# ... with 5 more variables: Col 10 <chr>, Col 11 <chr>, Col 12 <chr>, Col 13 <chr>,
#   Col 14 <chr>

下面是我想要的输出:

id     code   DS    FPLUC   N          R      S     SPLUC 

1 000914 NN    15     0      CELL       NINT1   true       1         
2 000916 SS    0      0      CELL       NINT1   true       1

替代解决方案:

library(tidyverse)

df1 %>% 
  mutate(values2 = str_remove(values2, "{"),
         values2 = str_remove(values2, "}")) %>% 
  tidyr::extract(values2, 
                 c("DS", "FPLUC", "N", "R", "S", "SPLUC"), 
                 "(.*?)=(.*?)", 
                 extra = "drop")

正则表达式“(.?)=(.?)”匹配=符号之间的文本,将键和值捕获为单独的组。c(“DS”,“FPLUC”,“N”,“R”,“S”,“SPLUC”)参数指定将基于提取的键值对创建的新列的名称。额外的=“drop”参数删除任何不匹配的文本。

mec1mxoz

mec1mxoz1#

tidyr溶液:

library(tidyr)

df1 %>%
  separate_rows(values2, sep = '(?<=\\})(?=\\{)') %>%
  extract(values2, c('name', 'value'), '\\{(.+?)=(.+?)\\}') %>%
  pivot_wider()

# # A tibble: 2 × 8
#   id     code  DS    FPLUC N     R     S     SPLUC
#   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 000914 NN    15    0     CELL  NINT1 true  1
# 2 000916 SS    0     0     CELL  NINT1 true  1
  • separate_rows()将折叠列(values2)分隔为多行。分隔符(?<=\\})(?=\\{)位于}{之间。
  • extract()使用正则表达式组将字符列分成多个列。正则表达式\\{(.+?)=(.+?)\\}搜索模式{Col=Value},并分别提取ColValue作为新列。
3qpi33ja

3qpi33ja2#

很乱,但你可以试试

library(tidverse)

nms <- str_extract_all(df1$values2[1], "(?<=\\{).+?(?=\\=)", simplify = T)
nms <- c(names(df1)[-3],nms)
df1 %>%
  mutate(values2 = str_extract_all(values2, "(?<=\\=).+?(?=\\})")) %>%
  unnest_wider(values2, names_repair = ~nms) 

  id     code  DS    FPLUC N     R     S     SPLUC
  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 000914 NN    15    0     CELL  NINT1 true  1    
2 000916 SS    0     0     CELL  NINT1 true  1
hi3rlvi2

hi3rlvi23#

如果您对REGEX不太感兴趣,请尝试以下方法

library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.3
library(tidyr)

df1 <- tibble(
  id = c('000914', '000916'),
  code = c('NN', 'SS'),
  values2 = c("{DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}" , "{DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}"         
  ) )

df1 
#> # A tibble: 2 x 3
#>   id     code  values2                                           
#>   <chr>  <chr> <chr>                                             
#> 1 000914 NN    {DS=15}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}
#> 2 000916 SS    {DS=0}{FPLUC=0}{N=CELL}{R=NINT1}{S=true}{SPLUC=1}

df1 %>% 
  mutate(values2 = stringr::str_remove_all(values2, "\\}")) %>% # remove the } from values 2
  separate(values2, into = c("X","DS","FPLUC","N","R","S","SPLUC"), sep = "{") %>% # split values 2 into required columns
  mutate(across(.cols = c(DS, FPLUC, N, R, S, SPLUC), 
                .fns = ~stringr::str_remove(.x, "^.+="))) %>% #remove "xxx=" from each of the columns
  select(!X) # keep all columns except X as it is empty
#> # A tibble: 2 x 8
#>   id     code  DS    FPLUC N     R     S     SPLUC
#>   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 000914 NN    15    0     CELL  NINT1 true  1    
#> 2 000916 SS    0     0     CELL  NINT1 true  1

相关问题