如何取消嵌套由'分隔的 Dataframe 项;`?

wvmv3b1j  于 2023-02-10  发布在  其他
关注(0)|答案(2)|浏览(115)

我有一个不整洁的数据框保存为tidyverse Tibble,名为data,如下所示:

# A tibble: 4 × 3
  `Full name` `Favorite foods`                            `Preferred colors`
  <chr>       <chr>                                       <chr>             
1 Homer       key lime pie; celery; fried rice            green; red        
2 Marge       celery; ice cream                           NA                
3 Mr. Burns   fried rice; apple; fried chicken; ice cream orange; purple    
4 Krusty      celery; key lime pie; apple                 red; blue

它可以像这样重新创建:

data <- tribble(
    ~"Full name", ~"Favorite foods", ~"Preferred colors", 
    "Homer", "key lime pie; celery; fried rice", "green; red", 
    "Marge", "celery; ice cream", NA, 
    "Mr. Burns", "fried rice; apple; fried chicken; ice cream", "orange; purple", 
    "Krusty", "celery; key lime pie; apple", "red; blue"
)

正如您所看到的,它列出了人们的Favorite foodsPreferred colors,其中每个元素都是一个chr,其中的项用;分隔,例如celery; ice cream
最后,我想分别分析和形象化人们最喜欢的食物和颜色,比如说,“最喜欢的两种食物是x和y”;或者“没有人喜欢黑色”。
如何整理这些数据,以便将食物和颜色重新塑造/处理成一个整洁的数据框?例如,食物和颜色列是否可以转换成它们自己的嵌套数据框?或者是否有更好/更简单的方法?
tidyverse解决方案不是必需的,但我强烈推荐它,因为这是我最熟悉的,计算性能很好,但在这种情况下不是一个重要的考虑因素。

**EDIT:**为了适应未来的需要,我可能还不想拆分此数据中的列。

epfja78i

epfja78i1#

您可以使用str_split创建列表列:

library(dplyr)
library(stringr)

data %>% 
  mutate(across(-`Full name`, ~ str_split(.x, pattern = "; ")))

# A tibble: 4 × 3
  `Full name` `Favorite foods` `Preferred colors`
  <chr>       <list>           <list>            
1 Homer       <chr [3]>        <chr [2]>         
2 Marge       <chr [2]>        <chr [1]>         
3 Mr. Burns   <chr [4]>        <chr [2]>         
4 Krusty      <chr [3]>        <chr [2]>

为了使它完全整洁,我建议旋转:

data %>%
  mutate(across(-1, ~ str_split(.x, pattern = "; "))) %>% 
  pivot_longer(-1) %>% 
  unnest_longer("value")

# A tibble: 19 × 3
   `Full name` name             value        
   <chr>       <chr>            <chr>        
 1 Homer       Favorite foods   key lime pie 
 2 Homer       Favorite foods   celery       
 3 Homer       Favorite foods   fried rice   
 4 Homer       Preferred colors green        
 5 Homer       Preferred colors red          
 6 Marge       Favorite foods   celery       
 7 Marge       Favorite foods   ice cream    
 8 Marge       Preferred colors NA           
 9 Mr. Burns   Favorite foods   fried rice   
10 Mr. Burns   Favorite foods   apple        
11 Mr. Burns   Favorite foods   fried chicken
12 Mr. Burns   Favorite foods   ice cream    
13 Mr. Burns   Preferred colors orange       
14 Mr. Burns   Preferred colors purple       
15 Krusty      Favorite foods   celery       
16 Krusty      Favorite foods   key lime pie 
17 Krusty      Favorite foods   apple        
18 Krusty      Preferred colors red          
19 Krusty      Preferred colors blue
pbpqsu0x

pbpqsu0x2#

按照r2 evans的注解,但使用“tidyr”中的新函数separate_longer_delim取代separate_rows

data |>
    separate_longer_delim(`Favorite foods`, delim = '; ') |>
    separate_longer_delim(`Preferred colors`, delim = '; ')
# A tibble: 22 × 3
#    `Full name` `Favorite foods` `Preferred colors`
#    <chr>       <chr>            <chr>
#  1 Homer       key lime pie     green
#  2 Homer       key lime pie     red
#  3 Homer       celery           green
#  4 Homer       celery           red
#  5 Homer       fried rice       green
#  …

但请注意,这是假设食物和颜色之间存在关系,而事实可能并非如此;梅尔的回答提供了一个更好的结构,更有可能的情况是这些偏好是不相关的。

相关问题