使用列标题基于条件创建新列:R型

az31mfrm  于 2023-07-31  发布在  其他
关注(0)|答案(3)|浏览(105)

我想创建另一个列category,它基于是否存在非0整数,具体取决于列标题。具体地说,如果所有非0整数都在包含“rural”的报头中,则category中的值将是“rural.”如果所有非0整数都在标题包含“urban”的列中,则category中的值将是“urban”。如果在标题中包含“rural”和“urban”的列中混合了非0整数,则category中的值将是“mixed”。
出于此目的,NA和0被视为相同
这是我的 Dataframe 的一个片段。值得注意的是,还有更多以“hosp”开头的列以及与此目的无关的列

df <- data.frame(ID = c(1, 2, 3, 4),
                 hosp1_ruralSEDD = c(0, 0, 2, 2), 
                 hosp1_urbanSEDD = c(0, 0, 0, 0), 
                 hosp1_ruralSID = c(0, 0, 0, 1), 
                 hosp1_urbanSID = c(12, 1, 1, 0), 
                 hosp2_ruralSEDD = c("NA", "0", "0", "0"), 
                 hosp2_urbanSEDD = c("NA", "0", "0", "0"), 
                 hosp2_ruralSID = c("NA", "0", "0", "0"), 
                 hosp2_urbanSID = c("NA", "9", "6", "0"), 
                 hosp3_ruralSEDD = c("NA", "NA", "NA", "1"), 
                 hosp3_urbanSEDD = c("NA", "NA", "NA", "0"),
                 hosp3_ruralSID = c("NA", "NA", "NA", "0"), 
                 hosp3_urbanSID = c("NA", "NA", "NA", "0"))

字符串
这将是我的理想输出:

的数据

xxslljrj

xxslljrj1#

df %>%
  type.convert(as.is =TRUE) %>%
  pivot_longer(-ID, names_to = ".value", names_pattern =  "_([a-z]+)") %>%
  summarise(.by = ID, rural = max(rural, na.rm = TRUE), 
            urban = max(urban, na.rm = TRUE),
            category = case_when(rural > 0 & urban > 0~'Mixed',
                                 rural > 0~'Rural', 
                                 urban > 0~'Urban'))

# A tibble: 4 × 4
     ID rural urban category
  <int> <int> <int> <chr>   
1     1     0    12 Urban   
2     2     0     9 Urban   
3     3     2     6 Mixed   
4     4     2     0 Rural
df %>%
  type.convert(as.is =TRUE) %>%
  pivot_longer(-ID, names_to = ".value", names_pattern =  "_([a-z]+)") %>%
  summarise(.by=ID, rural = max(rural, na.rm = TRUE), 
            urban = max(urban, na.rm = TRUE),
            category = case_when(rural > 0 & urban > 0~'Mixed',
                                 rural > 0~'Rural', 
                                 urban > 0~'Urban')) %>%
  select(-urban, !rural) %>%
  right_join(df)
3bygqnnd

3bygqnnd2#

你的数据是宽格式的,我发现这些类型的任务在长格式中更有效地完成,然后转换回来。如果有其他不相关的列,则需要调整此代码:

library(tidyr)
library(dplyr)

df |>
  mutate(across(-ID, as.numeric)) |>
  pivot_longer(cols = -ID,
               names_pattern = "(hosp\\d)_(rural|urban)(.*)",
               names_to = c("hosp", ".value", "x")) |>
  mutate(category = case_when(
    any(rural != 0, na.rm = T) & any(urban != 0, na.rm = T) ~ "Mixed",
    any(rural != 0, na.rm = T) ~ "Rural",
    any(urban != 0, na.rm = T) ~ "Urban",
    T ~ "Other"
  ), .by = ID) |>
  pivot_wider(id_cols = c(ID, category),
              names_from = c(hosp, x),
              names_glue = "{hosp}_{.value}_{x}",
              values_from = rural:urban)

字符串
注意:mutate(across(...))是必要的,因为在您的示例数据中,您引用了数字值,而在其他示例数据中则仅引用了数字值。在R数据框中,列是 * 原子 * 的,这意味着它们必须都是一种类型,因此当透视到长格式时,您不能合并字符和数字。强制这些列为数字将生成警告消息。您的真实的数据可能都是数字,因此不需要这一行。

输出

ID category hosp1_rural_SEDD hosp1_rural_SID hosp2_rural_SEDD hosp2_rural_SID hosp3_rural_SEDD hosp3_rural_SID hosp1_urban_SEDD hosp1_urban_SID hosp2_urban_SEDD
  <dbl> <chr>               <dbl>           <dbl>            <dbl>           <dbl>            <dbl>           <dbl>            <dbl>           <dbl>            <dbl>
1     1 Urban                   0               0               NA              NA               NA              NA                0              12               NA
2     2 Urban                   0               0                0               0               NA              NA                0               1                0
3     3 Mixed                   2               0                0               0               NA              NA                0               1                0
4     4 Rural                   2               1                0               0                1               0                0               0                0
# ℹ 3 more variables: hosp2_urban_SID <dbl>, hosp3_urban_SEDD <dbl>, hosp3_urban_SID <dbl>

2ledvvac

2ledvvac3#

一种使用pivot_longer的方法,然后提取名称,找到条件,最后将pivot_wider转换为原始格式。

df <- data.frame(Vectorize(\(x) as.numeric(x))(df)) # setting all numeric

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  pivot_longer(-ID) %>% 
  mutate(Category = gsub(".*_|[A-Z].*", "", name), 
         Category = list(unique(Category[value > 0 & !is.na(value)])), 
         Category = ifelse(lengths(Category) > 1, 
           "Mixed", str_to_title(unlist(Category)[1])), .by = ID) %>% 
  pivot_wider(names_from=name, values_from=value)
# A tibble: 4 × 14
     ID Category hosp1…¹ hosp1…² hosp1…³ hosp1…⁴ hosp2…⁵ hosp2…⁶ hosp2…⁷ hosp2…⁸
  <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1 Urban          0       0       0      12      NA      NA      NA      NA
2     2 Urban          0       0       0       1       0       0       0       9
3     3 Mixed          2       0       0       1       0       0       0       6
4     4 Rural          2       0       1       0       0       0       0       0
# … with 4 more variables: hosp3_ruralSEDD <dbl>, hosp3_urbanSEDD <dbl>,
#   hosp3_ruralSID <dbl>, hosp3_urbanSID <dbl>, and abbreviated variable names
#   ¹​hosp1_ruralSEDD, ²​hosp1_urbanSEDD, ³​hosp1_ruralSID, ⁴​hosp1_urbanSID,
#   ⁵​hosp2_ruralSEDD, ⁶​hosp2_urbanSEDD, ⁷​hosp2_ruralSID, ⁸​hosp2_urbanSID
# ℹ Use `colnames()` to see all variable names

字符串

相关问题