R -从数据框列中的单位拆分数值,并将单位追加到列名

zpjtge22  于 12个月前  发布在  其他
关注(0)|答案(3)|浏览(122)

我有一个样本 Dataframe 如下:

df_test <- data.frame(
  ID = c(1:5),
  Length = c("17 mm","20 mm", "32 mm", "69 mm", "100 mm"),
  Vmax = c("17 cm2","20 cm2", "32 cm2", "69 cm2", "100 cm2"),
  Age = c("17 years","20 years", "32 years", "69 years", "100 years"),
  PG = c("17 mmHg","20 mmHg", "32 mmHg", "69 mmHg", "100 mmHg"),
  Weight = c("17 kg","20 kg", "32 kg", "69 kg", "100 kg")
)

字符串
我想标识每列中的单位,将这些单位附加到列名中,并且只留下数值。结果应该如下所示:

ID Length_cm Vmax_cm2 Age_years PG_mmHg Weight_kg
1  1        17       17        17      17        17
2  2        20       20        20      20        20
3  3        32       32        32      32        32
4  4        69       69        69      69        69
5  5       100      100       100     100       100


然后,我想更进一步,能够指定范围内的哪些列应用此规则。
使用下面的代码,我可以成功地完成一个指定的列:

df_test$Vmax

units <- gsub(".* ", "", df_test$Vmax)
units <- units[!is.na(units)]
units <- unique(units)

df_test$Vmax <- as.numeric(gsub(" .*$", "", df_test$Vmax))

col_index <- which(colnames(df_test) == "Vmax")
col_name <- colnames(df_test)[col_index]

new_col_name <- paste(col_name, units, sep="_")
names(df_test)[col_index] <- new_col_name


我不知道如何将这个方法应用于多个列,我尝试将其放入for循环并使用mutate(across()),但没有成功。
注意:为了澄清,数值和单位总是用空格分隔。单位在任何给定的列中都是一致的。列中也可能有一些NA值。
任何关于这方面的帮助将不胜感激!谢谢!

ifsvaxew

ifsvaxew1#

我们可以重新整形longest,然后将每个度量值分为一个值和一个单位,将数字转换为数字数据,然后重新整形wide,将原始列名和新名称的单位组合在一起。

library(tidyr)
df_test %>%
  pivot_longer(-ID) %>%
  separate(value, c("val", "unit"), convert = TRUE) %>%
  pivot_wider(names_from = c(name, unit), values_from = val)

字符串
结果

# A tibble: 5 x 6
     ID Length_mm Vmax_cm2 Age_years PG_mmHg Weight_kg
  <int>     <int>    <int>     <int>   <int>     <int>
1     1        17       17        17      17        17
2     2        20       20        20      20        20
3     3        32       32        32      32        32
4     4        69       69        69      69        69
5     5       100      100       100     100       100

jhkqcmku

jhkqcmku2#

试试这个:

library(tidyverse)

df_test %>% 
  pivot_longer(-ID, names_to = "column", values_to = "value") %>% 
  separate(col = value, into = c("value", "unit")) %>% 
  
  transmute(
    ID, 
    column = str_c(column, unit, sep = "_"),
    value = as.numeric(value)
  ) %>% 
  
  pivot_wider(names_from = column, values_from = value) 

# A tibble: 5 × 6
     ID Length_mm Vmax_cm2 Age_years PG_mmHg Weight_kg
  <int>     <dbl>    <dbl>     <dbl>   <dbl>     <dbl>
1     1        17       17        17      17        17
2     2        20       20        20      20        20
3     3        32       32        32      32        32
4     4        69       69        69      69        69
5     5       100      100       100     100       100

字符串
另外,在colnames上尝试lower_case + camel_case,避免一些麻烦!
列中也可能有一些NA值。
PS.小版本,以帮助与NA数据。
让我们说:

df_test <- data.frame(
  ID = c(1:5),
  Length = c("17 mm","20 mm", "32 mm", NA_character_, "100 mm"),
  Vmax = c("17 cm2","20 cm2", "32 cm2", "69 cm2", NA_character_),
  Age = c("17 years", NA_character_, NA_character_, "69 years", "100 years"),
  PG = c("17 mmHg","20 mmHg", "32 mmHg", "69 mmHg", "100 mmHg"),
  Weight = c(NA_character_,"20 kg", "32 kg", "69 kg", "100 kg")


如果至少有一个值不是NA,则按“列”分组的fill将完成此任务:

df_test %>% 
  pivot_longer(-ID, names_to = "column", values_to = "value") %>% 
  separate(col = value, into = c("value", "unit")) %>% 
  
  with_groups(column, \(x) fill(x, unit, .direction = "downup")) %>% 
  
  transmute(
    ID, 
    column = str_c(column, unit, sep = "_"),
    value = as.numeric(value)
  ) %>% 
  
  pivot_wider(names_from = column, values_from = value) 

# A tibble: 5 × 6
     ID Length_mm Vmax_cm2 Age_years PG_mmHg Weight_kg
  <int>     <dbl>    <dbl>     <dbl>   <dbl>     <dbl>
1     1        17       17        17      17        NA
2     2        20       20        NA      20        20
3     3        32       32        NA      32        32
4     4        NA       69        69      69        69
5     5       100       NA       100     100       100

iyfamqjs

iyfamqjs3#

或者,请尝试

# get the units
nm <- sapply(df_test[,-1], \(x){ 
str_extract(x,'(?<=\\s).*')[1]
  })

# rename the variables with units
names(df_test)[2:6] <- paste(as.vector(names(df_test)[-1]),as.vector(nm),sep = '_')

# remove the units from variables
new_df <- map_df(df_test[,-1], \(x) {
  str_extract(x,'.*(?=\\s)')
})

# combine the ID with other variables
cbind(ID=df_test[,1],new_df)

字符串
创建于2023-11-09带有reprex v2.0.2

ID Length_mm Vmax_cm2 Age_years PG_mmHg Weight_kg
1  1        17       17        17      17        17
2  2        20       20        20      20        20
3  3        32       32        32      32        32
4  4        69       69        69      69        69
5  5       100      100       100     100       100

相关问题