R语言 为长数据集中的缺失数据添加行

5sxhfpxr  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(126)

我有一个长格式的数据集。数据集中的每个对象都被观察了很多次(从1到3)。根据受试者的可用性,从周一到周五每天进行一次观察。如果病例观察少于3天,我需要为缺失的观察生成空行(s)如果一个事例在星期一和星期五被观察到,那么第三个观察应该随机出现在星期二、星期三或星期四。下面的代码运行良好,除了一个基本缺陷:当日期是随机产生的,它们可能与受试者已经被观察的日期重叠。2这不应该发生。3如果我的解决方案太复杂而无法编辑,请随意建议完全不同的代码。4谢谢!

#Toy dataset
dataset_long <- data.frame(
  id = c(1, 1, 2, 2, 2, 3, 3, 4, 5, 5),
  observation = c(1, 2, 1, 2, 3, 1, 2, 1, 1, 2),
  day_name = c("Monday", "Tuesday", "Monday", "Wednesday", "Thursday", "Tuesday", "Thursday", "Wednesday", "Monday", "Friday"),
  scores = sample(20:60, 10)
)

#   case observation       day scores
#1     1           1    Monday     32
#2     1           2   Tuesday     31
#3     2           1    Monday     29
#4     2           2 Wednesday     28
#5     2           3  Thursday     22
#6     3           1   Tuesday     45
#7     3           2  Thursday     30
#8     4           1 Wednesday     36
#9     5           1    Monday     58
#10    5           2    Friday     37

#Identify cases with fewer than 3 observations
cases_to_fill <- dataset_long %>%
  group_by(case) %>%
  summarize(days_observed = n()) %>%
  filter(days_observed < 3) %>%
  ungroup()

#Create vector with day names
all_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

#Fill the missing observation day randomly
cases_to_fill_long <- cases_to_fill %>%
  mutate(missing_day = map(days_observed, ~sample(all_days, 3 - .x))) %>%
  unnest(missing_day) %>%
  mutate(observation = 3)

#Join the filled cases with the original dataset
dataset_long_filled <- dataset_long %>%
  full_join(cases_to_fill_long, by = c("case", "observation")) %>%
  arrange(case, observation)

#Coalesce the two columns of day into one
dataset_long_filled |> 
  mutate(day = coalesce(day, missing_day)) |> 
  select(-days_observed, -missing_day)

#   case observation       day scores
#1     1           1    Monday     32
#2     1           2   Tuesday     31
#3     1           3    Friday     NA
#4     2           1    Monday     29
#5     2           2 Wednesday     28
#6     2           3  Thursday     22
#7     3           1   Tuesday     45
#8     3           2  Thursday     30
#9     3           3   Tuesday     NA Tuesday is repeated for this subject
#10    4           1 Wednesday     36
#11    4           3 Wednesday     NA Wednesday is repeated for this subject
#12    4           3    Monday     NA
#13    5           1    Monday     58
#14    5           2    Friday     37
#15    5           3    Friday     NA Friday is repeated for this subject
uwopmtnx

uwopmtnx1#

您可以尝试:

library(tidyr)
library(dplyr)

dataset_long %>%
  group_by(id) %>%
  complete(day_name = sample(all_days[!(all_days %in% day_name)], 3-n())) %>%
  arrange(id, match(day_name, all_days)) %>%
  mutate(observation = row_number()) %>%
  ungroup()

# A tibble: 15 × 4
      id day_name  observation scores
   <dbl> <chr>           <int>  <int>
 1     1 Monday              1     51
 2     1 Tuesday             2     30
 3     1 Thursday            3     NA
 4     2 Monday              1     42
 5     2 Wednesday           2     53
 6     2 Thursday            3     35
 7     3 Tuesday             1     25
 8     3 Thursday            2     27
 9     3 Friday              3     NA
10     4 Monday              1     NA
11     4 Wednesday           2     59
12     4 Thursday            3     NA
13     5 Monday              1     50
14     5 Thursday            2     NA
15     5 Friday              3     54

相关问题