我有以下数据集:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
my_data = data.frame(name, year, age, gender, country_of_birth)
name year age gender country_of_birth
1 john 2010 21 male australia
2 john 2011 22 male australia
3 john 2014 25 male australia
4 sarah 2010 55 female uk
5 sarah 2015 60 female uk
6 peter 2011 61 male mexico
7 peter 2012 62 male mexico
8 peter 2013 63 male mexico
9 peter 2015 65 male mexico
我们在这里可以看到,这个数据集中有些人漏掉了"年份",假设一个人对应的第一行是最早的年份,最后一行是最大的年份。
- 对于此数据集中的每个人-我希望在缺失行之间"填充"。**例如-在每个缺失行中:
- 我希望"年龄"变量增加1(例如,在2012年,约翰应该是23岁-在2012年,约翰应该是24岁)
- 我希望"性别"变量保持不变
- 我希望"country_of_birth"变量保持不变
下面是我使用的R代码:
library(tidyr)
library(dplyr)
my_data %>%
group_by(name) %>%
complete(year = full_seq(year, period = 1)) %>%
fill(year, age, gender, country_of_birth, .direction = "downup") %>%
mutate(real_age= age - (row_number() - 1)) %>%
ungroup
这段代码运行后似乎添加了缺失的行-但是没有正确添加age变量:
# A tibble: 16 x 6
name year age gender country_of_birth real_age
<chr> <dbl> <dbl> <chr> <chr> <dbl>
1 john 2010 21 male australia 21
2 john 2011 22 male australia 21
3 john 2012 22 male australia 20
4 john 2013 22 male australia 19
5 john 2014 25 male australia 21
6 peter 2011 61 male mexico 61
7 peter 2012 62 male mexico 61
8 peter 2013 63 male mexico 61
9 peter 2014 63 male mexico 60
10 peter 2015 65 male mexico 61
11 sarah 2010 55 female uk 55
12 sarah 2011 55 female uk 54
13 sarah 2012 55 female uk 53
14 sarah 2013 55 female uk 52
15 sarah 2014 55 female uk 51
16 sarah 2015 60 female uk 55
目前,我正试图通过尝试mutate(real_age= age - (row_number() - 1))
的不同组合来解决这个问题-但到目前为止,似乎没有任何效果。
"有人能告诉我怎么修吗
谢谢!
1条答案
按热度按时间s4n0splo1#
一种方法是: