我有一个育龄及以上妇女的数据集,有近50万人。
sample <-
structure(list(iD = c(1, 2), age = c(68, 40), born_1 = c(14,
1), born_2 = c(9, NA), born_3 = c(4, NA), born_4 = c(0, NA),
asfr15 = c(0, 0), asfr16 = c(7.40444, 7.40444), asfr17 = c(42.87749,
7.40444), asfr18 = c(73.05321, 7.40444), asfr19 = c(98.28869,
7.40444), asfr20 = c(144.9568, 7.40444), asfr21 = c(164.976,
180.28), asfr22 = c(180.28, 180.28), asfr23 = c(191.304,
180.28), asfr24 = c(200.199, 180.28), asfr25 = c(200.2018,
180.28), asfr26 = c(200.2007, 180.28), asfr27 = c(200.392,
180.28), asfr28 = c(197.684, 180.28), asfr29 = c(193.7632,
180.28), asfr30 = c(187.6072, 180.28), asfr31 = c(179.5616,
180.28), asfr32 = c(170.5096, 180.28), asfr33 = c(160.8376,
180.28), asfr34 = c(150.484, 180.28), asfr35 = c(139.3304,
180.28), asfr36 = c(127.796, 180.28), asfr37 = c(115.028,
180.28), asfr38 = c(100.72, 180.28), asfr39 = c(85.6256,
180.28), asfr40 = c(182.4384, 180.28), asfr41 = c(178.412,
180.28), asfr42 = c(171.928, 180.28), asfr43 = c(163.316,
180.28), asfr44 = c(152.9056, 180.28), asfr45 = c(141.0264,
180.28), asfr46 = c(128.008, 180.28), asfr47 = c(114.18,
180.28), asfr48 = c(99.872, 180.28), asfr49 = c(85.4136,
180.28)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L))
sample1 <-
structure(list(age = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53), L = c(0.986675438, 0.98702345,
0.98637393, 0.98578905, 0.9852613, 0.98478356, 0.9843488, 0.98394965,
0.98357788, 0.98322392, 0.98287637, 0.98252143, 0.98214247, 0.9817192,
0.9812277, 0.98064228, 0.97994116, 0.97911745, 0.97818632, 0.97718334,
0.97615046, 0.97511987, 0.97410767, 0.97311336, 0.97212658, 0.97113462,
0.97012689, 0.9690984, 0.96804964, 0.96698351, 0.96590219, 0.96480536,
0.96368905, 0.96254573, 0.96136596, 0.96014053, 0.9588627, 0.95752879,
0.95613703, 0.954686, 0.95317327, 0.95159409, 0.94993984, 0.94819731,
0.94634926, 0.94437608, 0.94225687, 0.93996942, 0.93748997, 0.93479226,
0.93184638, 0.92861614, 0.92506109, 0.92114516)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -54L))
字符串
考虑到第一个女人是我的样本与ID号1.她是68岁.对于这个女人,我需要asfrs从15到49乘以从该年龄到数据收集的一年(2016)死亡的概率.
例如,第一个妇女[asfr 15]必须乘以[(1- L 53)],因为如果一个68岁的妇女在15岁时有一个孩子,那么在收集数据时她的孩子必须是53岁,等等。
Asfr 16 * [(1- L 52)]
Asfr 17* [(1- L 51)]
每个年龄段都是如此。
对于第二个女人,因为她已经40岁了。我她在15岁时有一个孩子,她的孩子在人口普查时(2016年)将是25岁。因此,我们需要将asfr 15乘以 * [(1- L25)]。
2条答案
按热度按时间rkue9o1l1#
tidyverse
方法看起来可能是这样的。通过整形为long,我们可以很容易地使用left_join
根据孩子的年龄合并L
列。字符串
然而,虽然这适用于您的示例数据,但恐怕对于您的庞大数据集来说效率不高。因此,这里是一个基于矩阵的方法。基本上,我首先创建一个
L
值的矩阵,其中每行反映了基于母亲年龄的相应值,并且名称对应于asfr
列。然后,我们可以将asfr
列乘以这个矩阵。型
ssm49v7z2#
我相信这是可行的,尽管我 * 肯定 * 有更优雅的解决方案-首先创建一个“helper”函数,它将根据年龄识别需要修改的列,并根据您的描述修改它们(
asfrN * (1-L[age])
)。然后我根据ID将split
转换为列表,并在整个列表中应用该函数,将其与do.call
重新组合在一起字符串
尽管由于样本数据的宽度而难以在此演示,但在序数
sample
数据中仅修改了与年龄相适应的列。下面我输出列asfr 15 -18和22-28,以显示转换在ID #2的正确列处停止:
型