在特定条件下将来自两个不同 Dataframe 的列相乘

g6ll5ycj  于 12个月前  发布在  其他
关注(0)|答案(2)|浏览(108)

我有一个育龄及以上妇女的数据集,有近50万人。

sample <-
 structure(list(iD = c(1, 2), age = c(68, 40), born_1 = c(14, 
 1), born_2 = c(9, NA), born_3 = c(4, NA), born_4 = c(0, NA), 
     asfr15 = c(0, 0), asfr16 = c(7.40444, 7.40444), asfr17 = c(42.87749, 
     7.40444), asfr18 = c(73.05321, 7.40444), asfr19 = c(98.28869, 
     7.40444), asfr20 = c(144.9568, 7.40444), asfr21 = c(164.976, 
     180.28), asfr22 = c(180.28, 180.28), asfr23 = c(191.304, 
     180.28), asfr24 = c(200.199, 180.28), asfr25 = c(200.2018, 
     180.28), asfr26 = c(200.2007, 180.28), asfr27 = c(200.392, 
     180.28), asfr28 = c(197.684, 180.28), asfr29 = c(193.7632, 
     180.28), asfr30 = c(187.6072, 180.28), asfr31 = c(179.5616, 
     180.28), asfr32 = c(170.5096, 180.28), asfr33 = c(160.8376, 
     180.28), asfr34 = c(150.484, 180.28), asfr35 = c(139.3304, 
     180.28), asfr36 = c(127.796, 180.28), asfr37 = c(115.028, 
     180.28), asfr38 = c(100.72, 180.28), asfr39 = c(85.6256, 
     180.28), asfr40 = c(182.4384, 180.28), asfr41 = c(178.412, 
     180.28), asfr42 = c(171.928, 180.28), asfr43 = c(163.316, 
     180.28), asfr44 = c(152.9056, 180.28), asfr45 = c(141.0264, 
     180.28), asfr46 = c(128.008, 180.28), asfr47 = c(114.18, 
     180.28), asfr48 = c(99.872, 180.28), asfr49 = c(85.4136, 
     180.28)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L))

sample1 <-
 structure(list(age = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 
 44, 45, 46, 47, 48, 49, 50, 51, 52, 53), L = c(0.986675438, 0.98702345, 
 0.98637393, 0.98578905, 0.9852613, 0.98478356, 0.9843488, 0.98394965, 
 0.98357788, 0.98322392, 0.98287637, 0.98252143, 0.98214247, 0.9817192, 
 0.9812277, 0.98064228, 0.97994116, 0.97911745, 0.97818632, 0.97718334, 
 0.97615046, 0.97511987, 0.97410767, 0.97311336, 0.97212658, 0.97113462, 
 0.97012689, 0.9690984, 0.96804964, 0.96698351, 0.96590219, 0.96480536, 
 0.96368905, 0.96254573, 0.96136596, 0.96014053, 0.9588627, 0.95752879, 
 0.95613703, 0.954686, 0.95317327, 0.95159409, 0.94993984, 0.94819731, 
 0.94634926, 0.94437608, 0.94225687, 0.93996942, 0.93748997, 0.93479226, 
 0.93184638, 0.92861614, 0.92506109, 0.92114516)), class = c("tbl_df", 
 "tbl", "data.frame"), row.names = c(NA, -54L))

字符串
考虑到第一个女人是我的样本与ID号1.她是68岁.对于这个女人,我需要asfrs从15到49乘以从该年龄到数据收集的一年(2016)死亡的概率.
例如,第一个妇女[asfr 15]必须乘以[(1- L 53)],因为如果一个68岁的妇女在15岁时有一个孩子,那么在收集数据时她的孩子必须是53岁,等等。
Asfr 16 * [(1- L 52)]
Asfr 17* [(1- L 51)]
每个年龄段都是如此。
对于第二个女人,因为她已经40岁了。我她在15岁时有一个孩子,她的孩子在人口普查时(2016年)将是25岁。因此,我们需要将asfr 15乘以 * [(1- L25)]。

rkue9o1l

rkue9o1l1#

tidyverse方法看起来可能是这样的。通过整形为long,我们可以很容易地使用left_join根据孩子的年龄合并L列。

library(tidyverse)

sample |>
  pivot_longer(starts_with("asfr")) |>
  mutate(age_child = age - readr::parse_number(name)) |>
  left_join(sample1, by = c("age_child" = "age")) |>
  mutate(value = value * (1 - L)) |>
  select(-age_child, -L) |>
  pivot_wider(
    names_from = name,
    values_from = value
  )
#> # A tibble: 2 × 41
#>      iD   age born_1 born_2 born_3 born_4 asfr15 asfr16 asfr17 asfr18 asfr19
#>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1    68     14      9      4      0      0  0.555  3.06   4.98   6.41 
#> 2     2    40      1     NA     NA     NA      0  0.206  0.199  0.192  0.184
#> # ℹ 30 more variables: asfr20 <dbl>, asfr21 <dbl>, asfr22 <dbl>, asfr23 <dbl>,
#> #   asfr24 <dbl>, asfr25 <dbl>, asfr26 <dbl>, asfr27 <dbl>, asfr28 <dbl>,
#> #   asfr29 <dbl>, asfr30 <dbl>, asfr31 <dbl>, asfr32 <dbl>, asfr33 <dbl>,
#> #   asfr34 <dbl>, asfr35 <dbl>, asfr36 <dbl>, asfr37 <dbl>, asfr38 <dbl>,
#> #   asfr39 <dbl>, asfr40 <dbl>, asfr41 <dbl>, asfr42 <dbl>, asfr43 <dbl>,
#> #   asfr44 <dbl>, asfr45 <dbl>, asfr46 <dbl>, asfr47 <dbl>, asfr48 <dbl>,
#> #   asfr49 <dbl>

字符串
然而,虽然这适用于您的示例数据,但恐怕对于您的庞大数据集来说效率不高。因此,这里是一个基于矩阵的方法。基本上,我首先创建一个L值的矩阵,其中每行反映了基于母亲年龄的相应值,并且名称对应于asfr列。然后,我们可以将asfr列乘以这个矩阵。

L_mat <- sapply(sample$age, \(age) {
  L <- c(rep(NA, 35), sample1$L) # 35 = 49 - 15 + 1
  names(L) <- c(-rev(seq_len(35)), sample1$age)
  L[as.character(age - 15:49)]
}) |>
  t()
colnames(L_mat) <- paste0("L", 15:49)

sample[paste0("asfr", 15:49)] <-
  sample[paste0("asfr", 15:49)] * (1 - L_mat)
sample
#> # A tibble: 2 × 41
#>      iD   age born_1 born_2 born_3 born_4 asfr15 asfr16 asfr17 asfr18 asfr19
#>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1    68     14      9      4      0      0  0.555  3.06   4.98   6.41 
#> 2     2    40      1     NA     NA     NA      0  0.206  0.199  0.192  0.184
#> # ℹ 30 more variables: asfr20 <dbl>, asfr21 <dbl>, asfr22 <dbl>, asfr23 <dbl>,
#> #   asfr24 <dbl>, asfr25 <dbl>, asfr26 <dbl>, asfr27 <dbl>, asfr28 <dbl>,
#> #   asfr29 <dbl>, asfr30 <dbl>, asfr31 <dbl>, asfr32 <dbl>, asfr33 <dbl>,
#> #   asfr34 <dbl>, asfr35 <dbl>, asfr36 <dbl>, asfr37 <dbl>, asfr38 <dbl>,
#> #   asfr39 <dbl>, asfr40 <dbl>, asfr41 <dbl>, asfr42 <dbl>, asfr43 <dbl>,
#> #   asfr44 <dbl>, asfr45 <dbl>, asfr46 <dbl>, asfr47 <dbl>, asfr48 <dbl>,
#> #   asfr49 <dbl>

ssm49v7z

ssm49v7z2#

我相信这是可行的,尽管我 * 肯定 * 有更优雅的解决方案-首先创建一个“helper”函数,它将根据年龄识别需要修改的列,并根据您的描述修改它们(asfrN * (1-L[age]))。然后我根据ID将split转换为列表,并在整个列表中应用该函数,将其与do.call重新组合在一起

helperFun <- function(x){
  asfr <- paste0("asfr", 15:min((x$age - 15), 49))
  x[asfr] <- lapply(x[asfr], \(y) 
                    y * (1-(sample1[sample1$age == (x$age - 15),"L"])))
  x
}

do.call(rbind, lapply(split(sample, sample$iD), helperFun))

字符串
尽管由于样本数据的宽度而难以在此演示,但在序数sample数据中仅修改了与年龄相适应的列。
下面我输出列asfr 15 -18和22-28,以显示转换在ID #2的正确列处停止:

# A tibble: 2 × 13
#      iD   age asfr15$L asfr16$L asfr17$L asfr18$L asfr22$L asfr23$L asfr24$L asfr25$L asfr26$L asfr27$L asfr28$L
#   <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
# 1     1    68        0    0.584    3.38     5.76     14.2     15.1     15.8     15.8      15.8     15.8     15.6
# 2     2    40        0    0.214    0.214    0.214     5.20     5.20     5.20     5.20    180.     180.     180.

相关问题