R语言 固定总样本量而非分层样本量的分层抽样

9rygscc1  于 2023-02-27  发布在  其他

我有一个数据集,希望尽可能均匀地对给定变量进行降采样。假设 Dataframe 为54个观测值,降采样组的固定总大小设置为25。但是,由于分层变量中的某些n较小,当我尝试均匀选择数字时,由于最小组中的观测数小于预期的分层组大小,因此会出错(在下面的示例中,2〈5)。与使用replace = TRUE复制观测不同,我希望选择较小组中的所有观测,然后填充其他分层组中的数字,直到满足指定的样本大小。这意味着,当第一个只有2个观测的组无法再次采样时,剩下的组的数目会增加,直到我选择了25个。2这将提供尽可能均匀的分层组下采样,而没有重复。

df <- data.frame(
  strat_group = c(rep("one", 2), rep("two", 10), rep("three", 5), rep("four", 25), rep("five", 12))

strat_group_size <- (25 / length(unique(df$strat_group)))

df |>
  dplyr::group_by(strat_group) |>
  dplyr::slice_sample(n = strat_group_size)

Error in `dplyr::slice_sample()`:
! Problem while computing indices.
ℹ The error occurred in group 3: strat_group = "one".
Caused by error in `sample.int()`:
! cannot take a sample larger than the population when 'replace = FALSE'

我想要的是一种方法,它可以按分层组均匀地进行下采样,直到达到一个特定的数字(N = 25)。

df <- data.frame(
  strat_group = c(rep("1", 2), rep("2", 6), rep("3", 5), rep("4", 6), rep("5", 6))





N=25            # how many rows do we want?
df$sampled = 0  # set each row initially to 'unselected'

for(i in 1:N){
  # find the number taken from each group, and the number remaining in each group
  df$totalpergroup=ave(df$sampled, df$strat_group, FUN=sum)
  df$remaining=ave((1-df$sampled), df$strat_group, FUN=sum)
  # choose an unselected row from the least represented group that has at least one row left
  # use this weird way of sampling a single value because of how 'sample' works when there's only one element
  possibleRows <- which((df$totalpergroup==min(df[df$remaining>0,"totalpergroup"])) & (df$sampled==0))
  rowToAdd <- possibleRows[sample(length(possibleRows),1)]
  # select that row
  df$sampled[rowToAdd] <- 1


# Here's my subsampled df


   strat_group sampled totalpergroup remaining
1          one       1             2         0
2          one       1             2         0
3          two       1             5         5
7          two       1             5         5
8          two       1             5         5
9          two       1             5         5
10         two       1             5         5
12         two       1             5         5
13       three       1             5         0
14       three       1             5         0
15       three       1             5         0
16       three       1             5         0
17       three       1             5         0
19        four       1             6        19
21        four       1             6        19
27        four       1             6        19
28        four       1             6        19
36        four       1             6        19
42        four       1             6        19
46        five       1             6         6
47        five       1             6         6
48        five       1             6         6
51        five       1             6         6
52        five       1             6         6
53        five       1             6         6
