我有一个名为survey_data
的数据集,它遵循以下结构:
dput(survey_data)
structure(list(id = structure(1:20, levels = c("118471098691",
"118471042050", "118471014713", "118470994901", "118470283421",
"118470281485", "118470282930", "118470269057", "118470266821",
"118470265552", "118470171081", "118470141169", "118470145488",
"118470049005", "118470032085", "118470026348", "118470006683",
"118470014124", "118470001292", "118469994597"), class = "factor"),
gender = c("Female", "Male", "Female", "Female", "Female",
"Female", "Female", "Female", "Male", "Female", "Male", "Female",
"Female", "Female", "Male", "Female", "Male", "Female", "Female",
"Female"), age = c("Aged 18-24", "Aged 25-32", "Aged 16-17",
"Aged 16-17", "Aged 18-24", "Aged 16-17", "Aged 16-17", "Aged 16-17",
"Aged 18-24", "Aged 16-17", "Aged 16-17", "Aged 18-24", "Aged 16-17",
"Aged 18-24", "Aged 16-17", "Aged 16-17", "Aged 25-32", "Aged 16-17",
"Aged 16-17", "Aged 16-17"), location = c("North West", "Northern Ireland",
"North East", "North West", "London", "London", "South East",
"West Midlands", "London", "South West", "North East", "North East",
"South West", "London", "West Midlands", "Wales", "Northern Ireland",
"South East", "East Midlands", "West Midlands"), occupation = c("Student",
"Employed full time", "Student", "Student", "Employed full time",
"Student", "Student", "Student", "Student", "Student", "Student",
"Student", "Unemployed", "Employed part time", "Student",
"Student", "Employed full time", "Student", "Student", "Student"
), income = c("Under £10,000", "£75,001 - £100,000", "Under £10,000",
"Under £10,000", "£25,001 - £50,000", "Under £10,000",
"Under £10,000", "Under £10,000", "£75,001 - £100,000",
"Under £10,000", "Under £10,000", "£10,001 - £25,000",
"Under £10,000", "Under £10,000", "Under £10,000", "Under £10,000",
"£25,001 - £50,000", "£10,001 - £25,000", "£10,001 - £25,000",
"£10,001 - £25,000")), row.names = c(NA, -20L), spec = structure(list(
cols = list(id = structure(list(levels = NULL, ordered = FALSE,
include_na = FALSE), class = c("collector_factor", "collector"
)), gender = structure(list(), class = c("collector_character",
"collector")), age = structure(list(), class = c("collector_character",
"collector")), location = structure(list(), class = c("collector_character",
"collector")), occupation = structure(list(), class = c("collector_character",
"collector")), income = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x6000001b8070>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
字符串
该数据集是一项在线调查的结果,包含近10,000个观察结果和150多个变量。
为了纠正一些过度抽样,并使受访者的个人资料与人口的个人资料相匹配,我需要随机拒绝其中一些记录。为此,我有另外两个表,一个是男性,另一个是女性,其中包含在每个年龄和性别组合中要删除的记录数量,如:males_reject
:
dput(males_reject)
structure(list(region = c("North East", "North West", "Yorkshire and the Humber",
"East Midlands", "West Midlands", "East of England", "London",
"South East", "South West", "Northern Ireland", "Scotland", "Wales"
), `Aged 16-17` = c(0, 8, 1, 1, 4, 0, 7, 2, 4, 0, 2, 1), `Aged 18-24` = c(20,
28, 17, 9, 32, 8, 74, 26, 10, 0, 0, 14), `Aged 25-32` = c(0,
2, 0, 0, 9, 0, 7, 2, 1, 1, 0, 0), `Aged 33-40` = c(1, 0, 1, 1,
8, 4, 3, 1, 0, 0, 2, 1), `Aged 41-50` = c(2, 5, 3, 2, 4, 4, 2,
0, 0, 0, 1, 1), `Aged 51-56` = c(0, 0, 2, 3, 0, 0, 0, 0, 1, 0,
0, 0), `Aged 57-67` = c(4, 12, 6, 2, 2, 1, 0, 6, 6, 1, 6, 1),
`Aged 68-76` = c(3, 6, 7, 4, 0, 0, 6, 2, 4, 3, 5, 0), `Aged 76+` = c(1,
3, 7, 0, 4, 0, 0, 5, 3, 0, 4, 0)), row.names = c(NA, -12L
), spec = structure(list(cols = list(region = structure(list(), class = c("collector_character",
"collector")), `Aged 16-17` = structure(list(), class = c("collector_double",
"collector")), `Aged 18-24` = structure(list(), class = c("collector_double",
"collector")), `Aged 25-32` = structure(list(), class = c("collector_double",
"collector")), `Aged 33-40` = structure(list(), class = c("collector_double",
"collector")), `Aged 41-50` = structure(list(), class = c("collector_double",
"collector")), `Aged 51-56` = structure(list(), class = c("collector_double",
"collector")), `Aged 57-67` = structure(list(), class = c("collector_double",
"collector")), `Aged 68-76` = structure(list(), class = c("collector_double",
"collector")), `Aged 76+` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x6000001a8590>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
型
另一个类似的女性群体:
dput(females_reject)
structure(list(region = c("North East", "North West", "Yorkshire and The Humber",
"East Midlands", "West Midlands", "East", "London", "South East",
"South West", "Northern Ireland", "Scotland", "Wales"), `Aged 16-17` = c(5,
16, 14, 13, 19, 13, 37, 11, 13, 2, 6, 9), `Aged 18-24` = c(0,
4, 0, 3, 11, 0, 21, 5, 3, 3, 0, 0), `Aged 25-32` = c(2, 8, 11,
13, 7, 11, 2, 16, 3, 0, 6, 4), `Aged 33-40` = c(9, 15, 17, 11,
13, 10, 5, 15, 17, 6, 10, 7), `Aged 41-50` = c(7, 0, 3, 2, 12,
9, 6, 0, 11, 0, 0, 11), `Aged 51-56` = c(2, 6, 1, 0, 6, 5, 0,
7, 0, 0, 5, 4), `Aged 57-67` = c(3, 1, 4, 0, 0, 0, 0, 18, 3,
0, 3, 8), `Aged 68-76` = c(0, 6, 5, 3, 2, 0, 0, 5, 0, 0, 3, 1
), `Aged 76+` = c(2, 4, 5, 6, 1, 8, 3, 11, 4, 0, 0, 1)), row.names = c(NA,
-12L), spec = structure(list(cols = list(region = structure(list(), class = c("collector_character",
"collector")), `Aged 16-17` = structure(list(), class = c("collector_double",
"collector")), `Aged 18-24` = structure(list(), class = c("collector_double",
"collector")), `Aged 25-32` = structure(list(), class = c("collector_double",
"collector")), `Aged 33-40` = structure(list(), class = c("collector_double",
"collector")), `Aged 41-50` = structure(list(), class = c("collector_double",
"collector")), `Aged 51-56` = structure(list(), class = c("collector_double",
"collector")), `Aged 57-67` = structure(list(), class = c("collector_double",
"collector")), `Aged 68-76` = structure(list(), class = c("collector_double",
"collector")), `Aged 76+` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x6000001ac390>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
型
假设数据集的大小,我需要一种方法来自动化这一点,并编写一段代码,获取males_reject
和females_reject
表,并使用这些数字随机拒绝survey_data
表中与年龄/性别定义匹配的观察结果。
因此,预期结果是数据集与原始survey_data
一样,但删除了20条记录(随机选择),这些记录是东北部18-24岁的男性,删除了28条记录,这些记录是西北部18-24岁的男性,依此类推(见上表)。
但我甚至不知道如何开始处理这个问题,因为males_reject
和females_reject
表不是 * 适合连接的数据 *,而是在脚本中用于定义过滤器和决定必须删除多少记录的信息。
有什么想法吗?
编辑:survey_data$age
中的值在最终的数据集中不是数值,而是数值因子。为了匹配值,reject_males
和reject_females
中的列名有空格,尽管它远非理想。
**EDIT(2):**这里提供的数据不是简化表,而是真实的数据集的一个小子集,定义表males_reject
和females_reject
是完整的。
**EDIT(3):**以下代码终于对我起作用了。它是r2 evans在回答中提供的代码,经过一些小的修改。它将初始数据分成两组,并将代码分别应用于每组。然后绑定行以组成新的数据集,并将其写入CSV文件。部分结果也被保存并写入CSV以供检查。
set.seed(42)
## Split the dataset in males and females
survey_data_males <- survey_data %>%
filter(gender == "Male")
survey_data_females <- survey_data %>%
filter(gender == "Female")
## Merge with rejection tables
males_result <- males_reject %>%
pivot_longer(cols = -region,
names_to = "age",
values_to = "drop") %>%
mutate(age = factor(age)) %>%
right_join(survey_data_males,
by = c("region", "age")) %>%
group_by(region, age) %>%
filter(is.na(first(drop)) | sample(n()) > first(drop)) %>%
ungroup()
females_result <- females_reject %>%
pivot_longer(cols = -region,
names_to = "age",
values_to = "drop") %>%
mutate(age = factor(age)) %>%
right_join(survey_data_females,
by = c("region", "age")) %>%
group_by(region, age) %>%
filter(is.na(first(drop)) | sample(n()) > first(drop)) %>%
ungroup()
completes_to_keep <- bind_rows(males_result,
females_result)
write_csv(males_result,
file = "./out/males_to_keep.csv",
col_names = TRUE)
write_csv(females_result,
file = "./out/females_to_keep.csv",
col_names = TRUE)
write_csv(completes_to_keep,
file = "./out/completes_to_keep.csv",
col_names = TRUE)
型
2条答案
按热度按时间gfttwv5a1#
步骤:
reject_males
中的行名作为列,然后旋转/整形,使其处于长格式,更适合于连接survey_data
合并location
和age
分组并删除必需的行数字符串
r6vfmomb2#
下面是一个使用辅助函数和
dplyr::group_modify
的选项。注意,这假设在你的reject Dataframe 中有一列location
。在你的帖子中没有提供列名。如果该列是行名称,那么你可以简单地用grp$location
替换df$location == grp$location
:字符串
我创建了一个小的辅助函数
remove_sample
。它使用组数据信息(性别,位置,年龄)来查找您想要删除的相应数字。使用此信息,它会随机抽取要保留的数量(例如,# total group rows - # rows you want to remove)。如果删除值不适用于该性别,位置,年龄组合或值为零,则不执行任何操作。
数据
型