R语言 给定一系列维度/因子,如何创建具有确定行数的数据集?

q3qa4bjr  于 2023-05-11  发布在  其他
关注(0)|答案(4)|浏览(119)

我有一系列的维度,我想制作固定行数的数据集用于测试。
下面是一个简单的例子:

Example dimensions
sex <- c("M", "F")
age <- 18:65
nationality <- c("AU", "AT", "ES", "FR", "MX", "IN")
eye_colour <- c("blue", "brown", "green", "hazel")

我所需要的是一个函数,它接受这些数据并输出一个所有列、没有重复和固定行数的data.frame。
示例输出:

> randomise_data(10, sex, age, nationality, eye_colour)
sex age nationality eye_colour
1    M  37          AT      brown
2    F  37          AT      brown
3    M  52          AT      brown
4    F  52          AT      brown
5    M  37          FR      brown
6    F  37          FR      brown
7    M  52          FR      brown
8    F  52          FR      brown
9    M  37          AT      green
10   F  37          AT      green

作为参考,我用来制作上述数据集的代码是:

vars <- list(
sex = c("M", "F"),
age = 18:65,
nationality = c("AU", "AT", "ES", "FR", "MX", "IN"),
eye_colour = c("blue", "brown", "green", "hazel")
)

sample_data <- lapply(vars, sample, 2)

expand.grid(sample_data)[1:10,]

但我不想每次都手动选择一个数字,显然它不会用3代替2。真实的情况中所有因素的组合太大,我无法计算并将其子集为所需的行数

ct2axkht

ct2axkht1#

你可以使用replicate。如果因子的数量足够大,则获得重复项的机会非常小,否则您可以过采样并删除重复项:

as.data.frame(t(replicate(10, sapply(vars, sample, 1))))

#    sex age nationality eye_colour
# 1    F  43          AU      green
# 2    M  56          IN      hazel
# 3    M  39          AT       blue
# 4    M  34          AT      hazel
# 5    F  29          MX      brown
# 6    F  38          AU      brown
# 7    F  34          ES       blue
# 8    M  41          AU       blue
# 9    M  23          AT      hazel
# 10   M  29          IN      brown

你可以使用expand.grid来得到所有可能的组合,而sample是其中的10个:

expand.grid(vars)[sample(prod(lengths(vars)), 10), ]

#      sex age nationality eye_colour
# 1788   F  47          AU      hazel
# 544    F  49          IN       blue
# 598    F  28          AU      brown
# 137    M  38          AT       blue
# 1111   M  45          IN      brown
# 1142   F  60          IN      brown
# 1148   F  63          IN      brown
# 169    M  54          AT       blue
# 28     F  31          AU       blue
# 257    M  50          ES       blue
g6ll5ycj

g6ll5ycj2#

使用暴力的通用程序。虽然很丑,但可以实现整个过程的自动化。

set.seed(1234)
num_rows <- 10
vars <- list(
  sex = c("M", "F"),
  age = 18:65,
  nationality = c("AU", "AT", "ES", "FR", "MX", "IN"),
  eye_colour = c("blue", "brown", "green", "hazel")
)
all_combinations <- combn(unlist(vars), length(names(vars))) |> t()
idx <- lapply(1:length(names(vars)), function(x) {
  nm <- names(vars)[ x ]
  which(all_combinations[ , x ] %in% vars[[ nm ]])
})
idx <- Reduce(intersect, idx)
idx <- sample(idx, num_rows)
sample_data_df <- all_combinations[ idx, ]
sample_data_df <- data.frame(sample_data_df)
colnames(sample_data_df) <- names(vars)
sample_data_df
sex age nationality eye_colour
1    M  59          MX      hazel
2    M  43          IN      green
3    M  56          IN      brown
4    M  34          FR      hazel
5    F  59          ES      brown
6    F  37          MX      brown
7    M  63          IN      green
8    M  29          AT      brown
9    F  62          IN      hazel
10   F  55          ES      hazel
omjgkv6w

omjgkv6w3#

下面是一种使用自定义函数的替代方法,该函数将行数和维度列表作为参数,并生成具有指定行数的随机数据集:

randomise_data <- function(n, ...) {
  vars <- list(...)
  df <- expand.grid(vars)
  df <- df[sample(nrow(df), n), ]
  return(df)
}

sex <- c("M", "F")
age <- 18:65
nationality <- c("AU", "AT", "ES", "FR", "MX", "IN")
eye_colour <- c("blue", "brown", "green", "hazel")

randomise_data(10, sex, age, nationality, eye_colour)
Var1 Var2 Var3  Var4
1983    M   49   ES hazel
6       F   20   AU  blue
172     F   55   AT  blue
1439    M   65   ES green
50      F   42   AU  blue
1272    F   29   AT green
1357    M   24   ES green
929     M   50   FR brown
365     M   56   FR  blue
993     M   34   MX brown
siotufzp

siotufzp4#

找到了一个不会产生巨大网格的答案:

set.seed(123)
vars <- list(
  sex = c("M", "F"),
  age = 18:65,
  nationality = c("AU", "AT", "ES", "FR", "MX", "IN"),
  eye_colour = c("blue", "brown", "green", "hazel")
)

desired_length = 10

var_lengths = vapply(vars, length, 1L)

# Find combination of variables that work. Use sum of squares as penalty
solution <- optim(par = sqrt(var_lengths), 
                  fn = \(x) (desired_n - Reduce("*", x))^2,
                  method = "L-BFGS-B",
                  lower = 1,
                  upper = var_lengths)

sub_vars <- mapply(\(x,y) sample(x,y), vars, ceiling(solution$par))

max_grid <- expand.grid(sub_vars)
result <- max_grid[sample(nrow(max_grid), desired_length),]
result
sex age nationality eye_colour
5    M  59          FR      brown
27   M  60          FR      green
28   F  60          FR      green
9    M  54          FR      brown
29   M  54          FR      green
35   M  59          AU      green
8    F  60          FR      brown
26   F  59          FR      green
7    M  60          FR      brown
10   F  54          FR      brown

相关问题