对混合数据应用h2o::h2o.glrm()时出错

2ic8powd  于 2023-05-11  发布在  其他
关注(0)|答案(1)|浏览(171)

我想在R包h2o中的h2o.glrm()函数的帮助下降低混合数据集的维数。我的数据集包括二元变量(具有两个可能水平的名义变量),名义变量(具有三个或更多可能水平)和有序变量(具有三个或更多可能水平)。我对二元变量使用逻辑损失,对有序变量和名义变量分别使用有序损失和分类损失。
下面是我的问题的一个最小的、可重复的例子。

  1. # Load packages
  2. library(tibble)
  3. library(h2o)
  4. # Example data for MRE
  5. my_data <- tibble::tibble(
  6. var.1 = as.factor(rep(1, 10)),
  7. var.2 = as.factor(c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1)),
  8. var.3 = as.factor(rep(-1, 10)),
  9. var.4 = as.factor(c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1)),
  10. var.5 = as.factor(rep(-1, 10)),
  11. var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
  12. var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
  13. var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
  14. var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
  15. var.10 = as.factor(c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1))
  16. )
  17. my_data_types <- tibble::tibble(
  18. var_name = paste("var", 1:10, sep = "."),
  19. var_type = c(rep("binary", 5),
  20. rep("ordinal", 3),
  21. "nominal", "binary")
  22. )
  23. # Initialize h2o cluster
  24. h2o::h2o.init()
  25. h2o::h2o.no_progress()
  26. # Convert data to h2o object
  27. my_data_h2o <- h2o::as.h2o(my_data)
  28. # Define loss function for ordinal and nominal variables
  29. losses <- tibble::tibble(
  30. index = which(my_data_types$var_type %in% c("ordinal", "nominal")) - 1,
  31. loss = NA_character_
  32. )
  33. for (i in seq_along(losses$index)) {
  34. losses$loss[i] <-
  35. ifelse(my_data_types$var_type[losses$index[i] + 1] == "ordinal", "Ordinal",
  36. ifelse(my_data_types$var_type[losses$index[i] + 1] == "nominal", "Categorical", NA))
  37. }
  38. # Run GLRM
  39. my_glrm <- h2o::h2o.glrm(
  40. training_frame = my_data_h2o,
  41. k = 2,
  42. loss = "Logistic",
  43. loss_by_col_idx = losses$index,
  44. loss_by_col = losses$loss,
  45. regularization_x = "None",
  46. regularization_y = "None",
  47. transform = "NONE",
  48. max_iterations = 2000,
  49. seed = 12345
  50. )

当我运行上述模型时,我收到以下错误消息:

  1. Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
  2. ERROR MESSAGE:
  3. Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_20. Details: ERRR on field: _loss: Logistic is not a numeric loss function

虽然我不认为这是错误消息告诉我的,但我也在数据集的另一个版本上运行了模型,其中二进制变量没有被定义为因子。

  1. # Alternative example data for MRE
  2. my_data_2 <- tibble::tibble(
  3. var.1 = rep(1, 10),
  4. var.2 = c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1),
  5. var.3 = rep(-1, 10),
  6. var.4 = c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1),
  7. var.5 = rep(-1, 10),
  8. var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
  9. var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
  10. var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
  11. var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
  12. var.10 = c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1)
  13. )
  14. # Convert data to h2o object
  15. my_data_2_h2o <- h2o::as.h2o(my_data_2)
  16. # Run GLRM
  17. my_glrm_2 <- h2o::h2o.glrm(
  18. training_frame = my_data_2_h2o,
  19. k = 2,
  20. loss = "Logistic",
  21. loss_by_col_idx = losses$index,
  22. loss_by_col = losses$loss,
  23. regularization_x = "None",
  24. regularization_y = "None",
  25. transform = "NONE",
  26. max_iterations = 2000,
  27. seed = 12345
  28. )

当我在数据集的替代版本上运行模型时,我收到以下错误:

  1. Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
  2. ERROR MESSAGE:
  3. Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_21. Details: ERRR on field: _loss: Logistic is not a numeric loss function
  4. ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 0
  5. ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 1
  6. ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 6

如果有人能告诉我我做错了什么,我将不胜感激。

i34xakig

i34xakig1#

与损失相关的函数参数格式不正确,因此将“不正确”的损失函数应用到给定的数据类型时会感到困惑(并给出错误)。
不传递loss =loss_by_col_idx =,只传递loss_by_col =。这是为了在training_frame中为每个特性取一个损失函数名,所以它的长度需要与ncol(my_data)相同。

  1. losses2 = dplyr::case_when(
  2. my_data_types$var_type == 'binary' ~ 'Logistic',
  3. my_data_types$var_type == 'ordinal' ~ 'Ordinal',
  4. TRUE ~ 'Categorical')
  5. losses2
  6. # console:
  7. # [1] "Logistic" "Logistic" "Logistic" "Logistic" "Logistic" "Ordinal"
  8. # [7] "Ordinal" "Ordinal" "Categorical" "Logistic"
  9. # Run GLRM
  10. my_glrm <- h2o::h2o.glrm(
  11. training_frame = my_data_h2o,
  12. k = 2,
  13. loss_by_col = losses2,
  14. regularization_x = "None",
  15. regularization_y = "None",
  16. transform = "NONE",
  17. max_iterations = 2000,
  18. seed = 12345
  19. )

现在你的模型已经启动并运行了,但是去掉了一些没有信息的特性,就像我们想要的那样。

展开查看全部

相关问题