我想在R包h2o
中的h2o.glrm()
函数的帮助下降低混合数据集的维数。我的数据集包括二元变量(具有两个可能水平的名义变量),名义变量(具有三个或更多可能水平)和有序变量(具有三个或更多可能水平)。我对二元变量使用逻辑损失,对有序变量和名义变量分别使用有序损失和分类损失。
下面是我的问题的一个最小的、可重复的例子。
# Load packages
library(tibble)
library(h2o)
# Example data for MRE
my_data <- tibble::tibble(
var.1 = as.factor(rep(1, 10)),
var.2 = as.factor(c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1)),
var.3 = as.factor(rep(-1, 10)),
var.4 = as.factor(c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1)),
var.5 = as.factor(rep(-1, 10)),
var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
var.10 = as.factor(c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1))
)
my_data_types <- tibble::tibble(
var_name = paste("var", 1:10, sep = "."),
var_type = c(rep("binary", 5),
rep("ordinal", 3),
"nominal", "binary")
)
# Initialize h2o cluster
h2o::h2o.init()
h2o::h2o.no_progress()
# Convert data to h2o object
my_data_h2o <- h2o::as.h2o(my_data)
# Define loss function for ordinal and nominal variables
losses <- tibble::tibble(
index = which(my_data_types$var_type %in% c("ordinal", "nominal")) - 1,
loss = NA_character_
)
for (i in seq_along(losses$index)) {
losses$loss[i] <-
ifelse(my_data_types$var_type[losses$index[i] + 1] == "ordinal", "Ordinal",
ifelse(my_data_types$var_type[losses$index[i] + 1] == "nominal", "Categorical", NA))
}
# Run GLRM
my_glrm <- h2o::h2o.glrm(
training_frame = my_data_h2o,
k = 2,
loss = "Logistic",
loss_by_col_idx = losses$index,
loss_by_col = losses$loss,
regularization_x = "None",
regularization_y = "None",
transform = "NONE",
max_iterations = 2000,
seed = 12345
)
当我运行上述模型时,我收到以下错误消息:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_20. Details: ERRR on field: _loss: Logistic is not a numeric loss function
虽然我不认为这是错误消息告诉我的,但我也在数据集的另一个版本上运行了模型,其中二进制变量没有被定义为因子。
# Alternative example data for MRE
my_data_2 <- tibble::tibble(
var.1 = rep(1, 10),
var.2 = c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1),
var.3 = rep(-1, 10),
var.4 = c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1),
var.5 = rep(-1, 10),
var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
var.10 = c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1)
)
# Convert data to h2o object
my_data_2_h2o <- h2o::as.h2o(my_data_2)
# Run GLRM
my_glrm_2 <- h2o::h2o.glrm(
training_frame = my_data_2_h2o,
k = 2,
loss = "Logistic",
loss_by_col_idx = losses$index,
loss_by_col = losses$loss,
regularization_x = "None",
regularization_y = "None",
transform = "NONE",
max_iterations = 2000,
seed = 12345
)
当我在数据集的替代版本上运行模型时,我收到以下错误:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_21. Details: ERRR on field: _loss: Logistic is not a numeric loss function
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 0
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 1
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 6
如果有人能告诉我我做错了什么,我将不胜感激。
1条答案
按热度按时间i34xakig1#
与损失相关的函数参数格式不正确,因此将“不正确”的损失函数应用到给定的数据类型时会感到困惑(并给出错误)。
不传递
loss =
或loss_by_col_idx =
,只传递loss_by_col =
。这是为了在training_frame
中为每个特性取一个损失函数名,所以它的长度需要与ncol(my_data)
相同。现在你的模型已经启动并运行了,但是去掉了一些没有信息的特性,就像我们想要的那样。