我正在penguins数据集上尝试tidymodels。我想构建一个配方,然后比较不同的估算方法(下面的例子中的knn)。我在尝试构建模型时得到以下错误:
Warning message:
There are new levels in a factor: NA
我已经尝试了不同的解决方案(使用step_novel(),step_unknown(),step_naomit()),但似乎都不起作用。唯一起作用的是,如果我在创建配方之前删除/处理所有丢失的数据,但这违背了对配方使用预处理的目的,对吗?下面是完整的代码。
# penguins dataset tidymodels
# libraries
library(tidyverse)
library(tidymodels)
library(workflowsets)
library(skimr)
library(DataExplorer)
library(SmartEDA)
library(dlookr)
library(dataMaid)
library(GGally)
# import
data <- penguins
# split
set.seed(1)
data_split <- data %>% initial_split(prop = 0.75, strata = species)
data_train <- training(data_split)
data_test <- testing(data_split)
# model recipe
recipe <- recipe(species ~ ., data = data_train) %>%
step_log(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9) %>%
step_zv(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# recipe with knn imputing
recipe_knn_impute <- recipe %>%
step_impute_knn(all_predictors())
# Apply processing to test and training data
baked_data_train <- recipe_knn_impute %>% prep() %>% bake(data_train)
baked_data_test <- recipe_knn_impute %>% prep() %>% bake(data_test)
1条答案
按热度按时间fruv7luv1#
We suggest doing imputation first;否则所有其他操作都会受到丢失数据的影响。此警告可能来自虚拟变量创建(因为此时它们仍然丢失)。
如果你先归咎于,警告就会消失:
创建于2023-03-22带有reprex v2.0.2