我对我的数据做了LASSO回归。
然而,两个图(系数图和交叉验证图)似乎不是很好。
系数图的问题是:当λ改变时,一些系数增大然后下降。在已发表的论文中,系数随着λ的变化而下降。它们不会生长。
coefficients plot
交叉验证图的问题是:红线的一部分与其他部分不连续。
cross validation plot
我的数据:https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv
我的代码:
library(readr)
library(glmnet)
train3 <- read_csv('train3.csv')
x <- as.matrix(train3[,-1])
y <- train3$CustomLabel
cvlasso <- cv.glmnet(x,y,alpha = 1,family = 'binomial')
plot(cvlasso)
plot(cvlasso$glmnet.fit)
在LASSO回归之前,我实际上做了相关性分析,以删除与其他变量高度相关(>0.9)的变量。
#read dataset
train2 <- read_csv('train2.csv')
#get non-normalized varibles
non_norm_vars <- train2 %>%
summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %>%
t() %>%
as.data.frame() %>%
filter(V1<0.05) %>%
rownames()
#get normalized varibles
norm_vars <- colnames(train2)[!colnames(train2) %in% non_norm_vars] %>%
head(-1)
#rearrange the dataset 'train2',put normalized variable in the front, so convinient for replace.
df_nonnorm <- train2[,non_norm_vars]
df_norm <- train2[,norm_vars]
train3 <- bind_cols(df_norm,df_nonnorm)
#calculate coefficient for normalized and overall varibales
cor_norm <- cor(df_norm,method = 'pearson')
cor_all <- cor(train3,method = 'spearman')
#replace the coefficient by normalized variable's
num_norm <- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] <- cor_norm
#sum how many 'high coefficients' (>0.9) each variable has, and rearrange by descendant.
var_seq <- cor_all %>%
as_tibble %>%
reframe(across(everything(),~sum(abs(.x)>0.9))) %>%
t() %>%
as.data.frame() %>%
arrange(desc(V1)) %>%
rownames()
#slice_seq: index of var_seq in colnames(cor_all)
slice_seq <- match(var_seq,colnames(cor_all))
#make a new matrix, most 'high coefficients' variable in the front, least in the end.
cor_all <- cor_all %>%
as_tibble() %>%
select(all_of(var_seq)) %>%
slice(slice_seq) %>%
as.matrix()
rownames(cor_all) <- colnames(cor_all)
#set to triangle matrix
cor_all[upper.tri(cor_all)] <- 0
diag(cor_all) <- 0
#keep variables that has 0 'high coefficients'
cor_vars <- cor_all %>%
as_tibble() %>%
summarise(across(everything(),~any(abs(.x)>0.9))) %>%
t() %>%
as.data.frame() %>%
filter(V1 == F) %>%
rownames()
#train3 got
train3 <- train2 %>%
select(CustomLabel, all_of(cor_vars))
我希望我的英语不会让你感到困惑。
train2:https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv
1条答案
按热度按时间k3bvogb11#
您的数据集
train3
具有显著的共线性,即:例如,4244个二元组合具有至少0.9的r²:此外,你的31个特征至少有3个极端异常值偏离平均值4个标准差或更多:
多重共线性和离群值都可能严重topple your regression。因此,在建模链的前面放一个降维和一些离群值管理是很好的。