R语言 奇怪的LASSO回归图,问题是什么,如何解决?

tp5buhyn  于 2023-05-26  发布在  其他
关注(0)|答案(1)|浏览(342)

我对我的数据做了LASSO回归。
然而,两个图(系数图和交叉验证图)似乎不是很好。
系数图的问题是:当λ改变时,一些系数增大然后下降。在已发表的论文中,系数随着λ的变化而下降。它们不会生长。
coefficients plot
交叉验证图的问题是:红线的一部分与其他部分不连续。
cross validation plot
我的数据:https://raw.githubusercontent.com/onkaparinga/default/main/train3.csv
我的代码:

library(readr)
library(glmnet)

train3 <- read_csv('train3.csv')

x <- as.matrix(train3[,-1])
y <- train3$CustomLabel
cvlasso <- cv.glmnet(x,y,alpha = 1,family = 'binomial')
plot(cvlasso)
plot(cvlasso$glmnet.fit)

在LASSO回归之前,我实际上做了相关性分析,以删除与其他变量高度相关(>0.9)的变量。

#read dataset
train2 <- read_csv('train2.csv')

#get non-normalized varibles
non_norm_vars <- train2 %>% 
  summarise(across(1:(ncol(train2)-1),~shapiro.test(.x)$p.value)) %>% 
  t() %>% 
  as.data.frame() %>% 
  filter(V1<0.05) %>% 
  rownames()

#get normalized varibles
norm_vars <- colnames(train2)[!colnames(train2) %in% non_norm_vars] %>% 
  head(-1)

#rearrange the dataset 'train2',put normalized variable in the front, so convinient for replace.
df_nonnorm <- train2[,non_norm_vars]
df_norm <- train2[,norm_vars]
train3 <- bind_cols(df_norm,df_nonnorm)

#calculate coefficient for normalized and overall varibales
cor_norm <- cor(df_norm,method = 'pearson')
cor_all <- cor(train3,method = 'spearman')
#replace the coefficient by normalized variable's
num_norm <- dim(cor_norm)[1]
cor_all[1:num_norm,1:num_norm] <- cor_norm

#sum how many 'high coefficients' (>0.9) each variable has, and rearrange by descendant.
var_seq <- cor_all %>% 
  as_tibble %>% 
  reframe(across(everything(),~sum(abs(.x)>0.9))) %>% 
  t() %>% 
  as.data.frame() %>% 
  arrange(desc(V1)) %>% 
  rownames()
#slice_seq: index of var_seq in colnames(cor_all)
slice_seq <- match(var_seq,colnames(cor_all))
#make a new matrix, most 'high coefficients' variable in the front, least in the end.
cor_all <- cor_all %>% 
  as_tibble() %>% 
  select(all_of(var_seq)) %>% 
  slice(slice_seq) %>% 
  as.matrix()
rownames(cor_all) <- colnames(cor_all)

#set to triangle matrix
cor_all[upper.tri(cor_all)] <- 0
diag(cor_all) <- 0

#keep variables that has 0 'high coefficients'
cor_vars <- cor_all %>% 
  as_tibble() %>% 
  summarise(across(everything(),~any(abs(.x)>0.9))) %>% 
  t() %>% 
  as.data.frame() %>% 
  filter(V1 == F) %>% 
  rownames()
#train3 got
train3 <- train2 %>% 
  select(CustomLabel, all_of(cor_vars))

我希望我的英语不会让你感到困惑。
train2:https://raw.githubusercontent.com/onkaparinga/default/main/train2.csv

k3bvogb1

k3bvogb11#

您的数据集train3具有显著的共线性,即:例如,4244个二元组合具有至少0.9的r²:

corr_mat <- cor(train3[-1])

expand.grid(A = dimnames(corr_mat)[[1]],
            B = dimnames(corr_mat)[[2]]
            ) |>
  cbind(r2 = as.vector(corr_mat)^2) |>
  as_tibble() |>
  filter(as.vector(upper.tri(corr_mat)),
         r2 > .9,
         r2 < 1
         ) |> 
  print(n = 3)
# A tibble: 4,244 x 3
  A     B        r2
  <fct> <fct> <dbl>
1 A693  A1228 0.996
2 A693  A1597 0.988
3 A1228 A1597 0.992
# i 4,241 more rows

此外,你的31个特征至少有3个极端异常值偏离平均值4个标准差或更多:

train3 |>
summarise(across(where(is.numeric), 
                 ~ sum(abs(mean(.x) - .x) > 4 * sd(.x))
                 )
          ) |> t() |> 
  as.data.frame() |>
  filter(V1 > 3) |>
  nrow()

## + [1] 31

多重共线性和离群值都可能严重topple your regression。因此,在建模链的前面放一个降维和一些离群值管理是很好的。

相关问题