R语言 生存分析预测中如何生成多条模拟路径?

ar5n3qh5  于 2023-05-04  发布在  其他
关注(0)|答案(1)|浏览(185)

我正在尝试创建R代码,用于生成预测生存概率的多个模拟路径。在这篇文章底部发布的代码中,我使用survival包的lung数据集并创建一个新的 Dataframe lung1,表示lung数据,“好像”研究最大周期是500,而不是lung数据中实际的1022。我使用参数威布尔分布,分别运行拟合优度检验。我试图通过多个模拟路径预测周期501-1000的生存曲线,理想情况下,将周期1-500的数据Weibull参数作为随机数生成指南。这个练习是一个预测“假设”,如果我只有500个周期的肺部研究数据。然后,我将预测与501-1000期的实际肺数据进行比较。
lung1数据中提取的形状和尺度参数I分别为1.804891和306.320693。
我很难为预测周期501-1000生成合理的、单调递减的模拟路径。在查看底部张贴的代码时,我应该做什么?
下面的图片有助于说明:
1.第一图像是示出整个肺数据集的存活概率的K-M图。
1.第二幅图绘制了lung 1(500个假设的研究周期),其中501-1000个预测周期以灰色线延伸。显然有些事情不太对劲!
1.第三张图只是为了展示我过去在使用时间序列模型(如ETS)之前所做的模拟,这在某种程度上得到了我在生存分析中所做的事情。这不是我最好的例子,我已经使用对数变换和ETS生成了很好的单调递减的凹形预测曲线。我现在正试图更好地理解生存分析,现在没有更多的ETS。

验证码:

library(fitdistrplus)
library(dplyr)
library(survival)
library(MASS)

# Modify lung dataset as if study had only lasted 500 periods
lung1 <- lung %>% 
  mutate(time1 = ifelse(time >= 500, 500, time)) %>% 
  mutate(status1 = ifelse(status == 2 & time >= 500, 1, status))

fit1 <- survfit(Surv(time1, status1) ~ 1, data = lung1)

# Get survival probability values at each time point
surv_prob <- summary(fit1, times = seq(0, 500, by = 1))$surv

# Create a data frame with time and survival probability values
lungValues <- data.frame(Time = seq(0, 500, by = 1), Survival_Probability = surv_prob)

# Plot the survival curve using the new data frame
plot(lungValues$Time, lungValues$Survival_Probability, xlab = "Time", ylab = "Survival Probability",
     main = "Survival Plot", type = "l", col = "blue", xlim = c(0, 1000), ylim = c(0, 1))

# Generate correlation matrix for Weibull parameters
cor_matrix <- matrix(c(1.0, 0.5, 0.5, 1.0), nrow = 2, ncol = 2)

# Generate simulation paths for forecasting
num_simulations <- 10
forecast_period <- seq(501, 1000, by = 1)
start_prob <- 0.293692
shape <- 1.5
scale <- 100

for (i in 1:num_simulations) {
  # Generate random variables for the Weibull distribution
  random_vars <- mvrnorm(length(forecast_period), c(0, 0), Sigma = cor_matrix)
  shape_values <- exp(random_vars[,1])
  scale_values <- exp(random_vars[,2]) * scale
  
  # Calculate the survival probabilities for the forecast period
  surv_prob <- numeric(length(forecast_period))
  surv_prob[1] <- start_prob
  for (j in 2:length(forecast_period)) {
    # Calculate the survival probability using the Weibull distribution
    surv_prob[j] <- pweibull(forecast_period[j] - 500, shape = shape_values[j], scale = scale_values[j], lower.tail = FALSE)
    # Ensure the survival probability follows a monotonically decreasing, concave path
    if (surv_prob[j] > surv_prob[j-1]) {
      surv_prob[j] <- surv_prob[j-1] - runif(1, 0, 0.0005)
    }
  }
  
  # Combine the survival probabilities with the forecast period and create a data frame
  df <- data.frame(Time = forecast_period, Survival_Probability = surv_prob)
  
  # Add the simulation path to the plot
  lines(df$Time, df$Survival_Probability, type = "l", col = "grey")
}
axr492tv

axr492tv1#

我发现你的代码的逻辑很难理解,所以我不完全确定这是否是你要找的,但对我来说似乎是合理的。
我的起点是您的lung1 Dataframe 。
首先,将Weibull分布拟合到您的数据,并获得shapescale的拟合估计值。请注意,survregrweibull使用不同的参数。

wFit <- survreg(Surv(time1,status1)~1, dist="w", data=lung1)

# See online doc for survreg for explanation
scale <- exp(wFit$coefficients[1])
shape <- 1/wFit$scale
median(rweibull(2000, shape, scale))
[1] 326.3782

[survfit报告中位数为310。]
现在,编写一个函数来插补时间超过500的生存时间,然后导出生存分布的Kaplan-Meier估计值。该函数接受一个参数x。如果x1,则返回值中包含时间〈500的估计生存概率,否则不包含。

# Generate simulation paths for forecasting
num_simulations <- 10
start_prob <- 0.293692

simulate <- function(x=1) {
  lung2 <- lung1 %>% 
    mutate(
      time2 = ifelse(
                time > 500, 
                qweibull(runif(nrow(.), min=1-start_prob), shape, scale), 
                time
              ),
      # It's possible that some simulated survival times are *just* under 500
      status2 = ifelse(time == 500, 2, status)
    )
  summary2 <- summary(survfit(Surv(time2, status2) ~ 1, data = lung2))
  result <- tibble(Simulation=x, Time=summary2$time, Survival=summary2$surv) %>% 
    add_row(Simulation=x, Time=0, Survival=1, .before=1)
  if (x == 1) {
    return(result %>% mutate(Type=Time > 500))
  } 
  return(result %>% filter(Time > 500) %>% add_column(Type=TRUE))
}

现在使用函数模拟生存时间〉500多次。将结果绑定到单个数据框中。

simulations <- lapply(1:num_simulations, simulate) %>% bind_rows()

绘制结果。

simulations %>% 
  ggplot(
    aes(x=Time, y=Survival, group=interaction(Simulation, Type), colour=Type)
  ) +
  geom_step(show.legend=FALSE) +
  scale_colour_manual(values=c("blue", "grey"))

希望能帮上忙。

相关问题