我有这样的数据集(小部分)
final_data=structure(list(Y = c(2282L, 2565L, 2242L, 2109L, 2704L, 2352L,
2492L, 2608L, 2667L, 1863L), is_red_ndvi_v_down = c("yes", "yes",
"yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"), ndvi_v_down = c(0.032460447,
0.028369653, 0.017094017, 0.016972906, 0.015228979, 0.020649285,
0.028151986, 0.036528581, 0.036026201, 0.017097506), is_red_mtci_m50_85 = c("yes",
"yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"
), mtci_m50_85 = c(0.195646208, 0.112022057, 0.229670211, 0.19607818,
0.205472798, 0.314782868, 0.238119728, 0.230381033, 0.21754644,
0.092345478), is_red_gcvi_m75_2 = c("yes", "yes", "yes", "yes",
"yes", "yes", "yes", "yes", "yes", "yes"), gcvi_m75_2 = c(5.590222802,
4.439820215, 6.659634599, 6.321884806, 5.049482031, 5.039738058,
5.336354603, 6.236330399, 6.231815273, 5.627697383), is_red_vdvi_vi_max = c("yes",
"yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"
), vdvi_vi_max = c(0.428571429, 0.283018868, 0.307692308, 0.307692308,
0.591836735, 0.50877193, 0.393939394, 0.461538462, 0.514285714,
0.428571429), is_red_mtci_m35_50 = c("yes", "yes", "yes", "yes",
"yes", "yes", "yes", "yes", "yes", "yes"), mtci_m35_50 = c(0.080354124,
0.134743258, 0.14510097, 0.198023501, 0.278444767, 0.235650507,
0.316062043, 0.216993856, 0.235756291, 0.002585028)), row.names = c(NA,
10L), class = "data.frame")
字符串
数据包含变量的名称和前缀为is_red
的完全相同的名称。例如,ndvi_v_down
是度量变量,is_red_ndvi_v_down
是值为yes
或no
的分类变量。Yes
表示图表上的点标记为red
,这表明它接近于一条单调直线(当一个或另一个预测因子与Y
相关时)。只是所有这些都被上传到最终数据集进行目视检查。然而,我想更新这个最终数据集如下。我需要截止点阈值为自动确定(而不是像我一样手动)。为此,阈值是通过k-mean
确定的。我做到了。但我需要在运行代码后在最终数据集中更新所有分类变量的值(是或否)。这是我的建议。
plot_and_threshold <- function(data, response_var) {
# Plotting and thresholding for each predictor
for (x_var in names(data)) {
if (x_var != response_var) {
# Variables for x and y axis
x <- as.numeric(data[[x_var]])
y <- as.numeric(data[[response_var]])
# Plot with red dots
plot(x, y, col = "red", xlim = range(x, na.rm = TRUE))
# Linear regression to fit data
model <- lm(y ~ x)
# Getting trend line coefficients
a <- coef(model)[1]
b <- coef(model)[2]
# Calculate the distance between points and the trend line
distances <- abs(y - (a + b * x))
# Histogram clustering
kmeans_obj <- kmeans(matrix(distances), centers = 2)
cluster_centers <- kmeans_obj$centers
# Threshold in the area of cluster separation
threshold <- (cluster_centers[1] + cluster_centers[2]) / 2
# Update variables is_red_*
data[[paste0("is_red_", x_var)]] <- ifelse(distances > threshold, "no", "yes")
# Plot with updated data
points(x, y, col = ifelse(data[[paste0("is_red_", x_var)]] == "no", "grey", "red"))
}
}
# Returning an updated dataset
return(data)
}
final_data_updated <- plot_and_threshold(final_data, "Y")
型
我得到错误
Error in plot.window(...) : final 'xlim' values needed
In addition: Warnings:
1: In plot_and_threshold(final_data, "Y") :
as a result of the transformation, NAs were created
2: In min(x) : 'min' has no non-missing arguments; return Inf
3: In max(x) : 'max' has no non-missing arguments; return -Inf
型
我做错了什么?如何正确获得更新的数据集?谢谢你的帮助。
1条答案
按热度按时间jfgube3f1#
发现你的函数工作正常,你只需要包括数字数据。
字符串
的数据
型