我有这样的数据:
subject,item,region_num,region,region_count,time_spent
ID_2965,14,1,begin,1,2109
ID_2965,14,2,ambig,1,206
ID_2965,14,3,location,1,920
ID_2965,14,4,disamb,1,127
ID_2965,14,5,follow,1,228
ID_2965,14,6,end,1,2238
ID_2965,14,5,follow,2,62
ID_2965,14,4,disamb,2,71
ID_2965,14,2,ambig,2,89
ID_2965,14,1,begin,2,2192
ID_2965,14,2,ambig,3,83
ID_2965,14,3,location,2,133
ID_2965,14,4,disamb,3,625
ID_2965,14,5,follow,3,57
ID_2965,14,6,end,2,1079
ID_2965,34,1,begin,1,57
ID_2965,34,2,ambig,1,224
ID_2965,34,1,begin,2,2136
ID_2965,34,2,ambig,2,98
ID_2965,34,3,location,1,127
ID_2965,34,4,disamb,1,415
ID_2965,34,5,follow,1,477
ID_2965,34,6,end,1,2445
ID_2965,34,5,follow,2,70
ID_2965,34,3,location,2,72
ID_2965,34,1,begin,3,3160
ID_2965,34,2,ambig,3,96
ID_2965,34,4,disamb,2,85
ID_2965,34,5,follow,3,213
ID_2965,34,6,end,2,482
ID_2965,34,5,follow,4,100
ID_2965,34,6,end,3,157
我需要通过三个步骤创建一个名为“gopast”的新列:
1.对于每一行,找到所有后续行,这些行对于给定主题和项目具有比该给定行更低或相等的region_num值,直到具有比给定region_num值更高的region_num值的第一后续行。
1.对于不存在具有较低region_num值的后续行的那些行,gopast值应等于time_spent。
1.对于其中存在具有较低或相等region_num值的紧接的后续行的那些行,对给定行和所有这些后续行的time_spent求和。
- gopast变量仅与region_count==1相关。
换句话说,结果数据应该如下所示:
subject,item,region_num,region,region_count,time_spent,gopast
ID_2965,14,1,begin,1,2109,2109
ID_2965,14,2,ambig,1,206,206
ID_2965,14,3,location,1,920,920
ID_2965,14,4,disamb,1,127,127
ID_2965,14,5,follow,1,228,228
ID_2965,14,6,end,1,2238,6629
ID_2965,14,5,follow,2,62,NA
ID_2965,14,4,disamb,2,71,NA
ID_2965,14,2,ambig,2,89,NA
ID_2965,14,1,begin,2,2192,NA
ID_2965,14,2,ambig,3,83,NA
ID_2965,14,3,location,2,133,NA
ID_2965,14,4,disamb,3,625,NA
ID_2965,14,5,follow,3,57,NA
ID_2965,14,6,end,2,1079,NA
ID_2965,34,1,begin,1,57,57
ID_2965,34,2,ambig,1,224,2458
ID_2965,34,1,begin,2,2136,NA
ID_2965,34,2,ambig,2,98,NA
ID_2965,34,3,location,1,127,127
ID_2965,34,4,disamb,1,415,415
ID_2965,34,5,follow,1,477,477
ID_2965,34,6,end,1,2445,6880
ID_2965,34,5,follow,2,70,NA
ID_2965,34,3,location,2,72,NA
ID_2965,34,1,begin,3,3160,NA
ID_2965,34,2,ambig,3,96,NA
ID_2965,34,4,disamb,2,85,NA
ID_2965,34,5,follow,3,213,NA
ID_2965,34,6,end,2,482,NA
ID_2965,34,5,follow,4,100,NA
ID_2965,34,6,end,3,157,NA
知道怎么做吗我在R中尝试了很多东西,也在这里和其他地方寻找解决方案,但我没有成功。
我能想到的最好的事情是在我包含了指定行号的rowN变量之后:
add_gopast <- function(df) {
df$gopast <- 0
for (i in 1:nrow(df)) {
time_spent <- df$time_spent[i]
region_num <- df$region_num[i]
region <- df$region[i]
lower_region_rows <- which(df$rowN > df$rowN[i] & df$region_num <= region_num & df$subject == df$subject[i] & df$item == df$item[i]& df$region_count >= df$region_count[i])
if (length(lower_region_rows) > 0) {
if (df$regout[i] == 0) {
gopast_value <- time_spent
} else if (df$regout[i] == 1) {
gopast_value <- time_spent + sum(df$time_spent[lower_region_rows])
}
df$gopast[i] <- gopast_value
} else {
df$gopast[i] <- time_spent
}
}
return(df)
}
问题是,当遇到region_num值高于给定region_num值的第一个后续行时,它不会停止求和…
1条答案
按热度按时间envsm3lx1#
1.使用
group_by
按“主题”和“项目”对数据进行分组;1.对于每个组,创建一个变量,指示下一个'region_num'的差异是否为正,并
cumsum
它以创建另一个分组变量;1.如果是
region_count != 1
,则'gopast'是不相关的,应该只是'time_spent'(我假设这就是您对df$regout[i] == 0
的意思)。否则,以rev() %>% cummean() %>% rev()
的逆序获得累积均值。结果: