R语言 在具有最小阈值的data.table中添加累积和列

dxxyhpgq  于 2023-05-26  发布在  其他
关注(0)|答案(2)|浏览(139)

我有一个R语言的data.table,其中有一个名为“score”的列,其中包含正实数和负实数。我想添加一个名为“sum”的新列,它表示“score”列的累积和。但是,我需要累积和有一个最小阈值。如果在任何点处的总和福尔斯最小阈值,则应将其设置为最小值,并且累积总和的下一步骤应从该最小值开始。
我尝试过使用cumsum()函数来计算累计和,但我不确定如何结合最小阈值逻辑。以下是我目前为止的代码:

dt <- data.table(score = c(10, -13, 8, -12, 6))

dt[, sum := cumsum(score)]

# its wrong
print(dt)

   score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -7
5:     6  -1

我也试过

min_value <- -3
dt[, sum := {
  temp_sum <- cumsum(score)
  temp_sum[temp_sum < min_value] <- min_value
  temp_sum
}]

# its also wrong
print(dt)
   score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -3
5:     6  -1 # expected sum is 3 here

我也在Cumulative sum with a threshold window in R data.tabledplyr / R cumulative sum with reset上尝试过这个方法

sum_score <- function(thresh) {
  function(x) {
    purrr::accumulate(x, ~dplyr::if_else(.x<=thresh, .y, .x+.y))
  }  
}

dt[, sum := sum_score(-3)(score)]

print(dt)
   score sum
1:    10  10
2:   -13  -3
3:     8   8
4:   -12  -4
5:     6   6

预期输出:

score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -3
5:     6  3
bejyjqdl

bejyjqdl1#

我知道你最好想要一个data.table的解决方案,但这里有一个简单的dplyr方法,它首先生成cumsum,然后相应地改变它的值:

library(dplyr)

dt <- data.frame(score = c(10, -13, 8, -12, 6))

thresh = -3

dt %>% mutate(sum = cumsum(score)) %>% 
  mutate(sum = case_when(
    lag(sum) < thresh ~ score + thresh,
    sum < thresh ~ thresh,
    is.na(sum) ~ score,
    .default = sum)
    )

输出:

score   sum
1    10    10
2   -13    -3
3     8     5
4   -12    -3
5     6     3
oknrviil

oknrviil2#

以下是dplyr选项:

library(dplyr)
library(purrr)
library(tidyr)
dt <- data.table::data.table(score = c(10, -13, 8, -12, 6, -20, 40))

dt |> 
  mutate(
    ID = row_number(),
    raw_score = score,
    indicator = (ifelse(cumsum(score) < -3, row_number(), NA))) |> 
  fill(indicator) |> 
  group_split(indicator) |> 
  map_dfr(~.x |> 
        mutate(score = ifelse(
          !is.na(indicator) & row_number() == 1, -3, score
        ))) |> 
  arrange(ID) |> 
  group_by(indicator) |> 
  mutate(sum = cumsum(score)) |> 
  ungroup() |> 
  select(raw_score, sum)

输出

raw_score   sum
      <dbl> <dbl>
1        10    10
2       -13    -3
3         8     5
4       -12    -3
5         6     3
6       -20    -3
7        40    37

相关问题