R:加快“按组”行动

6xfqseft  于 2023-05-04  发布在  其他
关注(0)|答案(6)|浏览(139)

我有一个模拟,有一个巨大的聚合和合并步骤的权利,在中间。我使用plyr的ddply()函数对这个过程进行了原型化,它可以很好地满足我的大部分需求。但是我需要这个聚合步骤更快,因为我必须运行10 K模拟。我已经在并行扩展模拟,但如果这一步更快,我可以大大减少我需要的节点数量。
下面是我试图做的一个合理的简化:

library(Hmisc)

# Set up some example data
year <-    sample(1970:2008, 1e6, rep=T)
state <-   sample(1:50, 1e6, rep=T)
group1 <-  sample(1:6, 1e6, rep=T)
group2 <-  sample(1:3, 1e6, rep=T)
myFact <-  rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)

# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
                     function(df) wtd.mean(df$myFact, weights=df$weights)
                                 )
           )

所有的提示或建议都很感激!

7fhtutme

7fhtutme1#

代替普通的R Dataframe ,你可以使用一个不可变的 Dataframe ,当你子集时,它会返回指向原始数据的指针,并且速度会快得多:

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250

如果我写一个plyr函数来定制这种情况,我会这样做:

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33

它的速度要快得多,因为它避免了复制数据,只在计算时提取每次计算所需的子集。将数据切换为矩阵形式提供了另一种速度提升,因为矩阵子集化比 Dataframe 子集化快得多。

cbeh67ev

cbeh67ev2#

2倍的加速和更简洁的代码:

library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time( 
  res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
)
#   user  system elapsed 
#  0.950   0.050   1.007

我的第一篇文章,所以请善待;)
data.table v1.9.2中,导出了setDT函数,该函数将通过引用 * 将data.frame转换为data.table *(与data.table的说法保持一致-所有set*函数都通过引用修改对象)。这意味着没有不必要的复制,因此速度很快。你可以计时,但会疏忽大意。

require(data.table)
system.time({
  setDT(myDF)
  res <- myDF[, weighted.mean(myFact, weights), 
             by=list(year, state, group1, group2)] 
})
#   user  system elapsed 
#  0.970   0.024   1.015

这与上面OP的解决方案的1.264秒相反,其中data.table(.)用于创建dtb

h7appiyu

h7appiyu3#

我会用基地R来侧写

g <- with(myDF, paste(year, state, group1, group2))
x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
aggregateDF$V1 <- x

在我的机器上,它需要5秒,而原始代码需要67秒。

EDIT刚刚发现rowsum函数的另一个加速:

g <- with(myDF, paste(year, state, group1, group2))
X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
x <- X$a/X$b
aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
aggregateDF2$V1 <- x

需要3秒!

9bfwbjaz

9bfwbjaz4#

您使用的是最新版本的Plyr吗(注意:这还没有使它的所有CRAN镜子呢)?如果是这样的话,您可以并行运行此操作。
下面是llply的例子,但同样的道理也适用于ddply:

x <- seq_len(20)
  wait <- function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed 
  # 0.007   0.005   2.005 

  library(doMC)
  registerDoMC(2) 
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed 
  # 0.020   0.011   1.038
  • 编辑:*

好吧,其他的循环方法更糟糕,所以这可能需要(a)C/C++代码或(b)对如何做这件事进行更根本的重新思考。我甚至没有尝试使用by(),因为根据我的经验,这非常慢。

groups <- unique(myDF[,c("year", "state", "group1", "group2")])
system.time(
aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
}))
)

aggregateDF <- data.frame()
system.time(
for(i in 1:nrow(groups)) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
}
)
watbbzwu

watbbzwu5#

当应用的函数有多个向量参数时,我通常使用索引向量和tapply:

system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
# user  system elapsed 
# 1.36    0.08    1.44

我使用一个简单的 Package 器,它是等效的,但隐藏了混乱:

tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)

编辑以包括下面的tmapply以供评论:

tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
  FUN = match.fun(FUN)
  if (!is.list(XS))
    XS = list(XS)
  tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
    do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
}
bnl4lu3b

bnl4lu3b6#

可能最快的解决方案是使用collapse::fgroup_by。它比data.table快8倍:

library(collapse)
myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights))

bm <- bench::mark(
  collapse = myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights)),
  data.table = dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)],
  check = FALSE)

#> bm
#  expression      min   median itr/se…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴
#1 collapse      101ms    105ms     9.10  8.84MB    0        5     0   549ms
#2 data.table    852ms    852ms     1.17 24.22MB    2.35     1     2   852ms

相关问题