我有一个很大的csv文件,有42个变量和20万条记录。我想通过map reduce(localbackend)来处理它,但总是会出现以下错误:
Error: cannot allocate vector of size 15.6 Gb
In addition: Warning messages:
1: closing unused connection 3 (C:\Users\LSZL~1\AppData\Local\Temp\RtmpgJ2FXm\filea302f8a7363)
2: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
3: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
4: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
5: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
我的代码:
inputformat <- make.input.format("csv", sep = ",", col.names=column_names)
a <- mapreduce(input="X:/BigData/working_dir/census-income.data",
input.format=inputformat,
map = function(k, v){
key = v
return(keyval(key, v[1,1]))
},
reduce = function(k, v){
key = k[1, 1]
val = sum(k[, 2])
return(keyval(key, val))
}
)()
有没有可能不提供不必要的列(+数据)来Map、减少和选择那些需要其数据的列?
1条答案
按热度按时间gfttwv5a1#
我终于明白了。
我不知道它是否有效,但它是有效的。
数据:http://kdd.ics.uci.edu/databases/census-income/