我正在尝试使用自定义r map reduce将一些数据写入hdfs。我读的过程很快,但后处理写需要相当长的时间。我尝试过(可以写入文件连接的函数)
output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)
但是write.table只写部分信息(前几行和后几行),writelines不起作用。所以现在我试着:
for(row in 1:nrow(base)){
cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
base[row,]$field5,",",base[row,]$field6,"\n",sep='')
}
但这本书的写作速度很慢。下面是一些关于写入速度有多慢的日志:
2016-07-07 08:59:30557信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406056 2016-07-07 08:59:40567信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406422 2016-07-07 08:59:50582信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406710 2016-07-07 09:00:00,947 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407001 2016-07-07 09:00:11392 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407316 2016-07-07 09:00:21832 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407683 2016-07-07 09:00:31,883 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408103 2016-07-07 09:00:41892 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408536 2016-07-07 09:00:51895 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408969 2016-07-07 09:01:01,903 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/409377 2016-07-07 09:01:12187 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/409782 2016-07-07 09:01:22198 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410161 2016-07-07 09:01:32,293 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410569 2016-07-07 09:01:42509 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410989 2016-07-07 09:01:52515 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/411435 2016-07-07 09:02:02,525 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/411814 2016-07-07 09:02:12625 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/412196 2016-07-07 09:02:22988 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/412616 2016-07-07 09:02:32,991 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413078 2016-07-07 09:02:43104 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413508 2016-07-07 09:02:53115 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413975 2016-07-07 09:03:03,122信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/414415 2016-07-07 09:03:13128信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/414835 2016-07-07 09:03:23131信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/415210 2016-07-07 09:03:33,143信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/415643 2016-07-07 09:03:43153信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/416031
所以我想知道我是不是做错了什么。我正在使用data.table。
1条答案
按热度按时间5vf7fwbs1#
基于我对各种具有文件写入功能的函数的不同实验,我发现以下是最快的:
哪里
x
只是为了捕获空返回sapply
投掷和sapply
的as.character
是为了防止混乱cat
不打印因子(打印内部因子值大于实际值)。