从r编写hdfs的最快方法(没有任何包)

yruzcnhs  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(725)

我正在尝试使用自定义r map reduce将一些数据写入hdfs。我读的过程很快,但后处理写需要相当长的时间。我尝试过(可以写入文件连接的函数)

output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)

但是write.table只写部分信息(前几行和后几行),writelines不起作用。所以现在我试着:

for(row in 1:nrow(base)){
      cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
          base[row,]$field5,",",base[row,]$field6,"\n",sep='')
    }

但这本书的写作速度很慢。下面是一些关于写入速度有多慢的日志:
2016-07-07 08:59:30557信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406056 2016-07-07 08:59:40567信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406422 2016-07-07 08:59:50582信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/406710 2016-07-07 09:00:00,947 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407001 2016-07-07 09:00:11392 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407316 2016-07-07 09:00:21832 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/407683 2016-07-07 09:00:31,883 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408103 2016-07-07 09:00:41892 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408536 2016-07-07 09:00:51895 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/408969 2016-07-07 09:01:01,903 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/409377 2016-07-07 09:01:12187 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/409782 2016-07-07 09:01:22198 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410161 2016-07-07 09:01:32,293 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410569 2016-07-07 09:01:42509 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/410989 2016-07-07 09:01:52515 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/411435 2016-07-07 09:02:02,525 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/411814 2016-07-07 09:02:12625 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/412196 2016-07-07 09:02:22988 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/412616 2016-07-07 09:02:32,991 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413078 2016-07-07 09:02:43104 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413508 2016-07-07 09:02:53115 info[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/413975 2016-07-07 09:03:03,122信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/414415 2016-07-07 09:03:13128信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/414835 2016-07-07 09:03:23131信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/415210 2016-07-07 09:03:33,143信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/415643 2016-07-07 09:03:43153信息[thread-49]org.apache.hadoop.streaming.pipemapred:records r/w=921203/416031
所以我想知道我是不是做错了什么。我正在使用data.table。

5vf7fwbs

5vf7fwbs1#

基于我对各种具有文件写入功能的函数的不同实验,我发现以下是最快的:

base <- data.table(apply(base,2,FUN=as.character),stringsAsFactors = F)
x <- sapply(1:nrow(base), 
FUN = function(row) {
cat(base$field1[row],",", base$field2[row], "," ,  base$field3[row], "," , 
base$field4[row], "," , base$field5[row], "," , base$field6[row], "\n" , sep='')
                    }
)
rm(x)

哪里 x 只是为了捕获空返回 sapply 投掷和 sapplyas.character 是为了防止混乱 cat 不打印因子(打印内部因子值大于实际值)。

相关问题