使用hadoop distcp命令的dataproc上的s3 dist cp groupby等效项

mbyulnm0 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(580)

在电子病历上，我用的是 s3-dist-cp --groupBy 为了给文件命名 fileName 在s3中，将文件夹重命名为：

s3-dist-cp --groupBy='.*(folder_in_hdfs).*' --src=hdfs:///user/testUser/tmp-location/folder_in_hdfs --dest=s3://testLocation/folder_in_s3

例子：

hadoop fs -ls hdfs:///user/testUser/tmp-location/folder_in_hdfs
Found 2 items
-rw-r--r--   1 hadoop hadoop          0 2019-04-05 14:54 hdfs:///user/testUser/tmp-location/folder_in_hdfs/file.csv/_SUCCESS
-rw-r--r--   1 hadoop hadoop     493077 2019-04-05 14:54 hdfs:///user/testUser/tmp-location/folder_in_hdfs/file.csv/part-00000-12db8851-31be-4b08-8a93-1887e534941d-c000.csv

运行s3 dist cp之后，

aws s3 ls s3://testLocation/folder_in_s3/
s3://testLocation/folder_in_s3/file.csv

但是，我想在dataproc上使用 hadoop distcp 命令并将文件写入gcs位置 gs://testLocation/folder_in_gs/file.csv 感谢您的帮助。

hadoop google-cloud-dataproc DistCp s3distcp

来源：https://stackoverflow.com/questions/55554402/s3-dist-cp-groupby-equivalent-on-dataproc-using-hadoop-distcp-commands

1条答案

按热度按时间

rbl8hiat1#

dataproc在distcp中没有这样的功能。
也就是说，在运行distcp之后，使用使用gsutil compose的简单bash脚本可以获得相同的结果：

DESTINATION=gs://bucket/path/to/destination/file
FILES=($(gsutil ls gs://testLocation/**folder_in_gs**))
gsutil compose "${FILES[@]::32}" "${DESTINATION}"
echo "${FILES[@]:32}"| xargs -n 1 | xargs -i gsutil compose "${DESTINATION}" {} "${DESTINATION}"
gsutil -m rm gs://testLocation/**folder_in_gs**

赞(0）回复(0）举报 2021-05-29

我来回答

使用hadoop distcp命令的dataproc上的s3 dist cp groupby等效项

1条答案

相关问题

热门标签

最新问答