dataproc不会解包作为存档传递的文件

5anewei6 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(425)

我正在尝试用.net spark作业提交dataproc。
命令行如下所示：

gcloud dataproc jobs submit spark \
         --cluster=<cluster> \
         --region=<region> \
         --class=org.apache.spark.deploy.dotnet.DotnetRunner \
         --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
         --archives=gs://bucket/dotnet-build-output.zip \
         -- find

此命令行应调用 find 函数显示当前目录中的文件。
我只看到两个文件：

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

最终gcp不会从指定为 --archives . 指定的文件存在，并且路径是从gcp ui复制的。此外，我还尝试从存档（存在）中运行一个精确的程序集文件，但它失败了 File does not exist

apache-spark .net google-cloud-platform google-cloud-dataproc

来源：https://stackoverflow.com/questions/62645635/dataproc-does-not-unpack-files-passed-as-archive

2条答案

按热度按时间

uqdfh47h1#

我认为问题是您的命令是在主节点上运行的spark驱动程序中运行的，因为dataproc在默认情况下是在客户机模式下运行的。您可以通过添加 --properties spark.submit.deployMode=cluster 提交作业时。
根据使用帮助 --archives 标志：

--archives=[ARCHIVE,...]
   Comma separated list of archives to be extracted into the working
   directory of each executor. Must be one of the following file formats:
   .zip, .tar, .tar.gz, or .tgz.

存档将仅提取到工作节点中。我试着提交一份工作 --archives=gs://my-bucket/foo.zip 其中包括2个文件 foo.txt 以及 deps.txt ，则可以在工作节点上找到提取的文件：

my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

赞(0）回复(0）举报 2021-05-27

gg0vcinb2#

正如@大港提到的 --archives 以及 --files 参数不会将zip文件复制到驱动程序示例，因此这是错误的方向。
我用这种方法：

gcloud dataproc jobs submit spark \
        --cluster=<cluster> \
        --region=<region> \
        --class=org.apache.spark.deploy.dotnet.DotnetRunner \
        --jars=gs://<bucket>/microsoft-spark-2.4.x-0.11.0.jar \
        -- /bin/sh -c "gsutil cp gs://<bucket>/builds/test.zip . && unzip -n test.zip && chmod +x ./Spark.Job.Test && ./Spark.Job.Test"

赞(0）回复(0）举报 2021-05-27

我来回答

dataproc不会解包作为存档传递的文件

2条答案

相关问题

热门标签

最新问答