运行pyspark时出现conda env问题

khbbv19g  于 2021-07-14  发布在  Spark
关注(0)|答案(0)|浏览(457)

我正在尝试使用pyspark传输python环境,具体如下:
https://community.cloudera.com/t5/community-articles/running-pyspark-with-conda-env/ta-p/247551httpshttp://henning.kropponline.de/2014/07/18/virtualenv-hadoop-streaming/https://community.cloudera.com/t5/community-articles/using-virtualenv-with-pyspark/ta-p/245932
上次尝试时,我发出了以下命令:

  1. conda create -y -n py35 python=3.5 numpy pandas scikit-learn
  2. conda activate py35
  3. cd /opt/cloudera/parcels/Anaconda/envs/
  4. zip -r py35.zip py35
  5. # Runing the job:
  6. PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py35/bin/python \
  7. PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py35/bin/python \
  8. pyspark2 \
  9. --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py35/bin/python \
  10. --master yarn \
  11. --deploy-mode client \
  12. --archives /opt/cloudera/parcels/Anaconda/envs/py35.zip#py35 test.py

测试.py:

  1. # !./py35/bin/python
  2. # -*- coding: utf-8 -*-
  3. from pyspark import SparkConf
  4. from pyspark import SparkContext
  5. conf = SparkConf()
  6. conf.setAppName('spark-yarn')
  7. sc = SparkContext(conf=conf)
  8. def some_function(x):
  9. # Packages are imported and available from your bundled environment.
  10. import sklearn
  11. import pandas
  12. import numpy as np
  13. # Use the libraries to do work
  14. return np.sin(x)**2 + 2
  15. rdd = (sc.parallelize(range(1000))
  16. .map(some_function)
  17. .take(10))
  18. print(rdd)

意外结果:异常:worker中的python版本2.7与driver 3.7不同,pyspark无法使用不同的次要版本运行。请检查环境变量pyspark\u python和pyspark\u driver\u python是否正确设置。

  1. 21/04/12 18:23:17 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, server, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  2. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/worker.py", line 267, in main
  3. ("%d.%d" % sys.version_info[:2], version))
  4. Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
  5. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
  6. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
  7. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  8. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  9. at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  10. at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  11. at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  12. at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  13. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  14. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  15. at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  16. at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  17. at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  18. at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  19. at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  20. at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  21. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  22. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  23. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  24. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  25. at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  26. at org.apache.spark.scheduler.Task.run(Task.scala:121)
  27. at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  28. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
  29. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  30. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  31. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  32. at java.lang.Thread.run(Thread.java:748)
  33. 21/04/12 18:23:17 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
  34. Traceback (most recent call last):
  35. File "<stdin>", line 1, in <module>
  36. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/rdd.py", line 1055, in count
  37. return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  38. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/rdd.py", line 1046, in sum
  39. return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  40. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/rdd.py", line 917, in fold
  41. vals = self.mapPartitions(func).collect()
  42. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/rdd.py", line 816, in collect
  43. sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  44. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  45. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/sql/utils.py", line 63, in deco
  46. return f(*a,**kw)
  47. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
  48. py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
  49. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, server, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  50. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/worker.py", line 267, in main
  51. ("%d.%d" % sys.version_info[:2], version))
  52. Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
  53. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
  54. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
  55. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  56. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  57. at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  58. at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  59. at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  60. at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  61. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  62. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  63. at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  64. at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  65. at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  66. at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  67. at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  68. at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  69. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  70. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  71. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  72. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  73. at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  74. at org.apache.spark.scheduler.Task.run(Task.scala:121)
  75. at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  76. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
  77. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  78. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  79. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  80. at java.lang.Thread.run(Thread.java:748)
  81. Driver stacktrace:
  82. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
  83. at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
  84. at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
  85. at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  86. at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  87. at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
  88. at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  89. at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  90. at scala.Option.foreach(Option.scala:257)
  91. at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  92. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
  93. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
  94. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
  95. at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  96. at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  97. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  98. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  99. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  100. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  101. at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
  102. at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  103. at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  104. at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  105. at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
  106. at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
  107. at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
  108. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  109. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  110. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  111. at java.lang.reflect.Method.invoke(Method.java:498)
  112. at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  113. at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  114. at py4j.Gateway.invoke(Gateway.java:282)
  115. at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  116. at py4j.commands.CallCommand.execute(CallCommand.java:79)
  117. at py4j.GatewayConnection.run(GatewayConnection.java:238)
  118. at java.lang.Thread.run(Thread.java:748)
  119. Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  120. File "/opt/cloudera/parcels/SPARK2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012/lib/spark2/python/pyspark/worker.py", line 267, in main
  121. ("%d.%d" % sys.version_info[:2], version))
  122. Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
  123. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
  124. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
  125. at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
  126. at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
  127. at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
  128. at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  129. at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
  130. at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  131. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  132. at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  133. at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  134. at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
  135. at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  136. at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
  137. at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  138. at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
  139. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  140. at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
  141. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  142. at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
  143. at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  144. at org.apache.spark.scheduler.Task.run(Task.scala:121)
  145. at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  146. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
  147. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  148. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  149. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  150. ... 1 more

我运行的是spark 2.4.0,基本python是2.7.16
合同通用条款7.3.0/ol6
cdh5.13.3-是的,它很旧。。
如有任何意见,我们将不胜感激。谢谢。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题