通过使用foreach方法处理旧Dataframe来创建新的pysparkDataframe时出现pickle错误

eyh26e7m  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(686)

给定PyparkDataframe given_df ,我需要用它来生成一个新的Dataframe new_df 从它那里。
我正在尝试使用 foreach() 方法。为了简单起见,假设Dataframe given_df 以及 new_df 由单个列组成。
我必须处理这个Dataframe的每一行,并基于该单元格中存在的值,创建一些新行并将其添加到 new_dfunion 把它和排成一排。处理一行数据时要生成的行数 given_df 是可变的。

  1. new_df=spark.createDataFrame([], schema=['SampleField']) // Create an empty dataframe initially
  2. given_df.foreach(func) // given_df already contains some data loaded. Now I run a function for each row.
  3. def func(row):
  4. rows_to_append = getNewRowsAfterProcessingCurrentRow(row)
  5. global new_df // without this line, the next line will result in an error, because it will think that new_df is a local variable and we are trying to access it without defining it first.
  6. new_df=new_df.union(spark.createDataFrame(data=rows_to_append, schema=['SampleField'])

但是,这会导致pickle错误。
如果union函数被注解掉,则不会发生错误。

  1. PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
  2. Traceback (most recent call last):
  3. File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
  4. return cloudpickle.dumps(obj, pickle_protocol)
  5. File "/databricks/spark/python/pyspark/cloudpickle.py", line 1097, in dumps
  6. cp.dump(obj)
  7. File "/databricks/spark/python/pyspark/cloudpickle.py", line 356, in dump
  8. return Pickler.dump(self, obj)
  9. File "/databricks/python/lib/python3.7/pickle.py", line 437, in dump
  10. self.save(obj)
  11. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  12. f(self, obj) # Call unbound method with explicit self
  13. File "/databricks/python/lib/python3.7/pickle.py", line 789, in save_tuple
  14. save(element)
  15. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  16. f(self, obj) # Call unbound method with explicit self
  17. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  18. self.save_function_tuple(obj)
  19. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  20. save(state)
  21. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  22. f(self, obj) # Call unbound method with explicit self
  23. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  24. self._batch_setitems(obj.items())
  25. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  26. save(v)
  27. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  28. f(self, obj) # Call unbound method with explicit self
  29. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  30. self._batch_appends(obj)
  31. File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
  32. save(x)
  33. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  34. f(self, obj) # Call unbound method with explicit self
  35. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  36. self.save_function_tuple(obj)
  37. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  38. save(state)
  39. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  40. f(self, obj) # Call unbound method with explicit self
  41. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  42. self._batch_setitems(obj.items())
  43. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  44. save(v)
  45. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  46. f(self, obj) # Call unbound method with explicit self
  47. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  48. self._batch_appends(obj)
  49. File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
  50. save(x)
  51. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  52. f(self, obj) # Call unbound method with explicit self
  53. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  54. self.save_function_tuple(obj)
  55. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  56. save(state)
  57. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  58. f(self, obj) # Call unbound method with explicit self
  59. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  60. self._batch_setitems(obj.items())
  61. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  62. save(v)
  63. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  64. f(self, obj) # Call unbound method with explicit self
  65. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  66. self._batch_appends(obj)
  67. File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
  68. save(x)
  69. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  70. f(self, obj) # Call unbound method with explicit self
  71. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  72. self.save_function_tuple(obj)
  73. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  74. save(state)
  75. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  76. f(self, obj) # Call unbound method with explicit self
  77. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  78. self._batch_setitems(obj.items())
  79. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  80. save(v)
  81. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  82. f(self, obj) # Call unbound method with explicit self
  83. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  84. self._batch_appends(obj)
  85. File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
  86. save(tmp[0])
  87. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  88. f(self, obj) # Call unbound method with explicit self
  89. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  90. self.save_function_tuple(obj)
  91. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  92. save(state)
  93. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  94. f(self, obj) # Call unbound method with explicit self
  95. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  96. self._batch_setitems(obj.items())
  97. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  98. save(v)
  99. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  100. f(self, obj) # Call unbound method with explicit self
  101. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  102. self._batch_appends(obj)
  103. File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
  104. save(tmp[0])
  105. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  106. f(self, obj) # Call unbound method with explicit self
  107. File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
  108. self.save_function_tuple(obj)
  109. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  110. save(state)
  111. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  112. f(self, obj) # Call unbound method with explicit self
  113. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  114. self._batch_setitems(obj.items())
  115. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  116. save(v)
  117. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  118. f(self, obj) # Call unbound method with explicit self
  119. File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
  120. self._batch_appends(obj)
  121. File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
  122. save(tmp[0])
  123. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  124. f(self, obj) # Call unbound method with explicit self
  125. File "/databricks/spark/python/pyspark/cloudpickle.py", line 495, in save_function
  126. self.save_function_tuple(obj)
  127. File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
  128. save(state)
  129. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  130. f(self, obj) # Call unbound method with explicit self
  131. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  132. self._batch_setitems(obj.items())
  133. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  134. save(v)
  135. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  136. f(self, obj) # Call unbound method with explicit self
  137. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  138. self._batch_setitems(obj.items())
  139. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  140. save(v)
  141. File "/databricks/python/lib/python3.7/pickle.py", line 549, in save
  142. self.save_reduce(obj=obj, *rv)
  143. File "/databricks/python/lib/python3.7/pickle.py", line 662, in save_reduce
  144. save(state)
  145. File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
  146. f(self, obj) # Call unbound method with explicit self
  147. File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
  148. self._batch_setitems(obj.items())
  149. File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
  150. save(v)
  151. File "/databricks/python/lib/python3.7/pickle.py", line 524, in save
  152. rv = reduce(self.proto)
  153. File "/databricks/spark/python/pyspark/context.py", line 356, in __getnewargs__
  154. "It appears that you are attempting to reference SparkContext from a broadcast "
  155. Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

为了更好地理解我要做的事情,让我举一个例子来说明一个可能的用例:
让我们说 given_df 是一个句子的数据框架,其中每个句子由一些单词组成,单词之间用空格隔开。

  1. given_df=spark.createDataframe([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])

新的-df是一个Dataframe,每个字在不同的行中组成。所以我们将处理每一行 given_df 根据我们通过拆分行得到的单词,我们将把每一行插入 new_df .

  1. new_df=spark.createDataFrame([("The",), ("old",), ("brown",), ("fox",), ("jumps",), ("over",), ("the",), ("lazy",), ("dog",)], schema=["SampleField"])
mi7gmzs6

mi7gmzs61#

您试图在不允许的执行器上使用DataFrameAPI,因此 PicklingError :
picklingerror:无法序列化对象:异常:似乎您正试图从广播变量、操作或转换引用sparkcontext。sparkcontext只能在驱动程序上使用,不能在工作程序上运行的代码中使用。有关更多信息,请参阅spark-5063。
你应该重写你的代码。例如,您可以使用 RDD.flatMap 或者,如果您喜欢DataFrameAPI, explode() 功能。
下面是如何使用后一种方法:

  1. given_df=spark.createDataFrame([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])
  2. from pyspark.sql.functions import udf, explode
  3. from pyspark.sql.types import ArrayType, StringType
  4. @udf(returnType=ArrayType(StringType()))
  5. def getNewRowsAfterProcessingCurrentRow(str):
  6. return str.split()
  7. new_df= given_df\
  8. .select(explode(getNewRowsAfterProcessingCurrentRow("SampleField")).alias("SampleField"))\
  9. .unionAll(given_df)
  10. new_df.show()

你把衣服包起来 getNewRowsAfterProcessingCurrentRow()udf() . 这只会使您的函数在dataframeapi中可用。
然后,使用另一个名为 explode() . 这是必需的,因为您希望将拆分的句子“分解”(或转置)到多行,每行一个单词。
最后,获取结果Dataframe并将其与原始Dataframe合并 given_df .
输出:

  1. +-----------------+
  2. | SampleField|
  3. +-----------------+
  4. | The|
  5. | old|
  6. | brown|
  7. | fox|
  8. | jumps|
  9. | over|
  10. | the|
  11. | lazy|
  12. | log|
  13. |The old brown fox|
  14. | jumps over|
  15. | the lazy log|
  16. +-----------------+
展开查看全部

相关问题