给定PyparkDataframe given_df
,我需要用它来生成一个新的Dataframe new_df
从它那里。
我正在尝试使用 foreach()
方法。为了简单起见,假设Dataframe given_df
以及 new_df
由单个列组成。
我必须处理这个Dataframe的每一行,并基于该单元格中存在的值,创建一些新行并将其添加到 new_df
由 union
把它和排成一排。处理一行数据时要生成的行数 given_df
是可变的。
new_df=spark.createDataFrame([], schema=['SampleField']) // Create an empty dataframe initially
given_df.foreach(func) // given_df already contains some data loaded. Now I run a function for each row.
def func(row):
rows_to_append = getNewRowsAfterProcessingCurrentRow(row)
global new_df // without this line, the next line will result in an error, because it will think that new_df is a local variable and we are trying to access it without defining it first.
new_df=new_df.union(spark.createDataFrame(data=rows_to_append, schema=['SampleField'])
但是,这会导致pickle错误。
如果union函数被注解掉,则不会发生错误。
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1097, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 356, in dump
return Pickler.dump(self, obj)
File "/databricks/python/lib/python3.7/pickle.py", line 437, in dump
self.save(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 789, in save_tuple
save(element)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 843, in _batch_appends
save(x)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 500, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 819, in save_list
self._batch_appends(obj)
File "/databricks/python/lib/python3.7/pickle.py", line 846, in _batch_appends
save(tmp[0])
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/spark/python/pyspark/cloudpickle.py", line 495, in save_function
self.save_function_tuple(obj)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 729, in save_function_tuple
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 549, in save
self.save_reduce(obj=obj, *rv)
File "/databricks/python/lib/python3.7/pickle.py", line 662, in save_reduce
save(state)
File "/databricks/python/lib/python3.7/pickle.py", line 504, in save
f(self, obj) # Call unbound method with explicit self
File "/databricks/python/lib/python3.7/pickle.py", line 859, in save_dict
self._batch_setitems(obj.items())
File "/databricks/python/lib/python3.7/pickle.py", line 885, in _batch_setitems
save(v)
File "/databricks/python/lib/python3.7/pickle.py", line 524, in save
rv = reduce(self.proto)
File "/databricks/spark/python/pyspark/context.py", line 356, in __getnewargs__
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
为了更好地理解我要做的事情,让我举一个例子来说明一个可能的用例:
让我们说 given_df
是一个句子的数据框架,其中每个句子由一些单词组成,单词之间用空格隔开。
given_df=spark.createDataframe([("The old brown fox",), ("jumps over",), ("the lazy log",)], schema=["SampleField"])
新的-df是一个Dataframe,每个字在不同的行中组成。所以我们将处理每一行 given_df
根据我们通过拆分行得到的单词,我们将把每一行插入 new_df
.
new_df=spark.createDataFrame([("The",), ("old",), ("brown",), ("fox",), ("jumps",), ("over",), ("the",), ("lazy",), ("dog",)], schema=["SampleField"])
1条答案
按热度按时间mi7gmzs61#
您试图在不允许的执行器上使用DataFrameAPI,因此
PicklingError
:picklingerror:无法序列化对象:异常:似乎您正试图从广播变量、操作或转换引用sparkcontext。sparkcontext只能在驱动程序上使用,不能在工作程序上运行的代码中使用。有关更多信息,请参阅spark-5063。
你应该重写你的代码。例如,您可以使用
RDD.flatMap
或者,如果您喜欢DataFrameAPI,explode()
功能。下面是如何使用后一种方法:
你把衣服包起来
getNewRowsAfterProcessingCurrentRow()
在udf()
. 这只会使您的函数在dataframeapi中可用。然后,使用另一个名为
explode()
. 这是必需的,因为您希望将拆分的句子“分解”(或转置)到多行,每行一个单词。最后,获取结果Dataframe并将其与原始Dataframe合并
given_df
.输出: