我正在streamlight应用程序中使用pyspark,并且我可以成功地创建 SparkSession
在应用程序中(作为spark客户机)与集群通信。我可以使用sparksql从我的spark集群中检索数据。例如,显示Hive表,
df1 = spark.sql("show databases")
我想从我的应用程序中创建一个简单的Dataframe,如下所示
import streamlit as st
import pandas as pd
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
st.dataframe(df.toPandas())
当我第一次运行streamlit应用程序时,我就可以得到预期的输出。
+----+----+
|Col1|Col2|
+----+----+
| a| 1|
| b| 2|
| c| 3|
+----+----+
但是当我稍微修改代码时,例如 ("c", 3)
至 ("c", 4)
,保存并重新运行应用程序。触发以下异常。
PicklingError: Could not pickle object as excessively deep recursion required.
Traceback:
File "/opt/streamlit/envs/streamlit/lib/python3.7/site-packages/streamlit/script_runner.py", line 324, in _run_script
exec(code, module.__dict__)
File "/opt/streamlit/app/8550.py", line 133, in <module>
schemaPeople = spark.createDataFrame(people)
File "/usr/python/pyspark/sql/session.py", line 687, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/usr/python/pyspark/sql/session.py", line 384, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio, names=schema)
File "/usr/python/pyspark/sql/session.py", line 355, in _inferSchema
first = rdd.first()
File "/usr/python/pyspark/rdd.py", line 1376, in first
rs = self.take(1)
File "/usr/python/pyspark/rdd.py", line 1358, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/usr/python/pyspark/context.py", line 1001, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/usr/python/pyspark/rdd.py", line 2470, in _jrdd
self._jrdd_deserializer, profiler)
File "/usr/python/pyspark/rdd.py", line 2403, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/python/pyspark/rdd.py", line 2389, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/python/pyspark/serializers.py", line 568, in dumps
return cloudpickle.dumps(obj, 2)
File "/usr/python/pyspark/cloudpickle.py", line 919, in dumps
cp.dump(obj)
File "/usr/python/pyspark/cloudpickle.py", line 239, in dump
raise pickle.PicklingError(msg)
追溯到异常的根本原因,它位于 ClouldPickler
无法序列化某些对象(可能是spark rdd)。它是由
Pickler.dump(self, obj)
当以上代码在没有streamlight背景的pyspark shell中运行时,没有发生错误。因此,我怀疑原因与streamlit缓存机制有关。有没有解决这个问题的办法?
暂无答案!
目前还没有任何答案,快来回答吧!