pyspark Spark Connect上的Pandas API支持

4bbkushb  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(151)

我尝试在Spark Connect上使用Spark PANDAS API,但出现Assert错误

  1. assert isinstance(spark_frame, SparkDataFrame)
  2. AssertionError

字符串
如果我使用spark Dataframe API,我不会得到任何错误。Spark connect支持Pandas-Spark API吗?
下面是我正在运行的代码。

  1. import pyspark.pandas as pd
  2. from pyspark.sql import Row
  3. # Stopping regular Spark Session before trying the SPARK Connect Functionality
  4. from pyspark.sql import SparkSession
  5. SparkSession.builder.master("local[*]").getOrCreate().stop()
  6. # Start the spark connect server running below
  7. #./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
  8. # Start Spark Session by Specifying the Spark Cluster Address ( local host.)
  9. spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
  10. d = {'col1': [1, 2], 'col2': [3, 4]}
  11. df = pd.DataFrame(d)
  12. print(df.head())
  13. import pyspark.pandas as pd
  14. from pyspark.sql import Row
  15. # Stopping regular Spark Session before trying the SPARK Connect Functionality
  16. from pyspark.sql import SparkSession
  17. SparkSession.builder.master("local[*]").getOrCreate().stop()
  18. # Start the spark connect server running below
  19. #./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
  20. # Start Spark Session by Specifying the Spark Cluster Address ( local host.)
  21. spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
  22. d = {'col1': [1, 2], 'col2': [3, 4]}
  23. df = pd.DataFrame(d)
  24. print(df.head())
  25. '''
  26. df = spark.createDataFrame([
  27. Row(a=1, b=2., c='string1'),
  28. Row(a=2, b=3., c='string2'),
  29. Row(a=4, b=5., c='string3')
  30. ])
  31. df.show()
  32. '''

o2rvlv0m

o2rvlv0m1#

下面是代码的更正版本

  1. import pyspark.pandas as pd
  2. from pyspark.sql import SparkSession
  3. from pyspark.sql import Row
  4. # Stopping regular Spark Session before trying the SPARK Connect Functionality
  5. SparkSession.builder.master("local[*]").getOrCreate().stop()
  6. # Start the Spark connect server running below
  7. # ./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
  8. # Start Spark Session by Specifying the Spark Cluster Address ( local host.)
  9. spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
  10. # Create a Spark DataFrame using Spark Session
  11. df_spark = spark.createDataFrame([
  12. Row(col1=1, col2=3),
  13. Row(col1=2, col2=4)
  14. ])
  15. # Convert the Spark DataFrame to a Pandas DataFrame using the Pandas Spark API
  16. df_pandas = df_spark.toPandas()
  17. print(df_pandas.head())

字符串
注意:在尝试远程连接之前,请确保您的Spark集群和Spark Connect服务器已正确配置并运行。

展开查看全部

相关问题