pyspark SparkContext只能在驱动程序上创建和访问

gblwokeq 于 2023-08-02 发布在 Spark

关注(0)|答案(2)|浏览(214)

我使用的是Azure Databricks（10.4 LTS（包括Apache Spark 3.2.1，Scala 2.12））Standard_L8s内核。
当执行下面的代码时，得到SparkContext should only be created and accessed on the driver错误。如果我只使用import pandas，它运行得很好，但需要3个多小时。对我来说，我有数十亿的记录要处理。我必须调整这个UDF请帮助在这方面。

import pyspark.pandas as pd
def getnearest_five_min_slot(valu):
  dataframe = pd.DataFrame([300,600,900,1200,1500,1800,2100,2400,2700,3000,3300,3600], columns = ['value'])
  rslt_df = dataframe.loc[dataframe['value'] >= value]
  rslt_df=rslt_df.sort_values(by=['value'], ascending=[True]).head(1)
  output=int(rslt_df.iat[0,0])
  print('\nResult dataframe :\n', output)
  
  return output
getnearestFiveMinSlot = udf(lambda m: getnearest_five_min_slot(m))

slotValue = [100,500,1100,400,601]
df = spark.createDataFrame(slotValue, IntegerType())
df=df.withColumn("NewValue",getnearestFiveMinSlot("value"))
display(df)

字符串

pyspark

来源：https://stackoverflow.com/questions/73115736/sparkcontext-should-only-be-created-and-accessed-on-the-driver

2条答案

按热度按时间

nwwlzxa71#

您需要实际创建SparkSession对象并为其提供一个App名称，以便开始在Databricks中使用Spark。这是强制性的先决条件。
SparkSession是PySpark的入口点，**创建SparkSession示例将是您使用RDD、DataFrame和Dataset编写程序的第一条语句。**SparkSession将使用SparkSession.builder构建器模式创建。
在代码的开头使用下面的语句来创建SparkSession。

#importing sparksession
from pyspark.sql import SparkSession
#creating a dataframe using spark object by reading csv file
 
#creating a sparksession object and providing appName 
spark=SparkSession.builder.appName("pysparkdf").getOrCreate()

字符串
关于spark session的更多信息和使用方法，请参考NNKhere的第三方文章。

赞(0）回复(0）举报 2023-08-02

slsn1g292#

我已经将SparkSession添加到我的脚本中，但错误仍然存在。在我的例子中，奇怪的是，当我在Databricks的Noteoboks上运行代码时，它运行得很好，但当我试图在.py脚本中运行它时，它会引发此错误。

赞(0）回复(0）举报 2023-08-02