我用的是Spark 1.5。
我有两个这样的框架:
scala> libriFirstTable50Plus3DF
res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int]
scala> linkPersonItemLessThan500DF
res2: org.apache.spark.sql.DataFrame = [person_id: int, family_id: int]
字符串libriFirstTable50Plus3DF
有766,151条记录,而linkPersonItemLessThan500DF
有26,694,353条记录。请注意,我在linkPersonItemLessThan500DF
上使用repartition(number)
,因为我打算稍后将这两个连接起来。我使用以下代码跟踪上面的代码:
val userTripletRankDF = linkPersonItemLessThan500DF
.join(libriFirstTable50Plus3DF, Seq("family_id"))
.take(20)
.foreach(println(_))
型
我得到了这个输出:
16/12/13 15:07:10 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 3.0 (TID 473) in 520 ms on mlhdd01.mondadori.it (199/200)
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala: at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.sql.execution.joins.BroadcastHashJoin.doExecute(BroadcastHashJoin.scala:110)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.TungstenProject.doExecute(basicOperators.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:63)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1315)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1378)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:402)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:371)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:77)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:79)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:81)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:83)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:85)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:87)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:91)
at $iwC$$iwC$$iwC.<init>(<console>:93)
at $iwC$$iwC.<init>(<console>:95)
at $iwC.<init>(<console>:97)
at <init>(<console>:99)
at .<init>(<console>:103)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
型
我不明白这是什么问题。它是简单的增加等待时间吗?是加入太密集?我需要更多的内存吗?是 Shuffle 密集?有人可以帮助吗?
5条答案
按热度按时间ibps3vxo1#
这是因为Spark尝试进行广播哈希连接,其中一个 Dataframe 非常大,所以发送它需要花费很多时间。
您可以:
1.设置更高的
spark.sql.broadcastTimeout
以增加超时-spark.conf.set("spark.sql.broadcastTimeout", newValueForExample36000)
persist()
两个DataFrame,那么Spark将使用Shuffle Join -来自here的参考PySpark
在PySpark中,您可以在构建spark上下文时按以下方式设置配置:
字符串
r7knjye22#
只是为了给very concise answer from @T. Gawęda添加一些代码上下文。
在您的Spark应用程序中,Spark SQL确实选择了广播哈希连接进行连接,因为 “libriFirstTable 50 Plus 3DF有766,151条记录” 恰好小于所谓的广播阈值(默认为10 MB)。
您可以使用spark.sql.autoBroadcastJoinThreshold配置属性控制广播阈值。
spark.sql.autoBroadcastJoinThreshold配置表的最大大小(以字节为单位),该表将在执行联接时广播到所有工作节点。通过将此值设置为-1,可以禁用广播。请注意,当前仅支持统计数据用于已运行ANALYZE TABLE PROTECTISTICS noscan命令的Hive Metastore表。
您可以在堆栈跟踪中找到该特定类型的联接:
org.apache.spark.sql.execution.joins.BroadcastHashJoin.doExecute(BroadcastHashJoin.scala:110)
Spark SQL中的
BroadcastHashJoin
物理操作符使用广播变量将较小的数据集分发给Spark执行器(而不是为每个任务发送一个副本)。如果你使用
explain
来查看物理查询计划,你会注意到查询使用了BroadcastExchangeExec物理操作符。这是你可以看到广播小表(和超时)的底层机制。字符串
doExecuteBroadcast
是SparkPlan
合约的一部分,Spark SQL中的每个物理操作符都遵循该合约,允许在需要时进行广播。timeout参数就是您要查找的参数。
型
正如你所看到的,你可以完全禁用它(使用负值),这意味着无限期地等待广播变量被发送到执行器,或者使用
sqlContext.conf.broadcastTimeout
,这正是spark.sql. broadcasting配置属性。默认值是5 * 60
秒,你可以在堆栈跟踪中看到:java.util.concurrent.TimeoutException:Futures在[300秒]后超时
thtygnil3#
除了增加
spark.sql.broadcastTimeout
或persistent()这两个DataFrame之外,您可以尝试:
1.通过将
spark.sql.autoBroadcastJoinThreshold
设置为-1
来禁用广播2.通过将
spark.driver.memory
设置为更高的值来增加Spark驱动器存储器。yduiuuwa4#
在我的例子中,它是由一个大的广播引起的:
字符串
所以,根据之前的答案,我通过删除广播来修复它:
型
eqfvzcg85#
当我在循环中使用
breeze's leastSquares
函数时,我得到了这个错误。Spark认为这是长时间运行并抛出超时异常。解决方案是将任务移动到自己的分布式循环中。