aws glue-处理数据时工作线程上的磁盘空间不足

tquggr8v  于 2021-07-12  发布在  Spark
关注(0)|答案(0)|浏览(268)

我正在一个s3数据集上运行一个粘合作业,该数据集包含大约600万个文件,总计约80GB,我正在执行一个窗口函数,并将其写入另一个s3位置。我的粘合工作使用50个g2.x工作线程和默认的spark分区。当我运行这个时,我收到下面列出的错误。关于如何防止执行器的存储空间耗尽,有什么建议吗?
scheduler.tasksetmanager(日志记录。scala:logwarning(66)):第4.0阶段任务401.0丢失(tid 505643、172.35.178.124,executor 13):org.apache.spark.memory.sparkoutofmemoryerror:调用org.apache.spark.util.collection.unsafe.sort上的spill()时出错。unsafeexternalsorter@16007c9 :org.apache.spark.memory.taskmemorymanager.acquireexecutionmemory(taskmemorymanager)上的设备上没有剩余空间。java:219)在org.apache.spark.memory.taskmemorymanager.allocatepage(taskmemorymanager。java:285)在org.apache.spark.memory.memoryconsumer.allocatepage(memoryconsumer。java:117)在org.apache.spark.util.collection.unsafe.sort.unsafeeExternalSorter.acquirenewpageifnecessary(unsafeeExternalSorter。java:383)在org.apache.spark.util.collection.unsafe.sort.unsafeexternalsorter.insertrecord(unsafeexternalsorter。java:407)位于org.apache.spark.sql.execution.unsafeexternalrowsorter.insertrow(unsafeexternalrowsorter。java:135)位于org.apache.spark.sql.catalyst.expressions.generatedclass$generateditorForCodeGenStage1.sort\u addtosorter\u 0$(未知源代码)org.apache.spark.sql.catalyst.expressions.generatedclass$generateEditorForCodeGenStage1.processnext(未知源代码),位于org.apache.spark.sql.execution.bufferedrowiterator.hasnext(bufferedrowiterator)。java:43)在org.apache.spark.sql.execution.whistagecodegenexec$$anonfun$13$$anon$1.hasnext(whistagecodegenexec。scala:636)在org.apache.spark.sql.execution.window.windowexec$$anonfun$11$$anon$1.fetchnextrow(windowexec。scala:314)在org.apache.spark.sql.execution.window.windowexec$$anonfun$11$$anon$1上。scala:323)在org.apache.spark.sql.execution.window.windowexec$$anonfun$11.apply(windowexec。scala:303)在org.apache.spark.sql.execution.window.windowexec$$anonfun$11.apply(windowexec。scala:302)在org.apache.spark.rdd.rdd$$anonfun$mappartitions$1$$anonfun$apply$23.apply(rdd。scala:801)在org.apache.spark.rdd.rdd$$anonfun$mappartitions$1$$anonfun$apply$23.apply(rdd。scala:801)在org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd。scala:52) 在org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd。scala:324)在org.apache.spark.rdd.rdd.iterator(rdd。scala:288)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题