我已经检查了Stackoverflow上的其他相关帖子,但似乎没有一个与我直接相关,因为他们引用的spark版本是5年前的**,而我使用的spark版本是Spark 3.3.3**
我正在运行一个Apache Spark集群,以Yarn为主机,同时使用Apriyter Labs作为IDE。当我运行命令来启动集群时,它开始使用:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
builder. \
config('spark.ui.port', '0'). \
config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
enableHiveSupport(). \
appName(f'{username} | Python - Kafka and Spark Integration'). \
master('yarn'). \
getOrCreate()
计划是从HDFS中的24个JSON文件中读取流,每个文件大约80MB,然后写入流,同时将数据划分为三列,并将它们存储为HDFS中的另一个文件夹中的parquet。
这是我使用的命令:
file_df. \
writeStream. \
partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
format('parquet'). \
option("checkpointLocation", f"/user/{username}/file_df/streaming/checkpoint/file_df"). \
option("path", f"/user/{username}/file_df/streaming/data/files_parq"). \
trigger(once=True). \
start()
然后我在笔记本上运行时得到这个输出
23/09/13 21:34:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
<pyspark.sql.streaming.StreamingQuery at 0x7fd22851adc0>
[Stage 3:===============> (5 + 2) / 19]
23/09/13 21:42:56 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 120168 ms exceeds timeout 120000 ms
23/09/13 21:42:56 ERROR YarnScheduler: Lost executor 2 on sparkde.camp.300123.internal: Executor heartbeat timed out after 120168 ms
23/09/13 21:42:56 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 41) (sparkde.camp.300123.internal executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 120168 ms
[Stage 3:===============> (5 + 1) / 19]
23/09/13 21:42:58 WARN TransportChannelHandler: Exception in connection from /10.172.0.3:41888
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
[Stage 3:========================> (8 + 2) / 19]
23/09/13 21:45:55 ERROR YarnScheduler: Lost executor 3 on sparkde.camp.300123.internal: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 47) (sparkde.camp.300123.internal executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 3 for reason Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143.
[2023-09-13 21:45:55.151]Killed by external signal
我注意到阶段继续进行,最终它停止并给出了所看到的错误:
[Stage 3:=======================================> (13 + 1) / 19]
23/09/13 21:54:46 WARN TaskSetManager: Lost task 14.0 in stage 3.0 (TID 58) (sparkde.camp.300123.internal executor 6): TaskKilled (Stage cancelled)
23/09/13 22:03:05 WARN SparkContext: Executor 1 might already have stopped and can not request thread dump from it.
这是spark job的截图
请帮助
1条答案
按热度按时间j1dl9f461#
所以我需要增加分配给YARN的内存,所以我需要更新yarn-site.xml中的属性值
所以我只是更新了值,之前我运行时只有1GB:)
然后过了一段时间,我仍然得到了同样的错误,但这一次是从我的虚拟机,我需要分配更多的RAM,因为我的配置是在1VCPU和8 GB RAM
所以我的spark提交作业现在看起来像这样: