我们使用Flume1.9.0将数据从kafka摄取到hdfs,我们的配置文件在文章的末尾。它平稳地运行了一段时间,但是,它当前无法接收数据并不断抛出错误日志,一些重要的日志是post末尾的配额。对于每个日志,由于数据文件损坏,文件通道无法启动,并且它进行了不懈的尝试,但始终失败。flume示例托管在kubernetes中,有4个副本,每个副本都有单独的文件通道持久卷,发生问题时有将近95%的可用磁盘空间。
所以有两个问题,
数据文件损坏的原因是什么?因为它是我们的生产应用程序,我们信任flume的健壮性,所以我们不希望看到这个损坏的数据文件。此外,我们如何避免这种损坏的数据文件?
我们如何从这种情况恢复而不丢失通道中的任何数据?删除checkoutdir和datadir是不可接受的。
非常感谢。
我们的Flume配置,
flume-agent-1.sources = source1
flume-agent-1.sinks = HDFSSink1
flume-agent-1.channels = channel2HDFS1
flume-agent-1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
flume-agent-1.sources.source1.kafka.bootstrap.servers = ${KAFKA_BOOTSTRAP_SERVERS}
flume-agent-1.sources.source1.kafka.consumer.group.id = ${GROUP_ID}
flume-agent-1.sources.source1.kafka.topics.regex = ${KAFKA_TOPIC_PATTERN}
flume-agent-1.sources.source1.setTopicHeader = true
flume-agent-1.sources.source1.batchSize = ${KAFKA_BATCH_SIZE}
flume-agent-1.sources.source1.batchDurationMillis = 1000
flume-agent-1.sources.source1.channels = channel2HDFS1
flume-agent-1.sources.source1.interceptors = int-1
flume-agent-1.sources.source1.interceptors.int-1.type = com.nvidia.gpuwa.kafka2file.interceptors.MainInterceptor$Builder
flume-agent-1.channels.channel2HDFS1.type = file
flume-agent-1.channels.channel2HDFS1.checkpointDir = ${FILE_CHANNEL_BASEDIR}/checkpoint
flume-agent-1.channels.channel2HDFS1.dataDirs = ${FILE_CHANNEL_BASEDIR}/data
flume-agent-1.channels.channel2HDFS1.transactionCapacity = 10000
flume-agent-1.sinks.HDFSSink1.channel = channel2HDFS1
flume-agent-1.sinks.HDFSSink1.type = hdfs
flume-agent-1.sinks.HDFSSink1.hdfs.path = ${HADOOP_URL}/%{projectdir}
flume-agent-1.sinks.HDFSSink1.hdfs.fileType = CompressedStream
flume-agent-1.sinks.HDFSSink1.hdfs.codeC = gzip
flume-agent-1.sinks.HDFSSink1.hdfs.filePrefix = %{projectsubdir}-%Y%m%d-%[localhost]
flume-agent-1.sinks.HDFSSink1.hdfs.useLocalTimeStamp = true
flume-agent-1.sinks.HDFSSink1.hdfs.rollCount= 0
flume-agent-1.sinks.HDFSSink1.hdfs.rollSize= 134217728
flume-agent-1.sinks.HDFSSink1.hdfs.rollInterval= 3600
flume-agent-1.sinks.HDFSSink1.hdfs.batchSize= ${HDFS_BATCH_SIZE}
flume-agent-1.sinks.HDFSSink1.hdfs.threadsPoolSize= ${HDFS_THREAD_COUNT}
flume-agent-1.sinks.HDFSSink1.hdfs.timeZone=America/Los_Angeles
这里有一些非常关键的日志,完整的日志可以在附件中看到。
org.apache.flume.channel.file.FileChannel.start(FileChannel.java:295)] Failed to start the file channel [channel=channel2HDFS1]
2020-07-29T07:15:31.640949847Z java.lang.RuntimeException: org.apache.flume.channel.file.CorruptEventException: Could not parse event from data file.
2020-07-29T07:15:31.638860323Z at org.apache.flume.channel.file.TransactionEventRecord.fromByteArray(TransactionEventRecord.java:212)
...
2020-07-29T07:15:31.64750767Z 2020-07-29 00:15:31,646 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
2020-07-29T07:15:31.647539686Z java.lang.IllegalStateException: Channel closed [channel=channel2HDFS1]. Due to java.lang.RuntimeException: org.apache.flume.channel.file.CorruptEventException: Could not parse event from data file.
2020-07-29T07:15:31.647552984Z at org.apache.flume.channel.file.FileChannel.createTransaction(FileChannel.java:358)
暂无答案!
目前还没有任何答案,快来回答吧!