flume在根据文件大小滚动时需要时间将数据复制到hdfs中

roejwanj  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(329)

我有一个用例,我想用flume将远程文件复制到hdfs中。我还希望复制的文件应与hdfs块大小(128mb/256mb)一致。远程数据的总大小为33gb。
我使用avro源和接收器将远程数据复制到hdfs中。同样地,在接收端,我正在进行文件大小滚动(128256),但是要从远程机器复制文件并将其存储到hdfs(文件大小128/256mb)中,flume的平均时间是2分钟。
Flume配置:avro源(远程机器)


### Agent1 - Spooling Directory Source and File Channel, Avro Sink  ###

# Name the components on this agent

Agent1.sources = spooldir-source  
Agent1.channels = file-channel
Agent1.sinks = avro-sink

# Describe/configure Source

Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test

# Describe the sink

Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx   #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000

# Use a channel which buffers events in file

Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000

# Bind the source and sink to the channel

Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel

avroFlume(运行hdfs的机器)


### Agent1 - Avro Source and File Channel, Avro Sink  ###

# Name the components on this agent

Agent1.sources = avro-source1  
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1

# Describe/configure Source

Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000

# Describe the sink

Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000

# Use a channel which buffers events in file

Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000

# Bind the source and sink to the channel

Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1

两台机器之间的网络连接为686 mbps。
有人能帮我确定是配置有问题还是其他配置有问题,这样复制就不用花那么多时间了。

fquxozlt

fquxozlt1#

两个代理都使用文件通道。所以在写入hdfs之前,数据已经被写入磁盘两次了。您可以尝试为每个代理使用一个内存通道,以查看性能是否有所提高。

相关问题