apacheflume比copyfromlocal命令花费更多的时间

b1uwtaje 于 2021-06-04 发布在 Flume

关注(0)|答案(2)|浏览(307)

我的本地文件系统中有24gb的文件夹。我的任务是把那个文件夹移到hdfs。我有两种方法。1） hdfs dfs-copyfromlocal/home/data//home/
这大约需要15分钟才能完成。
2）使用Flume。
这是我的经纪人

spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink_to_hdfs

# source

spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /home/data/
spool_dir.sources.src-1.fileHeader = false

# HDFS sinks

spool_dir.sinks.sink_to_hdfs.type = hdfs
spool_dir.sinks.sink_to_hdfs.hdfs.fileType = DataStream
spool_dir.sinks.sink_to_hdfs.hdfs.path = hdfs://192.168.1.71/home/user/flumepush
spool_dir.sinks.sink_to_hdfs.hdfs.filePrefix = customevent
spool_dir.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
spool_dir.sinks.sink_to_hdfs.hdfs.batchSize = 1000
spool_dir.channels.channel-1.type = file
spool_dir.channels.channel-1.checkpointDir = /home/user/spool_dir_checkpoint
spool_dir.channels.channel-1.dataDirs = /home/user/spool_dir_data
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink_to_hdfs.channel = channel-1

这一步花了将近一个小时将数据推送到hdfs。
据我所知，flume是分布式的，所以flume加载数据的速度不应该比copyfromlocal命令快。

hdfs flume bigdata flume-ng

来源：https://stackoverflow.com/questions/39871274/apache-flume-taking-more-time-than-copyfromlocal-command

2条答案

按热度按时间

o4hqfura1#

如果您只是简单地看读写操作，那么当您使用文件通道时，flume的配置将至少慢2倍—从磁盘读取的每个文件都封装到flume事件（内存中），然后通过文件通道序列化回磁盘。然后，接收器从文件通道（磁盘）读回事件，然后将其推送到hdfs。
您还没有在spooldir源上设置blob反序列化程序（因此它一次从源文件中读取一行，在flume事件中 Package ，然后写入文件通道），因此与hdfs sink默认roll值配对，每10个事件/30s/1k将获得一个hdfs文件，而不是使用copyfromlocal获得的每个输入文件的文件。
所有这些因素加起来会使您的性能变慢。如果您想获得更具可比性的性能，应该在spooldir源上使用blobdeserializer，并使用内存通道（但是要知道，在jre提前终止的情况下，内存通道不能保证事件的传递）。

赞(0）回复(0）举报 2021-06-05

7fyelxc52#

apacheflume不用于将文件夹从本地文件系统移动或复制到hdfs。flume的目的是高效地收集、聚合大量的日志数据，并将这些数据从许多不同的源移动到一个集中的数据存储区(参考：flume用户指南）
如果要移动大文件或目录，应该使用 hdfs dfs -copyFromLocal 正如你已经提到的。

赞(0）回复(0）举报 2021-06-05

我来回答

apacheflume比copyfromlocal命令花费更多的时间

2条答案

相关问题

热门标签

最新问答