如何使用flume将一组csv文件从本地目录复制到hdfs

kd3sttzy  于 2021-06-04  发布在  Hadoop
关注(0)|答案(2)|浏览(620)

如何使用flume将一组csv文件从本地目录复制到hdfs?我尝试使用spool目录作为源,但复制失败。然后,我使用以下Flume配置来获得结果:

agent1.sources = tail 
agent1.channels = MemoryChannel-2 
agent1.sinks = HDFS 
agent1.sources.tail.type = exec 
agent1.sources.tail.command = tail -F /home/cloudera/runs/*  
agent1.sources.tail.channels = MemoryChannel-2 
agent1.sinks.HDFS.channel = MemoryChannel-2 
agent1.sinks.HDFS.type = hdfs 
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/cloudera/runs                         
agent1.sinks.HDFS.hdfs.file.Type = DataStream 
agent1.channels.MemoryChannel-2.type = memory

我得到了我的文件复制到hdfs,但他们包含特殊字符,将是没有用的我。我的本地目录是/home/cloudera/runs,我的hdfs目标目录是/user/cloudera/runs。

rqqzpn5f

rqqzpn5f1#

I used the below flume configuration to get the job done.

# Flume Configuration Starts

# Define a file channel called fileChannel on agent_slave_1

agent_slave_1.channels.fileChannel1_1.type = file 

# on linux FS

agent_slave_1.channels.fileChannel1_1.capacity = 200000
agent_slave_1.channels.fileChannel1_1.transactionCapacity = 1000

# Define a source for agent_slave_1

agent_slave_1.sources.source1_1.type = spooldir

# on linux FS

# Spooldir in my case is /home/cloudera/runs

agent_slave_1.sources.source1_1.spoolDir = /home/cloudera/runs/
agent_slave_1.sources.source1_1.fileHeader = false
agent_slave_1.sources.source1_1.fileSuffix = .COMPLETED
agent_slave_1.sinks.hdfs-sink1_1.type = hdfs

# Sink is /user/cloudera/runs_scored under hdfs

agent_slave_1.sinks.hdfs-sink1_1.hdfs.path = hdfs://localhost.localdomain:8020/user/cloudera/runs_scored/
agent_slave_1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0
agent_slave_1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000
agent_slave_1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text

agent_slave_1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream
agent_slave_1.sources.source1_1.channels = fileChannel1_1
agent_slave_1.sinks.hdfs-sink1_1.channel = fileChannel1_1

agent_slave_1.sinks =  hdfs-sink1_1
agent_slave_1.sources = source1_1
agent_slave_1.channels = fileChannel1_1
oalqel3c

oalqel3c2#

在你的Flume里,你需要 agent1.sinks.HDFS.hdfs.fileType = DataStream 而不是 agent1.sinks.HDFS.hdfs.file.Type = DataStream 休息似乎很好。

相关问题