我在hdp(hortonworks)上工作,试图通过flume收集tweet,并从hive加载存储的数据。
问题是 select * from tweetsavro limit 1;
工作但是 select * from tweetsavro limit 2;
不起作用,因为
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
我所做的都写在这个答案里了。即
推特.conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.serializer = Text
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
twitter.avsc由以下命令创建。
java -jar avro-tools-1.7.7.jar getschema FlumeData.1503479843633 > twitter.avsc
我创建了一个表
CREATE TABLE tweetsavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs://sandbox.hortonworks.com:8020/user/flume/twitter.avsc') ;
LOAD DATA INPATH 'hdfs://sandbox.hortonworks.com:8020/user/flume/twitter_data/FlumeData.*' OVERWRITE INTO TABLE tweetsavro;
评论:
我尝试了外部表(而不是托管表)。但情况没有改变。
因为我使用hortonworks,所以我不使用cloudera的twittersource。
暂无答案!
目前还没有任何答案,快来回答吧!