我试图写hadoopPig脚本,将采取2个文件和过滤器的基础上字符串,即
文字.txt
google
facebook
twitter
linkedin
推文.json
{"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...", "user_id": 450990391, "id": 252479809098223616, "created_date": "Sun Sep 30 2012"}
脚本
twitter = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;
final = GROUP extracted BY pattern;
dump final;
输出
(facebook,{(facebook,252545104890449921,291041644,23:06:59 ,Sun Sep 30 2012,RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...)})
我得到的输出是,无需加载words.txt文件,即直接过滤tweet。
我需要得到输出作为
(facebook)(complete tweet of that facebook word contained)
i、 e它应该读取words.txt,当words正在读取时,它应该从tweets.json文件获取所有tweets
有什么帮助吗
莫汉五世
1条答案
按热度按时间ss2ws0br1#
您可以考虑在foreach语句中运行多个语句。像这样的-
请注意,这只是给一个想法,我没有测试它。我会尝试测试,一旦我得到Pig的环境,但这应该让你开始。