我想在hadoopmapreduce处理数据,我有以下格式的数据与非结构化,多行和非终止报价。
2/1/2013 5:16,Edward Felton,2,8/1/2012 3:57,Working on all the digital elements for our big event in Sydney in a couple of weeks... for more visit http://www.xy.com/au/geworks/,324005862,2,18200695
12/28/2012 19:28,Laura McCullum,2,7/26/2012 18:03,"The Day You Give Them Jive <br>
<a href="http://youtu.be/qfq9LVD2Qr4" > http://youtu.be/qfq9LVD2Qr4 <br>
<br>
'Like' if you have always wanted to destroy a cube!",502114904,2,18400313
11/21/2012 13:35,Timothy Widdowson,4,8/17/2012 12:38,"Can a table really replace a laptop...
With the new Windows tablets on the horizon and the Apple / Android devices out there I have been wondering if it is possible to really work with just and tablet.
My mission:
-For one whole week I will be working with just my iPad.
Hardware:
-Apple iPad
-Apple keyboard.
-Apple to HDMI connector.
-HDMI capable monitor.
- InCase iPad stand.
:-)",105001439,1,19301609
3/15/2013 13:43,Mary Romeo,3,8/16/2012 22:23,"HOW TO SHORTEN LONG LINKS YOU'RE POSTING <br>
The attached image describes how to shorten a long url before posting it. In 4 easy steps the 3-4 line urls can become a tiny link to post.",213022329,1,19901561
11/30/2012 2:17,Lu Yin Zhong,3,8/29/2012 1:29,working on 2013 comms plan...need big ideas!!,302014449,2,20300666
3/5/2013 22:15,Tim Steigert,12,8/29/2012 15:36,"Looking up 1024 email addresses. Manually? Probably a day! Doing it with SSOget, the add-in for #["excel"]? 5 minutes! Effort saved and #["productivity"] gained? Priceless! Now go get it and enjoy it for yourself! :)<br>http://sc.xy.com/*SSOget @@@data@@@{"image":"","title":""}",100011871,11,20400713
11/1/2012 20:46,Pranay Jain,2,8/30/2012 14:26,Do people agree with the iCloud restrictions that Airwatch will put on Personal iOS devices that have email?,212065316,0,20700913
11/9/2012 18:32,Monica Sharma,5,9/7/2012 11:42,hhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh gh gh gh gh ghghghghgghhhghghghghgh hg h gh gh,502000192,5,21400516
请给我提供代码片段如何处理提到的数据?提前谢谢!!!!!!!
1条答案
按热度按时间kse8i1jr1#
因为要处理多行数据,所以不能使用简单的
TextInputFormat
访问您的数据。因此,您需要使用自定义InputFormat
用于csv文件。目前在hadoop中没有处理多行csv文件的内置方法(请参阅https://issues.apache.org/jira/browse/mapreduce-2208),但幸运的是,github上有一些代码,您可以尝试:https://github.com/mvallebr/csvinputformat.
就未终止报价而言,可能首先需要对数据进行预处理和清理。一个简单的规则是,如果在引号前后没有分隔符,则对引号进行转义(
"
):逃逸:
a"b
=>a\"b
保持不变:a;"b
以及a";b
另一个选择是纠正产生无效csv的应用程序,以正确的方式转义数据。