REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();
validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;
--结果
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
-- Impala 查询 select * from rss_items -- Impala 结果
link title description pubdate
0 http://www.hannonhill.com/news/item1.html News Item 1 Description of news item 1 here. 03 Jun 2003 09:39:21
1 http://www.hannonhill.com/news/item2.html News Item 2 Description of news item 2 here. 30 May 2003 11:06:42
2 http://www.hannonhill.com/news/item3.html News Item 3 Description of news item 3 here. 20 May 2003 08:56:02
--rss.txt数据文件
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster@hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
4条答案
按热度按时间lxkprmvk1#
现在看来,使用impala和xml不会有什么好运气。impala使用hive元存储,但不支持自定义
InputFormat
s和SerDe
s。您可以在这里看到它们本机支持的格式。您可以使用hive,而较新的版本应该要快得多(0.12+)
zbdgwd5y2#
hive和impala实际上没有处理xml文件的机制(考虑到大多数数据库中的xml支持,这很奇怪)。
也就是说,如果我面临这个问题,我会使用pig将数据导入hcatalog。在这一点上,它是完全可用的Hive和 Impala 。
下面是一个使用pig将一些xml数据导入hcatalog的快速而肮脏的示例:
--rss.Pig
--结果
-- Impala 查询
select * from rss_items
-- Impala 结果--rss.txt数据文件
envsm3lx3#
另一种方法是将一堆xml快速转换为avro,并使用avro文件为hive或impala中定义的表提供支持。
xmlslurper可用于解析xml文件中的记录
oyt4ldly4#
您可以在这里尝试xml serde for hive