使用pig加载文件的子集

txu3uszq 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(397)

我在玩hortonworks沙盒学习hadoop等。
我正在尝试将文件加载到单机“群集”上：

A = LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t')
AS (ngram:chararray, year:int, count1:int, count2:int, count3:int);
B = LIMIT A 10;
Dump B;

不幸的是，这个文件对于我的虚拟机上的ram来说有点太大了。。
我想知道是否有可能 LOAD 的子集。 csv 文件？
有没有可能是这样的：

LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t') LOAD ONLY FIRST 100MB?

hadoop nosql csv apache-pig

来源：https://stackoverflow.com/questions/17156123/loading-a-subset-of-a-file-using-pig

2条答案

按热度按时间

70gysomp1#

在hadoop中定义解决方案的方式是不可能的，但是如果在osshell而不是hadoopshell中能够实现目标的话。在linuxshell中，您可以编写一个脚本来读取源文件中的第一个100mb，将其保存到本地文件系统，然后用作pig源文件。


# Script .sh

# Read file and save 100 MB content in file system

# Create N files of 100MB each

# write a pig_script to process your data as shown below

# Launch Pig script and pass the N files as parameter as below:

pig -f pigscript.pig -param inputparm=/user/currentuser/File1.File2,..,FileN

# pigscript.pig

A = LOAD '$inputparm' using PigStorage('\t') AS (ngram:chararray, year:int, count1:int, count2:int, count3:int); 
B = LIMIT A 10; 
Dump B;

一般情况下，多个文件可以通过它们的名称在hadoopshell中传递，因此您也可以从hadoopshell中调用文件名。
这里的关键是，在pig中，没有从文件和进程中读取x的默认方法，它是全部或全部，所以您可能需要找到解决方法来实现您的目标。

赞(0）回复(0）举报 2021-06-04

rslzwgfq2#

为什么要把整个文件加载到ram中？无论需要多少内存，都应该能够运行整个文件。尝试将此添加到脚本的顶部：

--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;

你的Pig脚本现在将改为：

--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t')
AS (ngram:chararray, year:int, count1:int, count2:int, count3:int);
B = LIMIT A 10;
Dump B;

假设您在运行脚本时遇到outofmemoryerror，这应该可以解决您的问题。

赞(0）回复(0）举报 2021-06-04

我来回答

使用pig加载文件的子集

2条答案

相关问题

热门标签

最新问答