通过python udf将文本文件导入pig

qzlgjiam 于 2021-06-25 发布在 Pig

关注(0)|答案(1)|浏览(379)

我尝试在使用python udf时将文件加载到pig，我尝试了两种方法：
•（myudf1，sample1.pig）：尝试从python读取文件，该文件位于我的客户机服务器上。
•（myudf2，sample2.pig）：首先将文件从hdfs加载到grunt shell，然后将其作为参数传递给python udf。
myudf1.py文件

from __future__ import with_statement
def get_words(dir):
    stopwords=set()
    with open(dir) as f1:
        for line1 in f1:
            stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
    return stopwords

stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")

@outputSchema("findit: int")
def findit(stp):
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0

示例1.pig：

REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);

T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S

我得到：ioerror:（2，'没有这样的文件或目录'，'/home/zhge/uwc/mappings/english\u stop.txt'）
对于解决方案2：
myudf2编号：

def get_wordlists(wordbag):
    stopwords=set()
    for t in wordbag:
        stopwords.update(t.decode('ascii','ignore'))
    return stopwords

@outputSchema("findit: int")
def findit(stopwordbag, stp):
    stopwords=get_wordlists(stopwordbag)
    stp=str(stp)
    if stp in stopwords:
        return 1
    else:
        return 0

样本2.pig

REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;

stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig 
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',')  AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;

然后我得到：error org.apache.pig.tools.grunt.grunt-error 1066:无法打开别名s的迭代器。后端错误：标量在输出中有多行。第一：（a），第二：（as）

python user-defined-functions apache-pig

来源：https://stackoverflow.com/questions/30472894/import-text-files-to-pig-through-python-udf

1条答案

按热度按时间

ivqmmu1c1#

你的第二个例子应该有用。尽管你 LIMIT 我猜错了，应该在 stops 关系。因此应该是：

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = LIMIT stops 1;
S = FOREACH item_title GENERATE pyudf.findit(T.stop_w, title);

然而，因为看起来你需要首先处理所有的停止词，这是不够的。你需要做一个 GROUP ALL 然后把结果传给你的朋友 get_wordlist 改为函数：

stops = LOAD '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);

item_title = LOAD '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = FOREACH (GROUP stops ALL) GENERATE pyudf.get_wordlists(stops) AS ready;
S = FOREACH item_title GENERATE pyudf.findit(T.ready, title);

你必须更新你的自定义项来接受一个dict列表，但是这个方法才能工作。

赞(0）回复(0）举报 2021-06-26

我来回答

通过python udf将文本文件导入pig

1条答案

相关问题

热门标签

最新问答