在pig中传递一个包作为udf的输入

mgdq6dx1  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(323)

我正在尝试传递一个数据包(final)作为输入。

dump final;

gives:-

(4,john,john,David,Banking ,4,M,20-01-1994,78.65,345000,Arkansasdest1,Destination)
(4,john,john,David,Banking ,4,M,20-01-1994,78.65,345000,Arkanssdest2,Destination)
(4,johns,johns,David,Banking ,4,M,20-01-1994,78.65,345000,ArkansasSrc1,source)
(4,johns,johns,David,Banking ,4,M,20-01-1994,78.65,345000,ArkansaSrc2,source)

我将要编写一个udf来处理上面的数据包并查找源和目标之间的不匹配,为了做到这一点,我必须检查我的udf是否接受数据包。所以我在下面写了一个示例自定义项:

package PigUDFpck;

import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class databag extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException { // different return type

    DataBag result = mBagFactory.newDefaultBag(); // change here
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();

        //logic
        Tuple t = mTupleFactory.getInstance().newTuple();

        t.append(tuple);

        result.add(t);
    }
    return result; // change here
}

}

之后,我使用

REGISTER /usr/local/pig/UDF/UDFBAG.jar;
DEFINE Databag Databag(); // not sure how to define it

2017-02-16 19:07:05875[main]warn org.apache.pig.newplan.baseoperatorplan-遇到警告隐式\u cast \u to \u int 2次//定义后收到此警告。

final1 = FOREACH final GENERATE(Databag(final));

错误1200:pig脚本未能分析:无效的标量投影:final:需要从关系投影列才能将其用作标量
请帮助我定义自定义项以及如何将数据包传递给自定义项
谢谢

mwg9r5ms

mwg9r5ms1#

尝试

final1 = FOREACH final GENERATE(Databag(*));

尽管据我所知,final包含元组,而不是一袋元组,所以您可能需要首先按某个键对它进行分组。在这种情况下,它将是smth一样

final1 = FOREACH (group final [by key or all]) GENERATE(Databag(final));

相关问题