pig中的文本模式分析

j5fpnvbx  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(319)

我正在寻找一种使用pig保存文本模式的方法。假设我有如下输入:

ae988852ed9eabe3b5298d8b4c3b652e    I Never In My Life Gave A Guy No Money For Gas Or Food besides That Simpson Guy SMH I Fault Myself Though

我想从这些数据中提取连续的单词模式并将其保存到一个包中。例如,{i,never}将是第一个,{never,in}将是第二个,依此类推。我知道我会用这样的方式开始这个项目:

myinput = LOAD '/user/hive/warehouse/twitter_raw/$date' USING PigStorage('\t') AS (id,  mess);
strings = FOREACH myinput GENERATE $0 AS id, LOWER($1) AS mess;

但下一步该怎么办呢?

ac1kyiln

ac1kyiln1#

可能只需使用内置函数就可以得到结果,但简单的自定义项也可以做到这一点:

public class SlidingTuple extends EvalFunc<DataBag> {

    private static final BagFactory bagFactory = BagFactory.getInstance();
    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag inputBag = (DataBag) input.get(0);
            DataBag result = null;
            if (inputBag != null) {
                result = bagFactory.newDefaultBag();
                Iterator<Tuple> it = inputBag.iterator();
                Tuple previous = it.next();
                while (it.hasNext()) {
                    Tuple current = it.next();
                    Tuple tuple = tupleFactory.newTuple(2);
                    tuple.set(0, previous.get(0));
                    tuple.set(1, current.get(0));
                    result.add(tuple);
                    previous = current;
                }
            }
            return result;
        }
        catch (Exception e) {
            throw new RuntimeException("SlidingTuple error", e);
        }
    }
}

然后:

A = LOAD '/user/hive/warehouse/twitter_raw/$date' USING PigStorage('\t') 
      AS (id:chararray,  mess:chararray);

B = foreach A generate TOKENIZE(mess, ' ') as words;

然后使用自定义自定义自定义项:

C = foreach B generate com.example.SlidingTuple(words);

相关问题