将数据从movielens加载到pig中时出现的问题

bqf10yzr  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(375)

我正试图把一些数据加载到pig中:
记录:

11::American President, The (1995)::Comedy|Drama|Romance

12::Dracula: Dead and Loving It (1995)::Comedy|Horror

使用的脚本:

loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               USING PigStorage(':') 
               AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);

输出

11,,American President, The (1995),,Comedy|Drama|Romance
 12,,Dracula,, Dead and Loving It (1995)

如何解决德古拉之后的结肠问题?
由于冒号的原因,第二列被分成两列,因为我们总共有3列,movieid 12的最后一列 comedy|horror 不会上膛。

snz8szmq

snz8szmq1#

您可以使用 REGEX_EXTRACT_ALL .
下面是实现这一点的一段代码:

A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               AS (f1:chrarray); 
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;

我得到了以下输出(这是一个元组)。“:”在“德古拉”完好无损之后。

(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)

相关问题