无法在pig中转储关系

ttygqcqt  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(429)

很长一段时间以来一直被困在一个问题上。任何帮助都是值得赞赏的。所以我在/home/hadoop/pig目录下有一个数据集文件。我可以查看该文件,因此没有权限问题。数据集有4列,以“:”分隔。我在本地模式下从/home/hadoop/pig目录运行pig。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

上述脚本失败。我可以成功地转储“ratingsdata”和“ratings”的关系,但不能转储分组的\u mid。下面的脚本运行成功。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

STORE ratings INTO 'ratingInfo.txt';

X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP X BY mid;

dump grouped_mid;

显然,第二个脚本有一个多余的步骤。我只是存储一个关系并重新加载它。我想避免这样。任何澄清/解释都是非常值得赞赏的。
非常感谢。

lx0bsm1f

lx0bsm1f1#

请参考:pig join with java.lang.classcastexception:java.lang.string不能转换为java.lang.integer
您可以将脚本修改为:

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

经过测试。

相关问题