用于hadoop的hive数据集

x7rlezfr  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(343)

我把Pig的代码放在这里。当我试图执行这段代码时,我看到了错误。我无法调试它。有人能帮我调试代码吗。请把答案和他们的输入输出结果一起贴出来。
我会要求人们用他们的输入和输出结果来回答问题。拜托。

igetnqfo

igetnqfo1#

此数据集的问题是多个字符作为分隔符“:”。在pig中,不能使用多个字符作为分隔符。要解决这个问题,你有三个选择

1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).

我从这个网站下载了数据集http://www.grouplens.org/datasets/movielens/ 并尝试了选项3

1. Go to your input folder /home/bigdata/sample/inputs/
    2. Run this sed command
        >> sed 's/::/$/g' movies.dat > Testmovies.dat
        >> sed 's/::/$/g' ratings.dat > Testratings.dat
        >> sed 's/::/$/g' users.dat > Testusers.dat

这将把多字符分隔符“::”转换为单字符分隔符“$”。我在所有三个文件中选择“$”作为分隔符bcoz“$”不存在。

3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat)  in the pig script using '$' as a delimiter

修改Pig脚本:

-- filtering action and war movies
    A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
    B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
    -- finding action and war movie ratings
    C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
    D = JOIN B by $0, C by MovieID;
    -- calculating avg
    E = group D by $0;
    F = foreach E generate group as mvId,  AVG(D.Rating) as avgRating;
    -- finding max avg-rating
    G = group F ALL;
    H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
    -- finding max avg-rated movie
    I = FILTER F BY (float)avgRating == (float)H.avgMax;
    -- filtering female users age between 20-30
    J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
    K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
    L = foreach K generate UserID;
    -- finding filtered female users rated movies
    M = JOIN L by $0, C by UserID;
    -- finding filtered female users who rated highest rated action and war movies
    N = JOIN I by $0, M by $2;
    -- finding distinct female users
    O = foreach N generate $2 as User;
    Q1 = Distinct O;
    DUMP Q1;

    Sample Output:
    (5763)
    (5785)
    (5805)
    (5808)
    (5812)
    (5825)
    (5832)
    (5852)
    (5869)
    (5878)
    (5920)
    (5955)
    (5972)
    (5974)
    (6009)
    (6036)

相关问题