1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).
1. Go to your input folder /home/bigdata/sample/inputs/
2. Run this sed command
>> sed 's/::/$/g' movies.dat > Testmovies.dat
>> sed 's/::/$/g' ratings.dat > Testratings.dat
>> sed 's/::/$/g' users.dat > Testusers.dat
3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat) in the pig script using '$' as a delimiter
修改Pig脚本:
-- filtering action and war movies
A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
-- finding action and war movie ratings
C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
D = JOIN B by $0, C by MovieID;
-- calculating avg
E = group D by $0;
F = foreach E generate group as mvId, AVG(D.Rating) as avgRating;
-- finding max avg-rating
G = group F ALL;
H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
-- finding max avg-rated movie
I = FILTER F BY (float)avgRating == (float)H.avgMax;
-- filtering female users age between 20-30
J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
L = foreach K generate UserID;
-- finding filtered female users rated movies
M = JOIN L by $0, C by UserID;
-- finding filtered female users who rated highest rated action and war movies
N = JOIN I by $0, M by $2;
-- finding distinct female users
O = foreach N generate $2 as User;
Q1 = Distinct O;
DUMP Q1;
Sample Output:
(5763)
(5785)
(5805)
(5808)
(5812)
(5825)
(5832)
(5852)
(5869)
(5878)
(5920)
(5955)
(5972)
(5974)
(6009)
(6036)
1条答案
按热度按时间igetnqfo1#
此数据集的问题是多个字符作为分隔符“:”。在pig中,不能使用多个字符作为分隔符。要解决这个问题,你有三个选择
我从这个网站下载了数据集http://www.grouplens.org/datasets/movielens/ 并尝试了选项3
这将把多字符分隔符“::”转换为单字符分隔符“$”。我在所有三个文件中选择“$”作为分隔符bcoz“$”不存在。
修改Pig脚本: