Filter_one = foreach Load_Data generate $0 as col1, $3 as col4;
Filter_one_temp = filter Filter_one by ($1 is not null);
Filter_two = foreach Load_Data generate $0 as col1, $1 as col2, $2 as col3;
Join_filter = JOIN Filter_two by $0 LEFT, Filter_one_temp by $0;
generetate_output = foreach Join_filter generate $0 as col1, $1 as col2 , $2 as col3,$4 as col4;
store generetate_output into 'dfs_path' using PigStorage(',');
--Load input data
input_data = LOAD 'input.txt' USING PigStorage() AS (Col1:chararray, Col2:int, Col3:int, Col4:chararray);
--Perform operation on each record
input_data = FOREACH input_data GENERATE Col1, Col2, Col3, ((Col4 is null or TRIM(Col4) == '') ? 'XYZ' : Col4) as Col4;
2条答案
按热度按时间nxagd54h1#
列1对于所有行都是相同的。如果是,则使用两组过滤器,否则必须找到col1和col4之间的uniq值,并删除空值,使用以下步骤
filter\u one将捕获col1和col4,其中col4不为空
过滤器2将捕获col1,col2,col3。使用连接筛选器\u one&
filter_two,其中filter_two将打印在第1、第2、第3列
第二列在第四个位置,
希望这也能有所帮助
Pig的脚本如下:
当我存储相同的数据时,delimeter的输出将是
35g0bw712#
如果您的需求是将所有值为null或空的记录的col4值更新为xyz,那么您可以使用下面的代码段执行相同的操作
这里假设您保存的是您的输入数据,那么对于每个记录,检查col4值是null还是空的,如果是空的,则用所需的值(xyz)更新它,否则只使用现有的值