删除pig中的重复对

2ul0zpep  于 2021-06-21  发布在  Pig
关注(0)|答案(2)|浏览(404)

我有下面的样品
更新:

OBR|1|METABOLIC PANEL
OBX|1|Glucose
OBX|2|BUN
OBX|3|CREATININE
OBR|2|RFLX TO VERIFICATION
OBX|1|EGFR
OBX|2|SODIUM
OBR|3|AMBIGUOUS DEFAULT
OBX|1|POTASSIUM

在此示例中,将所有obr视为一个测试,每个obr后面都有obx,这是obr的结果。每个obr后面都跟有id(如本例中的1、2和3),特定obr的所有obx都以1开头。所以我想说的是,如果我找到一个obr,我会创建一个唯一的id,然后把它放在所有的obx中,后面跟着obr,直到我再次找到id为2的obr,我也会这样做。下面是我的预期输出。
预期结果:

OBR|1|METABOLIC PANEL|OBR_filename_1
OBX|1|Glucose|OBR_filename_1
OBX|2|BUN|OBR_filename_1
OBX|3|CREATININE|OBR_filename_1
OBR|2|RFLX TO VERIFICATION|OBR_filename_2
OBX|1|EGFR|OBR_filename_2
OBX|2|SODIUM|OBR_filename_2
OBR|3|AMBIGUOUS DEFAULT|OBR_filename_3
OBX|1|POTASSIUM|OBR_filename_3
iyfjxgzm

iyfjxgzm1#

我试过这个,看起来像hl文件。你可以使用缝合,过铅和想出这样的东西。从性能的Angular 来看,可能有比这更好的解决方案。但我想这应该行得通,请告诉我进展如何。

DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;

in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray), 
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);

这应该会产生这样的结果:

(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)

(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)

(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)

(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)

(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)

(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)

它只是将obr记录添加到相应的obx中,而不是文件名或常量。

r9f1avp5

r9f1avp52#

使用distinct.ASSUPPING您与重复记录的关系a。下面的语句将删除重复记录并将唯一记录存储在关系b中

B = DISTINCT A;

相关问题