如何用pig反规范化基数为0,1，主要是1,n的2个csv文件？

ff29svar 于 2021-06-04 发布在 Hadoop

关注(0)|答案(1)|浏览(329)

我的Pig剧本需要一些帮助。我有2个csv文件，我想做一个共同的id他们之间的连接。

customer.csv :
1   ; nom1   ; prenom1   
2   ; nom2   ; prenom2   
3   ; nom3   ; prenom3   

child.csv
1  ; enfant_1_1  
2  ; enfant_1_2  
3  ; enfant_1_3  
1  ; enfant_2_1  
1  ; enfant_3_1

所以一个顾客可以有很多孩子，但一个孩子只能有一个“顾客”。
我要创建此文件：

1   ; nom1   ; prenom1  ; enfant_1_1  ; enfant_2_1  ; enfant_3_1    
2   ; nom2   ; prenom2  ; enfant_1_2   
3   ; nom3   ; prenom3  ; enfant_1_3

这是我的方法：
首先我试着做：

1  ; enfant_1_1  ; enfant_2_1  ; enfant_3_1
2  ; enfant_1_2
3  ; enfant_1_3

之后我会加入custome.csv
告诉我你认为有一个最简单的方法：）
这是我的剧本：

donnees_Enfants = LOAD '/user/cloudera/Jeux/mini_jeu2.csv' USING PigStorage(';')
AS (id_parent:int,nom_enfant:chararray);

group_enfants = GROUP donnees_Enfants BY id_parent;

enfant_uneLigne = foreach group_enfants generate group, donnees_Enfants.nom_enfant;

grunt> echantillon = LIMIT enfant_uneLigne 50;
grunt> DUMP echantillon;

使用descripe:group\u enfants:{group:int，donnees\u enfants:{（id\u parent:int，nom\u enfant:chararray）}}enfant\u unaligne:{group:int，{（nom\u enfant:chararray）}}
结果是：

(1,{( enfant_2_1  ),( enfant_1_1  ),( enfant_3_1  )})
(2,{( enfant_2_2  )})
(3,{( enfant_2_3  )})

我试着压扁“儿童”。。。但结果是每个孩子都有一个lign。。。我玩元组和袋子有些困难，你能帮我吗？
提前谢谢，
编辑：我找到了一个解决我的问题和更多^^^见下文
安吉利克

hadoop Join apache-pig denormalization Bag

来源：https://stackoverflow.com/questions/22612706/how-to-denormalized-2-csv-files-with-cardinality-0-1-and-mainly-1-n-with-pig

1条答案

按热度按时间

pu82cl6c1#

最后，我找到了解决方案，它适用于child的更多字段：（id、name、age）。
-- 1. 加载这两个文件
donnees_enfants=load'/user/cloudera/jeuxDenormalization/jeux/mini_jeu2.csv'使用pigstorage（'；'）as（id:int，nom）_enfant:chararray);
donnees_parents=load'/user/cloudera/jeuxDenormalization/jeux/mini_jeu1.csv'使用pigstorage（'；'）作为（id_parent:int，标称_parent:chararray，prenom公司_parent:chararray);
-- 2. 将文件与左边的外部连接起来，以保留没有孩子的客户。
非规范化=按id加入donnees\u parents\u parents left outer，按id加入donnees\u enfants；

(9, nom9   , prenom9   ,9, enfant_2_9  )
(9, nom9   , prenom9   ,9, enfant_3_9  )
(9, nom9   , prenom9   ,9, enfant_1_9  )
(10, nom10  , prenom10  ,10, enfant_3_10)
(10, nom10  , prenom10  ,10, enfant_1_10 )
(10, nom10  , prenom10  ,10, enfant_2_10 )

-- 3. groupby在customer上设置为只有一行by customer
unparent\u parligne=按（id\u parent、nom\u parent、prenom\u parent）进行组反规范化；

((48, nom48  , prenom48  ),{(48, nom48  , prenom48  ,48, enfant_2_48 ),(48, nom48  , prenom48  ,48, enfant_1_48 )})
((49, nom49  , prenom49  ),{(49, nom49  , prenom49  ,49, enfant_2_49 ),(49, nom49  , prenom49  ,49, enfant_1_49 )})
((50, nom50  , prenom50  ),{(50, nom50  , prenom50  ,50, enfant_2_50 ),(50, nom50  , prenom50  ,50, enfant_1_50 )})
((51, nom51  , prenom51  ),{(51, nom51  , prenom51  ,51, enfant_1_51 )})

-- 4. 将行展平：
ligne_finale=foreach unparange_parligne generate flatten（group），flatten（bagtotuple（反规范化（donnees_enfants:：nom_enfant，donnees_enfants:：age））；

(9, nom9   , prenom9   , enfant_2_9  , enfant_3_9  , enfant_1_9  )
(10, nom10  , prenom10  , enfant_3_10, enfant_1_10 , enfant_2_10 )
(11, nom11  , prenom11  , enfant_1_11 , enfant_2_11 )

或者如果有更多字段（带有“donnees\u enfants:：age”）：

(8, nom8   , prenom8   , enfant_3_8  , age_3_8 , enfant_2_8  , age_2_8 , enfant_1_8  , age_1_8 )
(9, nom9   , prenom9   , enfant_2_9  , age_2_9 , enfant_3_9  , age_3_9 , enfant_1_9  , age_1_9 )
(10, nom10  , prenom10  , enfant_3_10 , age_3_10, enfant_1_10 , age_1_10, enfant_2_10 , age_2_10)

-- 5. 使用org.apache.pig.piggybank.storage.pigstorageschema（“；”）将数据存储在csv文件store ligne\u finale中“/user/cloudera/jeuxdenormalisation/resultats/test4”；

赞(0）回复(0）举报 2021-06-04

我来回答

如何用pig反规范化基数为0,1，主要是1,n的2个csv文件？

1条答案

相关问题

热门标签

最新问答