apache pig group by函数未提供预期输出

r6vfmomb  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(447)

我有数据在里面 csv 格式如下所示。
数据的格式如下

"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"

下命名的示例数据 User.csv . 文件包含以下数据。

"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"

当我尝试同样的加载使用 PigStorage ```
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');

DUMP user;

其输出如下:

("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")

我想做一个城市小组。所以我写了

grp = group user by $4;
dump grp;

我得到的输出是:

( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})

公司名称和地址在包含 `','` 作为它的一部分。例如 `"14, Taylor St"` 在地址或 `"Elliott, John W Esq"` 以公司名义。
所以我的 `$4` 接受治疗 `"Taylor St"` 而不是 `"St. Stephens Ward"` 因此,由于地址数据或公司名称数据中的额外分隔符没有正确加载或正确分隔,并且groupby函数没有给出正确的结果。
我如何通过以下输出实现分组

("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})

grp = group a by $5 ;

这对我来说不是解决办法。我已经想到了。
wj8zmpe1

wj8zmpe11#

问题是 PigStorage 不考虑转义,因此为不应为列的字段创建列(每次条目包含逗号时)。
使用 CSVExcelStorage 将解决这个问题,因为这个存储可以处理转义,从而创建正确数量和顺序的列。

相关问题