使用pig计算不同字符串的数目

jpfvwuh4  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(219)

我是pig的新手,尝试在以下数据集中计算不同国家的数量(您可以从这个链接下载):

Athlete Country Year    Sport   Gold    Silver  Bronze  Total
Yang Yilin  China   2008    Gymnastics  1   0   2   3
Leisel Jones    Australia   2000    Swimming    0   2   0   2
Go Gi-Hyeon South Korea 2002    Short-Track Speed Skating   1   1   0   2
Chen Ruolin China   2008    Diving  2   0   0   2
Katie Ledecky   United States   2012    Swimming    1   0   0   1
Ruta Meilutyte  Lithuania   2012    Swimming    1   0   0   1

到目前为止我尝试的是:

athletes = LOAD '/data/OlympicAthletes.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') AS (athlete:chararray, country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int, total:int);
distinct_countries= distinct (foreach athletes generate country);
country_count_try1 = COUNT(distinct_countries);
country_count_try2 = FOREACH distinct_countries GENERATE COUNT(country);
country_count_try3 = FOREACH (GROUP athletes country) GENERATE count(athletes.country) as total_country;
p8h8hvxi

p8h8hvxi1#

您需要对整个数据集进行分组以计数。

distinct_countries= distinct (foreach athletes generate country);
country_count_try4 = foreach (group distinct_countries all) generate COUNT(distinct_countries) as cnt;

相关问题