sql—如何正确使用“左联接”,并在3列上显示匹配数据并聚合一列?有大量数据时的最佳采样策略?

djmepvbi  于 2021-07-29  发布在  Java
关注(0)|答案(0)|浏览(229)

我在他们的网站界面上使用googlebigquery的gdelt数据库。我是sql新手,我想从gdelt数据库中找到全球不同政府领导人的“主题”平均语调。
我正在创建一个关于年份、国家和政府领导人的数据集。这是必要的,因为gdelt拥有所有的公众人物 Persons 列。我用它来过滤gdelt人员的数据点,这样我就可以只获得选定领导者的数据点。此数据集如下所示:

COUNTRY,LEADER,YEAR,
United States,Obama,2016,
United States,Trump,2017,
...,...,...

我想从gdelt数据库中找到匹配的国家、年份和人名,找到主题,并为该领导人/年份/国家总结该主题的基调。到目前为止,我已经编写了以下代码:

SELECT b.year, b.leader, b.country, a.location_fullname, a.persons, theme, AVG(tone) AS tone_avg
FROM (
  SELECT 
    COUNTRY AS country,
    LEADER AS leader,
    CAST(YEAR AS string) AS year
  FROM 
    `smiling-landing-268118.country_leaders.test`
    ) b
LEFT JOIN (
  SELECT
    locations,
    REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
    CAST(REGEXP_EXTRACT(V2Tone,r'^([^,]*)') as NUMERIC) AS tone,
    CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
    REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
    substr(CAST(DATE AS string) , 1, 4) AS year,
    DocumentIdentifier,
    persons
  FROM 
    `gdelt-bq.gdeltv2.gkg_partitioned`,
    UNNEST(SPLIT(V2Locations,';')) AS locations,
    UNNEST(SPLIT(V2Themes,';')) AS themes,
    UNNEST(SPLIT(Persons,';')) AS persons
  WHERE 
    DATE >= 20150000051500
    ) a
ON a.year = b.year AND a.location_fullname = b.country AND a.persons = b.leader
GROUP BY a.location_fullname, b.year, b.leader, b.country, a.persons, a.year, theme
ORDER BY b.country DESC
LIMIT 500

我甚至不能显示任何结果。我想有些问题可能是:
聚合和分组操作不正确
连接未正确完成
select语句未正确使用
这个 ON 声明正在寻找一个完全匹配的,但没有一个。如何使其更具普遍性?
到目前为止,我已经尝试简化问题,不做任何聚合,只显示每个数据点。为此,我删除 , AVG(tone) AS tone_avg 以及 GROUP BY 在底部的声明。不过,我没有得到多少结果。
理想情况下,结果应采用以下列格式:

COUNTRY, LEADER, YEAR, THEME, AVERAGE TONE
"United States", "Obama", 2015, "economy", 2.00434
"United States", "Obama", 2015, "government_leadership", -4.5342
"United States", "Obama", 2016, "economy", 1.3412

编辑:
另外,我要计算1950-2020年的基调。在gdelt中有很多数据需要处理。我可以在sql中按年份随机抽取数据样本吗?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题