我在他们的网站界面上使用googlebigquery的gdelt数据库。我是sql新手,我想从gdelt数据库中找到全球不同政府领导人的“主题”平均语调。
我正在创建一个关于年份、国家和政府领导人的数据集。这是必要的,因为gdelt拥有所有的公众人物 Persons
列。我用它来过滤gdelt人员的数据点,这样我就可以只获得选定领导者的数据点。此数据集如下所示:
COUNTRY,LEADER,YEAR,
United States,Obama,2016,
United States,Trump,2017,
...,...,...
我想从gdelt数据库中找到匹配的国家、年份和人名,找到主题,并为该领导人/年份/国家总结该主题的基调。到目前为止,我已经编写了以下代码:
SELECT b.year, b.leader, b.country, a.location_fullname, a.persons, theme, AVG(tone) AS tone_avg
FROM (
SELECT
COUNTRY AS country,
LEADER AS leader,
CAST(YEAR AS string) AS year
FROM
`smiling-landing-268118.country_leaders.test`
) b
LEFT JOIN (
SELECT
locations,
REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
CAST(REGEXP_EXTRACT(V2Tone,r'^([^,]*)') as NUMERIC) AS tone,
CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
substr(CAST(DATE AS string) , 1, 4) AS year,
DocumentIdentifier,
persons
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST(SPLIT(V2Locations,';')) AS locations,
UNNEST(SPLIT(V2Themes,';')) AS themes,
UNNEST(SPLIT(Persons,';')) AS persons
WHERE
DATE >= 20150000051500
) a
ON a.year = b.year AND a.location_fullname = b.country AND a.persons = b.leader
GROUP BY a.location_fullname, b.year, b.leader, b.country, a.persons, a.year, theme
ORDER BY b.country DESC
LIMIT 500
我甚至不能显示任何结果。我想有些问题可能是:
聚合和分组操作不正确
连接未正确完成
select语句未正确使用
这个 ON
声明正在寻找一个完全匹配的,但没有一个。如何使其更具普遍性?
到目前为止,我已经尝试简化问题,不做任何聚合,只显示每个数据点。为此,我删除 , AVG(tone) AS tone_avg
以及 GROUP BY
在底部的声明。不过,我没有得到多少结果。
理想情况下,结果应采用以下列格式:
COUNTRY, LEADER, YEAR, THEME, AVERAGE TONE
"United States", "Obama", 2015, "economy", 2.00434
"United States", "Obama", 2015, "government_leadership", -4.5342
"United States", "Obama", 2016, "economy", 1.3412
编辑:
另外,我要计算1950-2020年的基调。在gdelt中有很多数据需要处理。我可以在sql中按年份随机抽取数据样本吗?
暂无答案!
目前还没有任何答案,快来回答吧!