使用API从Reddit帖子中提取所有评论

olmpazwi  于 2023-03-05  发布在  其他
关注(0)|答案(1)|浏览(300)

我正在使用Reddit API(Pushshift):https://github.com/pushshift/api
通过文档,我明白了如何使用它来提取在特定时间段内留下的每一条包含"covid"一词的评论:

https://api.pushshift.io/reddit/search/comment?q=covid&after=3h&before=2h&size=1

输出如下所示:

{"data":[{"subreddit_id":"t5_2qh6p","author_is_blocked":false,"comment_type":null,"edited":false,"author_flair_type":"richtext","total_awards_received":0,"subreddit":"Conservative","author_flair_template_id":null,"id":"j98zf27","gilded":0,"archived":false,"collapsed_reason_code":null,"no_follow":false,"author":"VamboRoolOkay","send_replies":true,"parent_id":41917615743,"score":1,"author_fullname":"t2_7uxkru5f","all_awardings":[],"body":"I will never believe that election fraud wasn't a significant factor. Go ahead - call it a conspiracy theory. But I also maintained that Covid was lab-created. Truth is the Daughter of Time.","top_awarded_type":null,"author_flair_css_class":null,"author_patreon_flair":false,"collapsed":false,"author_flair_richtext":[{"e":"text","t":"Conservative"}],"is_submitter":false,"gildings":{},"collapsed_reason":null,"associated_award":null,"stickied":false,"author_premium":false,"can_gild":true,"link_id":"t3_116l7ct","unrepliable_reason":null,"author_flair_text_color":"dark","score_hidden":true,"permalink":"/r/Conservative/comments/116l7ct/kamala_harris_plans_on_running_with_biden_in_2024/j98zf27/","subreddit_type":"public","locked":false,"author_flair_text":"Conservative","treatment_tags":[],"created_utc":1676866031,"subreddit_name_prefixed":"r/Conservative","controversiality":0,"author_flair_background_color":"","collapsed_because_crowd_control":null,"distinguished":null,"retrieved_utc":1676866047,"updated_utc":1676866048,"body_sha1":"328df3784d15f77b98a84418c4ce720822227cfe","utc_datetime_str":"2023-02-20 04:07:11"}],"error":null,"metadata":{"es":{"took":98,"timed_out":false,"_shards":{"total":828,"successful":828,"skipped":824,"failed":0},"hits":{"total":{"value":573,"relation":"eq"},"max_score":null}},"es_query":{"size":1,"query":{"bool":{"must":[{"bool":{"must":[{"simple_query_string":{"fields":["body"],"query":"covid","default_operator":"and"}},{"range":{"created_utc":{"gte":1676862433000}}},{"range":{"created_utc":{"lt":1676866033000}}}]}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{\"size\":1,\"query\":{\"bool\":{\"must\":[{\"bool\":{\"must\":[{\"simple_query_string\":{\"fields\":[\"body\"],\"query\":\"covid\",\"default_operator\":\"and\"}},{\"range\":{\"created_utc\":{\"gte\":1676862433000}}},{\"range\":{\"created_utc\":{\"lt\":1676866033000}}}]}}]}},\"aggs\":{},\"sort\":{\"created_utc\":\"desc\"}}","api_launch_time":1673017478.254743,"api_request_start":1676873233.6143198,"api_request_end":1676873233.7406816,"api_total_time":0.12636184692382812}}
    • 我的问题:**假设我识别出一个包含"covid"一词的帖子--现在,我想检索这个帖子的每一条评论(不管它是否包含"covid"一词):这可能做到吗?

例如,基于这些结果的输出,我看到:

  • 链接标识:t3_116l7ct
  • 父代编号:41917615743
    • 我是否可以使用此信息编写API查询来检索此帖子的所有评论?**

我尝试了以下查询,但得到的结果为空:https://api.pushshift.io/reddit/comment/search/?link_id=t3_116cjib
谢谢!

    • 注1:**是否可以使用"两阶段方法"执行这项任务?例如,第一阶段-确定评论中留下"covid"一词的帖子。第二阶段-开始提取该帖子的所有评论(无论是否包含"covid")
    • 注意2:**我有一个当前正在使用的R脚本
library(jsonlite)

part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="    
part2 = "h&before="
part3 = "h&size=500"

results = list()
for (i in 1:10)
{tryCatch({
    {
        url_i<-  paste0(part1, i+1,  part2, i,  part3)
        r_i <-  fromJSON(url_i)
      
        results[[i]] <- data.frame(r_i$data$body , r_i$data$id, r_i$data$parent_id, r_i$data$link_id)
        
        #myvec_i <- sapply(results, NROW)
        
        #print(c(i, sum(myvec_i))) 
        print(i)
        #ifelse(i %% 200 == 0, saveRDS(results, "results_index.RDS"), "" )
    }
}, error = function(e){})
}

final = do.call(rbind.data.frame, results)

我可以修改这个脚本来获得想要的结果吗?

pjngdqdw

pjngdqdw1#

从理论上讲,要实现这一点,我相信您需要几个步骤:
1.识别提及COVID的评论,就像你已经做的那样:
https://api.pushshift.io/reddit/search/comment?q=covid&after=3h&before=2h&size=1
1.确定与每个注解关联的提交ID,该ID嵌入在永久链接中:
/r/外籍人士/评论/11bzdu 2/您没有意识到的便利设施是什么/ja 2sk 8 m/

results[[i]] <- data.frame(r_i$data$body , r_i$data$id, r_i$data$parent_id, r_i$data$link_id, r_i$data$permalink)
results[[i]]$sub_id <- sapply(results[[i]]$r_i.data.permalink, function(x) strsplit(x, "/")[[1]][5])

1.确定与每个提交ID关联的所有注解ID。下面的代码应该可以工作,但不幸的是,API appears to be broken
https://api.pushshift.io/reddit/submission/comment_ids/11bzdu2
1.检索每个注解ID的文本。
https://api.pushshift.io/reddit/comment/search?ids=ja2sk8m
把这些放在一起应该可以完成你所寻找的。不幸的是,由于当前的API没有按预期工作,并且旧的列表功能似乎被破坏了(复制here所示的解决方案不起作用,因为link_id不被接受),看起来你所寻找的可能在当前是不可能的。

相关问题