在R中组合JSON和正则表达式

k4emjkb1  于 2023-03-20  发布在  其他
关注(0)|答案(1)|浏览(99)

bounty将在6天后过期。回答此问题可获得+50的声誉奖励。stats_noob正在寻找来自声誉良好来源的答案
我正在学习如何使用Reddit API -我正在尝试学习如何从特定帖子中提取所有评论。

例如-考虑这个职位:https://www.reddit.com/r/Homebrewing/comments/11dd5r3/worst_mistake_youve_made_as_a_homebrewer/
使用这个R代码,我想我能够访问评论:

library(httr)
library(jsonlite)

# Set authentication parameters
auth <- authenticate("some-key1", "some_key2")

# Set user agent
user_agent <- "my_app/0.1"

# Get access token
response <- POST("https://www.reddit.com/api/v1/access_token",
                 auth = auth,
                 user_agent = user_agent,
                 body = list(grant_type = "password",
                             username = "abc123",
                             password = "123abc"))

# Extract access token from response
access_token <- content(response)$access_token

# Use access token to make API request
url <- "https://oauth.reddit.com/LISTING" # Replace "LISTING" with the subreddit or endpoint you want to access

headers <- c("Authorization" = paste("Bearer", access_token))
result <- GET(url, user_agent(user_agent), add_headers(headers))

post_id <- "11dd5r3"
url <- paste0("https://oauth.reddit.com/r/Homebrewing/comments/", post_id)

# Set the user agent string 
user_agent_string <- "MyApp/1.0"

# Set the authorization header 
authorization_header <- paste("Bearer ", access_token, sep = "")

# Make the API request 
response <- GET(url, add_headers(Authorization = authorization_header, `User-Agent` = user_agent_string))

# Extract the response content and parse 
response_json <- rawToChar(response$content)

从这里看,所有注解似乎都存储在一组<p> and </p>:

  • <p>Reminds me of a chemistry professor I had in college, he taught a class on polymers (really smart guy, Nobel prize voter level). When talking about glass transition temperature he suddenly stopped and told a story about how a week or two beforehand he had put some styrofoam into the oven to keep the food warm while he waited for his wife to get home. It melted and that was his example on glass transition temperature. Basically: no matter how smart or trained you are, you can still make a mistake.</p>
  • <p>opening the butterfly valve on the bottom of a pressurized FV with a peanut butter chocolate milk stout in it. Made the inside of my freezer look like someone diarrhea&#39;d all over the inside of the door.</p>

使用这个逻辑,我尝试通过Regex只保留这些符号之间的文本:

final = response_json[1]
matches <- gregexpr("<p>(.*?)</p>", final)
matches_text <- regmatches(final, matches)[[1]]

我认为这段代码部分起作用了--但是返回的许多条目不是注解:

[212] "<p>Worst mistake was buying malt hops and yeast and letting it go stale.</p>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[213] "<p>Posts&#32;are&#32;automatically&#32;archived&#32;after&#32;6&#32;months.</p>"

有人能告诉我一个更好的方法吗?我怎么能只提取评论文本而不提取其他内容?

谢谢!

*:我不确定这段代码是否会提取帖子上的所有评论或只是一些评论-以及是否有办法改变这一点。

aemubtdh

aemubtdh1#

如果您想使用regex,可能应该尝试(?<=\\<p>).*?(?=\\</p>)这样的模式,例如:

> s <- "<p>xxxxx</p> <p>xyyyyyyyyy</p> <p>zzzzzzzzzzzz</p>"

> regmatches(s, gregexpr("(?<=<p>).*?(?=</p>)", s, perl = TRUE))[[1]]
[1] "xxxxx"        "xyyyyyyyyy"   "zzzzzzzzzzzz"

相关问题