我如何从R中的WSJ中获取以下大学排名数据？

5f0d552i 于 2023-10-13 发布在其他

关注(0)|答案(1)|浏览(117)

我正试着把《华尔街日报》的400所大学的R语言排名全部找出来。我在R中使用rvest包如下：

library(rvest)
link <- "https://www.wsj.com/rankings/college-rankings/best-colleges-2024"
wsj <- read_html(link)

在这之后，我真的不知道下一步该怎么办。页面的HTML源代码非常混乱，很难筛选。我觉得我已经取得了一些进展，通过做：

wsj %>% html_elements("section")  %>% html_element("p")

但我是个新手，所以我可能没走对路。任何指示将不胜感激。

来源：https://stackoverflow.com/questions/77262487/how-do-i-web-scrape-the-following-college-ranking-data-from-wsj-in-r

1条答案

按热度按时间

zd287kbt1#

网页上的所有数据都存储为脚本节点中的JSON数据。搜索网页后，这里是快速解决方案。
在脚本节点中检索数据后，将其转换为文本，然后将JSON数据解析为列表。需要通读列表以识别正确的列表结构。

library(rvest)
link <- "https://www.wsj.com/rankings/college-rankings/best-colleges-2024"
wsj <- read_html(link)

#Retrieve the script node whose id=__NEXT_DATA__
#convert to text
#then parse with the jsonlite library.
webdata <- wsj %>% html_elements("script[id='__NEXT_DATA__']") %>% 
                   html_text() %>% jsonlite::fromJSON()

#looking through the output the college rank data is stored here.
collegedata <- webdata$props$pageProps$collegeRankingsData

head(collegedata)
tail(collegedata)

赞(0）回复(0）举报 2023-10-13

我来回答

我如何从R中的WSJ中获取以下大学排名数据？

1条答案

相关问题

热门标签

最新问答