无法在R中从CME Group的网站中Web抓取表

toiithl6 于 2023-04-09 发布在其他

关注(0)|答案(2)|浏览(133)

我正在尝试从CME

https://www.cmegroup.com/market-data/cme-group-benchmark-administration/term-sofr.html中Web抓取此表
然而xml2和rvest的read_html或html函数从来不返回任何东西。有人能指导我如何将其拉入R Dataframe 吗？

来源：https://stackoverflow.com/questions/75933245/unable-to-web-scrape-a-table-from-cme-groups-website-in-r

2条答案

按热度按时间

tzcvj98z1#

您可以下载网页，然后使用read_html读取html文件，您可以检查网页以查找表的位置（Xpath），并使用html_element和html_table获取数据。

library(rvest)

page <- read_html("Term SOFR - CME Group.html")
xpath <- '//*[@id="main-content"]/div/div[5]/div/div[3]/div/div/div[1]/div[1]/table'
page %>% 
  html_element(xpath = xpath) %>% 
  html_table()
#> # A tibble: 6 × 10
#>   Date   CME T…¹ CME T…² CME T…³ CME T…⁴ Sofr …⁵ Sofr …⁶ Sofr …⁷ Sofr …⁸ Sofr …⁹
#>   <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 Date   1 Month 3 Month 6 Month 12 Mon… Overni… Index   30-Day… 90-Day… 180-Da…
#> 2 04 Ap… 4.82805 4.93736 4.94064 4.7373  -       1.0723… 4.67186 4.53349 4.13336
#> 3 03 Ap… 4.81043 4.92063 4.9201  4.75019 4.84    1.0721… 4.66213 4.52753 4.12316
#> 4 31 Ma… 4.80247 4.90855 4.89968 4.73451 4.87    1.0717… 4.63004 4.50833 4.09148
#> 5 30 Ma… 4.80341 4.89012 4.86581 4.69477 4.82    1.0716… 4.62101 4.50247 4.08105
#> 6 29 Ma… 4.80702 4.89833 4.86464 4.6642  4.83    1.0714… 4.61164 4.49651 4.07056
#> # … with abbreviated variable names ¹`CME Term Sofr (%)`, ²`CME Term Sofr (%)`,
#> #   ³`CME Term Sofr (%)`, ⁴`CME Term Sofr (%)`, ⁵`Sofr *`, ⁶`Sofr *`,
#> #   ⁷`Sofr Averages *`, ⁸`Sofr Averages *`, ⁹`Sofr Averages *`

赞(0）回复(0）举报 2023-04-09

okxuctiv2#

如果你使用RSelenium，你仍然可以自动地从这个页面上删除表格。我认为这个方法比手动复制和粘贴HTML或表格的文本有趣得多：D
具体操作如下：

# load libraries
library(RSelenium)
library(rvest)
library(magrittr)

# define target url
url <- "https://www.cmegroup.com/market-data/cme-group-benchmark-administration/term-sofr.html"

# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)

# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()

# Find all the tables on the page
tables <- page_html %>% html_table()

# save the first table in a new variable
CME_table <- tables[[1]]

它看起来是这样的：

> CME_table
# A tibble: 6 × 10
  Date  CME T…¹ CME T…² CME T…³ CME T…⁴ Sofr …⁵ Sofr …⁶ Sofr …⁷
  <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 Date  1 Month 3 Month 6 Month 12 Mon… Overni… Index   30-Day…
2 04 A… 4.82805 4.93736 4.94064 4.7373  -       1.0723… 4.67186
3 03 A… 4.81043 4.92063 4.9201  4.75019 4.84    1.0721… 4.66213
4 31 M… 4.80247 4.90855 4.89968 4.73451 4.87    1.0717… 4.63004
5 30 M… 4.80341 4.89012 4.86581 4.69477 4.82    1.0716… 4.62101
6 29 M… 4.80702 4.89833 4.86464 4.6642  4.83    1.0714… 4.61164
# … with 2 more variables: `Sofr Averages *` <chr>,
#   `Sofr Averages *` <chr>, and abbreviated variable names
#   ¹`CME Term Sofr (%)`, ²`CME Term Sofr (%)`,
#   ³`CME Term Sofr (%)`, ⁴`CME Term Sofr (%)`, ⁵`Sofr *`,
#   ⁶`Sofr *`, ⁷`Sofr Averages *`
# ℹ Use `colnames()` to see all variable names

这种方法的一些注意事项：

RSelenium的设置有时候会有点棘手。
1.这个页面不想被抓取，所以如果你从他们的网站上抓取了很多，我会采取措施来避免被阻止，比如在你的代码中添加延迟，添加随机的按键和鼠标移动，以及指定一个不同的用户代理。

赞(0）回复(0）举报 2023-04-09

我来回答

无法在R中从CME Group的网站中Web抓取表

2条答案

相关问题

热门标签

最新问答