使用rvest包获取var下指定的内容

9rygscc1  于 2023-09-27  发布在  其他
关注(0)|答案(1)|浏览(62)

我想从链接https://www.betashares.com.au/fund/high-interest-cash-etf/中提取以下信息

我写了下面的代码:

link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"
read_html(link) %>% 
  html_nodes('div') %>% 
  html_nodes('script') %>%
  .[5] %>%
  html_text() -> data

当我尝试使用类似于这里的东西时:In R extract a declared variable from html作为

library(V8)
ctx <- v8()
ctx$eval(data)
ctx$get("navdata")

我得到一个错误。我们可以通过“来进行字符串拆分;“并对\t和\n进行一些清理,但是有没有一种优雅的方法来处理这个问题呢?

jgwigjjp

jgwigjjp1#

这是一个有点重的js块,有外部依赖关系(例如因为你只需要一行,你可以通过硬编码索引或定位var navdata来提取它。从那里你可以用V8来计算这个赋值表达式:

library(dplyr, warn.conflicts = FALSE)
library(rvest)
library(V8)
#> Using V8 engine 9.1.269.38
library(stringr)

link <- "https://www.betashares.com.au/fund/high-interest-cash-etf"

navdata_js <- 
  read_html(link) %>% 
  html_element("#performance > div:nth-child(5) > script:nth-child(8)") %>% 
  html_text() %>% 
  # read only a single line, the 5th
  readr::read_lines(skip = 4, n_max = 1)

# start:
str_trunc(navdata_js, 80) %>% str_view()
#> [1] │ {\t\t\t\t}var navdata = [["2012-03-06",50,100],["2012-03-07",49.9998,99.9995],["201...
# end:
str_trunc(navdata_js, 80, side = "left") %>% str_view()
#> [1] │ ...32,130.2664],["2023-09-21",50.189,130.2813],["2023-09-22",50.1947,130.2962]];

ctx <- v8()
ctx$eval(navdata_js)
ctx$get("navdata") %>% 
  head()
#>      [,1]         [,2]      [,3]      
#> [1,] "2012-03-06" "50"      "100"     
#> [2,] "2012-03-07" "49.9998" "99.9995" 
#> [3,] "2012-03-08" "50.003"  "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047" 
#> [6,] "2012-03-13" "50.0271" "100.0542"

或者通过删除前导var navdata =和尾随;来提取数组字符串,并将其解析为JSON:

str_extract(navdata_js, "(?<=var navdata \\= )[^;]+") %>% 
  jsonlite::parse_json(simplifyVector = T) %>% 
  head()
#>      [,1]         [,2]      [,3]      
#> [1,] "2012-03-06" "50"      "100"     
#> [2,] "2012-03-07" "49.9998" "99.9995" 
#> [3,] "2012-03-08" "50.003"  "100.0061"
#> [4,] "2012-03-09" "50.0099" "100.0198"
#> [5,] "2012-03-12" "50.0235" "100.047" 
#> [6,] "2012-03-13" "50.0271" "100.0542"

创建于2023-09-22使用reprex v2.0.2

相关问题