尽管已将令牌传递到标头,但使用http和rvest的Web抓取XHR不工作

q8l4jmvw  于 2022-12-30  发布在  其他
关注(0)|答案(1)|浏览(154)

我是一个相当新的网页抓取和搜索后,我发现了一些例子,这是thisthis;
我尝试提取的数据如下所示:

library(httr)
library(rvest)
library(dplyr)

s <- session("https://www.barchart.com/stocks/highs-lows/highs")

cookies <- s$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", 
                                 !!!setNames(cookies$value, 
                                             cookies$name)))

pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
         add_headers(
                     Referer="https://www.barchart.com/stocks/highs-lows/highs",
                     `Accept`="application/json",
                     `Accept-Encoding`="gzip, deflate",
                     `Connection`="keep-alive",
                     `User-Agent`="Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0",
                     `X-XSRF-TOKEN`=token
                    ),
         query=list(
                   lists="stocks.us.new_highs_lows.highs.overall.1y",
                   fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
                   meta="field.shortName,field.type,field.description,lists.lastUpdate",
                   hasOptions="true",
                   page="1",
                   limit="100",
                   raw="1"
                   ),
         verbose()) -> res

data <- content(res, as = "text")
print(data)

理想情况下,我应该得到一些文本,其中包括一个我可以解析的json对象(devtools中检查的结果)。
我已经花了相当多的时间挠头,仍然没有一个线索,只是还没有。rvest没有request_GET函数暴露任何更多,因此唯一的选择是httr::GET,它没有真正的工作。

tkclm6bt

tkclm6bt1#

我不是一个经验丰富的网页刮刀的任何手段,但花了一些时间试图弄清楚这一点。
API似乎要求在您的请求中发送Cookie,否则您将被拒绝访问。
请注意,当您按原样运行代码时,res$status_code的结果是401 Unauthorized Error,这意味着您不被允许访问资源。
我不得不使用DevTools来检查网页,查看Network选项卡,找到发出API请求的文件,然后将cookie字符串复制/粘贴到R中,此外还在测试时添加了其他标头。

library(httr)
library(rvest)
library(dplyr)

s <- session("https://www.barchart.com/stocks/highs-lows/highs")

cookies <- s$response$cookies
token <- URLdecode(dplyr::recode("XSRF-TOKEN", 
                                 !!!setNames(cookies$value, 
                                             cookies$name)))

# go in your browser dev tools/inspect
# go to the 'Network' tab and look for the file in the screenshot below
# copy your very long cookie string here
cookie <- "your-very-long-cookie-string-goes-here"


pg <-GET(url="https://www.barchart.com/proxies/core-api/v1/quotes/get",
         add_headers(
           
           `accept-encoding` = "gzip, deflate, br",
           `accept-language` = "en-US,en;q=0.9",
           `cache-control` = "no-cache",
           `cookie` = cookie,
           `pragma` = "no-cache",
           `referer` = "https://www.barchart.com/stocks/highs-lows/highs",
           `sec-ch-ua` = '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
           `sec-ch-ua-mobile` = "?0",
           `sec-ch-ua-platform` = "macOS",
           `sec-fetch-dest` = "empty",
           `sec-fetch-mode` = "cors",
           `sec-fetch-site` = "same-origin",
           `user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
           `x-xsrf-token` = "eyJpdiI6ImpMNGNJZ2Y1ZnQ5bUM5UzVJRW1vaUE9PSIsInZhbHVlIjoibDBVeXVaKzZqUXc2dmVVVTRFVHd3K0NONlcvTzYrZ2ZwM3dMZWJpQldwMzU3SDlOVSt3RDNsa1dHeGdtaTNpRHd0SDhreldYTzEyelV6SXBNT2hOWGNYakR6djNqdDczb2FoWEtmTWE1LzYxVzR3anRQcGRTMmJCRVpJS1FlUWoiLCJtYWMiOiJmYWZjYjdjMGZjODBkZWJiNGI2OGE5MDQ5MGIyZjc2ZmQ4YzM2ZDk1ZTE4ZTUzMjQzYWE1OTQzOWRiNzZkMTE5In0="
             

         ),
         query=list(
           lists="stocks.us.new_highs_lows.highs.overall.1y",
           fields="symbol,symbolName,lastPrice,priceChange,percentChange,volume,highHits1y,highPercent1y,lowPercent1y,tradeTime,symbolCode,symbolType,hasOptions",
           meta="field.shortName,field.type,field.description,lists.lastUpdate",
           hasOptions="true",
           page="1",
           limit="100",
           raw="1"
         ),
         verbose()) -> res

data <- content(res, as = "text")
#> No encoding supplied: defaulting to UTF-8.

parsed <- content(res, as = "parsed")


# parsed result 

purrr::map(parsed$data, function(el) {
  purrr::map_df(el$raw, function(data) {
    return(data)
  })
}) %>% 
  bind_rows()

#> # A tibble: 54 × 13
#>    symbol symbolName     lastP…¹ price…² percen…³ volume highH…⁴ highP…⁵ lowPe…⁶
#>    <chr>  <chr>            <dbl>   <dbl>    <dbl>  <int>   <int>   <dbl>   <dbl>
#>  1 ACBA   Ace Global Bu…   11.0   0.36    0.0337  2.6 e3      31 -0.0036  0.0941
#>  2 ADMA   Adma Biologics    3.86  0.18    0.0489  3.09e6      40 -0.0153  2.06  
#>  3 ADRA   Adara Acquisi…   10.2   0.0300  0.00296 1.5 e3      65  0       0.0484
#>  4 AGFS   Agrofresh Sol…    2.96  0.01    0.0034  2.61e5      16 -0.0067  1.03  
#>  5 AKO.B  Embotell Andn…   14.5  -0.0150 -0.00104 1.46e5       9 -0.0236  0.503 
#>  6 AKRO   Akero Therape…   53.8   4.24    0.0855  1.05e6      23 -0.0024  6.16  
#>  7 AMBC   Ambac Financi…   17.1   0.44    0.0263  7.44e5       7 -0.0029  1.37  
#>  8 ARDX   Ardelyx Inc       2.53  0.02    0.008   3.52e7      14 -0.0524  4.16  
#>  9 ARYD   Arya Sciences…   10.1  -0.01   -0.0005  2.9 e3      25 -0.001   0.0402
#> 10 AURC   Aurora Acquis…   10.1   0.05    0.0045  2.07e4      13 -0.0005  0.0302
#> # … with 44 more rows, 4 more variables: tradeTime <int>, symbolCode <chr>,
#> #   symbolType <int>, hasOptions <lgl>, and abbreviated variable names
#> #   ¹​lastPrice, ²​priceChange, ³​percentChange, ⁴​highHits1y, ⁵​highPercent1y,
#> #   ⁶​lowPercent1y

我试着用cookie Dataframe 来格式化我的cookie,但由于某种原因,它不喜欢这样。
以下是您可以复制cookie字符串的文件:

相关问题