html_elements()返回空向量;使用rvest进行网页抓取html表

hivapdat  于 2023-05-26  发布在  其他
关注(0)|答案(1)|浏览(125)

我是一个完整的noobie,并试图webscrape以下网页:
https://ec.europa.eu/taxation_customs/dds2/taric/quota_consultation.jsp?Lang=en&Origin=&Code=090008&Critical=&Status=&Year=2023&Expand=true
本网站提供有关欧洲关税配额的信息。
特别是底部附近的内容物:订单号、起始日期、开始日期、结束日期、余额和html表格,可在[更多信息]页面上找到。
下面是我的代码:

url <-  "https://ec.europa.eu/taxation_customs/dds2/taric/quota_consultation.jsp?Lang=en&Origin=&Code=090008&Critical=&Status=&Year=2023&Expand=true"

html <- url %>% url() %>% read_html() %>% html_elements("#overlayPanel")

不幸的是,我在html代码中尝试了不同的类和id的不同选择器(例如#quotaMarkedUpContainer),但代码没有返回任何有用的东西--只有空的null vector。
任何有助于理解这个问题的帮助都是感激不尽的。
最好的祝愿。

vsikbqxv

vsikbqxv1#

正如注解中所指出的,结果表不包括在主页中,但我们可以通过在URL中将quota_consultation.jsp替换为quota_list.jsp来发出相同的请求。
连接处理也有问题,即使接收到内容,rvest也很可能失败。作为一种快速的变通方法,我们可以使用httr2发出请求,并将响应定向到文件;虽然这也会失败,但我们至少可以恢复内容。
所有指向更清洁解决方案的评论和编辑都非常受欢迎。

library(rvest)
library(stringr)
library(dplyr)
library(httr2)

html <- read_html("https://ec.europa.eu/taxation_customs/dds2/taric/quota_list.jsp?Lang=en&Code=090008&Year=2023&Expand=true&Offset=0")

# results table:
html %>% html_element("table#quotaTable") %>%
  html_table() %>% 
  # lazy method to make valid names
  as_tibble(.name_repair = make.names) %>% 
  # squish all strings
  mutate(across(where(is.character), str_squish))
#> # A tibble: 1 × 6
#>   Order.number Origins             Start.date End.date   Balance           X    
#>          <int> <chr>               <chr>      <chr>      <chr>             <chr>
#> 1        90008 All third countries 01-01-2023 31-12-2023 17221000 Kilogram [Mor…

# extracting [More info.. ] link and fetching it's content, this is where rvest and other 
# curl-based libraries and tools get confused, probably something with
# remote server connection handling. 
# Quick and dirty workaround is to save content as a file and let rvest use that
tmp_quota_file <- tempfile(pattern = "quotalink_", fileext = ".html")

html_element(html, "a#quotaLink") %>% 
  html_attr("href") %>% 
  # httr2 request, will fail but we still have the content in tmp_quota_file
  request() %>% req_perform(path = tmp_quota_file)
#> Error:
#> ! Failure when receiving data from the peer
#> Backtrace:
#>     ▆
#>  1. ├─... %>% req_perform(path = tmp_quota_file)
#>  2. └─httr2::req_perform(., path = tmp_quota_file)
#>  3.   └─base::tryCatch(...)
#>  4.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  5.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  6.         └─value[[3L]](cond)

# open file location:
# if (interactive()) browseURL(dirname(tmp_quota_file))

quota_details <- read_html(tmp_quota_file) %>% 
  html_element("#quotaDetailsMarkedUpContainer > table") %>% 
  html_table(na.strings = "") %>% 
  as_tibble(.name_repair = make.names) %>% 
  # html_table isn't exactly best tool for parsing messy tables,
  # but we can clean it up a bit
  mutate(across(where(is.character), ~ str_replace_all(.x, "[\\s-]{2,}", ";")))
quota_details
#> # A tibble: 16 × 2
#>    X1                                          X2                               
#>    <chr>                                       <chr>                            
#>  1 Order number                                090008                           
#>  2 Validity period                             01-01-2023;31-12-2023            
#>  3 Origin                                      All third countries              
#>  4 Initial amount                              17221000;Kilogram                
#>  5 Amount                                      17221000;Kilogram                
#>  6 Balance                                     17221000;Kilogram                
#>  7 Transferred Amount                          <NA>                             
#>  8 Exhaustion date                             <NA>                             
#>  9 Critical                                    No                               
#> 10 Last import date                            <NA>                             
#> 11 Last allocation date                        <NA>                             
#> 12 Total awaiting allocation;(indicative)      0                                
#> 13 Blocking period                             <NA>                             
#> 14 Suspension period                           <NA>                             
#> 15 Allocated percentage at the exhaustion date 0                                
#> 16 Associated TARIC code                       0302 31 10 00;0302 32 10 00;0302…

# TARIC codes
strsplit(quota_details[[2]][grepl("TARIC", quota_details[[1]])], ";")
#> [[1]]
#>  [1] "0302 31 10 00" "0302 32 10 00" "0302 33 10 00" "0302 34 10 00"
#>  [5] "0302 35 11 00" "0302 35 91 00" "0302 36 10 00" "0302 39 20 00"
#>  [9] "0302 49 11 00" "0302 89 21 00" "0303 41 10 00" "0303 42 20 00"
#> [13] "0303 43 10 00" "0303 44 10 00" "0303 45 12 00" "0303 45 91 00"
#> [17] "0303 46 10 00" "0303 49 20 00" "0303 59 21 00" "0303 89 21 00"

创建于2023-05-24使用reprex v2.0.2

相关问题