使用tryCatch和rvest处理404和其他爬网错误

3wabscal 于 2023-06-19 发布在其他

关注(0)|答案(3)|浏览(112)

当使用rvest检索h1标题时，我有时会遇到404页。这将停止进程并返回此错误。
open.connection（x，“rb”）出错：404错误页面
请参见下面的示例

Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html"))

用于检索h1的代码

library (rvest)
sapply(Data$Pages, function(url){
 url %>%
 as.character() %>% 
 read_html() %>% 
 html_nodes('h1') %>% 
 html_text()
 })

有没有一种方法可以包含一个参数来忽略错误并继续这个过程？

来源：https://stackoverflow.com/questions/38114066/using-trycatch-and-rvest-to-deal-with-404-and-other-crawling-errors

3条答案

按热度按时间

vlju58qv1#

您正在寻找try或tryCatch，这是R处理错误捕获的方式。
使用try，您只需要将可能失败的东西 Package 在try()中，它将返回错误并继续运行：

library(rvest)

sapply(Data$Pages, function(url){
  try(
    url %>%
      as.character() %>% 
      read_html() %>% 
      html_nodes('h1') %>% 
      html_text()
  )
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"                               
# [4] "Error in open.connection(x, \"rb\") : HTTP error 404.\n"

然而，虽然这将得到一切，但它也会将错误的数据插入我们的结果中。tryCatch允许你配置当一个错误被调用时会发生什么，方法是传递一个函数，当该条件出现时运行：

sapply(Data$Pages, function(url){
  tryCatch(
    url %>%
      as.character() %>% 
      read_html() %>% 
      html_nodes('h1') %>% 
      html_text(), 
    error = function(e){NA}    # a function that returns NA regardless of what it's passed
  )
})

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"                               
# [4] NA

好了好多了。

更新

在tidyverse中，purrr包提供了两个函数，safely和possibly，它们的工作方式类似于try和tryCatch。它们是 * 副词 *，而不是动词，这意味着它们接受一个函数，修改它以处理错误，并返回一个新函数（而不是数据对象），然后可以调用。示例：

library(tidyverse)
library(rvest)

df <- Data %>% rowwise() %>%     # Evaluate each row (URL) separately
    mutate(Pages = as.character(Pages),    # Convert factors to character for read_html
           title = possibly(~.x %>% read_html() %>%    # Try to take a URL, read it,
                                html_nodes('h1') %>%    # select header nodes,
                                html_text(),    # and collect text inside.
                            NA)(Pages))    # If error, return NA. Call modified function on URLs.

df %>% select(title)
## Source: local data frame [4 x 1]
## Groups: <by row>
## 
## # A tibble: 4 × 1
##                                                                                        title
##                                                                                        <chr>
## 1 'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages
## 2                                          OMG, this Japanese Trump Commercial is everything
## 3                                Omar Mateen posted to Facebook during Orlando mass shooting
## 4                                                                                       <NA>

赞(0）回复(0）举报 2023-06-19

lawou6xi2#

你可以看到这个问题的解释here

urls<-c(
    "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
    "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
    "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
    "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facdddebook.html")

readUrl <- function(url) {
    out <- tryCatch(
        {
            message("This is the 'try' part")
            url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text() 
        },
        error=function(cond) {
            message(paste("URL does not seem to exist:", url))
            message("Here's the original error message:")
            message(cond)
            return(NA)
        }
        }
    )    
    return(out)
}

y <- lapply(urls, readUrl)

赞(0）回复(0）举报 2023-06-19

cqoc49vn3#

我想添加一个简单的解决方案，我发现elsewhere：

tryCatch(read_html('http://tweg.com'), 
         error = function(e){'empty page'})    # just return "empty page"
#> [1] "empty page"

而且对我来说效果很好。
我也是这样用的：

page <- NULL
url <- 'http://tweg.com'
tryCatch(page <- read_html(url), 
         error = function(e){'empty page'})

if (is.null(page) == FALSE) {
 #block of code
}

赞(0）回复(0）举报 2023-06-19

我来回答

使用tryCatch和rvest处理404和其他爬网错误

3条答案

更新

相关问题

热门标签

最新问答