错误长度(url)== 1在Web抓取时不为TRUE

gojuced7  于 2023-09-27  发布在  其他
关注(0)|答案(1)|浏览(70)

我有以下链接:https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=202112&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266我想用R做网页抓取。目的是将本部分中的最后一位数字“202112”替换为“01”、“02”、“03”等,直到12,然后通过自动按下底部来下载每页的信息。我有下面的代码,但得到这个错误。

library(RSelenium)

download_senamhi_data <- function(url_list) {
  
  # NUMERO ALEATORIO DE PUERTO 
  port <- as.integer(runif(1, min = 5000, max = 6000))
  
  # EJECUTAMOS EL DRIVER DE GOOGLE CHROME 
  rD <- rsDriver(port = port, browser = "chrome", 
                 chromever = "101.0.4951.15")
  
  remDrv <- rD$client
  
  for (url in url_list){
  
  # INGRESAR AL URL
  remDrv$navigate(url)
  
  # ENCONTRAR EL BOTON DE DESCARGA 
  down_button <- remDrv$findElement(using = "id", "export2")
  down_button$clickElement()
  
  }
  
  # CERRAR LA SESION ACTUAL
  remDrv$close()
  rD$server$stop()
  rm(rD, remDrv)
  gc()

}

# EJECUTAR LA FUNCION PARA DESCARGAR TODOS LOS MESES DE UN AÑO

list_url <- list()

for (i in 1:9) {
  
  list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=20210",
               i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
  
 }

for (i in 10:12) {
  
  list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=2021",
               i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
  
}

download_senamhi_data(list_url)

Error in checkError(res) : 
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
h43kikqp

h43kikqp1#

如果{Rselenium}在这里不是一个严格的要求,我们可以从一个带有{rvest}的url列表中提取这些表。网站的CSV导出并没有好到哪里去,它用客户端JavaScript将HTML表转换为CSV。
这里我使用purrr::map来迭代列表,而不是for循环:

library(rvest)
library(dplyr)
library(purrr)

# build a vector of 12 urls
urls <- paste0("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php",
               "?estaciones=109045&CBOFiltro=2021", sprintf("%.2d", 1:12),
               "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266")

# read content of all urls,
# from each page extract dataTable,
# parse table content,
# bind list of tibbles (1 per each month) into one,
# filter out header rows (first column is not a date string),
# set column names,
# convert date strings to dates and measuremnts to numeric

df <- urls %>% 
  map(read_html, .progress = TRUE) %>% 
  map(html_element, "table#dataTable") %>% 
  map(html_table) %>% 
  bind_rows() %>% 
  filter(grepl("^\\d{4}-\\d{2}-\\d{2}$",X1)) %>% 
  set_names(c("date", "temp_max", "temp_min", "hum_rel", "prec")) %>% 
  mutate(date = lubridate::ymd(date)) %>% 
  mutate(across(temp_max:prec, as.numeric))

# save as csv:
readr::write_csv(df, "out.csv")

结果为全年的365行数据集:

df
#> # A tibble: 365 × 5
#>    date       temp_max temp_min hum_rel  prec
#>    <date>        <dbl>    <dbl>   <dbl> <dbl>
#>  1 2021-01-01       NA       NA      NA   0  
#>  2 2021-01-02       NA       NA      NA   1.2
#>  3 2021-01-03       NA       NA      NA   1.9
#>  4 2021-01-04       NA       NA      NA   0  
#>  5 2021-01-05       NA       NA      NA   6  
#>  6 2021-01-06       NA       NA      NA   1.9
#>  7 2021-01-07       NA       NA      NA   4.2
#>  8 2021-01-08       NA       NA      NA   2.5
#>  9 2021-01-09       NA       NA      NA   0  
#> 10 2021-01-10       NA       NA      NA   0  
#> # ℹ 355 more rows

创建于2023-09-22使用reprex v2.0.2

相关问题