**问题:**我尝试抓取多个表格,但在我抓取的表格中收到消息 “It appens your browser may be outdated..."。
**尝试修复:**我尝试在read_html()中添加user_agent调用以绕过此问题,但似乎并没有改变最终结果。
**问题:**如何使用user_agent调用绕过过时的浏览器?我是否将user_agent调用放在了函数中错误的位置?
library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
library(purrr)
library(openxlsx)
#leaderboard links
df6 <- expand.grid(
tournament_id = c("the-american-express","wm-phoenix-open","farmers-insurance-open"),
year_id = c("2004", "2005", "2006")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/tournaments/',
tournament_id,
"/past-results.",
year_id,
'.html'
)
) %>%
as_tibble()
#Scrape function
get_info <- function(link, tournament) {
link %>%
read_html(, user_agent ="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36") %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
mutate(tournament = tournament)
}
#retrieve data
test501 <- df6 %>%
mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))
test501 <- test501 %>%
unnest(everything())
test501
1条答案
按热度按时间mrwjdhj31#
使用浏览器开发工具的网络选项卡检查数据实际来自哪里。你需要一个不同的url结构,然后一些列和行清理。我没有完全清理所有的东西,但给出了大量的例子