在函数内部的read_html中添加user_agent

syqv5f0l  于 2022-12-06  发布在  其他
关注(0)|答案(1)|浏览(126)

**问题:**我尝试抓取多个表格,但在我抓取的表格中收到消息 “It appens your browser may be outdated..."
**尝试修复:**我尝试在read_html()中添加user_agent调用以绕过此问题,但似乎并没有改变最终结果。
**问题:**如何使用user_agent调用绕过过时的浏览器?我是否将user_agent调用放在了函数中错误的位置?

library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
library(purrr)
library(openxlsx)

#leaderboard links
df6 <- expand.grid(
  tournament_id = c("the-american-express","wm-phoenix-open","farmers-insurance-open"),
  year_id = c("2004", "2005", "2006")
) %>% 
  mutate(
    links = paste0(
      'https://www.pgatour.com/tournaments/',
      tournament_id,
      "/past-results.",
      year_id,
      '.html'
    )
  ) %>% 
  as_tibble()

#Scrape function
get_info <- function(link, tournament) {
  link %>%
    read_html(, user_agent ="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36")   %>%
  html_table() %>%
    .[[1]] %>%
    clean_names() %>%
    mutate(tournament = tournament)
}

#retrieve data
test501 <- df6 %>%
  mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))

test501 <- test501 %>% 
  unnest(everything())

test501
mrwjdhj3

mrwjdhj31#

使用浏览器开发工具的网络选项卡检查数据实际来自哪里。你需要一个不同的url结构,然后一些列和行清理。我没有完全清理所有的东西,但给出了大量的例子

library(tidyverse)
library(janitor)
library(rvest)

# leaderboard links
df6 <- expand.grid(
  tournament_id = c("the-american-express", "wm-phoenix-open", "farmers-insurance-open"),
  year_id = c("2004", "2005", "2006")
) %>%
  mutate(
    links = paste0(
      "https://www.pgatour.com/tournaments/",
      tournament_id,
      "/past-results/jcr:content/mainParsys/pastresults.selectedYear.",
      year_id,
      ".html"
    )
  ) %>%
  as_tibble()

# Scrape function
get_info <- function(link, tournament) {
  link %>%
    read_html() %>%
    html_element("[data-display-rounds]") %>%
    html_table(trim = T) %>%
    clean_names() %>%
    mutate(tournament = tournament)
}

# retrieve data
test501 <- df6 %>%
  mutate(tables = map2(links, tournament_id, possibly(get_info, otherwise = tibble())))

test501 <- test501 %>%
  unnest(everything())

test501 <- filter(test501, !grepl("PLAYER", player)) %>%
  mutate(across(starts_with("rounds"), ~ trimws(unlist(str_split(.x, " "))[1])),
  )

test501$pos <- lapply(test501$pos, function(x) tail(unlist(str_split(x, ' ')) ,1))

相关问题