Rvest web scraping,字符(空)

yhived7q  于 2023-04-09  发布在  其他
关注(0)|答案(1)|浏览(161)

我用Rvest做过几次网页抓取。但是我没有尝试过从请求中获取字符(空)的网页抓取。这是网站阻止我从他们的网站抓取数据的迹象吗?或者这是某种类型的Javascript/JSON查询?

library(rvest)
library(robotstxt)

##checking the website Rvest and Robotstxt
paths_allowed("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")
njit <- read_html("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")

##Checking file type
class(njit)

##extracting professor names
prof <- njit %>%
  xml_nodes(".cJdVEK") %>%
  html_text2()
gab6jxml

gab6jxml1#

我假设你想把所有与新泽西理工学院有关的教授都拉出来?我在这个链接上刮了一页:https://www.ratemyprofessors.com/search/teachers?query=*&sid=668(原始链接减去末尾的“htm”)。
由于页面使用JavaScript返回内容,rvest看到的html与用户看到的html不同。此外,当用户向下滚动时,结果会动态加载。下面是一种使用RSelenium自动Web浏览器继续滚动的方法,直到找到该大学的所有1,000名左右的教授:

# load libraries
library(RSelenium)
library(rvest)
library(magrittr)
library(readr)

# define target url
url <- "https://www.ratemyprofessors.com/search/teachers?query=*&sid=668"

# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
# If it doesn't open automatically:
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)

# Close "this site uses cookies" button
remDr$findElement(using = "css",value = "button.Buttons__Button-sc-19xdot-1:nth-child(3)")$clickElement()

# Find the number of profs
# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()

# extract the number of results
number_of_profs <- page_html %>% 
                  html_node("h1") %>% 
                  html_text() %>% 
                  parse_number()

# Define a variable for the number of results we've pulled
number_of_profs_pulled <- 0

# While the number of scraped results is less than the number of total results we keep
# scrolling and pulling the html

while(number_of_profs > number_of_profs_pulled){

# scroll down the page
# Root is the html id of the container that the search results
# we want to scroll just to the bottom of the search results not the bottom
# of the page, because it looks like the 
# "click for more results" button doesn't appear in the html 
# unless you're litterally right at that part of the page
webElem <- remDr$findElement("css", ".SearchResultsPage__StyledSearchResultsPage-vhbycj-0")
#webElem$sendKeysToElement(list(key = "end"))
webElem$sendKeysToElement(list(key = "down_arrow"))

# click on the show more button ------------------------------------
remDr$findElement(using = "css",value = ".Buttons__Button-sc-19xdot-1")$clickElement()

# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()

##extract professor names
prof_names <- page_html %>%
  html_nodes(".cJdVEK") %>%
  html_text()

# update the number of profs we pulled
# so we know if we need to keep running the loop
number_of_profs_pulled <- length(prof_names)

}

结果

> str(prof_names)
 chr [1:1250] "David Whitebook" "Donald Getzin" "Joseph Frank" "Soroosh Mohebbi" "Robert Lynch" "Don Wall" "Denis Blackmore" "Soha Abdeljaber" "Lamine Dieng" "Yehoshua Perl" "Douglas Burris" ...
>

注意事项:
1.这将是缓慢的,因为你必须等待页面重新加载。
1.您可能希望进一步减慢速度,以避免网站将您作为机器人阻止。您还可以使用RSelenium添加随机鼠标和按键移动,以降低被阻止的风险。

相关问题