如何从Google搜索结果页面抓取所有结果(Python/Selenium ChromeDriver)

ufj5ltwl 于 2023-02-12 发布在 Go

关注(0)|答案(2)|浏览(295)

我正在编写一个Python脚本，使用selenium chromedriver从指定数量的结果页面中抓取所有谷歌搜索结果（链接、标题、文本）。
我的代码似乎只是从第一页之后的所有页面中抓取第一个结果。我认为这与我的for循环在抓取函数中的设置有关，但我一直无法调整它以我希望的方式工作。对于如何修复/更好地处理这个问题，任何建议都很感谢。

# create instance of webdriver
driver = webdriver.Chrome()
url = 'https://www.google.com'
driver.get(url)
# set keyword
keyword = 'cars' 
# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')
# first we send our keyword to the search bar followed by the ent
searchBar.send_keys(keyword)
searchBar.send_keys('\n')
def scrape():
   pageInfo = []
   try:
      # wait for search results to be fetched
      WebDriverWait(driver, 10).until(
      EC.presence_of_element_located((By.CLASS_NAME, "g"))
      )
    
   except Exception as e:
      print(e)
      driver.quit()
   # contains the search results
   searchResults = driver.find_elements_by_class_name('g')
   for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo
# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())
for i in range(0 , numPages - 1):
   nextButton = driver.find_element_by_link_text('Next')
   nextButton.click()
   infoAll.extend(scrape())
print(infoAll)

selenium

来源：https://stackoverflow.com/questions/64487987/how-to-scrape-all-results-from-google-search-results-pages-python-selenium-chro

2条答案

按热度按时间

vnzz0bqm1#

您有一个缩进问题：
您应该将return pageInfo置于for循环之外，否则将在第一次循环执行后返回结果

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

就像这样：

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
return pageInfo

我运行了你的代码并得到了结果：
[{'标题'：'汽车（电影）-维基百科'，'链接'：'文本'："汽车：Quatre Roues，ou Les Bagnoles au Québec（汽车），是Pixar工作室合成图像中动画实体的第七个长期产品。\n时间：116分钟\n生产公司：皮克斯动画工作室\n类型：动画\n汽车总动员2·米歇尔·福廷· lightning 麦昆"}，{'header '：汽车-维基百科，自由的百科全书，链接："文本"：'汽车是一个película de animación por computadora de 2006，生产por皮克斯动画工作室和lanzada por沃尔特迪斯尼工作室电影.\nAño：2006年\n热内罗：动画;阿文图拉斯;喜剧;婴儿...\n历史：约翰·拉塞特·乔·兰夫特·约根·克鲁比...\n制作人：华特迪士尼电影公司皮克斯动画...'}，{'标题'：""、"链接"：'文本'：''}，{'表头'：""、"链接"：'https：//www.allocine.fr/film/fichefilm-55774/secrets-tournage/'，'文本'：''}，{'表头'：""、"链接"：'文本'：''}，
建议：
使用计时器来控制你的for循环，否则你可能会因为可疑活动而被Google禁止
步骤：1. -导入睡眠时间：from time import sleep 2. -在最后一个循环中添加计时器：

for i in range(0 , numPages - 1):
    sleep(5) #It'll wait 5 seconds for each iteration
    nextButton = driver.find_element_by_link_text('Next')
    nextButton.click()
    infoAll.extend(scrape())

展开查看全部

赞(0）回复(0）举报 2023-02-12

dbf7pr2w2#

Google搜索可以使用BeautifulSoup网页抓取库进行解析，而不需要selenium，因为数据不是通过JavaScript动态加载的，并且与selenium相比执行速度快得多，因为不需要渲染页面和使用浏览器。
为了从所有页面获取信息，您可以使用while无限循环进行分页。尽量避免使用for i in range()分页，因为这是一种硬编码的分页方式，因此不可靠。如果页码发生变化（从5到20），分页将中断。
由于while循环是无限的，因此需要设置退出它的条件，可以设置两个条件：

退出条件是存在切换到下一页的按钮（不在最后一页），可以通过其CSS选择器检查是否存在（在我们的示例中-".d6cvqb a [id = pnnext]"）

# condition for exiting the loop in the absence of the next page button
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break

另一种解决方案是如果不需要提取所有页面，则添加可用于抓取的页面的限制。

# condition for exiting the loop when the page limit is reached
    if page_num == page_limit:
        break

当试图请求一个站点时，它可能会认为这是一个bot，这样就不会发生这种情况，您需要发送包含user-agent的headers请求，然后站点会假设您是用户并显示信息。
下一步可能是旋转user-agent，例如，在PC、移动设备和平板电脑之间切换，以及在Chrome、Firefox、Safari、Edge等浏览器之间切换。最可靠的方法是使用旋转代理、用户代理和验证码解算器。
在联机IDE中检查完整代码。

from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "cars",         # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10                # page limit for example
page_num = 0
data = []
while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })
    # condition for exiting the loop when the page limit is reached
    if page_num == page_limit:
        break
    # condition for exiting the loop in the absence of the next page button
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例：

[
  {
    "title": "Cars (2006) - IMDb",
    "snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
    "links": "https://www.imdb.com/title/tt0317219/"
  },
  {
    "title": "Cars (film) - Wikipedia",
    "snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
    "links": "https://en.wikipedia.org/wiki/Cars_(film)"
  },
  {
    "title": "Cars - Rotten Tomatoes",
    "snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
    "links": "https://www.rottentomatoes.com/m/cars"
  },
  other results ...
]

你也可以使用SerpApi的Google Search Engine Results API，这是一个免费的付费API，不同的是它会绕过Google的屏蔽（包括CAPTCHA），不需要创建解析器和维护它。
代码示例：

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
  "api_key": "...",                  # serpapi key from https://serpapi.com/manage-api-key
  "engine": "google",                # serpapi parser engine
  "q": "cars",                       # search query
  "gl": "uk",                        # country of the search, UK -> United Kingdom
  "num": "100"                       # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params)      # where data extraction happens
page_limit = 10
organic_results_data = []
page_num = 0
while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet"),
            "link": result.get("link")
        })
    if page_num == page_limit:
        break
      
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出：

[
  {
    "title": "Rally Cars - Page 30 - Google Books result",
    "snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
    "link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
  },
  {
    "title": "Independent Sports Cars - Page 5 - Google Books result",
    "snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
    "link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
  }
  other results...
]

展开查看全部

赞(0）回复(0）举报 2023-02-12

我来回答

如何从Google搜索结果页面抓取所有结果(Python/Selenium ChromeDriver)

2条答案

相关问题

热门标签

最新问答