如何从Google搜索结果页面抓取所有结果(Python/Selenium ChromeDriver)

ufj5ltwl  于 2023-02-12  发布在  Go
关注(0)|答案(2)|浏览(295)

我正在编写一个Python脚本,使用selenium chromedriver从指定数量的结果页面中抓取所有谷歌搜索结果(链接、标题、文本)。
我的代码似乎只是从第一页之后的所有页面中抓取第一个结果。我认为这与我的for循环在抓取函数中的设置有关,但我一直无法调整它以我希望的方式工作。对于如何修复/更好地处理这个问题,任何建议都很感谢。

  1. # create instance of webdriver
  2. driver = webdriver.Chrome()
  3. url = 'https://www.google.com'
  4. driver.get(url)
  5. # set keyword
  6. keyword = 'cars'
  7. # we find the search bar using it's name attribute value
  8. searchBar = driver.find_element_by_name('q')
  9. # first we send our keyword to the search bar followed by the ent
  10. searchBar.send_keys(keyword)
  11. searchBar.send_keys('\n')
  12. def scrape():
  13. pageInfo = []
  14. try:
  15. # wait for search results to be fetched
  16. WebDriverWait(driver, 10).until(
  17. EC.presence_of_element_located((By.CLASS_NAME, "g"))
  18. )
  19. except Exception as e:
  20. print(e)
  21. driver.quit()
  22. # contains the search results
  23. searchResults = driver.find_elements_by_class_name('g')
  24. for result in searchResults:
  25. element = result.find_element_by_css_selector('a')
  26. link = element.get_attribute('href')
  27. header = result.find_element_by_css_selector('h3').text
  28. text = result.find_element_by_class_name('IsZvec').text
  29. pageInfo.append({
  30. 'header' : header, 'link' : link, 'text': text
  31. })
  32. return pageInfo
  33. # Number of pages to scrape
  34. numPages = 5
  35. # All the scraped data
  36. infoAll = []
  37. # Scraped data from page 1
  38. infoAll.extend(scrape())
  39. for i in range(0 , numPages - 1):
  40. nextButton = driver.find_element_by_link_text('Next')
  41. nextButton.click()
  42. infoAll.extend(scrape())
  43. print(infoAll)
vnzz0bqm

vnzz0bqm1#

您有一个缩进问题:
您应该将return pageInfo置于for循环之外,否则将在第一次循环执行后返回结果

  1. for result in searchResults:
  2. element = result.find_element_by_css_selector('a')
  3. link = element.get_attribute('href')
  4. header = result.find_element_by_css_selector('h3').text
  5. text = result.find_element_by_class_name('IsZvec').text
  6. pageInfo.append({
  7. 'header' : header, 'link' : link, 'text': text
  8. })
  9. return pageInfo

就像这样:

  1. for result in searchResults:
  2. element = result.find_element_by_css_selector('a')
  3. link = element.get_attribute('href')
  4. header = result.find_element_by_css_selector('h3').text
  5. text = result.find_element_by_class_name('IsZvec').text
  6. pageInfo.append({
  7. 'header' : header, 'link' : link, 'text': text
  8. })
  9. return pageInfo

我运行了你的代码并得到了结果:
[{'标题':'汽车(电影)-维基百科','链接':'文本':"汽车:Quatre Roues,ou Les Bagnoles au Québec(汽车),是Pixar工作室合成图像中动画实体的第七个长期产品。\n时间:116分钟\n生产公司:皮克斯动画工作室\n类型:动画\n汽车总动员2·米歇尔·福廷· lightning 麦昆"},{'header ':汽车-维基百科,自由的百科全书,链接:"文本":'汽车是一个película de animación por computadora de 2006,生产por皮克斯动画工作室和lanzada por沃尔特迪斯尼工作室电影.\nAño:2006年\n热内罗:动画;阿文图拉斯;喜剧;婴儿...\n历史:约翰·拉塞特·乔·兰夫特·约根·克鲁比...\n制作人:华特迪士尼电影公司皮克斯动画...'},{'标题':""、"链接":'文本':''},{'表头':""、"链接":'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/','文本':''},{'表头':""、"链接":'文本':''},
建议:
使用计时器来控制你的for循环,否则你可能会因为可疑活动而被Google禁止
步骤:1. -导入睡眠时间:from time import sleep 2. -在最后一个循环中添加计时器:

  1. for i in range(0 , numPages - 1):
  2. sleep(5) #It'll wait 5 seconds for each iteration
  3. nextButton = driver.find_element_by_link_text('Next')
  4. nextButton.click()
  5. infoAll.extend(scrape())
展开查看全部
dbf7pr2w

dbf7pr2w2#

Google搜索可以使用BeautifulSoup网页抓取库进行解析,而不需要selenium,因为数据不是通过JavaScript动态加载的,并且与selenium相比执行速度快得多,因为不需要渲染页面和使用浏览器。
为了从所有页面获取信息,您可以使用while无限循环进行分页。尽量避免使用for i in range()分页,因为这是一种硬编码的分页方式,因此不可靠。如果页码发生变化(从5到20),分页将中断。
由于while循环是无限的,因此需要设置退出它的条件,可以设置两个条件:

  • 退出条件是存在切换到下一页的按钮(不在最后一页),可以通过其CSS选择器检查是否存在(在我们的示例中-".d6cvqb a [id = pnnext]")
  1. # condition for exiting the loop in the absence of the next page button
  2. if soup.select_one(".d6cvqb a[id=pnnext]"):
  3. params["start"] += 10
  4. else:
  5. break
  • 另一种解决方案是如果不需要提取所有页面,则添加可用于抓取的页面的限制。
  1. # condition for exiting the loop when the page limit is reached
  2. if page_num == page_limit:
  3. break

当试图请求一个站点时,它可能会认为这是一个bot,这样就不会发生这种情况,您需要发送包含user-agentheaders请求,然后站点会假设您是用户并显示信息。
下一步可能是旋转user-agent,例如,在PC、移动设备和平板电脑之间切换,以及在Chrome、Firefox、Safari、Edge等浏览器之间切换。最可靠的方法是使用旋转代理、用户代理和验证码解算器。
在联机IDE中检查完整代码。

  1. from bs4 import BeautifulSoup
  2. import requests, json, lxml
  3. # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
  4. params = {
  5. "q": "cars", # query example
  6. "hl": "en", # language
  7. "gl": "uk", # country of the search, UK -> United Kingdom
  8. "start": 0, # number page by default up to 0
  9. #"num": 100 # parameter defines the maximum number of results to return.
  10. }
  11. # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
  12. headers = {
  13. "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
  14. }
  15. page_limit = 10 # page limit for example
  16. page_num = 0
  17. data = []
  18. while True:
  19. page_num += 1
  20. print(f"page: {page_num}")
  21. html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
  22. soup = BeautifulSoup(html.text, 'lxml')
  23. for result in soup.select(".tF2Cxc"):
  24. title = result.select_one(".DKV0Md").text
  25. try:
  26. snippet = result.select_one(".lEBKkf span").text
  27. except:
  28. snippet = None
  29. links = result.select_one(".yuRUbf a")["href"]
  30. data.append({
  31. "title": title,
  32. "snippet": snippet,
  33. "links": links
  34. })
  35. # condition for exiting the loop when the page limit is reached
  36. if page_num == page_limit:
  37. break
  38. # condition for exiting the loop in the absence of the next page button
  39. if soup.select_one(".d6cvqb a[id=pnnext]"):
  40. params["start"] += 10
  41. else:
  42. break
  43. print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例:

  1. [
  2. {
  3. "title": "Cars (2006) - IMDb",
  4. "snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
  5. "links": "https://www.imdb.com/title/tt0317219/"
  6. },
  7. {
  8. "title": "Cars (film) - Wikipedia",
  9. "snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
  10. "links": "https://en.wikipedia.org/wiki/Cars_(film)"
  11. },
  12. {
  13. "title": "Cars - Rotten Tomatoes",
  14. "snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
  15. "links": "https://www.rottentomatoes.com/m/cars"
  16. },
  17. other results ...
  18. ]

你也可以使用SerpApi的Google Search Engine Results API,这是一个免费的付费API,不同的是它会绕过Google的屏蔽(包括CAPTCHA),不需要创建解析器和维护它。
代码示例:

  1. from serpapi import GoogleSearch
  2. from urllib.parse import urlsplit, parse_qsl
  3. import json, os
  4. params = {
  5. "api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
  6. "engine": "google", # serpapi parser engine
  7. "q": "cars", # search query
  8. "gl": "uk", # country of the search, UK -> United Kingdom
  9. "num": "100" # number of results per page (100 per page in this case)
  10. # other search parameters: https://serpapi.com/search-api#api-parameters
  11. }
  12. search = GoogleSearch(params) # where data extraction happens
  13. page_limit = 10
  14. organic_results_data = []
  15. page_num = 0
  16. while True:
  17. results = search.get_dict() # JSON -> Python dictionary
  18. page_num += 1
  19. for result in results["organic_results"]:
  20. organic_results_data.append({
  21. "title": result.get("title"),
  22. "snippet": result.get("snippet"),
  23. "link": result.get("link")
  24. })
  25. if page_num == page_limit:
  26. break
  27. if "next_link" in results.get("serpapi_pagination", []):
  28. search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
  29. else:
  30. break
  31. print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:

  1. [
  2. {
  3. "title": "Rally Cars - Page 30 - Google Books result",
  4. "snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
  5. "link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
  6. },
  7. {
  8. "title": "Independent Sports Cars - Page 5 - Google Books result",
  9. "snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
  10. "link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
  11. }
  12. other results...
  13. ]
展开查看全部

相关问题