使用python selenium导航到下一页时出现问题

fkaflof6  于 2021-09-08  发布在  Java
关注(0)|答案(1)|浏览(346)

我不熟悉网页抓取;我正试图从这个网站上获取有关供水设施的信息。我目前能够通过下拉菜单成功浏览每个区域,并访问第一页。在转到下一个区域之前,我当前无法成功导航到所有页面的下一个页面。页面导航栏是一个没有“下一步”按钮的列表,我目前尝试使用range遍历该列表。当我得到len时,我没有得到列表的正确范围。目前,我只能浏览每个地区的第一页。我正在努力找出我做错了什么或者要考虑什么,甚至在试图寻找类似问题的答案之后。我们将非常感谢您为此提供的任何帮助。
谢谢
这是我当前的代码(我没有抓取,主要是导航页面):

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Firefox()
browser.get(url)
time.sleep(3)
print("Retriving the site...")

# All regions available

regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)

   list_of_table_pages = browser.find_element_by_xpath('//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul')
   no_pages = len(list_of_table_pages.find_elements_by_xpath("//li"))

   print(("No of table pages to be scraped are: %d") %no_pages)

   print("Outputing data into "+ region +".csv...")

   all_table_data = []

   # starts the range count from 1 instead of 0
   for page in range(1, no_pages):
      try:

        #Navigate to the next page once done
        table_page = str(page)
        WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul/li['+ table_page + ']/a'))).click()
        print("Navigating to next table page...")

      except (TimeoutException, WebDriverException):
        print("Last page reached, moving to the next region...")
        break

   print("No more pages to scrape under %s. Moving to the next region..." %region)

browser.close()
browser.quit()
4nkexdtk

4nkexdtk1#

下面根据结果计数和已知的每页最大结果数计算页数。
它循环单击包含此页码的相应href。如果此数字不可见,则会处理引发的异常,并单击初始分页省略号以显示页面。
我打印第一张 tr 第一 td ,页数大于1,表示该页已被访问。我还对等待条件进行了硬编码等待。
我用过铬驱动。
这是给你一个框架来使用。我对它进行了测试,并对所有区域选择和页面运行。

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
import math

results_per_page = 50
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome() #FireFox()
browser.get(url)
print("Retriving the site...")

# All regions available

regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   # Select all options from drop down menu
    selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

    print("Now constructing output for: " + region)

    # Select table and wait for data to populate
    selectOption.select_by_visible_text(region)

    WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
    num_results = int(browser.find_element_by_id('MainContent_lblqResults').text)
    num_pages = math.ceil(num_results/results_per_page)
    print(f'pages to scrape are: {num_pages}')

    for page in range(2, num_pages + 1):
        print(f'visiting page {page}')
        try:
            browser.find_element_by_css_selector(f'.pagination > li > [href*="Page\${page}"]').click()
            WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
            print(browser.find_element_by_css_selector('#MainContent_gvUtilities tr:nth-child(2) span').text)
        except NoSuchElementException:
            browser.find_element_by_css_selector('.pagination > li > a').click()
        except Exception as e:
            print(e)
            continue

相关问题