使用Selenium抓取多个链接:如何处理过期链接,避免停止刮取过程

yizd12fk  于 2023-05-29  发布在  其他
关注(0)|答案(1)|浏览(191)
  1. from selenium import webdriver
  2. from selenium.common.exceptions import NoSuchElementException,
  3. TimeoutException
  4. from selenium.webdriver.chrome.service import Service
  5. from selenium.webdriver.common.by import By
  6. from selenium.webdriver.support.ui import WebDriverWait
  7. from selenium.webdriver.support import expected_conditions as EC
  8. import json
  9. data = []
  10. driver_path =
  11. "C:\\Users\Engineer_Stephen\\Downloads\\Compressed\\chromedriver"
  12. service = Service(driver_path)
  13. chrome_options = webdriver.ChromeOptions()
  14. #chrome_options.add_argument("--headless")
  15. chrome_options.add_argument("--start-maximized")
  16. driver = webdriver.Chrome(service=service,
  17. options=chrome_options)
  18. links = ['link1','link1','link1','link1','link1','link1']
  19. for link in links:
  20. try:
  21. driver.get(link)
  22. wait = WebDriverWait(driver, 80)
  23. map_element =
  24. wait.until(EC.visibility_of_element_located((By.XPATH,
  25. '//span[@class="contour-HazardInfo-name montage-
  26. Text"]')))
  27. disaster = driver.find_element(By.XPATH,
  28. '//span[@class="contour-HazardInfo-name montage-Text"]')
  29. info = driver.find_element(By.XPATH, "//div[@class='contour-
  30. TabBarItem-label montage-Text']")
  31. types = driver.find_element(By.XPATH, "//div[@data-montage-
  32. id='type']")
  33. description = driver.find_element(By.XPATH,
  34. "//label/span[contains(text(), 'Description')]")
  35. descriptionText = driver.find_element(By.XPATH,
  36. '//div[@class="contour-HazardInfo-descriptionText montage-
  37. Text"]')
  38. data.append({
  39. 'disaster': disaster.text,
  40. 'info': info.text,
  41. 'type': types.text,
  42. 'description': description.text,
  43. 'descriptionText': descriptionText.text
  44. })
  45. except NoSuchElementException:
  46. print(f"Data not available for link: {link}")
  47. except TimeoutException:
  48. print(f"Timeout occurred for link: {link}")
  49. driver.quit()
  50. continue
  51. except Exception as e:
  52. print(f"An error occurred for link: {link}")
  53. print(str(e))
  54. driver.quit()
  55. continue
  56. driver.quit()
  57. resource = json.dumps(data, indent=4)
  58. print(resource)

堆栈跟踪是这样的:
链接超时:https://disasteralert.pdc.org/disasteralert/?hazard_id=202739
链接发生错误:[https://disasteralert.pdc.org/disasteralert/?hazard_id=202738](https://disasteralert.pdc.org/disasteralert/?hazard_id=202738%5C) HTTPConnectionPool(host='localhost',port=33660):超过URL的最大重试次数:/session/99 df 58 ee 2824 dbdf 4720427 db 546798 d/url(由NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000027446C07A00>:无法建立新连接:[WinError 10061]无法建立连接,因为目标计算机主动拒绝连接'))
它只显示第一个过期链接的连接超时,它不能继续为其余链接提供第二个异常
我提供了一个链接的样本,我想报废。我发现有些链接已经过期了。我正在努力改善这段代码,使其关闭当前窗口的过期链接,并转到下一个链接可能请您帮助
我期望代码从列表中列出的链接中删除数据。如果链接过期,这意味着它将遇到超时异常。我希望Selenium关闭该浏览器窗口并转到下一个链接,这意味着打开另一个浏览器窗口并抓取数据。

yqyhoc1h

yqyhoc1h1#

对该网页的初步分析显示实际信息的来源是https://hpxml.pdc.org/public.xml,这是一个定期访问的端点,用于刷新带有Map等的页面中的信息。
这是获取该信息的一种方法:

  1. import pandas as pd
  2. df = pd.read_xml('https://hpxml.pdc.org/public.xml')
  3. print(df)

终端结果:

  1. app_ID app_IDs autoexpire category_ID charter_Uri comment_Text create_Date_hst create_Date creator end_Date_hst ... start_Date_hst start_Date status type_ID update_Date_hst update_Date product_total uuid description update_User
  2. 0 0 NaN Y EVENT NaN D2P2 auto-generated Earthquake Hazard 2023-05-28T03:11:24-10:00 2023-05-28T13:11:24-10:00 D2P2 2023-05-29T03:11:24-10:00 ... 2023-05-28T02:52:39-10:00 2023-05-28T12:52:39-10:00 A EARTHQUAKE 2023-05-28T03:11:24-10:00 2023-05-28T13:11:24-10:00 1 e3b44cd0-503c-4c2a-b49e-8aedad507cfa An earthquake with a magnitude of 5.1 at a dep... None
  3. 1 0 NaN Y EVENT NaN 352090 2023-05-18T03:50:52-10:00 2023-05-18T13:50:52-10:00 D2P2 2023-06-01T15:20:43-10:00 ... 2023-05-18T03:39:59-10:00 2023-05-18T13:39:59-10:00 A VOLCANO 2023-05-28T03:00:52-10:00 2023-05-28T13:00:52-10:00 59 474b0157-761d-43fd-b8a5-24854683f437 Volcanic activity has been reported for Sangay... None
  4. 2 0 NaN Y EVENT NaN FLOODIPAWS-WARNING-2023-McCook, NE 2023-05-26T06:08:49-10:00 2023-05-26T16:08:49-10:00 D2P2 2023-05-28T19:14:59-10:00 ... 2023-05-26T13:01:59-10:00 2023-05-26T23:01:59-10:00 A FLOOD 2023-05-28T02:35:24-10:00 2023-05-28T12:35:24-10:00 16 779627e6-d4e6-4d9e-9980-bda8d0905d20 The National Weather Service (NWS) has issued ... None
  5. 3 0 NaN Y EVENT NaN D2P2 auto-generated Earthquake Hazard 2023-05-27T20:10:31-10:00 2023-05-28T06:10:31-10:00 D2P2 2023-05-28T20:15:00-10:00 ... 2023-05-27T19:49:56-10:00 2023-05-28T05:49:56-10:00 A EARTHQUAKE 2023-05-27T20:15:00-10:00 2023-05-28T06:15:00-10:00 2 a0370397-66e7-4d2e-91f6-e844904779b0 An earthquake with a magnitude of 5.2 at a dep... None
  6. 4 0 NaN Y EVENT NaN 282080 2023-05-01T09:40:52-10:00 2023-05-01T19:40:52-10:00 D2P2 2023-06-02T01:50:44-10:00 ... 2023-05-01T09:32:59-10:00 2023-05-01T19:32:59-10:00 A VOLCANO 2023-05-28T01:50:45-10:00 2023-05-28T11:50:45-10:00 62 f1e4fe5b-7e99-4f2b-aafc-77ed09d8ae00 Volcanic activity has been reported for Aira i... None
  7. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
  8. 143 0 NaN N EVENT NaN None 2023-04-25T21:56:45-10:00 2023-04-26T07:56:45-10:00 jmyhre 2023-05-02T21:54:18-10:00 ... 2023-04-25T21:54:18-10:00 2023-04-26T07:54:18-10:00 A BIOMEDICAL 2023-04-25T21:56:45-10:00 2023-04-26T07:56:45-10:00 1 7f27c603-47ad-4622-ad1c-0c38cd40094d A measles outbreak has been reported in parts ... jmyhre
  9. 144 0 NaN N EVENT NaN NaN 2022-10-05T07:16:01-10:00 2022-10-05T17:16:01-10:00 amontoro 2022-10-21T07:08:35-10:00 ... 2022-10-05T07:08:35-10:00 2022-10-05T17:08:35-10:00 A BIOMEDICAL 2022-11-28T10:21:15-10:00 2022-11-28T20:21:15-10:00 32 52e8b82a-3719-4f0b-9ea8-e66e2ab0f432 Cholera outbreak has been reported in parts of... jmyhre
  10. 145 0 NaN N EVENT NaN None 2023-03-10T15:03:32-10:00 2023-03-11T01:03:32-10:00 jmyhre 2023-03-17T15:01:46-10:00 ... 2023-03-10T15:01:46-10:00 2023-03-11T01:01:46-10:00 A DROUGHT 2023-03-10T15:03:32-10:00 2023-03-11T01:03:32-10:00 1 4a1e1e72-ce4c-4a34-88bd-63b8c498089f Drought warnings have been reported for Djibou... jmyhre
  11. 146 0 NaN N EVENT NaN None 2023-02-24T13:36:34-10:00 2023-02-24T23:36:34-10:00 jmyhre 2023-03-03T13:33:11-10:00 ... 2023-02-24T13:33:11-10:00 2023-02-24T23:33:11-10:00 A BIOMEDICAL 2023-02-24T13:36:34-10:00 2023-02-24T23:36:34-10:00 1 6882129b-e06a-4ac7-bd53-6b874b1026f7 Cholera outbreak has been reported in parts of... jmyhre
  12. 147 0 NaN N EVENT NaN https://www.who.int/emergencies/disease-outbre... 2022-08-25T18:12:26-10:00 2022-08-26T04:12:26-10:00 jmyhre 2022-09-01T18:10:04-10:00 ... 2022-08-25T18:10:04-10:00 2022-08-26T04:10:04-10:00 A BIOMEDICAL 2022-08-25T18:12:26-10:00 2022-08-26T04:12:26-10:00 1 4207506c-4e19-4dcb-b268-d291023b2acd PDC continues to work closely with partners th... jmyhre
  13. 148 rows × 33 columns

您可以找到pandas文档here

展开查看全部

相关问题