如何使用wget模块从URL列表下载PDF?

m1m5dgzv  于 2022-10-22  发布在  Python
关注(0)|答案(1)|浏览(238)

我有一个Python脚本,它使用Selenium从website中抓取URL并将其存储在列表中。现在,我想用wget模块下载它们。
这是代码的相关部分,脚本完成了从网站获得的部分URL:

  1. new_links = []
  2. for link in list_of_links: # trim links
  3. current_strings = link.split("/consultas/coleccion/window.open('")
  4. current_strings[1] = current_strings[1].split("');return")[0]
  5. new_link = current_strings[0] + current_strings[1]
  6. new_links.append(new_link)
  7. for new_link in new_links:
  8. wget.download(new_link)

脚本此时不做任何事情。它从不下载任何PDF,也不会显示错误消息。
我在第二个for循环中做错了什么?
至于new_links是否为空的问题。事实并非如此。

  1. print(*new_links, sep = '\n')

给了我这样的链接(这里只有四个):

  1. http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D
  2. http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D
  3. http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D
  4. http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D

部分URL如下所示:

  1. /consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D`

然后在它之前添加“基本URL”

  1. http://digesto.asamblea.gob.ni

这是代码的相关部分,刚好在上面的代码之前,它收集部分URL:

  1. list_of_links = [] # will hold the scraped links
  2. tld = 'http://digesto.asamblea.gob.ni'
  3. current_url = driver.current_url # for any links not starting with /
  4. table_id = driver.find_element(By.ID, 'tableDocCollection')
  5. rows = table_id.find_elements_by_css_selector("tbody tr") # get all table rows
  6. for row in rows:
  7. row.find_element_by_css_selector('button').click()
  8. link = row.find_element_by_css_selector('li a[onclick*=pdf]').get_attribute("onclick") # get partial link
  9. if link.startswith('/'):
  10. list_of_links.append(tld + link) # add base to partial link
  11. else:
  12. list_of_links.append(current_url + link)
  13. row.find_element_by_css_selector('button').click()
xuo3flqw

xuo3flqw1#

你的循环正在工作。
尝试将wget版本升级到3.2并检查

  1. new_links = ['http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=vPjrUnz0wbA%3D',
  2. 'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=dsyx6l1Fbig%3D',
  3. 'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=Cb64W7EHlD8%3D',
  4. 'http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=A4TKEG9x4F8%3D']
  5. for new_link in new_links:
  6. wget.download(new_link)

输出:下载了四个文件,文件名为pdf。php,pdf(1)。php等。

相关问题