在python 3中获取请求的数据之前等待页面加载

khbbv19g 于 2022-11-19 发布在 Python

关注(0)|答案(6)|浏览(259)

我有一个页面，我需要获得源代码以用于BS4，但页面中间需要1秒（可能更少）来加载内容，并请求.get在部分加载之前捕获页面的源代码，我如何在获得数据之前等待一秒？

r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 )
    soup = BeautifulSoup(r.content, 'html.parser')
    a = soup.find_all('section', 'wrapper')

The page

<section class="wrapper" id="resultado_busca">

python-3.x

来源：https://stackoverflow.com/questions/45448994/wait-page-to-load-before-getting-data-with-requests-get-in-python-3

6条答案

按热度按时间

huwehgph1#

这看起来不像是等待的问题，它看起来像是JavaScript正在创建元素，requests不能处理JavaScript动态生成的元素。一个建议是使用**selenium和PhantomJS**来获取页面源代码，然后你可以使用BeautifulSoup来解析，下面的代码将完全做到这一点：

from bs4 import BeautifulSoup
from selenium import webdriver

url = "http://legendas.tv/busca/walking%20dead%20s03e02"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')

此外，如果只查找一个元素，则无需使用.findAll。

赞(0）回复(0）举报 2022-11-19

3qpi33ja2#

我也遇到了同样的问题，提交的答案没有一个对我有效。但经过长时间的研究，我找到了一个解决方案：

from requests_html import HTMLSession
s = HTMLSession()
response = s.get(url)
response.html.render()

print(response)
# prints out the content of the fully loaded page
# response can be parsed with for example bs4

requests_html包（docs）是Python软件基金会发布的一个官方包，它有一些额外的JavaScript功能，比如等待页面的JS加载完成。
该包目前仅支持Python Version3.6及更高版本，因此可能无法与其他版本兼容。

赞(0）回复(0）举报 2022-11-19

a5g8bdjr3#

Selenium是一个很好的解决方法，但公认的答案是相当反对的。正如@Seth在评论中提到的Firefox/Chrome（或其他浏览器）的无头模式应该用来代替PhantomJS。
首先你需要下载特定的驱动程序：
用于Firefox的Geckodriver
ChromeDriver适用于Chrome
接下来，您可以将下载驱动程序路径添加到系统PATH变量中但这不是必须，您也可以在代码中指定可执行文件所在位置
火狐浏览器：

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

Chrome也是如此：

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Chrome(options=options, executable_path='YOUR_PATH/chromedriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

最好记住browser.quit()，以避免在代码执行后挂起进程。如果您担心您的代码可能会在浏览器被释放之前失败，您可以将其 Package 在try...except块中，并将browser.quit()放在finally部分中，以确保它会被调用。
此外，如果使用该方法仍然没有加载部分源代码，您可以要求Selenium等待，直到特定元素出现：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')

try:
    browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
    timeout_in_seconds = 10
    WebDriverWait(browser, timeout_in_seconds).until(ec.presence_of_element_located((By.ID, 'resultado_busca')))
    html = browser.page_source
    soup = BeautifulSoup(html, features="html.parser")
    print(soup)
except TimeoutException:
    print("I give up...")
finally:
    browser.quit()

如果你对Firefox或Chrome以外的其他驱动程序感兴趣，请检查docs。

赞(0）回复(0）举报 2022-11-19

ct3nt3jp4#

我找到了一种方法！

r = requests.get('https://github.com', timeout=(3.05, 27))

在这里，timeout有两个值，第一个是设置会话超时，第二个是您需要的值。第二个值决定在多少秒后发送响应。您可以计算填充所需的时间，然后打印出数据。

赞(0）回复(0）举报 2022-11-19

au9on6nz5#

在Python 3中，实际使用urllib模块比requests模块在加载动态网页时效果更好。
即

import urllib.request
try:
    with urllib.request.urlopen(url) as response:

        html = response.read().decode('utf-8')#use whatever encoding as per the webpage
except urllib.request.HTTPError as e:
    if e.code==404:
        print(f"{url} is not found")
    elif e.code==503:
        print(f'{url} base webservices are not available')
        ## can add authentication here 
    else:
        print('http error',e)

赞(0）回复(0）举报 2022-11-19

kjthegm66#

只是列出我的做法，也许它可以对某人的价值：

max_retries = # some int
retry_delay = # some int
n = 1
ready = 0
while n < max_retries:
  try:
     response = requests.get('https://github.com')
     if response.ok:
        ready = 1
        break
  except requests.exceptions.RequestException:
     print("Website not availabe...")
  n += 1
  time.sleep(retry_delay)

if ready != 1:
  print("Problem")
else:
  print("All good")

赞(0）回复(0）举报 2022-11-19

我来回答

在python 3中获取请求的数据之前等待页面加载

6条答案

相关问题

热门标签

最新问答