Python、Selenium:将整页屏幕截图作为.pdf,不带分页符,无论页面尺寸如何

hfyxw5xn  于 2023-05-21  发布在  Python
关注(0)|答案(2)|浏览(196)

目前,我看到可以使用Selenium创建屏幕截图。但是,它们始终是.png文件。我怎么能采取相同的风格截图,但作为.pdf
所需样式:无边距;与当前页面尺寸相同(如整页截图)
打印页面并不能完成这一点,因为打印附带的所有格式。
我目前如何获得屏幕截图:

from selenium import webdriver

# Function to find page size
S = lambda X: driver.execute_script('return document.body.parentNode.scroll'+X)

driver = webdriver.Firefox(options=options)
driver.get('https://www.google.com')

# Screen 
height = S('Height')
width = S('Width')

driver.set_window_size(width, height)
driver.get_screenshot_as_file(PNG_SAVEAS)

driver.close()
hxzsmxv2

hxzsmxv21#

试试这个:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from webdriver_manager.firefox import GeckoDriverManager
from PIL import Image

def get_page_size(driver):
    return driver.execute_script('return [document.documentElement.clientWidth, document.documentElement.clientHeight];')

def scroll_to_bottom(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

def capture_screenshot_as_pdf(driver, file_path):
    driver.save_screenshot(file_path)

def convert_to_pdf(input_file, output_file):
    image = Image.open(input_file)
    image.save(output_file, 'PDF', resolution=100.0)

# Set up the Firefox driver with options
options = Options()
options.headless = True
capabilities = DesiredCapabilities.FIREFOX.copy()
capabilities['acceptInsecureCerts'] = True
driver = webdriver.Firefox(options=options, executable_path=GeckoDriverManager().install(), capabilities=capabilities)

# Navigate to the webpage
driver.get('https://www.google.com')

# Get the page size
page_size = get_page_size(driver)

# Set the window size
driver.set_window_size(page_size[0], page_size[1])

# Scroll to the bottom to load dynamic content
scroll_to_bottom(driver)

# Capture the full-page screenshot as PNG
png_file_path = 'full_page_screenshot.png'
capture_screenshot_as_pdf(driver, png_file_path)

# Convert the PNG screenshot to PDF
pdf_file_path = 'full_page_screenshot.pdf'
convert_to_pdf(png_file_path, pdf_file_path)

# Clean up and close the browser
driver.quit()

这段代码将整页屏幕截图捕获为PNG文件,然后将其转换为PDF文件。将文件路径(png_file_path和pdf_file_path)调整到所需的位置以保存文件。

2ekbmq32

2ekbmq322#

为了达到预期的效果,我找到了一个其他地方不容易找到的解决方案。
关键是动态配置PDF页面的宽度和高度,以匹配正在打印的内容。此外,我发现将结果缩小到仅为原始大小的1%可以显著加快该过程。
需要注意的一点是,在使用GeckoDriver时,我遇到了一个bug(reference),导致生成的PDF打印出错误的大小。但是,我发现将大小乘以2.5352112676056335解决了这个问题。我仍然不清楚为什么这个特定的常数与我的答案相关,但是如果不应用这个修复程序,PDF的纵横比就会失真(而不是按比例缩小到其所需大小的39%)。失真导致多页.pdf文件,这不是预期的结果。
这个方法是用GeckoDriver测试的。如果您使用的是Chrome,则可能不需要RATIO_MULTIPLIER解决方案。

from selenium import webdriver
from selenium.webdriver.common.print_page_options import PrintOptions
import base64

# Bug in geckodriver... seems unrelated, but this wont work otherwise.
# https://github.com/SeleniumHQ/selenium/issues/12066
RATIO_MULTIPLIER = 2.5352112676056335

# Function to find page size
S = lambda X: driver.execute_script('return document.body.parentNode.scroll'+X)

# Scale for PDF size. 1 for no change takes long time
pdf_scaler = .01

# Browser options. Headless is more reliable for screenshots in my exp.
options = webdriver.FirefoxOptions()
options.add_argument('--headless')

# Lanuch webdriver, navigate to destination
driver = webdriver.Firefox(options=options)
driver.get('https://www.google.com')

# Find full page dimensions regardless of scroll
height = S('Height')
weight = S('Width')

# Dynamic setting of PDF page dimensions
print_options = PrintOptions()
print_options.page_height = (height*pdf_scaler)*RATIO_MULTIPLIER
print_options.page_width = (weight*pdf_scaler)*RATIO_MULTIPLIER
print_options.shrink_to_fit = True

# Prints to PDF (returns base64 encoded data. Must save)
pdf = driver.print_page(print_options=print_options)
driver.close()

# save the output to a file.
with open('example.pdf', 'wb') as file:
    file.write(base64.b64decode(pdf))

使用的版本:

geckodriver 0.31.0
Firefox 113.0.1
selenium==4.9.1
Python 3.11.2
Windows 10

相关问题