漂亮的汤Python-下载图片从谷歌搜索在其全尺寸

t5fffqht  于 2023-06-20  发布在  Python
关注(0)|答案(2)|浏览(69)

我正在编写一个脚本,它只从google搜索下载第一张图片。我已经用这段代码让它工作了:

def download_image(query):
    url = f"https://www.google.com/search?q={query}&tbm=isch" 

    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    img_tag = soup.find("img", {"class": "yWs4tf"})

    if img_tag is not None:
        img_link = img_tag.get("src")
        print(query + ' - Downloading image: ' + img_link)  

        os.mkdir(IMAGE_SAVE_PATH)
        
        response = requests.get(img_link)
        image_bytes = BytesIO(response.content)
        img = Image.open(image_bytes)
        img.save(IMAGE_SAVE_PATH + query + '.png')
    else:
        print("Couldn't find image for: " + query)

我的问题是,它下载的图像只是在谷歌搜索中显示的大小,而不是它们的原始大小。我尝试改变这些图片的html大小属性,然后下载它,但它没有工作。你还有其他建议吗?

dxxyhpgq

dxxyhpgq1#

a.您可以尝试通过image = image.resize(X,Y)下载后调整图像大小
B.如果 a 不适合你,那么你需要了解搜索引擎上的图像搜索结果会显示图像的缩略图,而不是实际的图像。我们看到的是缩略图和原始位置是其他地方,如果你想下载他们在原始大小,那么你需要去该网站并从那里下载。

vdgimpew

vdgimpew2#

你正在提取图像的url,这是缩略图,当然不是完整的图像。不幸的是,这个网址有点难以获得,因为它不会在查看图像时加载,只有在点击它之后。
这同样很难模拟,因为据我所知,它是通过javascript加载的。你可以重写整个过程,使用一个Web驱动程序,比如Selenium来模拟用户与图像的交互,从而能够加载它们。
我将包括一个小脚本,搜索'小丑'和下载的前10个图像,它发现,在其全分辨率通过Selenium在一个无头设置和地方,他们在一个名为'图像'目录
要测试它,您必须首先执行pip install selenium Pillow requests webdriver_manager

import os
import time
from io import BytesIO
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager
from PIL import Image
import requests

# Directory where images are to be saved
IMAGE_SAVE_PATH = os.path.join('.', 'images')

# Create directory if not exist
os.makedirs(IMAGE_SAVE_PATH, exist_ok=True)

def setup_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')

    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

    return driver

def click_consent_if_exists(driver):
    try:
        consent_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button/span'))
        )
        consent_button.click()
    except:
        print("No consent form found, proceeding...")

def download_image(query, num_images):
    url = f"https://www.google.com/search?q={query}&tbm=isch"

    driver = setup_driver()

    try:
        driver.get(url)
        click_consent_if_exists(driver)

        action_chains = ActionChains(driver)

        for i in range(num_images):
            try:
                image = WebDriverWait(driver, 10).until(
                    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "img.rg_i"))
                )[i]

                action_chains.move_to_element(image).click().perform()

                img_tag = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, '#Sva75c > div.DyeYj > div > div.dFMRD > div.pxAole > div.tvh9oe.BIB1wf > c-wiz > div > div > div > div.n4hgof > div.MAtCL.PUxBg > a > img.r48jcc.pT0Scc.iPVvYb'))
                )

                img_link = img_tag.get_attribute("src")

                if img_link is not None and img_link.startswith('http'):
                    response = requests.get(img_link)
                    image_bytes = BytesIO(response.content)
                    img = Image.open(image_bytes)
                    img_filename = f"{os.path.join(IMAGE_SAVE_PATH, query.replace(' ', '_'))}_{i}.png"
                    img.save(img_filename)
                    print(f"{query} - Downloaded image: {i}")
                else:
                    print(f"Couldn't download image for: {query}. Error: Link is None or not http")

            except Exception as e:
                print(f"Error occurred: {str(e)}")

    finally:
        driver.quit()

download_image('clowns', 10)

注意:XPATH和CSS选择器是目前为我工作的,但它可能需要一些调整,这取决于页面加载是否不同。

相关问题