selenium 如何在谷歌上抓取相关搜索?

ffscu2ro  于 2022-11-24  发布在  其他
关注(0)|答案(2)|浏览(140)

我试着在谷歌上搜索相关的关键词,然后把这些相关的搜索结果输出到一个csv文件中。我的问题是得到漂亮的汤来识别相关的搜索html标签。
下面是源代码中的html标记示例:

<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>

以下是我的webdriver设置:

from selenium import webdriver

user_agent = 'Chrome/100.0.4896.60'

webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))

capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True

下面是我的代码:

queries = ["iphone"]

driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)

df2 = []

driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()

# get_current_related_searches
for query in queries:
    driver.get("https://google.com/search?q=" + query)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    p = soup.find_all('div data-ved')
    print(p)
    d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
    terms = d["to"]
    df2.append(d)
    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

它的p=soup.find_all是不正确的,我只是不知道如何让BS来识别这些特定的html标签。任何帮助都是很好的:)

rjjhvcjd

rjjhvcjd1#

@jakecohensol,正如您所指出的,p = soup.find_all中的选择器是错误的。正确的CSS选择器:.y6Uyqe .AB4Wff .
Chrome/100.0.4896.60 User-Agent标头不正确。Google会阻止具有此类代理字符串的请求。对于完整的User-Agent字符串,Google会返回正确的HTML响应。
谷歌相关搜索可以在没有浏览器的情况下进行。它会更快,更可靠。
以下是您的固定代码片段(链接到在线IDE中的完整代码)

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}

queries = ["iphone", "pixel", "samsung"]

df2 = []

# get_current_related_searches
for query in queries:
    params = {"q": query}
    response = requests.get("https://google.com/search", params=params, headers=headers)

    soup = BeautifulSoup(response.text, "html.parser")

    p = soup.select(".y6Uyqe .AB4Wff")

    d = pd.DataFrame(
        {"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
    )

    terms = d["to"]
    df2.append(d)

    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

示例输出:

,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login
xyhw6mcr

xyhw6mcr2#

看看SelectorGadget Chrome extension,通过在浏览器中单击返回HTML元素的所需元素来获取CSS选择器。
Check out what's your user agentfind multiple user agents for mobile, tablet, PC, or different OS,以便轮换用户代理,这稍微减少了被阻塞的机会。
理想的场景是将旋转用户代理与旋转代理(理想情况下是住宅)相结合,并使用CAPTCHA求解器来求解最终将出现的Google CAPTCHA。
如果您不想知道如何从头开始创建和维护解析器,或者如何绕过Google(或其他搜索引擎)的块,那么作为一种替代方法,可以使用Google Search Engine Results API来抓取Google搜索结果。
要集成的示例代码:

import os
from serpapi import GoogleSearch

queries = [
    'banana',
    'minecraft',
    'apple stock',
    'how to create a apple pie'
]

def serpapi_scrape_related_queries():

    related_searches = []

    for query in queries:
        print(f'extracting related queries from query: {query}')

        params = {
            'api_key': os.getenv('API_KEY'),  # your serpapi api key
            'device': 'desktop',              # device to retrive results from
            'engine': 'google',               # serpapi parsing engine
            'q': query,                       # search query
            'gl': 'us',                       # country of the search
            'hl': 'en'                        # language of the search
        }

        search = GoogleSearch(params)         # where data extracts on the backend
        results = search.get_dict()           # JSON -> dict

        for result in results['related_searches']:
            query = result['query']
            link = result['link']

            related_searches.append({
                'query': query,
                'link': link
            })

    pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)

serpapi_scrape_related_queries()

Dataframe 输出的一部分:

query                                               link
0  banana benefits  https://www.google.com/search?gl=us&hl=en&q=Ba...
1  banana republic  https://www.google.com/search?gl=us&hl=en&q=Ba...
2      banana tree  https://www.google.com/search?gl=us&hl=en&q=Ba...
3   banana meaning  https://www.google.com/search?gl=us&hl=en&q=Ba...
4     banana plant  https://www.google.com/search?gl=us&hl=en&q=Ba...

相关问题