scrapy 如何使用google api抓取数据

mzsu5hc0  于 2022-11-09  发布在  Go
关注(0)|答案(3)|浏览(233)
import requests

def search(query, pages=4, rsz=8):
    url = 'https://ajax.googleapis.com/ajax/services/search/web'
    params = {
        'v': 1.0,     # Version
        'q': query,   # Query string
        'rsz': rsz,   # Result set size - max 8
    }

    for s in range(0, pages*rsz+1, rsz):
        params['start'] = s
        r = requests.get(url, params=params)
        for result in r.json()['responseData']['results']:
            yield result

在第一次2,3尝试,它是检索所有需要的网页,但在2,3尝试后,它是没有得到任何结果。它是返回“无”或[]。谷歌是封锁我的IP后,几次尝试?任何解决方案?

bhmjp9jg

bhmjp9jg1#

我不确定这是否可行,但避免被不鼓励刮取的网站阻止的唯一方法是在检索网页时使用代理。请检查代理如何在您的代码中使用。

jvidinwx

jvidinwx2#

这个问题是通过请求和BeautifulSoup解决的。

import requests, import BeautifulSoup
url = 'http://www.google.com/search'
payload = { 'q' : strToSearch, 'start' : str(start), 'num' : str(num) }
r = requests.get( url,params = payload, auth=('user', 'pass')) 
subSoup = BeautifulSoup( subR.text, 'html.parser' )
text = soup.get_text(separator=' ')
ugmeyewa

ugmeyewa3#

请确保您使用的是user-agent,因为如果发送请求时没有使用user-agent,Google可能会阻止该请求。例如,默认的requestsuser-agentpython-requests,这样网站就知道这是一个发送请求的脚本,并可能会阻止它。
此外,也没有必要为auth=('user', 'pass'),因为你不必登录任何地方时,搜索谷歌。
在线IDE中的代码和完整示例:

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls

params = {
    "q": "minecraft redstone ideas",  # search query
    "gl": "us",                       # country of the search
    "hl": "en"                        # language                
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

results = []

for index, result in enumerate(soup.select(".tF2Cxc"), start=1):
    title = result.select_one(".DKV0Md").text
    link = result.select_one(".yuRUbf a")["href"]
    displayed_link = result.select_one(".tjvcx").text
    try:
        snippet = result.select_one("#rso .lyLwlc").text
    except: snippet = None

    results.append({
        "position": index,
        "title": title,
        "link": link,
        "displayed_link": displayed_link,
        "snippet": snippet
    })

print(json.dumps(results, indent=2, ensure_ascii=False))

部分输出:

[
  {
    "position": 1,
    "title": "15 Awesome Minecraft Redstone Ideas - WhatIfGaming",
    "link": "https://whatifgaming.com/awesome-minecraft-redstone-ideas/",
    "displayed_link": "https://whatifgaming.com › awesome-minecraft-redstone-i...",
    "snippet": null
  },
  {
    "position": 2,
    "title": "Minecraft: 20 Insanely Useful Redstone Contraptions ...",
    "link": "https://gamerant.com/minecraft-insanely-useful-redstone-contraptions/",
    "displayed_link": "https://gamerant.com › Lists",
    "snippet": "Nov 1, 2021 — Minecraft: 20 Insanely Useful Redstone Contraptions ; 20 Bubble Elevator ; 19 Kelp Farm ; 18 Xray Machine ; 17 Armor Wardrobe ; 16 Micro-Crop Farm."
  },
  {
    "position": 3,
    "title": "10 Minecraft Redstone Tricks for Survival Mode - dummies",
    "link": "https://www.dummies.com/article/home-auto-hobbies/games/online-games/minecraft/10-minecraft-redstone-tricks-for-survival-mode-147583",
    "displayed_link": "https://www.dummies.com › ... › Minecraft",
    "snippet": "Learn how to apply redstone programming in Minecraft Survival mode, including dungeon farms, fast transportation, elevators, and more."
  },
]

或者,您也可以使用SerpApi的Google Organic Results API来实现相同的功能。
这是一个付费的API,有一个免费的计划,处理来自谷歌或其他搜索引擎的块,可以扩展到月球,并让最终用户考虑什么数据提取,而不是从头开始创建一个解析器和维护它,并找出如何绕过来自谷歌或其他搜索引擎的块。
要集成的代码:

from serpapi import GoogleSearch
import json

params = {
    "api_key": "serpapi_key",
    "engine": "google",
    "q": "minecraft redstone ideas",
    "google_domain": "google.com",
    "gl": "us",
    "hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()

data = []

for result in results["organic_results"]:
    data.append({
        "position": result.get("position"),
        "title": result.get("title"),
        "link": result.get("link"),
        "displayed_link": result.get("displayed_link"),
        "snippet": result.get("snippet")
    })

print(json.dumps(data, indent=2, ensure_ascii=False))

部分输出:

[
  {
    "position": 1,
    "title": "Minecraft: 20 Insanely Useful Redstone Contraptions ...",
    "link": "https://gamerant.com/minecraft-insanely-useful-redstone-contraptions/",
    "displayed_link": "https://gamerant.com › Lists",
    "snippet": "Minecraft: 20 Insanely Useful Redstone Contraptions ; 20 Bubble Elevator ; 19 Kelp Farm ; 18 Xray Machine ; 17 Armor Wardrobe ; 16 Micro-Crop Farm."
  },
  {
    "position": 2,
    "title": "Build These in Your Minecraft House! - YouTube",
    "link": "https://www.youtube.com/watch?v=a3ggfzC0rLg",
    "displayed_link": "https://www.youtube.com › watch",
    "snippet": null
  },
  {
    "position": 3,
    "title": "10 Minecraft Redstone Tricks for Survival Mode - dummies",
    "link": "https://www.dummies.com/article/home-auto-hobbies/games/online-games/minecraft/10-minecraft-redstone-tricks-for-survival-mode-147583",
    "displayed_link": "https://www.dummies.com › ... › Minecraft",
    "snippet": "Learn how to apply redstone programming in Minecraft Survival mode, including dungeon farms, fast transportation, elevators, and more."
  }, ... other results
]

免责声明,我为SerpApi工作。

相关问题