如何使用代理从谷歌搜索轻松获取URL列表？- python

dy1byipe 于 2023-06-25 发布在 Python

关注(0)|答案(1)|浏览(196)

我通常使用googlesearch库如下：

from googlesearch import search
list(search(f"{query}", num_results))

但我现在一直得到这个错误：

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%{query}%26num%3D10%26hl%3Den%26start%3D67&hl=en&q=EhAmABcACfAIIME0fDvEUYF8GOKX1KQGIjAEGg2nloeEEAcko9umYCP9uPHRWoSo2odE3n3ZgbQ1L6lDvGfyai6798pyy3iU5vcyAXJaAUM

我使用requests和BeautifulSoup开发了一个“hacky”解决方案，但它非常低效，我需要1小时才能获得100个URL，而上面的行只需要1秒：

search_results = []
    retry = True
    while retry:
        try:
            response = requests.get(f"https://www.google.com/search?q={query}", 
                                        headers={
                                            'User-Agent': user_agent,
                                            'Referer': 'https://www.google.com/',
                                            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
                                            'Accept-Encoding': 'gzip, deflate, br', 
                                            'Accept-Language': 'en-US,en;q=0.9,en-gb', 
                                        }, 
                                        proxies={
                                            "http": proxy, 
                                            "https": proxy}, 
                                        timeout=TIMEOUT_THRESHOLD*2)
        
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                
                for link in soup.select('.yuRUbf a'):
                    url = link['href']
                    
                    search_results.append(url)
                    
                    if len(search_results) >= num_results:
                        retry = False
                        break  
            
            else:
                proxy = get_working_proxy(proxies)
                user_agent = random.choice(user_agents)

        except Exception as e:
            proxy = get_working_proxy(proxies)
            user_agent = random.choice(user_agents)
            print(f"An error occurred in tips search: {str(e)}")

有没有一种更好、更简单的方法仍然使用我的代理来获取一个查询的Google搜索结果列表？

python-3.x

来源：https://stackoverflow.com/questions/76537012/how-to-easily-get-list-of-urls-from-google-search-using-a-proxy-python

1条答案

按热度按时间

lf5gs5x21#

你的代码花费更长时间的原因很可能是因为你使用了代理，这需要时间来完成他们的事情，也许是因为你的超时阈值太高，这意味着你等待请求超时的时间太长了。
你可以尝试的事情：
1.如果你使用的是免费代理，购买一些代理-付费代理往往有更好的可用性和速度
1.降低门槛（我注意到你已经加倍了，试着不要这样做，看看效果如何）
问题是，你的问题可能应该是“我如何才能更快地抓取谷歌搜索？“你得到这429个错误的原因是因为你用请求攻击谷歌的服务器，我猜抓取是违反谷歌的服务条款的。所以真实的的答案是：
1.使用the Google Search API，或者慢慢做，要有耐心。
免责声明：刮网站，要求你不要这样做是一个道德可疑的行为，我不赞成。

赞(0）回复(0）举报 2023-06-25

我来回答

如何使用代理从谷歌搜索轻松获取URL列表？- python

1条答案

相关问题

热门标签

最新问答