如何在Scrapy中使用旋转代理?

yqhsw0fo  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(137)

我尝试在这个脚本中使用一个旋转代理。但是我不知道如何使用它。我已经检查了以前关于这个的问题,并尝试实现它。但是它检测代理,要求登录,并阻止获取数据。我已经开发了下面提到的脚本使用selenium + selenium-stealth。我也尝试了爬网蜘蛛,但得到了相同的结果。

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth
import time

class RsSpider(scrapy.Spider):
      name = 'rs'
      allowed_domains = ['www.sahibinden.com']
      def start_requests(self):
          options = webdriver.ChromeOptions()
          options.add_argument("start-maximized")
          options.add_experimental_option("excludeSwitches", ["enable-automation"])
          options.add_experimental_option('useAutomationExtension', False)

          driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
          driver.set_window_size(1920, 1080)

          stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
          )

          driver.get("https://www.sahibinden.com/satilik/istanbul-eyupsultan?pagingOffset=0")
          time.sleep(5)

          links = driver.find_elements(By.XPATH, "//td[@class='searchResultsTitleValue ']/a")

          for link in links:
              href= link.get_attribute('href')
              yield SeleniumRequest(
                url = href,
                callback= self.parse,
                meta={'proxy': 'username:password@server:2000'},
                wait_time=1
          )

      driver.quit()
      return super().start_requests()

      def parse(self, response):
          yield {
                'URL': response.url,
                'City': response.xpath("normalize-space(//div[@class='classifiedInfo 
                 ']/h2/a[1]/text())").get(),
                 }
wj8zmpe1

wj8zmpe11#

如果向请求参数添加代理不起作用,则
第一名
您可以添加一个proxy middleware pipeline并将其添加到项目设置中。(更好、更安全的选项)
下面是中间件的工作代码-

from w3lib.http import basic_auth_header
from scrapy.utils.project import get_project_settings

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        settings = get_project_settings()
        request.meta['proxy'] = settings.get('PROXY_HOST') + ':' + settings.get('PROXY_PORT')
        request.headers["Proxy-Authorization"] = basic_auth_header(settings.get('PROXY_USER'), settings.get('PROXY_PASSWORD'))
        spider.log('Proxy : %s' % request.meta['proxy'])

设置文件(激活DOWNLOADER_MIDDLEWARES)-

import os
from dotenv import load_dotenv

load_dotenv()
....
....

# Proxy setup

PROXY_HOST = os.environ.get("PROXY_HOST")
PROXY_PORT = os.environ.get("PROXY_PORT")
PROXY_USER = os.environ.get("PROXY_USER")
PROXY_PASSWORD = os.environ.get("PROXY_PASSWORD")
.....
.....
.....

DOWNLOADER_MIDDLEWARES = {
   # 'project.middlewares.projectDownloaderMiddleware': 543,
    'project.proxy_middlewares.ProxyMiddleware': 350,
}

.env文件-

PROXY_HOST=127.0.0.1
PROXY_PORT=6666
PROXY_USER=proxy_user
PROXY_PASSWORD=proxy_password
第二个

看看这个中间件-scrapy-rotating-proxies

相关问题