scrapy 正在使用Leafproxy代理进行抓取,ValueError:端口无法转换为整数值

83qze16e  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(223)

我是一个Scrapy爱好者到刮3个月。因为我真的很喜欢刮,我结束了沮丧和兴奋地购买了代理包从Leafpad。
不幸的是,当我把它们上传到我的Scrapy Spider时,我收到了ValueError:
我使用scrapy-rotating-proxy来整合代理。我添加了不是数字而是字符串URL的代理,如下所示:

ROTATING_PROXY_LIST = [
    "us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
    "us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
    # ...
]

DOWNLOADER_MIDDLEWARES = {
      'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 800

    }

报废日志:

draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:

  HomeSpider named 'home' (in thomas.spiders.home)

  HomeSpider named 'home' (in thomas.spiders.home3)

  This can cause unexpected behavior.
  warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
 'CLOSESPIDER_ERRORCOUNT': 10,
 'CONCURRENT_REQUESTS': 3,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
 'CONCURRENT_REQUESTS_PER_IP': 5,
 'COOKIES_ENABLED': False,
 'DNS_TIMEOUT': 10,
 'DOWNLOAD_DELAY': 2,
 'DOWNLOAD_TIMEOUT': 200,
 'NEWSPIDER_MODULE': 'thomas.spiders',
 'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'rotating_proxies.middlewares.RotatingProxyMiddleware',
 'rotating_proxies.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'thomas.middlewares.UserAgentRotatorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'thomas.middlewares.ThomasSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list

    2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD

# ....

    2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
  esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
    2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
    2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
    2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
    2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
    2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
    OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
    2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args,**kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
    2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
    2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
    2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args,**kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
    2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
    2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
    2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
    Traceback (most recent call last):
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
        result = current_context.run(
      File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
        return (yield download_func(request=request, spider=spider))
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
        result = f(*args,**kw)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
        return handler.download_request(request, spider)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
        return agent.download_request(request)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
        agent = self._get_agent(request, timeout)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
        return _parsed_url_args(parsed)
      File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
        port = parsed.port
      File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
        raise ValueError(message) from None
    ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
    2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
    2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
    2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
    2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
    2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)

    ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
    2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'bans/error/builtins.ValueError': 24,
     'downloader/exception_count': 24,
     'downloader/exception_type_count/builtins.ValueError': 24,
     'downloader/request_bytes': 7158,
     'downloader/request_count': 24,
     'downloader/request_method_count/GET': 24,
     'elapsed_time_seconds': 55.895942,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
     'log_count/DEBUG': 50,
     'log_count/ERROR': 4,
     'log_count/INFO': 13,
     'memusage/max': 65073152,
     'memusage/startup': 65073152,
     'proxies/dead': 21,
     'proxies/mean_backoff': 196.90260209397636,
     'proxies/reanimated': 1,
     'proxies/unchecked': 9,
     'scheduler/dequeued': 24,
     'scheduler/dequeued/memory': 24,
     'scheduler/enqueued': 24,
     'scheduler/enqueued/memory': 24,
     'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
    2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)

会有什么问题呢?
我在Leafproxy的代理会员是“住宅代理”。Leafproxy不提供任何关于它的细节和如何使用它的信息。据我所知,没有真实的的消费者支持,但一个不和谐的渠道。
这是Leafproxy提供的面板。我从下面列出的代理中获得。没有数据使用记录

unftdfkk

unftdfkk1#

您定义代理列表的方式不正确。您需要使用username:password@server:port,而不是server:port:username:password。请尝试使用以下定义:

ROTATING_PROXY_LIST= [
    "https://ksre9jva95etajxxaoll9k+JI38HJg5:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5000",
    "https://ksre9jva95etajxxaoll9k+zHtjyZRG:lnztmtf9nh@us-retail-fast.resdleafproxies.com:5001",
]
DOWNLOADER_MIDDLEWARES = {
    # ...
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 810,
    # ...
}

注意:您已将凭据公开到Internet,因此看到此问题得任何人都可以免费使用您得代理服务.请考虑尽快吊销凭据.

你可能面临的第二个问题是一些代理可能已经被你正在抓取的站点禁止了,所以你会收到失败的响应。所以当你使用代理时,你需要增加RETRIES的值。

相关问题