我是一个Scrapy爱好者到刮3个月。因为我真的很喜欢刮,我结束了沮丧和兴奋地购买了代理包从Leafpad。
不幸的是,当我把它们上传到我的Scrapy Spider时,我收到了ValueError:
我使用scrapy-rotating-proxy来整合代理。我添加了不是数字而是字符串URL的代理,如下所示:
ROTATING_PROXY_LIST = [
"us-retail-fast.resdleafproxies.com:5000:ksre9jXXXXXXXXI38HJg5:XXX9nh",
"us-retail-fast.resdleafproxies.com:5000:ksre9jvXXXXXXXXk+zHtjyZRG:XXXXtf9nh",
# ...
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 800,
'rotating_proxies.middlewares.BanDetectionMiddleware': 800
}
报废日志:
draco@draco:~/docs/scraping/scrapyyy/thomas$ scrapy crawl home2 -o all_np4.csv
/home/draco/.local/lib/python3.8/site-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:
HomeSpider named 'home' (in thomas.spiders.home)
HomeSpider named 'home' (in thomas.spiders.home3)
This can cause unexpected behavior.
warnings.warn(
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: thomas)
2022-02-21 00:16:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-30-generic-x86_64-with-glibc2.29
2022-02-21 00:16:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-21 00:16:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'thomas',
'CLOSESPIDER_ERRORCOUNT': 10,
'CONCURRENT_REQUESTS': 3,
'CONCURRENT_REQUESTS_PER_DOMAIN': 3,
'CONCURRENT_REQUESTS_PER_IP': 5,
'COOKIES_ENABLED': False,
'DNS_TIMEOUT': 10,
'DOWNLOAD_DELAY': 2,
'DOWNLOAD_TIMEOUT': 200,
'NEWSPIDER_MODULE': 'thomas.spiders',
'SPIDER_MODULES': ['thomas.spiders']}
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet Password: 536c802b585074b3
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'thomas.middlewares.UserAgentRotatorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'thomas.middlewares.ThomasSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-21 00:16:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-21 00:16:51 [scrapy.core.engine] INFO: Spider opened
2022-02-21 00:16:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-21 00:16:51 [home2] INFO: Spider opened: home2
2022-02-21 00:16:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-21 00:16:51 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 30, reanimated: 0, mean backoff time: 0s)
INITIAL REQUEST
OPENING LIST https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden
OPENING LIST https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau
OPENING LIST https://www.homegate.ch/buy/apartment/canton-zurich/matching-list
2022-02-21 00:16:51 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006:XXXXXj: XXXXXXXtf9nh> is DEAD
# ....
2022-02-21 00:17:02 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 2 times, max retries: 5)
esdleafproxies.com:5005:ksre9jva95etajxxaoll9k+cw17qdyl:xxxx9nh> is DEAD
2022-02-21 00:17:21 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:23 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXjxxaoll9k+ZcGvdwJf:XXXXXXXtf9nh> is DEAD
2022-02-21 00:17:23 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> with another proxy (failed 5 times, max retries: 5)
2022-02-21 00:17:25 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproXXXXXXXsre9jva95etajxxaoll9k+oFx6kEXE:xxxxxxxtf9nh> is DEAD
2022-02-21 00:17:25 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden> (failed 6 times with different proxies)
OPENING LIST https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400
2022-02-21 00:17:25 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-bern/matching-list?ah=1000www.homegate.ch/buy/apartment/canton-baselstadt/matching-list?loc=geo-canton-basel-landschaft%2Cgeo-canton-st-gallen%2Cgeo-canton-graubunden>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args,**kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5007:ksre9jva95etajxxaoll9k+oFx6kEXE:XXXXtf9nh'
2022-02-21 00:17:28 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5006xxxxxxxxetajxxaoll9k+V2UowimU:XXXXXXf9nh> is DEAD
2022-02-21 00:17:28 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau> (failed 6 times with different proxies)
2022-02-21 00:17:28 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-aargau/matching-list?loc=geo-canton-thurgau>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args,**kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5006:ksre9jva95etajxxaoll9k+XXXXXX'
2022-02-21 00:17:30 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5004:XXXXXXX5etajxxaoll9k+fbg56Ioj:XXXXf9nh> is DEAD
2022-02-21 00:17:30 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list> (failed 6 times with different proxies)
2022-02-21 00:17:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.homegate.ch/buy/apartment/canton-zurich/matching-list>
Traceback (most recent call last):
File "/home/draco/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
result = current_context.run(
File "/home/draco/.local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args,**kw)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
return handler.download_request(request, spider)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
return agent.download_request(request)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 322, in download_request
agent = self._get_agent(request, timeout)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 278, in _get_agent
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 36, in _parse
return _parsed_url_args(parsed)
File "/home/draco/.local/lib/python3.8/site-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args
port = parsed.port
File "/usr/lib/python3.8/urllib/parse.py", line 174, in port
raise ValueError(message) from None
ValueError: Port could not be cast to integer value as '5004:XXXXXva95etajxxaoll9k+fbg56Ioj:XXXXXtf9nh'
2022-02-21 00:17:31 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2022-02-21 00:17:33 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5000:XXXXXajxxaoll9k+zHtjyZRG:XXXX9nh> is DEAD
2022-02-21 00:17:33 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 1 times, max retries: 5)
2022-02-21 00:17:36 [rotating_proxies.expire] DEBUG: Proxy <http://us-retail-fast.resdleafproxies.com:5001:XXXXXXXXXetajxxaoll9k+uSsCeYH5:lXXXXXXmtf9nh> is DEAD
2022-02-21 00:17:36 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.homegate.ch/buy/apartment/canton-schwyz/matching-list?loc=geo-canton-obwalden%2Cgeo-canton-nidwalden%2Cgeo-canton-glarus%2Cgeo-canton-solothurn%2Cgeo-canton-schaffhausen%2Cgeo-canton-zug%2Cgeo-canton-appenzell-ausserrhoden%2Cgeo-canton-appenzell-innerrhoden&ag=2400> with another proxy (failed 2 times, max retries: 5)
ValueError: Port could not be cast to integer value as '5009:ksre9jva95etajxxaoll9k+HOggeKA3:XXXXXh'
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-21 00:17:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/builtins.ValueError': 24,
'downloader/exception_count': 24,
'downloader/exception_type_count/builtins.ValueError': 24,
'downloader/request_bytes': 7158,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'elapsed_time_seconds': 55.895942,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 2, 20, 21, 17, 47, 135433),
'log_count/DEBUG': 50,
'log_count/ERROR': 4,
'log_count/INFO': 13,
'memusage/max': 65073152,
'memusage/startup': 65073152,
'proxies/dead': 21,
'proxies/mean_backoff': 196.90260209397636,
'proxies/reanimated': 1,
'proxies/unchecked': 9,
'scheduler/dequeued': 24,
'scheduler/dequeued/memory': 24,
'scheduler/enqueued': 24,
'scheduler/enqueued/memory': 24,
'start_time': datetime.datetime(2022, 2, 20, 21, 16, 51, 239491)}
2022-02-21 00:17:47 [scrapy.core.engine] INFO: Spider closed (finished)
会有什么问题呢?
我在Leafproxy的代理会员是“住宅代理”。Leafproxy不提供任何关于它的细节和如何使用它的信息。据我所知,没有真实的的消费者支持,但一个不和谐的渠道。
这是Leafproxy提供的面板。我从下面列出的代理中获得。没有数据使用记录
1条答案
按热度按时间unftdfkk1#
您定义代理列表的方式不正确。您需要使用
username:password@server:port
,而不是server:port:username:password
。请尝试使用以下定义:注意:您已将凭据公开到Internet,因此看到此问题得任何人都可以免费使用您得代理服务.请考虑尽快吊销凭据.
你可能面临的第二个问题是一些代理可能已经被你正在抓取的站点禁止了,所以你会收到失败的响应。所以当你使用代理时,你需要增加
RETRIES
的值。