所以我用scrappy从一个网站上提取数据。
我是这么做的
从主页面导航到子页面。
---从这些子页面,我提取数据和下载附件到特定的文件夹。现在数据正在被提取。
但是,一旦下载附件部分开始,我得到了这个错误。
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/robots.txt> (failed 1 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:23:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/robots.txt> (failed 2 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:23:56 [scrapy.extensions.logstats] INFO: Crawled 24 pages (at 24 pages/min), scraped 0 items (at 0 items/min)
2023-06-06 04:24:02 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://59.207.152.10:8001/robots.txt> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:02 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://59.207.152.10:8001/robots.txt>: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
Traceback (most recent call last):
File "C:\Users\Mudassir\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 1 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 2 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:56 [scrapy.extensions.logstats] INFO: Crawled 24 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-06-06 04:25:06 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:25:06 [scrapy.core.scraper] ERROR: Error downloading <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc>
Traceback (most recent call last):
File "C:\Users\Mudassir\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
你知道这是什么原因吗
我已经尝试在设置文件中添加Download_delay。
下面是我的代码:(我已经删除了文本报废的一部分)
class SthjjSpiders(scrapy.Spider):
name="sthjj"
start_urls=[
"xyz.html"
]
def parse(self, response):
links=response.css('div.xxgk td a::attr(href)').getall()
for link in links:
yield response.follow(link, self.page_parse)
next_page= response.css('span.item.operation > a::attr(href)')[2].get()
self.project_count+=1
if next_page and self.project_count<1:
yield response.follow(next_page, callback=self.parse)
def page_parse(self, response):
attachments=response.css('a[href*=".doc"]::attr(href)').getall()
attachments_folder=os.path.join(page_dir,'Attachments')
os.makedirs(attachments_folder, exist_ok=True)
for link in attachments:
yield response.follow(link, callback=self.download_attachments, meta={'attachments_folder': attachments_folder})
def download_attachments(self, response):
filename=response.url.split('/')[-1]
attachment_folder=response.meta['attachments_folder']
attachments_file=os.path.join(attachment_folder,filename)
with open(attachments_file, 'wb') as f:
f.write(response.body)
1条答案
按热度按时间b4wnujal1#
就像 Alexandria 说的,使用文件管道。