下载文件开始时出现TCP连接超时Scrapy错误

hi3rlvi2  于 2023-06-23  发布在  其他
关注(0)|答案(1)|浏览(161)

所以我用scrappy从一个网站上提取数据。
我是这么做的
从主页面导航到子页面。
---从这些子页面,我提取数据和下载附件到特定的文件夹。现在数据正在被提取。
但是,一旦下载附件部分开始,我得到了这个错误。

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/robots.txt> (failed 1 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:23:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/robots.txt> (failed 2 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:23:56 [scrapy.extensions.logstats] INFO: Crawled 24 pages (at 24 pages/min), scraped 0 items (at 0 items/min)
2023-06-06 04:24:02 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://59.207.152.10:8001/robots.txt> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:02 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://59.207.152.10:8001/robots.txt>: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
Traceback (most recent call last):
  File "C:\Users\Mudassir\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 1 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 2 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:24:56 [scrapy.extensions.logstats] INFO: Crawled 24 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-06-06 04:25:06 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2023-06-06 04:25:06 [scrapy.core.scraper] ERROR: Error downloading <GET http://59.207.152.10:8001/pdshbj/upload/files/2023/5/ab174cf3fe6d1e68dc82bf6ba46a365f.doc>
Traceback (most recent call last):
  File "C:\Users\Mudassir\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..

你知道这是什么原因吗
我已经尝试在设置文件中添加Download_delay。
下面是我的代码:(我已经删除了文本报废的一部分)

class SthjjSpiders(scrapy.Spider):
    name="sthjj"
    start_urls=[
        "xyz.html"
    ]

    def parse(self, response):
        links=response.css('div.xxgk td a::attr(href)').getall()
        for link in links:
            yield response.follow(link, self.page_parse)
                    
        
        next_page= response.css('span.item.operation > a::attr(href)')[2].get()
        self.project_count+=1
        if next_page and self.project_count<1:
            yield response.follow(next_page, callback=self.parse)

        
    def page_parse(self, response):
       
        attachments=response.css('a[href*=".doc"]::attr(href)').getall()
        attachments_folder=os.path.join(page_dir,'Attachments')
        os.makedirs(attachments_folder, exist_ok=True)

        for link in attachments:
            yield response.follow(link, callback=self.download_attachments, meta={'attachments_folder': attachments_folder})
            
               
    def download_attachments(self, response):
        
        filename=response.url.split('/')[-1]
        attachment_folder=response.meta['attachments_folder']
        attachments_file=os.path.join(attachment_folder,filename)
        
        
        with open(attachments_file, 'wb') as f:
                f.write(response.body)
b4wnujal

b4wnujal1#

就像 Alexandria 说的,使用文件管道。

import scrapy
import os
from scrapy import Request
from scrapy.pipelines.files import FilesPipeline
from itemadapter import ItemAdapter

class ProcessPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        urls = ItemAdapter(item).get(self.files_urls_field, [])
        return [Request(u) for u in urls]

    def file_path(self, request, response=None, info=None, *, item=None):
        return item['file_item']

class DownloladItem(scrapy.Item):
    file_urls = scrapy.Field()
    file_item = scrapy.Field()
    files = scrapy.Field()

class SthjjSpiders(scrapy.Spider):
    name = "sthjj"
    start_urls = ['http://sthjj.pds.gov.cn/channels/11330_3.html']
    custom_settings = {
        'FILES_STORE': 'downloads_dir',
        # change your path here to your pipeline path.
        'ITEM_PIPELINES': {'tempbuffer.spiders.spider.ProcessPipeline': 1}
    }

    def parse(self, response):
        links = response.css('div.xxgk td a::attr(href)').getall()
        for link in links:
            yield response.follow(link, self.page_parse)

        next_page = response.css('span.item.operation > a::attr(href)')[2].get()
        self.project_count += 1
        if next_page and self.project_count < 1:
            yield response.follow(next_page, callback=self.parse)

    def page_parse(self, response):
        attachments = response.css('a[href*=".doc"]::attr(href)').getall()
        attachments_folder = 'Attachments'

        for link in attachments:
            yield response.follow(link, callback=self.download_attachments,
                                  cb_kwargs={'attachments_folder': attachments_folder})

    def download_attachments(self, response, attachments_folder):
        filename = response.url.split('/')[-1]
        attachments_file = os.path.join(attachments_folder, filename)

        item = DownloladItem()
        item['file_urls'] = [response.url]
        item['file_item'] = attachments_file

        yield item

相关问题