scrapy.core.scraper错误：错误下载-- OSError：错误号24打开的文件太多

我继承了一个scrapy应用程序，它在一个域上抓取了1000个页面，并将最终结果写入一个json文件。作者一直在Mac上运行这个应用程序，遇到了操作系统的限制，它抱怨打开文件的限制已经达到。他通过覆盖操作系统级别的上限来解决这个问题：
$ ulimit -n 2048
我在windows上运行这个，显然没有上限，但我仍然遇到同样的问题。scrappy运行了一段时间后，它抛出了一堆这样的错误，然后放弃：

2023-11-17 14:30:14 [scrapy.core.scraper] ERROR: Error downloading <GET https://some_page>
Traceback (most recent call last):
File ".venv\lib\site-packages\twisted\internet\defer.py", line 1445, in _inlineCallbacks
    result = current_context.run(g.send, result)
  File ".venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
  File ".venv\lib\site-packages\scrapy\downloadermiddlewares\httpcache.py", line 77, in process_request
  File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 302, in retrieve_response
  File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 354, in _read_meta
OSError: [Errno 24] Too many open files: 'path to file\\pickled_meta'

字符串
我读到这是一个Python问题，并尝试应用此修复，这是没有帮助的：

import win32file
win32file._setmaxstdio(2048)

型
目前，该高速缓存显示已经创建了63,744个文件。所以，我不知道这是操作系统的问题，还是Python的问题，还是Scrapy中的一些错误，或者是对它的滥用。我可以在这里发布一些代码，但我不知道哪些是相关的-蜘蛛，项目管道，解析方法或设置文件。任何想法，试图解决这一点将不胜感激。请让我知道什么其他细节，我可以提供。
以下是相关的项目设置：

CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
   'pipelines.JsonWriterPipeline': 301
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = False
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0

def is_running_in_container():
   """Determines if we are running in a container"""
   cgroup_path = '/proc/1/cgroup'
   if not os.path.isfile(cgroup_path):
      return False

   with open(cgroup_path, 'r') as cgroup_file:
      for line in cgroup_file:
         parts = line.rstrip().split(":")
         if len(parts) < 3:
            return False
         if parts[2] != "/":
            return True

   return False

def get_httpcache_dir():
   """Returns the appropriate httpcache directory to use
      depending on envrionment"""
   if is_running_in_container():
      parent_dir = "/scrapyd/scrapyd/data"
      if os.path.isdir(parent_dir):
         return f'{parent_dir}/httpcache'

      return os.path.join(os.path.expanduser('~'), "scrapy_httpcache")

   # Return a relative dir
   return 'httpcache'

HTTPCACHE_DIR = get_httpcache_dir()
HTTPCACHE_IGNORE_HTTP_CODES = []

# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'

型
这是最后一次运行后日志的输出：

023-11-18 12:50:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4205141,
 'downloader/request_count': 9859,
 'downloader/request_method_count/GET': 9859,
 'downloader/response_bytes': 418300317,
 'downloader/response_count': 8188,
 'downloader/response_status_count/200': 8188,
 'dupefilter/filtered': 701,
 'elapsed_time_seconds': 12209.725667,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2023, 11, 18, 17, 50, 51, 249912, tzinfo=datetime.timezone.utc),
 'httpcache/firsthand': 9859,
 'httpcache/miss': 9859,
 'httpcache/store': 9859,
 'item_scraped_count': 7763,
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1671,
 'log_count/INFO': 228,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 8188,
 'scheduler/dequeued': 9859,
 'scheduler/dequeued/memory': 9859,
 'scheduler/enqueued': 9859,
 'scheduler/enqueued/memory': 9859,
 'start_time': datetime.datetime(2023, 11, 18, 14, 27, 21, 524245, tzinfo=datetime.timezone.utc)}

型
在这里添加pipeline代码：

class JsonWriterPipeline:
    """Use to save items to JSON files"""

    def __init__(self):
        self.file = None
        self.logging: Final = logging.getLogger('jsonwriter')
        self.fileName: Final = 'items.json'

        # counter of number of scrapes read in so far, used to provide some helpful debug later
        self.recordsScraped = 0

    def open_spider(self, _):
        """Callback called when a spider is opened"""
        self.file = open(self.fileName, 'w')

    def close_spider(self, _):
        """Callback called when a spider is closed"""
        self.file.close()
        if self.recordsScraped > 0:
            self.logging.info(f'Wrote {self.recordsScraped} records to file: {self.fileName}')

    def process_item(self, item, _):
        """Use to save an item yielded from a spider in a JSON file"""
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        self.recordsScraped += 1
        return item

型

看起来你的应用程序使用httpcache来存储收到的响应。
默认情况下，如果启用了httpcache，则应用scrapy.extensions.httpcache.FilesystemCacheStorage。
目前，该高速缓存显示已经创建了63，744个文件。所以，我不知道这是操作系统的问题，Python的问题，scrapy中的一些bug还是对它的滥用。我可以在这里发布一些代码，但我不知道哪些是相关的-蜘蛛，项目管道，解析方法或设置文件。
对于每个存储的响应FilesystemCacheStorage创建6个文件和1个文件夹。源
我建议你通过应用dbm来应用基于缓存存储

HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.DbmCacheStorage"

字符串
到项目设置（它只创建几个文件-它的源代码在同一个链接）
但是在我的实践中，我没有遇到FilesystemCacheStorage在那么多文件上的问题。

scrapy.core.scraper错误：错误下载-- OSError：错误号24打开的文件太多

1条答案

相关问题

热门标签

最新问答