scrapy.core.scraper错误:错误下载-- OSError:错误号24打开的文件太多

7vux5j2d  于 2023-11-19  发布在  其他
关注(0)|答案(1)|浏览(141)

我继承了一个scrapy应用程序,它在一个域上抓取了1000个页面,并将最终结果写入一个json文件。作者一直在Mac上运行这个应用程序,遇到了操作系统的限制,它抱怨打开文件的限制已经达到。他通过覆盖操作系统级别的上限来解决这个问题:
$ ulimit -n 2048
我在windows上运行这个,显然没有上限,但我仍然遇到同样的问题。scrappy运行了一段时间后,它抛出了一堆这样的错误,然后放弃:

2023-11-17 14:30:14 [scrapy.core.scraper] ERROR: Error downloading <GET https://some_page>
Traceback (most recent call last):
File ".venv\lib\site-packages\twisted\internet\defer.py", line 1445, in _inlineCallbacks
    result = current_context.run(g.send, result)
  File ".venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
  File ".venv\lib\site-packages\scrapy\downloadermiddlewares\httpcache.py", line 77, in process_request
  File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 302, in retrieve_response
  File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 354, in _read_meta
OSError: [Errno 24] Too many open files: 'path to file\\pickled_meta'

字符串
我读到这是一个Python问题,并尝试应用此修复,这是没有帮助的:

import win32file
win32file._setmaxstdio(2048)


目前,该高速缓存显示已经创建了63,744个文件。所以,我不知道这是操作系统的问题,还是Python的问题,还是Scrapy中的一些错误,或者是对它的滥用。我可以在这里发布一些代码,但我不知道哪些是相关的-蜘蛛,项目管道,解析方法或设置文件。任何想法,试图解决这一点将不胜感激。请让我知道什么其他细节,我可以提供。
以下是相关的项目设置:

CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
   'pipelines.JsonWriterPipeline': 301
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = False
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0

def is_running_in_container():
   """Determines if we are running in a container"""
   cgroup_path = '/proc/1/cgroup'
   if not os.path.isfile(cgroup_path):
      return False

   with open(cgroup_path, 'r') as cgroup_file:
      for line in cgroup_file:
         parts = line.rstrip().split(":")
         if len(parts) < 3:
            return False
         if parts[2] != "/":
            return True

   return False

def get_httpcache_dir():
   """Returns the appropriate httpcache directory to use
      depending on envrionment"""
   if is_running_in_container():
      parent_dir = "/scrapyd/scrapyd/data"
      if os.path.isdir(parent_dir):
         return f'{parent_dir}/httpcache'

      return os.path.join(os.path.expanduser('~'), "scrapy_httpcache")

   # Return a relative dir
   return 'httpcache'

HTTPCACHE_DIR = get_httpcache_dir()
HTTPCACHE_IGNORE_HTTP_CODES = []

# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'


这是最后一次运行后日志的输出:

023-11-18 12:50:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4205141,
 'downloader/request_count': 9859,
 'downloader/request_method_count/GET': 9859,
 'downloader/response_bytes': 418300317,
 'downloader/response_count': 8188,
 'downloader/response_status_count/200': 8188,
 'dupefilter/filtered': 701,
 'elapsed_time_seconds': 12209.725667,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2023, 11, 18, 17, 50, 51, 249912, tzinfo=datetime.timezone.utc),
 'httpcache/firsthand': 9859,
 'httpcache/miss': 9859,
 'httpcache/store': 9859,
 'item_scraped_count': 7763,
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1671,
 'log_count/INFO': 228,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 8188,
 'scheduler/dequeued': 9859,
 'scheduler/dequeued/memory': 9859,
 'scheduler/enqueued': 9859,
 'scheduler/enqueued/memory': 9859,
 'start_time': datetime.datetime(2023, 11, 18, 14, 27, 21, 524245, tzinfo=datetime.timezone.utc)}


在这里添加pipeline代码:

class JsonWriterPipeline:
    """Use to save items to JSON files"""

    def __init__(self):
        self.file = None
        self.logging: Final = logging.getLogger('jsonwriter')
        self.fileName: Final = 'items.json'

        # counter of number of scrapes read in so far, used to provide some helpful debug later
        self.recordsScraped = 0

    def open_spider(self, _):
        """Callback called when a spider is opened"""
        self.file = open(self.fileName, 'w')

    def close_spider(self, _):
        """Callback called when a spider is closed"""
        self.file.close()
        if self.recordsScraped > 0:
            self.logging.info(f'Wrote {self.recordsScraped} records to file: {self.fileName}')

    def process_item(self, item, _):
        """Use to save an item yielded from a spider in a JSON file"""
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        self.recordsScraped += 1
        return item

w8rqjzmb

w8rqjzmb1#

看起来你的应用程序使用httpcache来存储收到的响应。
默认情况下,如果启用了httpcache,则应用scrapy.extensions.httpcache.FilesystemCacheStorage
目前,该高速缓存显示已经创建了63,744个文件。所以,我不知道这是操作系统的问题,Python的问题,scrapy中的一些bug还是对它的滥用。我可以在这里发布一些代码,但我不知道哪些是相关的-蜘蛛,项目管道,解析方法或设置文件。
对于每个存储的响应FilesystemCacheStorage创建6个文件和1个文件夹。源
我建议你通过应用dbm来应用基于缓存存储

HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.DbmCacheStorage"

字符串
到项目设置(它只创建几个文件-它的源代码在同一个链接)
但是在我的实践中,我没有遇到FilesystemCacheStorage在那么多文件上的问题。

相关问题