我继承了一个scrapy应用程序,它在一个域上抓取了1000个页面,并将最终结果写入一个json文件。作者一直在Mac上运行这个应用程序,遇到了操作系统的限制,它抱怨打开文件的限制已经达到。他通过覆盖操作系统级别的上限来解决这个问题:
$ ulimit -n 2048
我在windows上运行这个,显然没有上限,但我仍然遇到同样的问题。scrappy运行了一段时间后,它抛出了一堆这样的错误,然后放弃:
2023-11-17 14:30:14 [scrapy.core.scraper] ERROR: Error downloading <GET https://some_page>
Traceback (most recent call last):
File ".venv\lib\site-packages\twisted\internet\defer.py", line 1445, in _inlineCallbacks
result = current_context.run(g.send, result)
File ".venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
File ".venv\lib\site-packages\scrapy\downloadermiddlewares\httpcache.py", line 77, in process_request
File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 302, in retrieve_response
File ".venv\lib\site-packages\scrapy\extensions\httpcache.py", line 354, in _read_meta
OSError: [Errno 24] Too many open files: 'path to file\\pickled_meta'
字符串
我读到这是一个Python问题,并尝试应用此修复,这是没有帮助的:
import win32file
win32file._setmaxstdio(2048)
型
目前,该高速缓存显示已经创建了63,744个文件。所以,我不知道这是操作系统的问题,还是Python的问题,还是Scrapy中的一些错误,或者是对它的滥用。我可以在这里发布一些代码,但我不知道哪些是相关的-蜘蛛,项目管道,解析方法或设置文件。任何想法,试图解决这一点将不胜感激。请让我知道什么其他细节,我可以提供。
以下是相关的项目设置:
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
'pipelines.JsonWriterPipeline': 301
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = False
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
def is_running_in_container():
"""Determines if we are running in a container"""
cgroup_path = '/proc/1/cgroup'
if not os.path.isfile(cgroup_path):
return False
with open(cgroup_path, 'r') as cgroup_file:
for line in cgroup_file:
parts = line.rstrip().split(":")
if len(parts) < 3:
return False
if parts[2] != "/":
return True
return False
def get_httpcache_dir():
"""Returns the appropriate httpcache directory to use
depending on envrionment"""
if is_running_in_container():
parent_dir = "/scrapyd/scrapyd/data"
if os.path.isdir(parent_dir):
return f'{parent_dir}/httpcache'
return os.path.join(os.path.expanduser('~'), "scrapy_httpcache")
# Return a relative dir
return 'httpcache'
HTTPCACHE_DIR = get_httpcache_dir()
HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'
型
这是最后一次运行后日志的输出:
023-11-18 12:50:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4205141,
'downloader/request_count': 9859,
'downloader/request_method_count/GET': 9859,
'downloader/response_bytes': 418300317,
'downloader/response_count': 8188,
'downloader/response_status_count/200': 8188,
'dupefilter/filtered': 701,
'elapsed_time_seconds': 12209.725667,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2023, 11, 18, 17, 50, 51, 249912, tzinfo=datetime.timezone.utc),
'httpcache/firsthand': 9859,
'httpcache/miss': 9859,
'httpcache/store': 9859,
'item_scraped_count': 7763,
'log_count/DEBUG': 3,
'log_count/ERROR': 1671,
'log_count/INFO': 228,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 8188,
'scheduler/dequeued': 9859,
'scheduler/dequeued/memory': 9859,
'scheduler/enqueued': 9859,
'scheduler/enqueued/memory': 9859,
'start_time': datetime.datetime(2023, 11, 18, 14, 27, 21, 524245, tzinfo=datetime.timezone.utc)}
型
在这里添加pipeline代码:
class JsonWriterPipeline:
"""Use to save items to JSON files"""
def __init__(self):
self.file = None
self.logging: Final = logging.getLogger('jsonwriter')
self.fileName: Final = 'items.json'
# counter of number of scrapes read in so far, used to provide some helpful debug later
self.recordsScraped = 0
def open_spider(self, _):
"""Callback called when a spider is opened"""
self.file = open(self.fileName, 'w')
def close_spider(self, _):
"""Callback called when a spider is closed"""
self.file.close()
if self.recordsScraped > 0:
self.logging.info(f'Wrote {self.recordsScraped} records to file: {self.fileName}')
def process_item(self, item, _):
"""Use to save an item yielded from a spider in a JSON file"""
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
self.recordsScraped += 1
return item
型
1条答案
按热度按时间w8rqjzmb1#
看起来你的应用程序使用httpcache来存储收到的响应。
默认情况下,如果启用了httpcache,则应用
scrapy.extensions.httpcache.FilesystemCacheStorage
。目前,该高速缓存显示已经创建了63,744个文件。所以,我不知道这是操作系统的问题,Python的问题,scrapy中的一些bug还是对它的滥用。我可以在这里发布一些代码,但我不知道哪些是相关的-蜘蛛,项目管道,解析方法或设置文件。
对于每个存储的响应
FilesystemCacheStorage
创建6个文件和1个文件夹。源我建议你通过应用dbm来应用基于缓存存储
字符串
到项目设置(它只创建几个文件-它的源代码在同一个链接)
但是在我的实践中,我没有遇到
FilesystemCacheStorage
在那么多文件上的问题。