每当我在Visual Studio代码终端中运行spider scrapy crawl test -O test.json
时,我会得到如下输出:
2023-01-31 14:31:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/product/1
{'price': 100,
'newprice': 90
}
2023-01-31 14:31:50 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-31 14:31:50 [scrapy.extensions.feedexport] INFO: Stored json feed (251 items) in: test.json
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:61169/session/996866d968ab791730e4f6d87ce2a1ea {}
2023-01-31 14:31:50 [urllib3.connectionpool] DEBUG: http://localhost:61169 "DELETE /session/996866d968ab791730e4f6d87ce2a1ea HTTP/1.1" 200 14
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2023-01-31 14:31:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 91321,
'downloader/request_count': 267,
'downloader/request_method_count/GET': 267,
'downloader/response_bytes': 2730055,
'downloader/response_count': 267,
'downloader/response_status_count/200': 267,
'dupefilter/filtered': 121,
'elapsed_time_seconds': 11.580893,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 31, 13, 31, 50, 495392),
'httpcompression/response_bytes': 9718676,
'httpcompression/response_count': 267,
'item_scraped_count': 251,
'log_count/DEBUG': 537,
'log_count/INFO': 11,
'request_depth_max': 2,
'response_received_count': 267,
'scheduler/dequeued': 267,
'scheduler/dequeued/memory': 267,
'scheduler/enqueued': 267,
'scheduler/enqueued/memory': 267,
'start_time': datetime.datetime(2023, 1, 31, 13, 31, 38, 914499)}
2023-01-31 14:31:52 [scrapy.core.engine] INFO: Spider closed (finished)
我想记录所有这些,包括Spider中的print('hi')
行,但我不想记录Spider输出,在本例中是{'price': 100, 'newprice': 90 }
。
检查以上我认为我只需要禁用downloader/response_bytes
。我一直在阅读这个https://docs.scrapy.org/en/latest/topics/logging.html,但我不知道在哪里或如何配置我的确切用例。我有数百个蜘蛛,我不想在每个蜘蛛添加一个配置,而是将登录配置应用到所有蜘蛛。我需要添加一个单独的配置文件或添加到现有的像scrapy.cfg
?
- 更新1**
下面是我创建settings.py
的文件夹结构:
Scrapy\
tt_spiders\
myspiders\
spider1.py
spider2.py
settings.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
settings.py
- 设置. py**
if __name__ == "__main__":
disable_list = ['scrapy.core.engine', 'scrapy.core.scraper', 'scrapy.spiders']
for element in disable_list:
logger = logging.getLogger(element)
logger.disabled = True
spider = 'example_spider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
这会抛出3个错误,这是有意义的,因为我没有定义这些错误:
- 未定义"日志记录"
- 未定义"获取项目设置"
- 未定义"爬网程序进程"
但更重要的是,我不明白的是,这段代码包含spider = 'example_spider'
,我希望这个逻辑适用于所有的蜘蛛。
所以我把它简化为:
if __name__ == "__main__":
disable_list = ['scrapy.core.scraper']
但是输出仍然被记录。我错过了什么?
1条答案
按热度按时间zf2sa74q1#
假设我们有这样一个蜘蛛:
spider.py:
其输出为:
如果您想禁用特定行的日志记录,只需复制方括号内的文本并禁用其日志记录器。例如:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>
.main.py:
如果您想禁用某些扩展,您可以在
settings.py
中将其设置为None
:将以下内容添加到
settings.py
: