我正在抓取2800万个页面,我的蜘蛛开始很快,然后慢慢慢下来。我怀疑是服务器阻止了我,因为我可以运行第二个蜘蛛,它会再次快速启动。不是硬件,是在一个不错的vps上运行,有24gb的内存。允许的域只是那个网站。什么可能是速度慢的原因?
如果我停止作业并立即恢复它,它将再次快速启动
2022-11-11 12:25:39 [scrapy.core.engine] INFO: Spider opened
2022-11-11 12:25:39 [scrapy.core.scheduler] INFO: Resuming crawl (97145 requests scheduled)
2022-11-11 12:25:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-11 12:25:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-11 12:26:39 [scrapy.extensions.logstats] INFO: Crawled 1633 pages (at 1633 pages/min), scraped 1629 items (at 1629 items/min)
2022-11-11 12:27:39 [scrapy.extensions.logstats] INFO: Crawled 3242 pages (at 1609 pages/min), scraped 3238 items (at 1609 items/min)
2022-11-11 12:28:39 [scrapy.extensions.logstats] INFO: Crawled 4736 pages (at 1494 pages/min), scraped 4733 items (at 1495 items/min)
2022-11-11 12:29:40 [scrapy.extensions.logstats] INFO: Crawled 5914 pages (at 1178 pages/min), scraped 5906 items (at 1173 items/min)
2022-11-11 12:30:39 [scrapy.extensions.logstats] INFO: Crawled 7198 pages (at 1284 pages/min), scraped 7190 items (at 1284 items/min)
2022-11-11 12:31:40 [scrapy.extensions.logstats] INFO: Crawled 8417 pages (at 1219 pages/min), scraped 8408 items (at 1218 items/min)
2022-11-11 12:32:40 [scrapy.extensions.logstats] INFO: Crawled 9557 pages (at 1140 pages/min), scraped 9553 items (at 1145 items/min)
2022-11-11 12:33:40 [scrapy.extensions.logstats] INFO: Crawled 10617 pages (at 1060 pages/min), scraped 10612 items (at 1059 items/min)
2022-11-11 12:34:40 [scrapy.extensions.logstats] INFO: Crawled 11629 pages (at 1012 pages/min), scraped 11623 items (at 1011 items/min)
2022-11-11 12:35:40 [scrapy.extensions.logstats] INFO: Crawled 12592 pages (at 963 pages/min), scraped 12587 items (at 964 items/min)
2022-11-11 12:36:39 [scrapy.extensions.logstats] INFO: Crawled 13499 pages (at 907 pages/min), scraped 13493 items (at 906 items/min)
2022-11-11 12:37:40 [scrapy.extensions.logstats] INFO: Crawled 14368 pages (at 869 pages/min), scraped 14364 items (at 871 items/min)
2022-11-11 12:38:40 [scrapy.extensions.logstats] INFO: Crawled 15161 pages (at 793 pages/min), scraped 15153 items (at 789 items/min)
2022-11-11 12:39:40 [scrapy.extensions.logstats] INFO: Crawled 15884 pages (at 723 pages/min), scraped 15881 items (at 728 items/min)
2022-11-11 12:40:40 [scrapy.extensions.logstats] INFO: Crawled 16665 pages (at 781 pages/min), scraped 16657 items (at 776 items/min)
2022-11-11 12:41:40 [scrapy.extensions.logstats] INFO: Crawled 17417 pages (at 752 pages/min), scraped 17409 items (at 752 items/min)
2022-11-11 12:42:40 [scrapy.extensions.logstats] INFO: Crawled 18140 pages (at 723 pages/min), scraped 18132 items (at 723 items/min)
2022-11-11 12:43:40 [scrapy.extensions.logstats] INFO: Crawled 18844 pages (at 704 pages/min), scraped 18836 items (at 704 items/min)
2022-11-11 12:44:40 [scrapy.extensions.logstats] INFO: Crawled 19528 pages (at 684 pages/min), scraped 19516 items (at 680 items/min)
2022-11-11 12:45:40 [scrapy.extensions.logstats] INFO: Crawled 20188 pages (at 660 pages/min), scraped 20180 items (at 664 items/min)
2022-11-11 12:46:40 [scrapy.extensions.logstats] INFO: Crawled 20836 pages (at 648 pages/min), scraped 20828 items (at 648 items/min)
2022-11-11 12:47:39 [scrapy.extensions.logstats] INFO: Crawled 21460 pages (at 624 pages/min), scraped 21452 items (at 624 items/min)
2022-11-11 12:48:40 [scrapy.extensions.logstats] INFO: Crawled 22014 pages (at 554 pages/min), scraped 22006 items (at 554 items/min)
2022-11-11 12:49:40 [scrapy.extensions.logstats] INFO: Crawled 22588 pages (at 574 pages/min), scraped 22580 items (at 574 items/min)
2022-11-11 12:50:40 [scrapy.extensions.logstats] INFO: Crawled 23159 pages (at 571 pages/min), scraped 23151 items (at 571 items/min)
2022-11-11 12:51:39 [scrapy.extensions.logstats] INFO: Crawled 23731 pages (at 572 pages/min), scraped 23723 items (at 572 items/min)
2022-11-11 12:52:40 [scrapy.extensions.logstats] INFO: Crawled 24299 pages (at 568 pages/min), scraped 24291 items (at 568 items/min)
2022-11-11 12:53:40 [scrapy.extensions.logstats] INFO: Crawled 24847 pages (at 548 pages/min), scraped 24839 items (at 548 items/min)
2022-11-11 12:54:40 [scrapy.extensions.logstats] INFO: Crawled 25385 pages (at 538 pages/min), scraped 25377 items (at 538 items/min)
2022-11-11 12:55:39 [scrapy.extensions.logstats] INFO: Crawled 25917 pages (at 532 pages/min), scraped 25909 items (at 532 items/min)
2022-11-11 12:56:40 [scrapy.extensions.logstats] INFO: Crawled 26441 pages (at 524 pages/min), scraped 26433 items (at 524 items/min)
2022-11-11 12:57:40 [scrapy.extensions.logstats] INFO: Crawled 26953 pages (at 512 pages/min), scraped 26945 items (at 512 items/min)
2022-11-11 12:58:40 [scrapy.extensions.logstats] INFO: Crawled 27442 pages (at 489 pages/min), scraped 27440 items (at 495 items/min)
2022-11-11 12:59:40 [scrapy.extensions.logstats] INFO: Crawled 27882 pages (at 440 pages/min), scraped 27874 items (at 434 items/min)
2022-11-11 13:00:40 [scrapy.extensions.logstats] INFO: Crawled 28372 pages (at 490 pages/min), scraped 28364 items (at 490 items/min)
2022-11-11 13:01:40 [scrapy.extensions.logstats] INFO: Crawled 28856 pages (at 484 pages/min), scraped 28848 items (at 484 items/min)
2022-11-11 13:02:40 [scrapy.extensions.logstats] INFO: Crawled 29332 pages (at 476 pages/min), scraped 29324 items (at 476 items/min)
2022-11-11 13:03:40 [scrapy.extensions.logstats] INFO: Crawled 29800 pages (at 468 pages/min), scraped 29792 items (at 468 items/min)
2022-11-11 13:04:39 [scrapy.extensions.logstats] INFO: Crawled 30260 pages (at 460 pages/min), scraped 30252 items (at 460 items/min)
2022-11-11 13:05:40 [scrapy.extensions.logstats] INFO: Crawled 30720 pages (at 460 pages/min), scraped 30712 items (at 460 items/min)
2022-11-11 13:06:40 [scrapy.extensions.logstats] INFO: Crawled 31166 pages (at 446 pages/min), scraped 31158 items (at 446 items/min)
我试过自动油门,另一个VPS和相同的结果。
1条答案
按热度按时间7gyucuyw1#
它实际上并没有变慢,只是因为它一次管理的并发过程的数量而看起来变慢。
AutoThrottle
可以帮助缓解这种行为,但它只影响零碎工作流的一端。蜘蛛的输出/馈送端也是异步的,通常会累积大量并发作业,迭代所有这些作业将花费越来越长的时间来清除。此外,如果您有任何自定义的中间件,碰巧是计算昂贵,这可能会导致明显的速度下降。
您可以使用
CONCURRENT_ITEMS
来调整可以同时处理的输出过程的数量,还有CONCURRENT_REQUESTS
、CONCURRENT_REQUESTS_PER_DOMAIN
和CONCURRENT_REQUESTS_PER_IP
设置,它们的工作方式与AutoThrottle类似。调整这些设置中的任何一个都可以提高蜘蛛的输出速率。您也可以使用scrapy日志记录和信号API来帮助精确定位速度下降的位置。但是我应该注意到scrapy在爬行开始时总是运行得更快。在开始/重新开始爬行时,调度程序是完全空的,因此第一个处理的项目将快速通过工作流。