scrapy Spider上有碎屑环

fiei3ece  于 2023-01-02  发布在  其他
关注(0)|答案(1)|浏览(189)

我想像这样循环scrapy.Spider

for i in range(0, 10): 

   class MySpider(scrapy.Spider, ABC):

        start_urls = ["example.com"]
    
        def start_requests(self):
            for url in self.urls:
                if dec == i:
                    yield SplashRequest(url=url, callback=self.parse_data, args={"wait": 1.5})
    
        def parse_data(self, response):
            data= response.css("td.right.data").extract()
            items["Data"] = data
            yield items
    
    settings = get_project_settings()
    settings["FEED_URI"] = f"/../Data/data_{i}.json"
    
    if __name__ == "__main__":
        process = CrawlerProcess(settings)
        process.crawl(MySpider)
        process.start()

然而,这产生了

twisted.internet.error.ReactorNotRestartable

使用

process.start(stop_after_crawl=False)

i=0执行脚本,但在i=1挂起

fgw7neuy

fgw7neuy1#

您可以使用多重处理或LoopingCall
您可以在twisted reactor文档中阅读有关scheduling tasks for the future的信息。

from scrapy.utils.log import configure_logging
from twisted.internet import task
from twisted.internet import reactor
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner

# callback for when the loop is finished
def cbLoopDone(result):
    reactor.stop()

_loopCounter = 0
loopTimes = 3
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

def run_spider():
    global _loopCounter
    # add whatever you want to check before running another spider
    if _loopCounter >= loopTimes:
        loop.stop()
        return

    _loopCounter += 1
    runner = CrawlerRunner(get_project_settings())
    runner.crawl('exampleSpider')

loop = task.LoopingCall(run_spider)

# Start looping every 5 seconds
loopDeferred = loop.start(5.0)

# Add callbacks for stop
loopDeferred.addCallback(cbLoopDone)

reactor.run()

相关问题