scrapy 报废-React堆不可重启[重复]

rslzwgfq  于 2022-11-09  发布在  React
关注(0)|答案(6)|浏览(181)

此问题在此处已有答案

ReactorNotRestartable error in while loop with scrapy(共10个答案)
三年前就关门了。
与:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

我总是成功地运行此流程:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)

# the script will block here until the crawling is finished

process.start()

但是因为我已经将这段代码移到了一个web_crawler(self)函数中,就像这样:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2)

并开始使用类示例化调用该方法,如下所示:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

和运行:

test()

我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

出什么事了?

wwodge7n

wwodge7n1#

您无法重新启动React器,但应该可以通过派生一个单独的进程来多次运行它:

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

# the wrapper to make it run more times

def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

运行两次:

configure_logging()

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

结果:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
ryoqjall

ryoqjall2#

这就是帮助我战胜ReactorNotRestartable错误的原因:last answer from the author of the question
0)pip install crochet
1)import from crochet import setup
2)setup()-位于文件顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
B)reactor.run()
我遇到了同样的问题,这个错误,并花了4个多小时来解决这个问题,阅读所有的问题在这里。最后找到了一个-并分享它。这就是我如何解决这个问题。唯一有意义的行从Scrapy docs离开是2最后一行在这个我的代码:


# some more imports

from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

这段代码允许我选择要运行的spider,只需将其名称传递给run_spider函数,在scraping完成后-选择另一个spider并再次运行它。
希望这能帮助一些人,就像它对我帮助一样:)

nukf8bse

nukf8bse3#

根据Scrapy文档,CrawlerProcess类的start()方法执行以下操作:
[...]启动TwistedReact器,将其池大小调整为REACTOR_THREADPOOL_MAXSIZE,并根据DNSCACHE_ENABLED和DNSCACHE_SIZE安装DNS缓存。
您收到的错误是由Twisted引发的,因为Twisted reactor无法重新启动。它使用了大量的全局变量,即使您使用jimmy-rig某种代码来重新启动它(我见过这样做),也不能保证它会工作。
老实说,如果你认为你需要重新启动React堆,你可能做错了什么。
根据您想要做的事情,我也会从文档的脚本部分查看运行Scrapy。

qlvxas9a

qlvxas9a4#

正如一些人已经指出的那样:你不需要重启React堆。
理想情况下,如果您希望链接您的进程(crawl1,然后crawl2,然后crawl3),只需添加回调即可。
例如,我一直在使用这个循环蜘蛛,它遵循以下模式:

1. Crawl A
2. Sleep N
3. goto 1

这是它在Scrappy中的样子:

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: {duration}')
    time.sleep(duration)  # block here

def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d

def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()

if __name__ == '__main__':
    loop_crawl()

为了更详细地解释这个过程,crawl函数安排了一个爬网,并添加了两个额外的回调函数,在爬网结束时调用这些回调函数:阻止休眠和对自身的递归调用(计划另一个爬网)。

$ python endless_crawl.py 
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
xxslljrj

xxslljrj5#

错误就在这段代码中:

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler()返回两个结果,为此,它尝试启动进程两次,重新启动React器,如@Rejected所指。
运行一个进程获得结果,并将两个结果存储在一个元组中,这是这里要采用的方法:

def __call__(self):
    result1, result2 = test.web_crawler()
jdgnovmf

jdgnovmf6#

这解决了我的问题,把下面的代码后reactor.run()process.start()

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

相关问题