scrapy 报废-React堆不可重启[重复]

rslzwgfq 于 2022-11-09 发布在 React

关注(0)|答案(6)|浏览(181)

此问题在此处已有答案：

ReactorNotRestartable error in while loop with scrapy（共10个答案）
三年前就关门了。
与：

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

我总是成功地运行此流程：

process = CrawlerProcess(get_project_settings())
process.crawl(*args)

# the script will block here until the crawling is finished

process.start()

但是因为我已经将这段代码移到了一个web_crawler(self)函数中，就像这样：

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2)

并开始使用类示例化调用该方法，如下所示：

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

和运行：

test()

我收到以下错误：

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

出什么事了？

scrapy

来源：https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable

6条答案

按热度按时间

wwodge7n1#

您无法重新启动React器，但应该可以通过派生一个单独的进程来多次运行它：

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

# the wrapper to make it run more times

def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

运行两次：

configure_logging()

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

结果：

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

赞(0）回复(0）举报 2022-11-09

ryoqjall2#

这就是帮助我战胜ReactorNotRestartable错误的原因：last answer from the author of the question
0)pip install crochet
1)import from crochet import setup个
2)setup()-位于文件顶部
3)删除2行：
a）d.addBoth(lambda _: reactor.stop())
B）reactor.run()
我遇到了同样的问题，这个错误，并花了4个多小时来解决这个问题，阅读所有的问题在这里。最后找到了一个-并分享它。这就是我如何解决这个问题。唯一有意义的行从Scrapy docs离开是2最后一行在这个我的代码：


# some more imports

from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

这段代码允许我选择要运行的spider，只需将其名称传递给run_spider函数，在scraping完成后-选择另一个spider并再次运行它。
希望这能帮助一些人，就像它对我帮助一样：）

赞(0）回复(0）举报 2022-11-09

nukf8bse3#

根据Scrapy文档，CrawlerProcess类的start()方法执行以下操作：
[...]启动TwistedReact器，将其池大小调整为REACTOR_THREADPOOL_MAXSIZE，并根据DNSCACHE_ENABLED和DNSCACHE_SIZE安装DNS缓存。
您收到的错误是由Twisted引发的，因为Twisted reactor无法重新启动。它使用了大量的全局变量，即使您使用jimmy-rig某种代码来重新启动它（我见过这样做），也不能保证它会工作。
老实说，如果你认为你需要重新启动React堆，你可能做错了什么。
根据您想要做的事情，我也会从文档的脚本部分查看运行Scrapy。

赞(0）回复(0）举报 2022-11-09

qlvxas9a4#

正如一些人已经指出的那样：你不需要重启React堆。
理想情况下，如果您希望链接您的进程（crawl1，然后crawl2，然后crawl3），只需添加回调即可。
例如，我一直在使用这个循环蜘蛛，它遵循以下模式：

1. Crawl A
2. Sleep N
3. goto 1

这是它在Scrappy中的样子：

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: {duration}')
    time.sleep(duration)  # block here

def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d

def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()

if __name__ == '__main__':
    loop_crawl()

为了更详细地解释这个过程，crawl函数安排了一个爬网，并添加了两个额外的回调函数，在爬网结束时调用这些回调函数：阻止休眠和对自身的递归调用（计划另一个爬网）。

$ python endless_crawl.py 
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5

赞(0）回复(0）举报 2022-11-09

xxslljrj5#

错误就在这段代码中：

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler()返回两个结果，为此，它尝试启动进程两次，重新启动React器，如@Rejected所指。
运行一个进程获得结果，并将两个结果存储在一个元组中，这是这里要采用的方法：

def __call__(self):
    result1, result2 = test.web_crawler()

赞(0）回复(0）举报 2022-11-09

jdgnovmf6#

这解决了我的问题，把下面的代码后reactor.run()或process.start()：

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 报废-React堆不可重启[重复]

6条答案

相关问题

热门标签

最新问答