scrapy 我如何才能运行不同的蜘蛛在同一时间,他们有不同的爬行者运行设置

xvw2m8pv  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(161)

默认用法为:

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

我的代码:

import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor

runner1 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\1.json": {"format": "json", "overwrite": True}
        },
        })

runner2 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\2.json": {"format": "json", "overwrite": True}
        },
        })

runner3 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\3.json": {"format": "json", "overwrite": True}
        },
        })

 h = runner1.crawl(Live1)
 h.addBoth(lambda _: reactor.stop())
 a = runner2.crawl(Live2)
 a.addBoth(lambda _: reactor.stop())
 t = runner3.crawl(Live3)
 t.addBoth(lambda _: reactor.stop())

 reactor.run()

上面的代码不起作用!我怎么能运行不同的蜘蛛在同一时间,他们有不同的爬虫运行设置?设置是不同的,所以我用不同的变量为他们runner 1,runner 2,runner 3...什么应该是正确的用法?请你帮助我关于这个主题。非常感谢。

dluptydi

dluptydi1#

就像我在评论中说的,我认为使用custom_settings更好。
不管怎样,这对我很有效:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from tempbuffer.spiders.spider1 import ExampleSpider1
from tempbuffer.spiders.spider import ExampleSpider2

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

runner = CrawlerRunner(settings={
    "FEEDS": {
        r"1.json": {"format": "json", "overwrite": True}
    }})
runner.crawl(ExampleSpider1)

runner = CrawlerRunner(settings={
    "FEEDS": {
        r"2.json": {"format": "json", "overwrite": True}
    }})
runner.crawl(ExampleSpider2)

d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()

另一种方式:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from tempbuffer.spiders.spider1 import ExampleSpider1
from tempbuffer.spiders.spider import ExampleSpider2

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner1 = CrawlerRunner(settings={
    "FEEDS": {
        r"1.json": {"format": "json", "overwrite": True}
    }})
runner2 = CrawlerRunner(settings={
    "FEEDS": {
        r"2.json": {"format": "json", "overwrite": True}
    }})

d = runner1.crawl(ExampleSpider2)
runner2.crawl(ExampleSpider1)
d.addBoth(lambda _: reactor.stop())

reactor.run()

1.json:

[
{"title": "Short Dress", "price": "$24.99"},
{"title": "Patterned Slacks", "price": "$29.99"},
{"title": "Short Chiffon Dress", "price": "$49.99"},
{"title": "Off-the-shoulder Dress", "price": "$59.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "Short Chiffon Dress", "price": "$49.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "Short Lace Dress", "price": "$59.99"}
]

2.json:

[
{"title": "Long-sleeved Jersey Top", "price": "$12.99"}
]

我有点猜到答案了,我不确定哪个更好。如果有人想在评论中纠正/解释/澄清,我会很高兴的。

相关问题