scrapy 我如何才能运行不同的蜘蛛在同一时间，他们有不同的爬行者运行设置

xvw2m8pv 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(161)

默认用法为：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

我的代码：

import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor

runner1 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\1.json": {"format": "json", "overwrite": True}
        },
        })

runner2 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\2.json": {"format": "json", "overwrite": True}
        },
        })

runner3 = CrawlerRunner(settings = {
        "FEEDS": {
        r"file:///C:\\Users\Messi\\3.json": {"format": "json", "overwrite": True}
        },
        })

 h = runner1.crawl(Live1)
 h.addBoth(lambda _: reactor.stop())
 a = runner2.crawl(Live2)
 a.addBoth(lambda _: reactor.stop())
 t = runner3.crawl(Live3)
 t.addBoth(lambda _: reactor.stop())

 reactor.run()

上面的代码不起作用！我怎么能运行不同的蜘蛛在同一时间，他们有不同的爬虫运行设置？设置是不同的，所以我用不同的变量为他们runner 1，runner 2，runner 3...什么应该是正确的用法？请你帮助我关于这个主题。非常感谢。

scrapy

来源：https://stackoverflow.com/questions/74067893/how-can-i-run-different-spiders-at-the-same-time-that-they-have-different-crawle

1条答案

按热度按时间

dluptydi1#

就像我在评论中说的，我认为使用custom_settings更好。
不管怎样，这对我很有效：

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from tempbuffer.spiders.spider1 import ExampleSpider1
from tempbuffer.spiders.spider import ExampleSpider2

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

runner = CrawlerRunner(settings={
    "FEEDS": {
        r"1.json": {"format": "json", "overwrite": True}
    }})
runner.crawl(ExampleSpider1)

runner = CrawlerRunner(settings={
    "FEEDS": {
        r"2.json": {"format": "json", "overwrite": True}
    }})
runner.crawl(ExampleSpider2)

d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()

另一种方式：

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from tempbuffer.spiders.spider1 import ExampleSpider1
from tempbuffer.spiders.spider import ExampleSpider2

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner1 = CrawlerRunner(settings={
    "FEEDS": {
        r"1.json": {"format": "json", "overwrite": True}
    }})
runner2 = CrawlerRunner(settings={
    "FEEDS": {
        r"2.json": {"format": "json", "overwrite": True}
    }})

d = runner1.crawl(ExampleSpider2)
runner2.crawl(ExampleSpider1)
d.addBoth(lambda _: reactor.stop())

reactor.run()

1.json：

[
{"title": "Short Dress", "price": "$24.99"},
{"title": "Patterned Slacks", "price": "$29.99"},
{"title": "Short Chiffon Dress", "price": "$49.99"},
{"title": "Off-the-shoulder Dress", "price": "$59.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "Short Chiffon Dress", "price": "$49.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "V-neck Top", "price": "$24.99"},
{"title": "Short Lace Dress", "price": "$59.99"}
]

2.json：

[
{"title": "Long-sleeved Jersey Top", "price": "$12.99"}
]

我有点猜到答案了，我不确定哪个更好。如果有人想在评论中纠正/解释/澄清，我会很高兴的。

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 我如何才能运行不同的蜘蛛在同一时间，他们有不同的爬行者运行设置

1条答案

相关问题

热门标签

最新问答