使用scrapy将Scraped数据获取到变量而不是文件/数据库

3wabscal  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(128)

我正在尝试将Scrapy作为一个python脚本运行,并希望处理抓取的数据,而不是将其存储在文件/数据库中。

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

# spider

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        yield {"html_data": response.text}

# the wrapper to make it run more times

def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

configure_logging()

x = run_spider(QuotesSpider)

我想在调用Spider时运行它。如何实现

ve7v8dk2

ve7v8dk21#

据我所知,您希望使用Scrapy抓取链接,而不是将这些链接存储在一个文件中,您希望将它们返回到脚本。
使用您的方法

import scrapy
import scrapy.crawler as crawler
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from twisted.internet import reactor

all_html_data = []

# spider

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']
    html_data = []

    def parse(self, response):
        self.html_data.append(response.text)

    def close(self, reason):
        global all_html_data
        all_html_data = self.html_data

# the wrapper to make it run more times

def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result:
        raise result

configure_logging()

run_spider(QuotesSpider)
data = all_html_data  # this is the data

但是如果您想抓取链接并使用HTML响应,我认为使用另一个库(如Requests)或使用来自Scrapy like的HtmlResponse会更好

from scrapy.http import HtmlResponse
def parse(url):
    response = HtmlResponse(url=url)
    return response.text

data=parse('example.com')

相关问题