scrapy 如何将数据从Flask API传递到Web Scraper?

eivnm1vs  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(159)

我正在做一个应用程序项目,它允许用户在输入一组关键字后获得网页搜索结果,这些关键字将被发送到Ask。为此,我在Flask和Scrapy中创建了一个api,灵感来自下面的文章,针对api。但是,这个api不起作用,因为我无法将用作关键字的数据从我的api传递到我的scraper。下面是我的flask api文件:

import crochet
crochet.setup()

from flask import Flask , render_template, jsonify, request, redirect, url_for
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.signalmanager import dispatcher
import time
import os

# Importing our Scraping Function from the amazon_scraping file

from scrap.askScraping import AskScrapingSpider

# Creating Flask App Variable

app = Flask(__name__)

output_data = []
crawl_runner = CrawlerRunner()

# By Deafult Flask will come into this when we run the file

@app.route('/')
def index():
    return render_template("index.html") # Returns index.html file in templates folder.

# After clicking the Submit Button FLASK will come into this

@app.route('/', methods=['POST'])
def submit():
    if request.method == 'POST':
        s = request.form['url'] # Getting the Input Amazon Product URL
        global baseURL
        baseURL = s
        # This will remove any existing file with the same name so that the scrapy will not append the data to any previous file.
        if os.path.exists("<path_to_outputfile.json>"): 
            os.remove("<path_to_outputfile.json>")

        return redirect(url_for('scrape')) # Passing to the Scrape function

@app.route("/scrape")
def scrape():

    scrape_with_crochet(baseURL="https://www.ask.com/web?q={baseURL}") # Passing that URL to our Scraping Function

    time.sleep(20) # Pause the function while the scrapy spider is running

    return jsonify(output_data) # Returns the scraped data after being running for 20 seconds.

@crochet.run_in_reactor
def scrape_with_crochet(baseURL):
    # This will connect to the dispatcher that will kind of loop the code between these two functions.
    dispatcher.connect(_crawler_result, signal=signals.item_scraped)

    # This will connect to the ReviewspiderSpider function in our scrapy file and after each yield will pass to the crawler_result function.
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
    return eventual

# This will append the data to the output data list.

def _crawler_result(item, response, spider):
    output_data.append(dict(item))

if __name__== "__main__":
    app.run(debug=True)

我的刮刀之一

import scrapy
import datetime

class AskScrapingSpider(scrapy.Spider):

    name = 'ask_scraping'
    def start_requests(self):
        myBaseUrl = ''
        start_urls = []

        def __init__(self, category='',**kwargs): # The category variable will have the input URL.
            self.myBaseUrl = category
            self.start_urls.append(self.myBaseUrl)
            super().__init__(**kwargs)

            custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15} # This will tell scrapy to store the scraped data to outputfile.json and for how long the spider should run.

            yield scrapy.Request(start_urls, callback=self.parse, meta={'pos': 0})

    def parse(self, response):
         print('url:', response.url)

         start_pos = response.meta['pos']
         print('start pos:', start_pos)

         dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    

         items = response.css('div.PartialSearchResults-item')

         for pos, result in enumerate(items, start_pos+1):
            yield {
                'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
                'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
                'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
                'position': pos, 
                'date':     dt,
            }

        # --- after loop ---

         next_page = response.css('.PartialWebPagination-next a')

         if next_page:
            url = next_page.attrib.get('href')
            print('next_page:', url)  # relative URL
            # use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
            yield response.follow(url, callback=self.parse, meta={'pos': pos+1})

当我运行的时候绝对没有错误。在看了用户对我的问题的回答后,我改变了我的scraper的代码如下,但没有成功,因为,在传递数据到scraper后,我在浏览器中得到了下面的url localhost:5000/scrape空括号[],而括号通常应该包含我的scraper返回的数据:

import scrapy
import datetime

class AskScrapingSpider(scrapy.Spider):

    name = 'ask_scraping'
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

    custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15}
    def __init__(self, category='',**kwargs):
        self.myBaseUrl = category
        self.start_urls.append(self.myBaseUrl)
        super().__init__(**kwargs)

    def parse(self, response):
         print('url:', response.url)

         start_pos = response.meta['pos']
         print('start pos:', start_pos)

         dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    

         items = response.css('div.PartialSearchResults-item')

         for pos, result in enumerate(items, start_pos+1):
            yield {
                'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
                'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
                'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
                'position': pos, 
                'date':     dt,
            }

        # --- after loop ---

         next_page = response.css('.PartialWebPagination-next a')

         if next_page:
            url = next_page.attrib.get('href')
            print('next_page:', url)  # relative URL
            # use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
            yield response.follow(url, callback=self.parse, meta={'pos': pos+1})

我还在我的main.py文件中将crawl_runner = CrawlerRunner()替换为

project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)

并执行了以下导入

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

但是当我重新加载Flask服务器时,我收到了以下错误:

2022-06-21 11:44:55 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-21 11:44:57 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Windows-10-10.0.19044-SP0
2022-06-21 11:44:57 [werkzeug] WARNING:  * Debugger is active!
2022-06-21 11:44:57 [werkzeug] INFO:  * Debugger PIN: 107-226-838
2022-06-21 11:44:57 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:44:57 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:44:57] "GET / HTTP/1.1" 200 -
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a,**kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args,**kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a,**kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args,**kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:45:54 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:45:54] "←[32mPOST / HTTP/1.1←[0m" 302 -
2022-06-21 11:45:54 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a,**kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args,**kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a,**kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args,**kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

我查看了此stackOverflow question,但没有成功。

oxcyiej7

oxcyiej71#

你不应该屈服
scrapy.Request
init方法中。
删除此行:

yield scrapy.Request(start_urls, callback=self.parse, meta={'pos': 0})

并将您init方法更改为:

custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15}
def __init__(self, category='',**kwargs):
        self.myBaseUrl = category
        self.start_urls.append(self.myBaseUrl)
        super().__init__(**kwargs)

这可能行得通。

更新日期:

如果你想在你的请求中传递参数,在那些行中改变之后,你可以覆盖start_requests()方法,如下所示:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

**更新2:**如果你的Scrapy Spider在你的 flask 应用后台运行,试试这个:写下这些行:

project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)

而不是:

crawl_runner = CrawlerRunner()

当然,您应该导入CrawlerProcessget_project_settings,如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

**更新3:**我写过一些类似的项目,它工作正常,你可以检查this repo

相关问题