Scrapy/Python -试图错开蜘蛛

szqfcxe2  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(159)

试图错开两只蜘蛛:
spider1抓取并将URL列表构建到.csv文件中
spider2从.csv抓取,然后提取特定数据
我不断收到此错误:with open('urls.csv') as file: FileNotFoundError: [Errno 2] No such file or directory: 'urls.csv'
看起来像是spider1无法首先触发,和/或python正在检查文件urls.csv,因为代码的顺序,并且因为文件还不存在而出错。
这是错开爬行的部分--这是我不久前从gitHub上抓取的东西,但是链接似乎已经不存在了。我试着把它放在不同的位置,甚至复制或分割它。

def crawl():
    yield runner.crawl(spider1)
    yield runner.crawl(spider2)
    reactor.stop()
crawl()
reactor.run()

我喜欢使用urls.csv来解决URL问题,但最好将URL作为变量存储在列表中[尽管我还没有弄清楚能够这样做的语法]。
下面是我正在使用的完整代码。任何输入将不胜感激。谢谢!

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)

class spider1(scrapy.Spider):
    name = 'spider1'
    start_urls = [
        'https://tsd-careers.hii.com/en-US/search?keywords=alion&location='
    ]

    custom_settings = {'FEEDS': {r'urls.csv': {'format': 'csv', 'item_export_kwargs': {'include_headers_line': False,}, 'overwrite': True,}}}

    def parse(self, response):
        for job in response.xpath('//@href').getall():
            yield {'url': response.urljoin(job),}

        next_page = response.xpath('//a[@class="next-page-caret"]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

class spider2(scrapy.Spider):
    name = 'spider2'
    with open('urls.csv') as file:
        start_urls = [line.strip() for line in file]

    custom_settings = {'FEEDS': {r'data_tsdfront.xml': {'format': 'xml', 'overwrite': True}}}

    def parse(self, response):
        reqid = response.xpath('//li[6]/div/div[@class="secondary-text-color"]/text()').getall()
        yield {
            'reqid': reqid,
        }

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(spider1)
    yield runner.crawl(spider2)
    reactor.stop()
crawl()
reactor.run()
gpfsuwkq

gpfsuwkq1#

我逐渐了解到,使用变量需要大量的重构。
我做了一些修改后,一些更多的研究和实验,这可能是丑陋的,但我有一切工作的理想。我可以改善和重组,随着我的知识和经验的增长。

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import pandas as pd

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)

class spider1(scrapy.Spider):   
    name = 'spider1'    
    start_urls = [      
            'https://tsd-careers.hii.com/en-US/search?keywords=alion&location=' 
    ]   

    custom_settings = {'FEEDS': {r'urls.csv': {'format': 'csv', 'overwrite': True,}}}   

    def parse(self, response):
        for job in response.xpath('//@href').getall():
            yield {'url': response.urljoin(job),}       

        next_page = response.xpath('//a[@class="next-page-caret"]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

def read_csv(): 
    df = pd.read_csv('urls.csv')
    return df['url'].values.tolist()

class spider2(scrapy.Spider):   
    name = 'spider2'    
    def start_requests(self):
        for url in read_csv():
        yield scrapy.Request(url)   

custom_settings = {'FEEDS': {r'data_tsdfront.xml': {'format': 'xml', 'overwrite': True}}}   

    def parse(self, response):
        data = response.css('*').getall()
            yield {
            'data': data,   
            }

@defer.inlineCallbacks
def crawl():    
    yield runner.crawl(spider1)
    yield runner.crawl(spider2)
    reactor.stop()
crawl()
reactor.run()

相关问题