试图错开两只蜘蛛:spider1
抓取并将URL列表构建到.csv文件中spider2
从.csv抓取,然后提取特定数据
我不断收到此错误:with open('urls.csv') as file: FileNotFoundError: [Errno 2] No such file or directory: 'urls.csv'
看起来像是spider1
无法首先触发,和/或python正在检查文件urls.csv
,因为代码的顺序,并且因为文件还不存在而出错。
这是错开爬行的部分--这是我不久前从gitHub上抓取的东西,但是链接似乎已经不存在了。我试着把它放在不同的位置,甚至复制或分割它。
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
reactor.stop()
crawl()
reactor.run()
我喜欢使用urls.csv
来解决URL问题,但最好将URL作为变量存储在列表中[尽管我还没有弄清楚能够这样做的语法]。
下面是我正在使用的完整代码。任何输入将不胜感激。谢谢!
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
class spider1(scrapy.Spider):
name = 'spider1'
start_urls = [
'https://tsd-careers.hii.com/en-US/search?keywords=alion&location='
]
custom_settings = {'FEEDS': {r'urls.csv': {'format': 'csv', 'item_export_kwargs': {'include_headers_line': False,}, 'overwrite': True,}}}
def parse(self, response):
for job in response.xpath('//@href').getall():
yield {'url': response.urljoin(job),}
next_page = response.xpath('//a[@class="next-page-caret"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
class spider2(scrapy.Spider):
name = 'spider2'
with open('urls.csv') as file:
start_urls = [line.strip() for line in file]
custom_settings = {'FEEDS': {r'data_tsdfront.xml': {'format': 'xml', 'overwrite': True}}}
def parse(self, response):
reqid = response.xpath('//li[6]/div/div[@class="secondary-text-color"]/text()').getall()
yield {
'reqid': reqid,
}
@defer.inlineCallbacks
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
reactor.stop()
crawl()
reactor.run()
1条答案
按热度按时间gpfsuwkq1#
我逐渐了解到,使用变量需要大量的重构。
我做了一些修改后,一些更多的研究和实验,这可能是丑陋的,但我有一切工作的理想。我可以改善和重组,随着我的知识和经验的增长。