我对Python还很陌生,根据DataCamp和youtube教程的指示,我正在尝试运行一个蜘蛛来抓取一个网站,并从最近的(数千个)视频中提取元数据。
到目前为止,我的Spider看起来像这样:
class NaughtySpider(scrapy.Spider):
name = "naughtyspider"
allowed_domains = ["example.com"]
start_url = ("https://www.example.com/video?o=cm")
# start_requests method
def start_requests(self):
yield scrapy.Request(url = start_url,
callback = self.parse_video)
# First parsing method
def parse_video(self, response):
self.log('F i n i s h e d s c r a p i n g ' + response.url)
video_links = response.css('ul#videoCategory').css('li.videoBox').css('div.thumbnail-info-wrapper').css('span.title > a').css('::attr(href)') #Correct path, chooses 32 videos from page ignoring the links coming from ads
links_to_follow = video_links.extract()
for url in links_to_follow:
yield response.follow(url = url,
callback = self.parse_metadata)
#Continue through pagination
next_page_url = response.css('li.page_next > a.orangeButton::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse_video)
# Second parsing method
def parse_metadata(self, response):
# Create a SelectorList of the course titles text
video_title = response.css('div.title-container > h1.title > span.inlineFree::text')
# Extract the text and strip it clean
video_title_ext = video_title.extract_first().strip()
# Extract views
video_views = response.css('span.count::text').extract_first()
# Extract tags
video_tags = response.css('div.tagsWrapper a::text').extract()
del video_tags[-1] #Eliminate '+' tag, which is for suggestions
# Extract Categories
video_categories = response.css('div.categoriesWrapper a::text').extract()
del video_categories[-1] #Same as tags
# Fill in the dictionary
yield {
'title': video_title_ext,
'views': video_views,
'tags': video_tags,
'categories': video_categories,
}
我按照文档中介绍的这种看似简单的方法导出收集到的数据
scrapy crawl quotes -o quotes.json
但当我运行等价的代码时
scrapy crawl naughtyspider -o data.csv
我得到以下错误日志:
2019-08-17 22:24:54 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:\Users\bla\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "C:\Users\bla\naughty\naughty\spiders\NaughtySpider.py", line 11, in start_requests
yield scrapy.Request(url = start_url,
NameError: name 'start_url' is not defined
2019-08-17 22:24:54 [scrapy.core.engine] INFO: Closing spider (finished)
特别令人沮丧的是,它是在前面的代码行中定义的。我在其他问题中看到过类似的情况,但似乎没有一个完全符合我正在使用的代码。
提前感谢,如果有重大错误影响代码,请道歉,周围的资源似乎对初学者一点也不友好(没有说明他们使用的是哪个终端/ shell,例如,主要使用Mac等)。
1条答案
按热度按时间5sxhfpxr1#
如果要引用一个类变量,则需要使用
self.
: