我正在使用Scrapy CLI并在ubuntu 18服务器上运行它,我试图避免在start_urls属性中硬编码一堆url,而是在我的解析器底部调用'yield scrapy.Request()'。我正在抓取的网站相当基本,并且在2014-2030年有不同的页面。在我的代码底部,我有一个if()函数来检查当前年份并将scraper移动到下一年的页面。我是scrapy的新手,所以我不确定我是否正确调用了scrapy.Request()方法。下面是我的代码:
import scrapy
from .. import items
class EventSpider(scrapy.Spider):
name = "event_spider"
start_urls = [
"http://www.seasky.org/astronomy/astronomy-calendar-2014.html",
]
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
start_year = 2014
#response is the website
def parse(self, response):
CONTENT_SELECTOR = 'div#right-column-content ul li'
for astro_event in response.css(CONTENT_SELECTOR):
NAME_SELECTOR = "p span.title-text ::text"
DATE_SELECTOR = "p span.date-text ::text"
DESCRIPTION_SELECTOR = "p ::text"
item = items.AstroEventsItem()
item["title"] = astro_event.css(NAME_SELECTOR).extract_first()
item["date"] = astro_event.css(DATE_SELECTOR).extract_first()
item["description"] = astro_event.css(DESCRIPTION_SELECTOR)[-1].extract()
yield item
#Next page code:
#Goes through years 2014 to 2030
if(self.start_year < 2030):
self.start_year = self.start_year + 1
new_url = "http://www.seasky.org/astronomy/astronomy-calendar-" + str(self.start_year) + ".html"
print(new_url)
yield scrapy.Request(new_url, callback = self.parse)
字符串
以下是我在成功抓取第一页后收到的错误:
2020-11-10 05:25:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
2020-11-10 05:25:50 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.seasky.org/astronomy/astronomy-calendar-2015.html> (referer: http://www.seasky.org/astronomy/astronomy-calendar-2014.html)
Traceback (most recent call last):
File "/home/jcmq6b/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: <200 http://www.seasky.org/astronomy/astronomy-calendar-2015.html>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/spidermw.py", line 58, in process_spider_input
return scrape_func(response, request, spider)
File "/usr/local/lib/python3.6/dist-packages/scrapy/core/scraper.py", line 149, in call_spider
warn_on_generator_with_return_value(spider, callback)
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 245, in warn_on_generator_with_return_value
if is_generator_with_return_value(callable):
File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 230, in is_generator_with_return_value
tree = ast.parse(dedent(inspect.getsource(callable)))
File "/usr/lib/python3.6/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
def parse(self, response):
^
IndentationError: unexpected indent
型
我想我可能没有传递正确的参数来回调parse方法,但我不确定。任何帮助都非常感谢!让我知道如果我需要发布更多信息。
2条答案
按热度按时间bwntbbo31#
在文件“/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py“的第230行,在is_generator_with_return_value中给出此错误
tree = ast.parse(dedent(inspect.getsource(callable)))
的代码行。由于与先前在评论scrapy/issue/4477上提到的相关的scrappy pull请求4935而删除
为了防止这种情况-建议将scrapy的版本更新到较新的版本..至少到2.5.0
在scrapy(v.2.5.0)发行说明中提到了这一点-https://docs.scrapy.org/en/2.11/news.html?highlight=indentation#id40
更新
如果(由于某种原因)用户..没有选择更新scrapy版本,则可以..“禁用”
warn_on_generator_with_return_value(spider, callback)
检查字符串
by.. monkey通过在spider代码中添加这样的东西来修补
warn_on_generator_with_return_value
本身:型
正如在this answer和后来的other answer中提到的那样
pb3s4cty2#
对于任何遇到这个问题的人来说,我没有找到缩进错误的原因,但我确实找到了一种解决方法,将代码分为两个不同的解析方法:
字符串
第一个解析是从网站上列出的href中获取url。然后,它为每个href调用第二个解析方法parse_contents,并将从页面抓取的信息处理为MongoDB的项目。希望这可以帮助那些有类似问题的人。