我正在构建一个SitemapSpider。我正在尝试过滤网站Map条目以排除链接中包含此子字符串'/p/'的条目:
<url>
<loc>https://example.co.za/product-name/p/product-id</loc>
<lastmod>2019-08-27</lastmod>
<changefreq>daily</changefreq>
</url>
根据Scrapy docs,我们可以定义一个sitemap_filter
函数:
for entry in entries:
date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
if date_time.year >= 2005:
yield entry
在我的例子中,我是在entry['loc']
上过滤,而不是entry['lastmod']
。
不幸的是,除了上面的例子之外,我还没有找到使用sitemap_filter
的例子。
from scrapy.spiders import SitemapSpider
class mySpider(SitemapSpider)
name = 'spiderName'
sitemap_urls = ['https://example.co.za/medias/sitemap']
# sitemap_rules = [ ('donut/c', 'parse')]
def sitemap_filter(self, entries):
for entry in entries:
if '/p/' not in entry['loc']
print(entry)
yield entry
def parse(self, response):
...
代码在没有sitemap_filter
函数的情况下运行良好,但是定义所有的sitemap_rules
是不可行的。
当我运行上面的代码时,它打印了正确的站点Map条目,但似乎并没有进入解析函数。日志文件没有显示错误:
2022-05-10 17:02:00 [scrapy.core.engine] INFO: Spider opened
2022-05-10 17:02:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-10 17:02:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/robots.txt> (referer: None)
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/medias/sitemap.xml> (referer: None)
2022-05-10 17:02:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2022-05-10 17:02:06 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2022-05-10 17:02:09 [scrapy.core.engine] INFO: Closing spider (shutdown)
2022-05-10 17:02:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
我正在寻找一种方法,将sitemap_filter
生成的条目发送到parse
函数,或者,在scrapy打开链接之前过滤站点Map条目。
1条答案
按热度按时间kknvjkwl1#
谢谢大家的建议。根据@Georgiy的评论和old answer,用
entry.get('loc')
替换entry['loc']
是有效的。