我正在尝试从一个网站上的文字废弃所有链接在每个网站上。现在我的代码是创建重复,他们很多,我想避免。你能请帮助我,告诉我在哪里犯了错误?
这是我的蜘蛛
class SuperSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
rules = [Rule(LinkExtractor(allow='/'),
callback='parse', follow=True)]
def parse(self, response):
url_list = []
for quote in response.css('div'):
name = quote.xpath('.//a/@href').get()
if name in url_list:
continue
url_list.append(name)
yield {
'Link_without_base_url': quote.xpath('.//a/@href').get(),
'Text': response.css("::text").extract(),
}
我得到json的例子
{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n ", "\n ", "\n", "\n", "\n ", "\n ", "\n ", "\n ", "\n ", "Quotes to Scrape", "\n ", "\n ", "\n ", "\n ", "\n \n ", "Login", "\n \n ", "\n ", "\n ", "\n \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n ", "\n\n ", "\n ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n ", "by ", "Harper Lee", "\n ", "(about)", "\n ", "\n ", "\n Tags:\n ", " \n \n ", "better-life-empathy", "\n \n ", "\n ", "\n\n ", "\n ", "\n \n \n ", "\n ", "\n ", "\n ", "\n \n ", "Top Ten tags", "\n \n ", "\n ", "love", "\n ", "\n \n ", "\n ", "inspirational", "\n ", "\n \n ", "\n ", "life", "\n ", "\n \n ", "\n ", "humor", "\n ", "\n \n ", "\n ", "books", "\n ", "\n \n ", "\n ", "reading", "\n ", "\n \n ", "\n ", "friendship", "\n ", "\n \n ", "\n ", "friends", "\n ", "\n \n ", "\n ", "truth", "\n ", "\n \n ", "\n ", "simile", "\n ", "\n \n \n ", "\n", "\n\n ", "\n ", "\n ", "\n ", "\n Quotes by: ", "GoodReads.com", "\n ", "\n ", "\n Made with ", "\u2764", " by ", "Scrapinghub", "\n ", "\n ", "\n ", "\n", "\n"]},
{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n ", "\n ", "\n", "\n", "\n ", "\n ", "\n ", "\n ", "\n ", "Quotes to Scrape", "\n ", "\n ", "\n ", "\n ", "\n \n ", "Login", "\n \n ", "\n ", "\n ", "\n \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n ", "\n\n ", "\n ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n ", "by ", "Harper Lee", "\n ", "(about)", "\n ", "\n ", "\n Tags:\n ", " \n \n ", "better-life-empathy", "\n \n ", "\n ", "\n\n ", "\n ", "\n \n \n ", "\n ", "\n ", "\n ", "\n \n ", "Top Ten tags", "\n \n ", "\n ", "love", "\n ", "\n \n ", "\n ", "inspirational", "\n ", "\n \n ", "\n ", "life", "\n ", "\n \n ", "\n ", "humor", "\n ", "\n \n ", "\n ", "books", "\n ", "\n \n ", "\n ", "reading", "\n ", "\n \n ", "\n ", "friendship", "\n ", "\n \n ", "\n ", "friends", "\n ", "\n \n ", "\n ", "truth", "\n ", "\n \n ", "\n ", "simile", "\n ", "\n \n \n ", "\n", "\n\n ", "\n ", "\n ", "\n ", "\n Quotes by: ", "GoodReads.com", "\n ", "\n ", "\n Made with ", "\u2764", " by ", "Scrapinghub", "\n ", "\n ", "\n ", "\n", "\n"]},
{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n ", "\n ", "\n", "\n", "\n ", "\n ", "\n ", "\n ", "\n ", "Quotes to Scrape", "\n ", "\n ", "\n ", "\n ", "\n \n ", "Login", "\n \n ", "\n ", "\n ", "\n \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n ", "\n\n ", "\n ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n ", "by ", "Harper Lee", "\n ", "(about)", "\n ", "\n ", "\n Tags:\n ", " \n \n ", "better-life-empathy", "\n \n ", "\n ", "\n\n ", "\n ", "\n \n \n ", "\n ", "\n ", "\n ", "\n \n ", "Top Ten tags", "\n \n ", "\n ", "love", "\n ", "\n \n ", "\n ", "inspirational", "\n ", "\n \n ", "\n ", "life", "\n ", "\n \n ", "\n ", "humor", "\n ", "\n \n ", "\n ", "books", "\n ", "\n \n ", "\n ", "reading", "\n ", "\n \n ", "\n ", "friendship", "\n ", "\n \n ", "\n ", "friends", "\n ", "\n \n ", "\n ", "truth", "\n ", "\n \n ", "\n ", "simile", "\n ", "\n \n \n ", "\n", "\n\n ", "\n ", "\n ", "\n ", "\n Quotes by: ", "GoodReads.com", "\n ", "\n ", "\n Made with ", "\u2764", " by ", "Scrapinghub", "\n ", "\n ", "\n ", "\n", "\n"]},
谢谢大家的支持
1条答案
按热度按时间nmpmafwu1#
简单地说,您可以选择所有列表项并迭代,然后选择链接和文本项如下:
输出: