我是一个初学者在数据刮取,我目前正在刮取的quotes to scrape
网站使用scrappy。
我的问题是,当我抓取div框中的文本时,我使用代码text = div.css('.text::text').extract()
来提取段落,但是,当我将文本存储在.csv
文件中时,它将双引号视为特殊字符,然后错误地解释双引号并将其更改为其他字符。
我怎样才能设置一个if条件,使得那些双引号在提取过程中不被存储呢?
class QuoteSpider(scrapy.Spider):
name = 'quotes' #***spiderName*** #THESE 2 VARIABLES MUST HAVE THESE NAME EVERYTIME UR WRITING A SPIDER AS THE SCRAPY,SPIDER CLASS WE INHERIT
start_urls = [ #EXPECTS THESE TWO VARIABLES TO BE AVAILBLE IN THE FILE
'http://quotes.toscrape.com/'
]
def parse(self, response): #response variable will store the source code of the webpage we want to scrap
items = QuotetutorialItem() #Creating an instance of the class created in the items.py file
allDiv = response.css('.quote')
for div in allDiv:
text = div.css('.text::text').extract() #goes into the .text class to get the text
authors = div.css('.author::text').extract() #goes into the .author class to get the text of the author
aboutAuthors = div.css('.quote span a').xpath('@href').extract() #goes into the .quote div, then into the span and then gets the <a> tag from all of the boxes in the .quote div and then gets the link using xpath
tags = div.css('.tags .tag::text').extract()
items['storeText'] = text #the names passed in the list iterator should be the same-
items['storeAuthors'] = authors #- as the names of the member variables in the items.py file
items['storeAboutAuthors'] = aboutAuthors
items['storeTags'] = tags
yield items
1条答案
按热度按时间c0vxltue1#
由于引号以
“
和”
字符开头和结尾,因此可以考虑使用以下方法:Example:
结果-引号 * 不带 *
“
和”
字符:得到引号后,可以替换
“
和”
字符。代码: