csv 利用剪贴板提取特殊字符

cgvd09ve 于 2023-01-15 发布在其他

关注(0)|答案(1)|浏览(174)

我是一个初学者在数据刮取，我目前正在刮取的quotes to scrape网站使用scrappy。
我的问题是，当我抓取div框中的文本时，我使用代码text = div.css('.text::text').extract()来提取段落，但是，当我将文本存储在.csv文件中时，它将双引号视为特殊字符，然后错误地解释双引号并将其更改为其他字符。
我怎样才能设置一个if条件，使得那些双引号在提取过程中不被存储呢？

class QuoteSpider(scrapy.Spider):
    name = 'quotes'   #***spiderName***    #THESE 2 VARIABLES MUST HAVE THESE NAME EVERYTIME UR WRITING A SPIDER AS THE SCRAPY,SPIDER CLASS WE INHERIT        
    start_urls = [       #EXPECTS THESE TWO VARIABLES TO BE AVAILBLE IN THE FILE
        'http://quotes.toscrape.com/'
    ]
    
    def parse(self, response):      #response variable will store the source code of the webpage we want to scrap      
      items = QuotetutorialItem()   #Creating an instance of the class created in the items.py file
      allDiv = response.css('.quote')
      for div in allDiv:
         text = div.css('.text::text').extract()    #goes into the .text class to get the text
         authors = div.css('.author::text').extract()   #goes into the .author class to get the text of the author
         aboutAuthors = div.css('.quote span a').xpath('@href').extract()     #goes into the .quote div, then into the span and then gets the <a> tag from all of the boxes in the .quote div and then gets the link using xpath
         tags = div.css('.tags .tag::text').extract()
         
         items['storeText'] = text           #the names passed in the list iterator should be the same- 
         items['storeAuthors'] = authors     #- as the names of the member variables in the items.py file
         items['storeAboutAuthors'] = aboutAuthors
         items['storeTags'] = tags
         
         yield items

csv

来源：https://stackoverflow.com/questions/75085053/special-character-being-extracted-using-scrapy

1条答案

按热度按时间

c0vxltue1#

由于引号以“和”字符开头和结尾，因此可以考虑使用以下方法：

从字符串中删除第一个和最后一个字符。

Example：

# Sample quote:
quote_sample = "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"

# Modify the string - by taking all the characters after the first and before the last character: 
quote_sample = quote_sample[1:-1]

# Print the modified quote:
print(quote_sample[1:-1])

结果-引号 * 不带 * “和”字符：

A woman is like a tea bag; you never know how strong it is until it's in hot water.

得到引号后，可以替换“和”字符。
代码：

quote_sample = quote_sample.replace("“", "").replace("”", "")

赞(0）回复(0）举报 2023-01-15

我来回答

csv 利用剪贴板提取特殊字符

1条答案

相关问题

热门标签

最新问答