python 剥除\n \t \r碎片

klr1opcd 于 2022-10-30 发布在 Python

关注(0)|答案(8)|浏览(236)

我正在尝试用scrapy spider\r \n \t去除字符，然后生成一个json文件。
我有一个“description”对象，其中充满了新行，但它并没有做我想要的事情：将每个描述与标题相匹配。
我试着用map（unicode.strip（）），但它并不真正起作用。作为scrapy的新手，我不知道是否有其他更简单的方法，也不知道map unicode是如何工作的。
这是我的代码：

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

我也试探着：

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

但它引发了一个错误。最好的方法是什么？

python

来源：https://stackoverflow.com/questions/35288184/strip-n-t-r-in-scrapy

8条答案

按热度按时间

vuv7lop31#

unicode.strip只处理字符串开头和结尾的空白字符
返回删除前导字符和尾随字符后的字符串副本。
而不是中间有\n、\r或\t。
您可以使用自定义方法删除字符串中的这些字符（使用正则表达式模块），甚至可以使用XPath的normalize-space()
传回参数字串，其中的空白字符会借由去除开头和结尾的空白**，并以单一空格**取代空白字符序列来正规化。
python shell会话示例：

>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
... 
...  This is some text,
...  with some newlines \r
...  and some \t tabs \t too;
... 
... <a href="http://example.com"> and a link too
...  </a>
... 
... I think we're done here
... 
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>> 
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
 u"\n\nI think we're done here\n\n"]
>>> 
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
>>> 
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>>

展开查看全部

赞(0）回复(0）举报 2022-10-30

i2byvkas2#

我是一个python，scrapy新手，我今天遇到了一个类似的问题，在以下模块/函数w3lib.html.replace_escape_chars的帮助下解决了这个问题我为我的项目加载器创建了一个默认的输入处理器，它工作起来没有任何问题，你也可以将它绑定到特定的scrapy.Field（）上，而且它还可以与css选择器和csv提要导出一起工作：

from w3lib.html import replace_escape_chars
yourloader.default_input_processor = MapCompose(relace_escape_chars)

赞(0）回复(0）举报 2022-10-30

juud5qan3#

正如保罗·特伦布思所建议的那样，

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

可能就是您想要的。但是，normalize-space也会将字串中的空白压缩成单一空格。如果您只想移除\r、\n和\t，而不干扰其他空白，您可以使用translate()来移除字符。

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

这仍然会留下不在\r、\n或\t集合中的前导和尾随空格。如果你也想去掉这些空格，只需插入一个对strip()的调用：

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

赞(0）回复(0）举报 2022-10-30

g52tjvyc4#

从www.example.com中提取价格的最简单示例alibris.com是

response.xpath('normalize-space(//td[@class="price"]//p)').get()

赞(0）回复(0）举报 2022-10-30

wmvff8tz5#

当我使用scrapy抓取网页时，我遇到了同样的问题。我有两种方法来解决这个问题。第一种方法是使用replace（）函数。AS“response.xpath”返回一个列表格式，但replace函数只操作字符串format.so。我使用for循环将列表中的每一项都取为字符串，替换每一项中的'\n''\t'，然后追加到一个新的列表中。

import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
        # remove \t \n    
a = re.compile(r'(\t)+')     
b = re.compile(r'(\n)+')
text = []
for n in test_string:
    n = a.sub('',n)
    n = b.sub('',n)
    text.append(n)
print(text)
        # remove all ''
while '' in text:
    text.remove('')
print(text)

第二种方法使用map（）和strip。map（）函数直接处理列表并获取原始格式。python2中使用'Unicode'，而python3中将其更改为'str'，如下所示：

text = list(map(str.strip, test_string))
print(text)

strip函数只会移除字串开头和结尾的\n\t\r，而不会移除字串中间的部分。它与remove函数不同。

展开查看全部

赞(0）回复(0）举报 2022-10-30

dz6r00yl6#

如果你想保留列表中所有的联合字符串，就不需要添加额外的步骤，你只需简单地调用getall()来代替get()：

response.xpath('normalize-space(.//td[@class="price"]/text())').getall()

此外，还应在末尾添加text()。
希望对任何人都有帮助！

赞(0）回复(0）举报 2022-10-30

0h4hbjxa7#

您可以尝试结合使用css和get（）.strip（），它对我很有效

赞(0）回复(0）举报 2022-10-30

oipij1gg8#

str(i.css("p::text")[1].extract()).strip()

赞(0）回复(0）举报 2022-10-30

我来回答

python 剥除\n \t \r碎片

8条答案

相关问题

热门标签

最新问答