css 使用BeautifulSoup删除所有内联样式

nhn9ugyo 于 2023-02-06 发布在其他

关注(0)|答案(7)|浏览(211)

我正在用BeautifulSoup.Noob对Python和BeautifulSoup做一些HTML清理。基于我在Stackoverflow的其他地方找到的答案，我已经正确地删除了如下标记：

[s.extract() for s in soup('script')]

但是如何删除内联样式呢？例如：

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

应改为：

<p>Text</p>
<img href="somewhere.com">

如何删除所有元素的内联class，id，name和style属性？
我能找到的其他类似问题的答案都提到了使用CSS解析器来处理这个问题，而不是BeautifulSoup，但由于任务只是简单地删除而不是操作属性，并且是所有标签的通用规则，我希望找到一种方法来在BeautifulSoup中完成这一切。

css

来源：https://stackoverflow.com/questions/12959308/remove-all-inline-styles-using-beautifulsoup

7条答案

按热度按时间

pnwntuvh1#

如果你只想删除所有的CSS，你不需要解析任何CSS。BeautifulSoup提供了一种删除整个属性的方法，如下所示：

for tag in soup():
    for attribute in ["class", "id", "name", "style"]:
        del tag[attribute]

另外，如果只想删除整个标签（及其内容），则不需要返回标签的extract()，只需要decompose()：

[tag.decompose() for tag in soup("script")]

不是很大的区别，只是我在查看文档时发现的其他一些东西。您可以在BeautifulSoup documentation中找到关于API的更多细节，其中有许多示例。

赞(0）回复(0）举报 2023-02-06

cczfrluj2#

我不会在BeautifulSoup中这样做-您将花费大量时间尝试、测试和处理边缘情况。
Bleach完全可以满足您的需求。http://pypi.python.org/pypi/bleach
如果你要在BeautifulSoup中这样做，我建议你像Bleach一样使用“白名单”方法，决定哪些标签可能有哪些属性，并去除每个不匹配的标签/属性。

赞(0）回复(0）举报 2023-02-06

i5desfxk3#

下面是我对Python3和BeautifulSoup4的解决方案：

def remove_attrs(soup, whitelist=tuple()):
    for tag in soup.findAll(True):
        for attr in [attr for attr in tag.attrs if attr not in whitelist]:
            del tag[attr]
    return soup

它支持应该保留的属性的白名单。：）如果没有提供白名单，则所有属性都将被删除。

赞(0）回复(0）举报 2023-02-06

nnt7mjpx4#

lxml的清洁剂怎么样？

from lxml.html.clean import Cleaner

content_without_styles = Cleaner(style=True).clean_html(content)

赞(0）回复(0）举报 2023-02-06

xzv2uavs5#

基于jmk的函数，我使用这个函数来删除基于白色名单的属性：
工作在Python2，美丽的汤3

def clean(tag,whitelist=[]):
    tag.attrs = None
    for e in tag.findAll(True):
        for attribute in e.attrs:
            if attribute[0] not in whitelist:
                del e[attribute[0]]
        #e.attrs = None     #delte all attributes
    return tag

#example to keep only title and href
clean(soup,["title","href"])

赞(0）回复(0）举报 2023-02-06

wpcxdonn6#

不完美但短的：

' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

赞(0）回复(0）举报 2023-02-06

8tntrjer7#

我使用re和regex实现了这一点。

import re

def removeStyle(html):
  style = re.compile(' style\=.*?\".*?\"')    
  html = re.sub(style, '', html)

  return(html)

html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'

removeStyle(html)

输出：文本
您可以使用该函数，通过将正则表达式中的“style”替换为属性名称来剥离任何内联属性。

赞(0）回复(0）举报 2023-02-06

我来回答

css 使用BeautifulSoup删除所有内联样式

7条答案

相关问题

热门标签

最新问答