在Python中分割标记上的文本

mfuanj7w 于 2023-09-29 发布在 Python

关注(0)|答案(3)|浏览(168)

我有以下一行文字：

<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>

使用Python，我想打破标记实体以获得以下列表：

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']

到目前为止，我使用了：

markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(<\/(?P=tag)>)")
text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)

但它产生：

['<code>', 'code', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']

问题是(?P=tag)组被添加到列表中，因为它被捕获了。我捕捉它只是为了得到最接近的结束标签。
假设代码一次只处理一行，我如何在结果列表中去掉它？

python

来源：https://stackoverflow.com/questions/76868183/split-text-on-markup-in-python

3条答案

按热度按时间

qyzbxkaa1#

您可以使用xml，它是为xml files设计的模块，与html同义。

import xml.etree.ElementTree as ET
text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'
root = ET.fromstring(f'<root>{text}</root>')
result = []
for element in root:
    if element.tag:
        result.append(f'<{element.tag}>')
    if element.text:
        result.extend(element.text.split())
    if element.tail:
        result.append(element.tail)
print(result)

展开查看全部

赞(0）回复(0）举报 2023-09-29

a11xaf1n2#

RegEx不适合解析HTML。然而，它通常足以用于令牌化。使用re.finditer，令牌化变成了一行代码：

list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))

说明：

仅使用非捕获组(?:...);我们这里不需要具体的捕获。
匹配一个“标签”<(?:.*?>)?（可能是无效的（只是<符号），仅通过其开口<识别，直到>）或明文[^<]+。

这将处理您的测试用例

s = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

正确地，生产

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

然而，请注意，一个成熟的HTML标记器需要一个更复杂的常规语法来处理，例如。onclick = "console.log(1 < 2)"等属性。最好使用现成的库来为您进行标记解析（甚至只是标记化）。

展开查看全部

赞(0）回复(0）举报 2023-09-29

von4xj4u3#

s = r'<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'
l = []
for i in range(len(s)):
    if s[i] == ">":
        l[-1] += s[i]
        l.append("")
    elif s[i] == "<":
        l.append("")
        l[-1] += s[i]
    else:
        l[-1] += s[i]
        
l.pop()
print(l)

输出：['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

展开查看全部

赞(0）回复(0）举报 2023-09-29

我来回答

在Python中分割标记上的文本

3条答案

相关问题

热门标签

最新问答