如何使用Python向XML文本添加标记

unhi4e5o  于 2023-05-21  发布在  Python
关注(0)|答案(3)|浏览(134)

我有一个XML格式的标记文本。我需要添加标记,即添加标签,以某些字,如果他们出现在文本中。
这就是我试图做到的:

import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''

profs=['one','two']
tag='<key>'
tag_cl='</key>'

root = ET.fromstring(doc)
for child in root:
    for word in profs:
        if word in child.text:
            child.text=child.text.replace(word, f'{tag}{word}{tag_cl}')
    print(child.text)

如果文本中没有嵌套标记,则此操作有效。如果有一个标签(在这个例子中是'fr'),那么child.text只被认为是第一个标签之前的文本。当然,一定有一些简单的解决方案来完成我所描述的任务。你能给予我点提示吗?

lstz6jyr

lstz6jyr1#

你搜索尾部元素。如有必要,可以为elem.text复制if条件:

import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''

profs=['one','two']
tag= ET.Element('key')

root = ET.fromstring(doc)

for elem in root.iter():
    #print(elem.text)
    #print(elem.tail)
    for word in profs:
        if elem.tail != None and word in elem.tail:
            tag.text=word
            elem.tail = elem.tail.replace(word, ET.tostring(tag).decode())
      
    if elem.tail != None:
        print(elem.tail)

输出:

text with key words <key>one</key> and <key>two</key>

**选项2:**如果你想使用真实的element etree对象而不是html标签,你可以这样做:

import xml.etree.ElementTree as ET
import html
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''

profs=['one','two']
tag= ET.Element('key')

root = ET.fromstring(doc)

for elem in root.iter('par'):
    text_str = ET.tostring(elem).decode()
    root.remove(elem)
    for word in profs:
        tag.text=word
        text_str = text_str.replace(word, ET.tostring(tag).decode())

    par=html.unescape(ET.fromstring(text_str))

root.append(par)   
ET.dump(root)

输出:

<root><par>An <fr>example</fr> text with key words <key>one</key> and <key>two</key></par></root>
cx6n0qe3

cx6n0qe32#

下面是该任务的XSLT2.0实现。

输入XML

<?xml version="1.0"?>
<root>
    <par>An <fr>example</fr> text with key words one and two</par>
</root>

XSLT2.0

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="utf-8"
                omit-xml-declaration="no"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="text()">
            <xsl:call-template name="OneTwoSequence"/>
      </xsl:template>

      <xsl:template name="OneTwoSequence">
            <xsl:param name="string" select="string(.)"/>
            <xsl:analyze-string select="$string" regex="one|two">
                  <xsl:matching-substring>
                        <key>
                              <xsl:value-of select="."/>
                        </key>
                  </xsl:matching-substring>
                  <xsl:non-matching-substring>
                        <xsl:value-of select="."/>
                  </xsl:non-matching-substring>
            </xsl:analyze-string>
      </xsl:template>
</xsl:stylesheet>

输出

<?xml version='1.0' encoding='utf-8' ?>
<root>
  <par>An 
    <fr>example</fr> text with key words 
    <key>one</key> and 
    <key>two</key>
  </par>
</root>
qxgroojn

qxgroojn3#

你已经很接近了,但是你必须使用lxml而不是ElementTree来实现:

from lxml import html as lh
root = lh.fromstring(doc)

#locate relevant the element
target = root.xpath('//fr')[0]

#convert the relevant element to string and copy it to a new string
#that is a necessary step because we're going to have to delete the
#original string
target_str = lh.tostring(target).decode()

#make the necessary changes to the string
profs=['one','two']
for word in profs:
    if word in target_str:
        target_str = target_str.replace(word, f'<key>{word}</key>')    

#locate the destination for the new element
destination = root.xpath('//par')[0]
#remove the original target
destination.remove(target)
#insert the new string, converted into a new element
destination.insert(0,lh.fromstring(target_str))
print(lh.tostring(root))

输出应该是您期望的输出。

相关问题