python-3.x 使用名称空间解析XML属性

l0oc07j2  于 2023-03-04  发布在  Python
关注(0)|答案(1)|浏览(132)

给定以下XML

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  chemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>

我怎样才能得到entry/content/articleDoc属性中的xml:lang属性的值呢?我查过Python文档,但不幸的是它没有涵盖带有名称空间的属性。如果通过手动将名称空间作为字典键写在attribute-name前面找到解决方案,那似乎是错误的。我使用的是Python 3.9.9。
下面是我的代码:

import xml.etree.cElementTree as tree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" schemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>"""
ns = {'nitf': 'http://iptc.org/std/NITF/2006-10-18/',
      'w3': 'http://www.w3.org/2005/Atom',
      'xml': 'http://www.w3.org/XML/1998/namespace'}
root = tree.fromstring(xml)
id = root.find("w3:id", ns).text # works
print(id)
type_attribute = root.find("w3:content", ns).attrib['type'] # works
print(type_attribute)

#language = root.find("w3:content/articleDoc/articleDocHeader[xml:lang']", ns) # doesn't work
language = root.find("w3:content/articleDoc", ns).attrib['{http://www.w3.org/XML/1998/namespace}lang'] # works, but seems wrong
print(language)

任何帮助都是感激的。非常感谢!

mf98qq94

mf98qq941#

以下是如何使用lxml.etree在xml文件中定位的快速指南

In [2]: import lxml.etree as etree

In [3]: xml = """
   ...:     <entry xmlns="http://www.w3.org/2005/Atom" xmlns:demo="http://www.wh
   ...: atever.com">
   ...:       <id>1</id>
   ...:       <demo:demo_child>some namespace entry</demo:demo_child>
   ...:       <title>Example XML</title>
   ...:       <published>2021-12-15T00:00:00Z</published>
   ...:       <updated>2022-01-06T12:44:47Z</updated>
   ...:       <content type="application/xml">
   ...:         <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema
   ...: -instance" schemaVersion="1.8" xml:lang="en">
   ...:           <articleDocHead>
   ...:             <itemInfo/>
   ...:           </articleDocHead>
   ...:         </articleDoc>
   ...:       </content>
   ...:     </entry>"""

In [4]: tree = etree.fromstring(xml)

In [5]: tree
Out[5]: <Element {http://www.w3.org/2005/Atom}entry at 0x7d010c153800>

In [6]: list(tree.iterchildren())  # get children of cuurent element
Out[6]: 
[<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>,
 <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>,
 <Element {http://www.w3.org/2005/Atom}title at 0x7d010c9c5180>,
 <Element {http://www.w3.org/2005/Atom}published at 0x7d01233d6cc0>,
 <Element {http://www.w3.org/2005/Atom}updated at 0x7d010c0d4580>,
 <Element {http://www.w3.org/2005/Atom}content at 0x7d010c0d46c0>]

In [7]: print([el.tag for el in tree.iterchildren()])    # get children of cuurent element (human readable)
['{http://www.w3.org/2005/Atom}id', '{http://www.whatever.com}demo_child', '{http://www.w3.org/2005/Atom}title', '{http://www.w3.org/2005/Atom}published', '{http://www.w3.org/2005/Atom}updated', '{http://www.w3.org/2005/Atom}content']

In [8]: print(tree[0] == next(tree.iterchildren()))  # you can also access by #tree[index]
True

In [9]: tree.find('id')  # FAILS: did not consider default namespace

In [10]: tree.find('{http://www.w3.org/2005/Atom}id')  # now it works
Out[10]: <Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>

In [11]: tree.find('{http://www.w3.org/2005/Atom}demo_child')  # FAILS: element with non-default namespace

In [12]: tree.find('{http://www.whatever.com}demo_child')  # take proper namespace
Out[12]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [13]: tree.find(f'{{{tree.nsmap["demo"]}}}demo_child')  # do not spell out full namespace
Out[13]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [14]: tree.find('{http://www.w3.org/2005/Atom}content').find('articleDoc')  # follow path of elements
Out[14]: <Element articleDoc at 0x7d010c13d9c0>

In [15]: tree.xpath('//tmp_ns:id', namespaces={'tmp_ns': tree.nsmap[None]})  # use xpath, handling default namespace is tedious here
Out[15]: [<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>]

In [16]: tree.xpath('//articleDoc')  # find elements not being a direct child
Out[16]: [<Element articleDoc at 0x7d010c13d9c0>]

In [17]: tree.xpath('//@type')  # search for attribute
Out[17]: ['application/xml']

In [18]: tree.xpath('//@xml:lang')  # search for other attribute
Out[18]: ['en']

相关问题