pandas 解析多个XML属性值并附加到 Dataframe /列表

jaql4c8m  于 2023-04-28  发布在  其他
关注(0)|答案(1)|浏览(122)

我目前正在尝试用Python解析一些XML(在问题末尾添加的XML)来获取多个属性的值。最终目标是能够标记,首先是是否有属性,然后是该属性的值。然后我将其与列标题结合起来,以识别该列是否有属性(并非所有列都有)。
进口情况如下:

#---import libraries
import xlwings as xw #---lib to do most of the excel steps
import pandas as pd #---lib to read csv file
import openpyxl as xl #---lib to convert csv to xls
import os #---lib to extract filename for input and folder movement
from datetime import datetime
import xml.etree.ElementTree as Etree

使用下面的代码,我可以看到哪些字段具有属性,然后将文本“custom”添加到头部。

for elem in tree.iter():
    if bool(elem.attrib):
        ea = str(elem.attrib)
        B.update({elem.tag + " (Custom)": ea})
        A.append(B) # appending B to list

    else:
        B.update({elem.tag: elem.attrib})
        A.append(B) # appending B to list

打印时的结果如下。

company department employee name job      salary (Custom)
0      {}         {}       {}   {}  {}  {'datatype': 'int'}

现在,当只有一个属性时,这是可行的,但是当有多个属性时,我需要能够实现类似的结果,我正在处理的示例有两个(“custom”和“datatype”)。所以我需要获得这两个属性,然后将它们添加到dataframe / list中的最终列。
一个示例可能如下所示:
| 作业[string]|Salary(custom)[int]|
| --------------|--------------|
| 1号牢房|二号牢房|
标题下面的数据对我来说并不重要,因为从这里开始,我将只将标题拉入一个新的主Excel文档,以创建最终将成为一种报告定义/数据字典的东西,在那里我可以解析整个组织中使用的所有报告的XML文件。
我只是无法定位我需要的属性。有问题的属性是'Custom',其中值可能是'Yes'和'datatype',其中值可能是'int','string'等。理想情况下,我希望将自定义属性的值添加到列标题中,并使用普通括号()和方括号[]中的数据类型值
XML文件:

<?xml version="1.0" encoding="UTF-8"?>
<company>
    <department>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
    </department>
    <department>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
        <employee>
            <name>
            </name>
            <job>
            </job>
            <salary  custom = 'Yes'>
            </salary>
            <salary  datatype = 'int'>
            </salary>
        </employee>
    </department>
</company>
2mbi3lxu

2mbi3lxu1#

XSLT为例,这是一种专门用于转换XML文件的语言,pandas.read_xml()使用默认的lxml解析器和stylesheet参数支持这种语言。
下面的XSLT将把内容扁平化到employee级别,绘制祖先元素,并根据属性动态处理salary元素。示例数据包括一个完整的XML,它在pandasxml StackOverflow标签中捕获最高收入者,并使用航海等级。
请注意:括号和圆括号不包括在XML节点名称中,因为它们是不允许的符号。在Python中解析后根据需要进行更改。

XML格式

<?xml version="1.0" encoding="UTF-8"?>
<company name="stackoverflow">
    <department name="python">
        <employee>
            <name>Jezrael</name>
            <job>Captain</job>
            <salary custom="Yes">30463</salary>
            <salary datatype="int">806160</salary>
        </employee>
        <employee>
            <name>Ed Chum</name>
            <job>Chief Mate</job>
            <salary custom="Yes">3058</salary>
            <salary datatype="int">369802</salary>
        </employee>
        <employee>
            <name>Andy Hayden</name>
            <job>Second Mate</job>
            <salary custom="Yes">1959</salary>
            <salary datatype="int">353578</salary>
        </employee>
        <employee>
            <name>cs95</name>
            <job>Third Mate</job>
            <salary custom="Yes">5635</salary>
            <salary datatype="int">369908</salary>
        </employee>
    </department>
    <department name="xml">
        <employee>
            <name>Dimitre Novatchev</name>
            <job>Captain</job>
            <salary custom="Yes">5142</salary>
            <salary datatype="int">239598</salary>
        </employee>
        <employee>
            <name>Jon Skeet</name>
            <job>Chief Mate</job>
            <salary custom="Yes">35587</salary>
            <salary datatype="int">1398836</salary>
        </employee>
        <employee>
            <name>kjhughes</name>
            <job>Second Mate</job>
            <salary custom="Yes">2954</salary>
            <salary datatype="int">104539</salary>
        </employee>
        <employee>
            <name>Michael Kay</name>
            <job>Third Mate</job>
            <salary custom="Yes">8226</salary>
            <salary datatype="int">154024</salary>
        </employee>
    </department>
</company>

XSLT*(保存为.xsl脚本,特殊的.xml文件)*

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- PARSE ONLY employee LEVEL ELEMENTS -->
    <xsl:template match="/company">
     <employees>
       <xsl:apply-templates select="descendant::employee"/>
     </employees>
    </xsl:template>

    <!-- RESTYLE employee ELEMENTS, DRAW DOWN ANCESTORS -->
    <xsl:template match="employee">
     <xsl:copy>
       <company><xsl:value-of select="ancestor::company/@name"/></company>
       <department><xsl:value-of select="ancestor::department/@name"/></department>
       <xsl:copy-of select="name|job"/>
       <xsl:apply-templates select="salary"/>
     </xsl:copy>
    </xsl:template>
    
    <!-- DYNAMICALLY HANDLE salary ELEMENTS -->
    <xsl:template match="salary">
      <xsl:element name="{concat(name(), '_', name(@*), '_', @*)}">
        <xsl:value-of select="text()"/>
      </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Online Demo

Python*(stylesheet需要第三方lxml包)*

import pandas as pd

employees_df = pd.read_xml("myInput.xml", stylesheet="myScript.xsl")

print(employees_df)
         company department               name          job  salary_custom_Yes  salary_datatype_int
0  stackoverflow     python            Jezrael      Captain              30463               806160
1  stackoverflow     python            Ed Chum   Chief Mate               3058               369802
2  stackoverflow     python        Andy Hayden  Second Mate               1959               353578
3  stackoverflow     python               cs95   Third Mate               5635               369908
4  stackoverflow        xml  Dimitre Novatchev      Captain               5142               239598
5  stackoverflow        xml          Jon Skeet   Chief Mate              35587              1398836
6  stackoverflow        xml           kjhughes  Second Mate               2954               104539
7  stackoverflow        xml        Michael Kay   Third Mate               8226               154024

相关问题