pandas Python xml到dataframe标签中的标签

8ehkhllq  于 2023-06-20  发布在  Python
关注(0)|答案(2)|浏览(89)

我有下面的xml文件:

<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>

我想获得以下 Dataframe :

id latitude longitude city name   maj                 value
10 46       52        LA   diesel 2017-01-02T09:37:03 1258
10 46       52        LA   diesel 2017-01-03T09:54:58 1258
10 46       52        LA   diesel 2017-01-06T12:33:57 1258
10 46       52        LA   diesel 2017-01-09T08:59:53 1258
10 46       52        LA   diesel 2017-01-10T10:38:39 1258
2  46       53        NY   diesel 2017-01-03T09:38:59 1258
2  46       53        NY   diesel 2017-01-06T11:19:09 1258

我尝试了以下代码:

df = pd.read_xml("myfile.xml", xpath="//price")

但是我得到了以下我不想要的数据框:

name   id maj                 value
diesel 1  2017-01-02T09:37:03 1258
diesel 1  2017-01-03T09:54:58 1258
diesel 1  2017-01-06T12:33:57 1258
diesel 1  2017-01-09T08:59:53 1258
diesel 1  2017-01-10T10:38:39 1258
diesel 1  2017-01-03T09:38:59 1258
diesel 1  2017-01-06T11:19:09 1258

有多个pdv。我应该如何更改代码?

djp7away

djp7away1#

以下是lxml的一个选项:

from lxml import etree

tree = etree.parse("myfile.xml")

df = pd.DataFrame(
    [
        {
            "id": pdv.attrib["id"],
            "latitude": pdv.attrib["latitude"],
            "longitude": pdv.attrib["longitude"],
            "city": pdv.findtext("city"),
            "name": price.attrib["name"],
            "maj": price.attrib["maj"],
            "value": price.attrib["value"]
        }
        for pdv in tree.getroot().findall(".//pdv")
        for price in pdv.findall("price")
    ]
)

输出:

print(df)

   id latitude longitude city    name                  maj value
0  10       46        52   LA  diesel  2017-01-02T09:37:03  1258
1  10       46        52   LA  diesel  2017-01-03T09:54:58  1258
2  10       46        52   LA  diesel  2017-01-06T12:33:57  1258
3  10       46        52   LA  diesel  2017-01-09T08:59:53  1258
4  10       46        52   LA  diesel  2017-01-10T10:38:39  1258
5   2       46        53   NY  diesel  2017-01-03T09:38:59  1258
6   2       46        53   NY  diesel  2017-01-06T11:19:09  1258

如果您想使用excel来解析/保存电子表格,以便使用read_excel
打开新工作簿,选择Open/Browse/myfile.xml,然后单击OK

之后,使用另一个OK确认,以便Excel尝试推断模式:

最后,将创建一个新的电子表格/表格:

q1qsirdb

q1qsirdb2#

pandas文档:
对于更复杂的XML文档,样式表允许您暂时用XSLT(一种特殊用途语言)重新设计原始文档,以获得更扁平的版本,从而迁移到DataFrame。
因此,解决方案之一是使用xslt转换样式表对xml进行反规范化。这里的代码如何使用它与Pandas:

import pandas as pd

x_s = '''<?xml version="1.0" encoding="UTF-8"?>
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>'''

df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    <xsl:template match="/pdv_liste">
        <xsl:variable name="var1_initial" select="."/>
        <result>
            <xsl:for-each select="pdv">
                <xsl:variable name="var2_level" select="."/>
                <xsl:for-each select="price">
                    <xsl:variable name="var3_level" select="."/>
                    <ROW>
                        <id>
                            <xsl:value-of select="$var2_level/@id"/>
                        </id>
                        <latitude>
                            <xsl:value-of select="$var2_level/@latitude"/>
                        </latitude>
                        <longitude>
                            <xsl:value-of select="$var2_level/@longitude"/>
                        </longitude>
                        <name>
                            <xsl:value-of select="$var3_level/@name"/>
                        </name>
                        <maj>
                            <xsl:value-of select="$var3_level/@maj"/>
                        </maj>
                        <value>
                            <xsl:value-of select="$var3_level/@value"/>
                        </value>
                        <city>
                            <xsl:value-of select="$var2_level"/>
                        </city>
                    </ROW>
                </xsl:for-each>
            </xsl:for-each>
        </result>
    </xsl:template>
</xsl:stylesheet>'''

df = pd.read_xml(x_s, stylesheet=df_style)
df

结果我得到:

id  latitude  longitude    name                  maj  value city
0  10        46         52  diesel  2017-01-02T09:37:03   1258   LA
1  10        46         52  diesel  2017-01-03T09:54:58   1258   LA
2  10        46         52  diesel  2017-01-06T12:33:57   1258   LA
3  10        46         52  diesel  2017-01-09T08:59:53   1258   LA
4  10        46         52  diesel  2017-01-10T10:38:39   1258   LA
5   2        46         53  diesel  2017-01-03T09:38:59   1258   NY
6   2        46         53  diesel  2017-01-06T11:19:09   1258   NY

相关问题