python XML文件到Pandas数据框

c0vxltue  于 2022-12-25  发布在  Python
关注(0)|答案(4)|浏览(162)

我在将XML转换为DataFrame时遇到了一个问题。我有以下示例XML:

<Fruits>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Apple">
        <Identifier FruitIdentifier="111"
                    FruitBrand="GoldenApple"/>
        <FruitInformation Country="Turkey"
                          Colour="Green"/>
        <CompanyInformation CompanyName="GlobalFruits"
                            Location="USA"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 1.</FreeText1>
                <FreeText2>Sample sentence 2.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Orange">
        <Identifier FruitIdentifier="222"
                    FruitBrand="BestOrange"/>
        <FruitInformation Country="Egypt"
                          Colour="Orange"/>
        <CompanyInformation CompanyName="FreshFood"
                            Location="UK"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 3.</FreeText1>
                <FreeText2>Sample sentence 4.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
</Fruits>

我想把它转换成DataFrame。最终的表格应该看起来像下图中的表格:

如果这是一个重复的问题,我先表示歉意,但我没有找到适合我的答案。
到目前为止,我有以下代码:

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse("fruits.xml")
xroot = xtree.getroot() 

df_cols = ["ReferenceDate", "FruitName", "FruitIdentifier",
           "FruitBrand", "Country", "Colour", "CompanyName",
           "Location", "CountryId", "LanguageId"]
rows = []

for node in xroot.iter(): 
    ReferenceDate = node.attrib.get("ReferenceDate")
    FruitName = node.attrib.get("FruitName")
    FruitIdentifier = node.attrib.get("FruitIdentifier")
    FruitBrand = node.attrib.get("FruitBrand")
    Country = node.attrib.get("Country")
    Colour = node.attrib.get("Colour")
    CompanyName = node.attrib.get("CompanyName")
    Location = node.attrib.get("Location")
    CountryId = node.attrib.get("CountryId")
    LanguageId = node.attrib.get("LanguageId")
    
    rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName, 
                  "FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
                  "Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
                  "CountryId": CountryId, "LanguageId": LanguageId})

out_df = pd.DataFrame(rows, columns = df_cols)

我有两个主要问题:
1.无法获取文本(自由文本1和自由文本2);
1.子查询中的每组属性都有自己的行。

ifsvaxew

ifsvaxew1#

下面的作品

import pandas as pd
import xml.etree.ElementTree as ET

xml = '''<Fruits>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Apple">
        <Identifier FruitIdentifier="111"
                    FruitBrand="GoldenApple"/>
        <FruitInformation Country="Turkey"
                          Colour="Green"/>
        <CompanyInformation CompanyName="GlobalFruits"
                            Location="USA"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 1.</FreeText1>
                <FreeText2>Sample sentence 2.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
    <Fruit ReferenceDate="2022-09-22"
                FruitName="Orange">
        <Identifier FruitIdentifier="222"
                    FruitBrand="BestOrange"/>
        <FruitInformation Country="Egypt"
                          Colour="Orange"/>
        <CompanyInformation CompanyName="FreshFood"
                            Location="UK"/>
        <Languages>
            <LanguageDependent CountryId="GB"
                               LanguageId="EN">
                <FreeText1>Sample sentence 3.</FreeText1>
                <FreeText2>Sample sentence 4.</FreeText2>
            </LanguageDependent>
        </Languages>
    </Fruit>
</Fruits>'''

root = ET.fromstring(xml)
rows = []

for f in root.findall('.//Fruit'):
    ReferenceDate = f.attrib.get("ReferenceDate")
    FruitName = f.attrib.get("FruitName")
    FruitIdentifier = f.find("Identifier").attrib.get("FruitIdentifier")
    FruitBrand = f.find("Identifier").attrib.get("FruitBrand")
    Country = f.find("FruitInformation").attrib.get("Country")
    Colour = f.find("FruitInformation").attrib.get("Colour")
    CompanyName = f.find("CompanyInformation").attrib.get("CompanyName")
    Location = f.find("CompanyInformation").attrib.get("Location")
    CountryId = f.find(".//LanguageDependent").attrib.get("CountryId")
    LanguageId = f.find(".//LanguageDependent").attrib.get("LanguageId")
    FreeText1 = f.find(".//LanguageDependent/FreeText1").text
    FreeText2 = f.find(".//LanguageDependent/FreeText2").text

    rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
                 "FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
                 "Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
                 "CountryId": CountryId, "LanguageId": LanguageId, "FreeText1": FreeText1,
                 "FreeText2": FreeText2})

df = pd.DataFrame(rows)
print(df)

输出

ReferenceDate FruitName  ...           FreeText1           FreeText2
0    2022-09-22     Apple  ...  Sample sentence 1.  Sample sentence 2.
1    2022-09-22    Orange  ...  Sample sentence 3.  Sample sentence 4.

[2 rows x 12 columns]
jvidinwx

jvidinwx2#

虽然XML对于单个pandas.read_xml来说不够浅,但是对于可以水平合并的多个调用来说,所需的数据是足够一致的:

fruits_df = (
    pd.concat([
        pd.read_xml("fruits.xml", xpath=".//Fruit", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//Identifier", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//FruitInformation", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//CompanyInformation", attrs_only = True, parser="etree"),
        pd.read_xml("fruits.xml", xpath=".//LanguageDependent", parser="etree")
       ], 
       axis=1
    )
)

或者,通过列表解析:

fruits_df = (
    pd.concat(
        [
            pd.read_xml("fruits.xml", xpath=f".//{elem}", parser="etree")
            for elem in
            ["Fruit", "Identifier", "FruitInformation", "CompanyInformation", "LanguageDependent"]
       ], 
       axis=1
    ).dropna(axis="columns")
)

产出

ReferenceDate FruitName  FruitIdentifier   FruitBrand Country  Colour   CompanyName Location CountryId LanguageId           FreeText1           FreeText2
0    2022-09-22     Apple              111  GoldenApple  Turkey   Green  GlobalFruits      USA        GB         EN  Sample sentence 1.  Sample sentence 2.
1    2022-09-22    Orange              222   BestOrange   Egypt  Orange     FreshFood       UK        GB         EN  Sample sentence 3.  Sample sentence 4.
0s0u357o

0s0u357o3#

更短、更通用的实现:

import xml.etree.ElementTree as ET
import pandas as pd

root = ET.parse("fruits.xml")

rows = []
for fruit in root:  # <Fruit> is child node of root node <Fruits>
    row = {}
    for node in fruit.iter():
        row.update(node.attrib)
        if node.text and not node.text.isspace():
            row[node.tag] = node.text
    rows.append(row)
            
df = pd.DataFrame(rows)

此实现使用下一个逻辑:

  • 将每个子节点的属性添加到row字典中;
  • 如果当前节点有一些非空白文本,我们使用标签名称作为键将此文本写入row
gc0ot86w

gc0ot86w4#

尝试以下powershell脚本:

using assembly System 
using assembly System.Xml.Linq 
using assembly System.Data

$dt = New-Object System.Data.DataTable("Fruits")
$dt.Columns.Add("ReferenceDate", [DateTime]) | Out-Null
$dt.Columns.Add("FruitName", [string]) | Out-Null
$dt.Columns.Add("FruitIdentifier", [string]) | Out-Null
$dt.Columns.Add("FruitBrand", [string]) | Out-Null
$dt.Columns.Add("Country", [string]) | Out-Null
$dt.Columns.Add("Colour", [string]) | Out-Null
$dt.Columns.Add("CompanyName", [string]) | Out-Null
$dt.Columns.Add("Location", [string]) | Out-Null
$dt.Columns.Add("CountryId", [string]) | Out-Null
$dt.Columns.Add("LanguageId", [string]) | Out-Null
$dt.Columns.Add("FreeText1", [string]) | Out-Null
$dt.Columns.Add("FreeText2", [string]) | Out-Null

$Filename = "c:\temp\test.xml"
$xDoc = [System.Xml.Linq.XDocument]::Load($Filename)
$fruits = $xDoc.Descendants("Fruit")

foreach($fruit in $fruits)
{
   $row = $dt.NewRow()
   $dt.Rows.Add($row)

   $date = [DateTime]$fruit.Attribute("ReferenceDate")
   $row.ReferenceDate = $date

   $name = $fruit.Attribute("FruitName").Value
   $row.FruitName = $name

   $identifier = $fruit.Element("Identifier")
   $fruitIdentifier = $identifier.Attribute("FruitIdentifier").Value
   $row.FruitIdentifier = $fruitIdentifier
   $fruitBrand = $identifier.Attribute("FruitBrand").Value
   $row.FruitBrand = $fruitBrand
   
   $fruitInformation = $fruit.Element("FruitInformation")
   $country = $fruitInformation.Attribute("Country").Value
   $row.country = $country
   $colour = $fruitInformation.Attribute("Colour").Value
   $row.Colour = $colour
   
   $companyInformation = $fruit.Element("CompanyInformation")
   $companyName = $companyInformation.Attribute("CompanyName").Value
   $row.CompanyName = $companyName
   $location = $companyInformation.Attribute("Location").Value
   $row.Location = $location

   $languageDependent = $fruit.Descendants("LanguageDependent")
   $countryId = $languageDependent.Attribute("CountryId").Value
   $row.CountryId = $countryId
   $languageId = $languageDependent.Attribute("LanguageId").Value
   $row.languageId = $languageId
   $freeText1 = $languageDependent.Element("FreeText1").Value
   $row.FreeText1 = $freeText1
   $freeText2 = $languageDependent.Element("FreeText2").Value
   $row.FreeText2 = $freeText2

 
}
# convert the datatable to an array of strings
$table = $dt | Format-Table | Out-String -Stream

foreach ($row in $table) {
   Write-Host $row
}

相关问题