我在将XML转换为DataFrame时遇到了一个问题。我有以下示例XML:
<Fruits>
<Fruit ReferenceDate="2022-09-22"
FruitName="Apple">
<Identifier FruitIdentifier="111"
FruitBrand="GoldenApple"/>
<FruitInformation Country="Turkey"
Colour="Green"/>
<CompanyInformation CompanyName="GlobalFruits"
Location="USA"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 1.</FreeText1>
<FreeText2>Sample sentence 2.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
<Fruit ReferenceDate="2022-09-22"
FruitName="Orange">
<Identifier FruitIdentifier="222"
FruitBrand="BestOrange"/>
<FruitInformation Country="Egypt"
Colour="Orange"/>
<CompanyInformation CompanyName="FreshFood"
Location="UK"/>
<Languages>
<LanguageDependent CountryId="GB"
LanguageId="EN">
<FreeText1>Sample sentence 3.</FreeText1>
<FreeText2>Sample sentence 4.</FreeText2>
</LanguageDependent>
</Languages>
</Fruit>
</Fruits>
我想把它转换成DataFrame。最终的表格应该看起来像下图中的表格:
如果这是一个重复的问题,我先表示歉意,但我没有找到适合我的答案。
到目前为止,我有以下代码:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("fruits.xml")
xroot = xtree.getroot()
df_cols = ["ReferenceDate", "FruitName", "FruitIdentifier",
"FruitBrand", "Country", "Colour", "CompanyName",
"Location", "CountryId", "LanguageId"]
rows = []
for node in xroot.iter():
ReferenceDate = node.attrib.get("ReferenceDate")
FruitName = node.attrib.get("FruitName")
FruitIdentifier = node.attrib.get("FruitIdentifier")
FruitBrand = node.attrib.get("FruitBrand")
Country = node.attrib.get("Country")
Colour = node.attrib.get("Colour")
CompanyName = node.attrib.get("CompanyName")
Location = node.attrib.get("Location")
CountryId = node.attrib.get("CountryId")
LanguageId = node.attrib.get("LanguageId")
rows.append({"ReferenceDate": ReferenceDate, "FruitName": FruitName,
"FruitIdentifier": FruitIdentifier, "FruitBrand": FruitBrand,
"Country": Country, "Colour": Colour, "CompanyName": CompanyName, "Location": Location,
"CountryId": CountryId, "LanguageId": LanguageId})
out_df = pd.DataFrame(rows, columns = df_cols)
我有两个主要问题:
1.无法获取文本(自由文本1和自由文本2);
1.子查询中的每组属性都有自己的行。
4条答案
按热度按时间ifsvaxew1#
下面的作品
输出
jvidinwx2#
虽然XML对于单个
pandas.read_xml
来说不够浅,但是对于可以水平合并的多个调用来说,所需的数据是足够一致的:或者,通过列表解析:
产出
0s0u357o3#
更短、更通用的实现:
此实现使用下一个逻辑:
row
字典中;row
。gc0ot86w4#
尝试以下powershell脚本: