pandas 我已将销售数据从电子商务站点导出到XML,我需要帮助使用python提取订单中嵌套的特定信息

chhkpiq4  于 2023-03-11  发布在  Python
关注(0)|答案(2)|浏览(78)

我想先说我没有XML文件的工作经验。我只是在过去的几天里开始研究这些数据,并有一些代码,有点工作,它只是需要一些调整。我的代码在这里:

import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET

#parses the orders.xml file and stores in the "tree" variable
tree = ET.parse("orders (1).xml")

#gets the root of the parsed tree
root = tree.getroot()

#creating lists to store xml data in
OrderID = []
CustomerID = []
SKU = []
Price = []
Quantity = []

#goes through the xml file looking for the specified tag and stores the value into the list created above
for orderid in root.iter('OrderId'):
    OrderID.append(orderid.text)

for customerid in root.iter('CustomerId'):
   CustomerID.append(customerid.text)

for sku in root.iter('Sku'):
   SKU.append(sku.text)

for price in root.iter('PriceInclTax'):
   Price.append(price.text)

for quantity in root.iter('Quantity'):
    Quantity.append(quantity.text)




# writes the data stored in the lists above to a pandas dataframe
df = pd.DataFrame(
    list(zip(OrderID, CustomerID, SKU, Price, Quantity)), columns=['Order ID', 'Customer ID', 'SKU', 'Price', 'Quantity']
)

# prints the dataframe
print(df)

结果如下:

Order ID Customer ID                SKU          Price Quantity
0       518    18321022    2BF207HT750-150        83.2300        8
1       517    18265518    4BF209HT750-175       107.5200        2
2       516    18344682       HUCP211-35KX       560.0000        2
3       515    18344503        6206-2RC4KX        58.8300        1
4       514    18352425        6206-ZZC4KX        47.8900       13
5       513    18265596       HUCF206-19KX       242.0000        8
6       512    18352395    4BF209HT750-175       352.7800       10
7       511    18266176  4BF209SHT750-175P       825.7200       10
8       510    18351938         SUCF208-24  28251299.4606        4
9       509    18265518             125855       567.4500        1
10      508    18351852        6003-2RC4KX        73.4686        1
11      506    18350111    4BF212HT750-250       152.4800        4
12      505    18327149          6200-2RSH         4.4600        6
13      504    18351172        6205-2RC4KX        68.0700        1
14      503    18266179        6203-2RC4KX        62.3800        1
15      502    18266179    4BF210HT750-200       162.2300        6
16      501    18265517        6204-ZZC4KX        34.8400        1
17      500    18265548        CG3-7000-NE         0.0000        1
18      499    18348139         CG3-7000-E         0.0000        1
19      498    18321909         CG3-7000-E      8920.0000        1
20      497    18337101        CG3-7000-NE     10770.0000        1
21      495    18348878    4BF207HT750-150       281.6700       24
22      494    18348911          UCF211-35       185.4930        6
23      493    18321909        HUC207-23KX       276.3900        2
24      492    18265594  4BF316SHT100-400P      1801.3900       20
25      491    18265533  2PB316SHT750-400P      1651.4400        1
26      490    18347438         SUCF205-16       134.8800        2

我遇到的问题是其中一些订单上有多个项目,这并没有反映出这一点。每个订单的项目数据存储在标记中,然后。我的代码没有说明每个订单可以有多个。下面是订单ID 515的XML文档的源代码

<Order>
        <OrderId>515</OrderId>
        <OrderGuid></OrderGuid>
        <StoreId>1</StoreId>
        <CustomerId></CustomerId>
        <OrderStatusId></OrderStatusId>
        <PaymentStatusId></PaymentStatusId>
        <ShippingStatusId></ShippingStatusId>
        <CustomerLanguageId></CustomerLanguageId>
        <CustomerTaxDisplayTypeId></CustomerTaxDisplayTypeId>
        <CustomerIp></CustomerIp>
        <OrderSubtotalInclTax>681.4000</OrderSubtotalInclTax>
        <OrderSubtotalExclTax>681.4000</OrderSubtotalExclTax>
        <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
        <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
        <OrderShippingInclTax>0.0000</OrderShippingInclTax>
        <OrderShippingExclTax>0.0000</OrderShippingExclTax>
        <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
        <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
        <TaxRates>0.00:0.00;   </TaxRates>
        <OrderTax>0.0000</OrderTax>
        <OrderTotal>681.4000</OrderTotal>
        <RefundedAmount>0.0000</RefundedAmount>
        <OrderDiscount>0.0000</OrderDiscount>
        <CurrencyRate>1.0000</CurrencyRate>
        <CustomerCurrencyCode>USD</CustomerCurrencyCode>
        <AffiliateId>0</AffiliateId>
        <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
        <CardType />
        <CardName />
        <CardNumber />
        <MaskedCreditCardNumber />
        <CardCvv2 />
        <CardExpirationMonth />
        <CardExpirationYear />
        <PaymentMethodSystemName>Payments.PurchaseOrder</PaymentMethodSystemName>
        <AuthorizationTransactionId />
        <AuthorizationTransactionCode />
        <AuthorizationTransactionResult />
        <CaptureTransactionId />
        <CaptureTransactionResult />
        <SubscriptionTransactionId />
        <PaidDateUtc>03/08/2023 21:29:31</PaidDateUtc>
        <ShippingMethod>UPS Next Day Air®</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
        <CustomValuesXml></CustomValuesXml>
        <VatNumber />
        <Deleted>False</Deleted>
        <CreatedOnUtc>3/8/2023 9:29:31 PM</CreatedOnUtc>
        <OrderItems>
            <OrderItem>
                <Id>633</Id>
                <OrderItemGuid>1cb035ca-73c8-4124-94d9-c14c54b669d8</OrderItemGuid>
                <Name>30mm 6206 High Temperature 220 °C Sealed C4 Bearing</Name>
                <Sku>6206-2RC4KX</Sku>
                <PriceExclTax>58.8300</PriceExclTax>
                <PriceInclTax>58.8300</PriceInclTax>
                <Quantity>1</Quantity>
                <DiscountExclTax>0.0000</DiscountExclTax>
                <DiscountInclTax>0.0000</DiscountInclTax>
                <TotalExclTax>58.8300</TotalExclTax>
                <TotalInclTax>58.8300</TotalInclTax>
            </OrderItem>
            <OrderItem>
                <Id>634</Id>
                <OrderItemGuid>1a4a005f-e669-4420-9379-611e5b3f637c</OrderItemGuid>
                <Name>30mm 6206 High Temperature 250 °C Shielded ZZ C4 Bearing</Name>
                <Sku>6206-ZZC4KX</Sku>
                <PriceExclTax>47.8900</PriceExclTax>
                <PriceInclTax>47.8900</PriceInclTax>
                <Quantity>13</Quantity>
                <DiscountExclTax>0.0000</DiscountExclTax>
                <DiscountInclTax>0.0000</DiscountInclTax>
                <TotalExclTax>622.5700</TotalExclTax>
                <TotalInclTax>622.5700</TotalInclTax>
            </OrderItem>
        </OrderItems>
        <Shipments>
            <Shipment>
                <ShipmentId>391</ShipmentId>
                <TrackingNumber>1Z4054470191053867</TrackingNumber>
                <TotalWeight>6.1600</TotalWeight>
                <ShippedDateUtc>3/8/2023 9:31:38 PM</ShippedDateUtc>
                <DeliveryDateUtc />
                <CreatedOnUtc>03/08/2023 21:31:38</CreatedOnUtc>
            </Shipment>
        </Shipments>
  </Order>

我需要帮助的是拥有包含多个行项目的订单,并能够将其写入panda Dataframe 。OrderID 515应该是索引3和索引4中的SKU。
我原以为有多个项目的订单会有多个行,而不是订单515中的项目转到下一个订单ID为516的下一行。
下面是两个订单的XML:

<Order>
    <OrderId>516</OrderId>
    <OrderGuid></OrderGuid>
    <StoreId></StoreId>
    <CustomerId></CustomerId>
    <OrderStatusId>30</OrderStatusId>
    <PaymentStatusId>30</PaymentStatusId>
    <ShippingStatusId>30</ShippingStatusId>
    <CustomerLanguageId>1</CustomerLanguageId>
    <CustomerTaxDisplayTypeId>10</CustomerTaxDisplayTypeId>
    <CustomerIp></CustomerIp>
    <OrderSubtotalInclTax>1120.0000</OrderSubtotalInclTax>
    <OrderSubtotalExclTax>1120.0000</OrderSubtotalExclTax>
    <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
    <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
    <OrderShippingInclTax>0.0000</OrderShippingInclTax>
    <OrderShippingExclTax>0.0000</OrderShippingExclTax>
    <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
    <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
    <TaxRates>0.00:0.00;</TaxRates>
    <OrderTax>0.0000</OrderTax>
    <OrderTotal>1120.0000</OrderTotal>
    <RefundedAmount>0.0000</RefundedAmount>
    <OrderDiscount>0.0000</OrderDiscount>
    <CurrencyRate>1.0000</CurrencyRate>
    <CustomerCurrencyCode>USD</CustomerCurrencyCode>
    <AffiliateId>0</AffiliateId>
    <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
    <CardType />
    <CardName />
    <CardNumber />
    <MaskedCreditCardNumber></MaskedCreditCardNumber>
    <CardCvv2 />
    <CardExpirationMonth />
    <CardExpirationYear />
    <PaymentMethodSystemName></PaymentMethodSystemName>
    <AuthorizationTransactionId></AuthorizationTransactionId>
    <AuthorizationTransactionCode></AuthorizationTransactionCode>
    <AuthorizationTransactionResult>Approved (1: This transaction has been approved.)</AuthorizationTransactionResult>
    <CaptureTransactionId></CaptureTransactionId>
    <CaptureTransactionResult>Approved (1: This transaction has been approved.)</CaptureTransactionResult>
    <SubscriptionTransactionId />
    <PaidDateUtc>03/09/2023 14:43:09</PaidDateUtc>
    <ShippingMethod>Other</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
    <CustomValuesXml />
    <VatNumber />
    <Deleted>False</Deleted>
    <CreatedOnUtc>3/9/2023 2:39:57 PM</CreatedOnUtc>
    <OrderItems>
        <OrderItem>
            <Id>635</Id>
            <OrderItemGuid>9dac1add-310b-41f7-a5af-36ff680b46de</OrderItemGuid>
            <Name>2-3/16" High Temperature 500 °F Mounted Pillow Block Bearing</Name>
            <Sku>HUCP211-35KX</Sku>
            <PriceExclTax>560.0000</PriceExclTax>
            <PriceInclTax>560.0000</PriceInclTax>
            <Quantity>2</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>1120.0000</TotalExclTax>
            <TotalInclTax>1120.0000</TotalInclTax>
        </OrderItem>
    </OrderItems>
    <Shipments>
        <Shipment>
            <ShipmentId>392</ShipmentId>
            <TrackingNumber>1ZB3B6620328442937</TrackingNumber>
            <TotalWeight>17.0000</TotalWeight>
            <ShippedDateUtc>3/9/2023 2:40:56 PM</ShippedDateUtc>
            <DeliveryDateUtc />
            <CreatedOnUtc>03/09/2023 14:40:55</CreatedOnUtc>
        </Shipment>
    </Shipments>
</Order>
<Order>
    <OrderId>515</OrderId>
    <OrderGuid></OrderGuid>
    <StoreId>1</StoreId>
    <CustomerId></CustomerId>
    <OrderStatusId>30</OrderStatusId>
    <PaymentStatusId>10</PaymentStatusId>
    <ShippingStatusId>30</ShippingStatusId>
    <CustomerLanguageId>1</CustomerLanguageId>
    <CustomerTaxDisplayTypeId>10</CustomerTaxDisplayTypeId>
    <CustomerIp></CustomerIp>
    <OrderSubtotalInclTax>681.4000</OrderSubtotalInclTax>
    <OrderSubtotalExclTax>681.4000</OrderSubtotalExclTax>
    <OrderSubTotalDiscountInclTax>0.0000</OrderSubTotalDiscountInclTax>
    <OrderSubTotalDiscountExclTax>0.0000</OrderSubTotalDiscountExclTax>
    <OrderShippingInclTax>0.0000</OrderShippingInclTax>
    <OrderShippingExclTax>0.0000</OrderShippingExclTax>
    <PaymentMethodAdditionalFeeInclTax>0.0000</PaymentMethodAdditionalFeeInclTax>
    <PaymentMethodAdditionalFeeExclTax>0.0000</PaymentMethodAdditionalFeeExclTax>
    <TaxRates>0.00:0.00;   </TaxRates>
    <OrderTax>0.0000</OrderTax>
    <OrderTotal>681.4000</OrderTotal>
    <RefundedAmount>0.0000</RefundedAmount>
    <OrderDiscount>0.0000</OrderDiscount>
    <CurrencyRate>1.0000</CurrencyRate>
    <CustomerCurrencyCode>USD</CustomerCurrencyCode>
    <AffiliateId>0</AffiliateId>
    <AllowStoringCreditCardNumber>False</AllowStoringCreditCardNumber>
    <CardType />
    <CardName />
    <CardNumber />
    <MaskedCreditCardNumber />
    <CardCvv2 />
    <CardExpirationMonth />
    <CardExpirationYear />
    <PaymentMethodSystemName>Payments.PurchaseOrder</PaymentMethodSystemName>
    <AuthorizationTransactionId />
    <AuthorizationTransactionCode />
    <AuthorizationTransactionResult />
    <CaptureTransactionId />
    <CaptureTransactionResult />
    <SubscriptionTransactionId />
    <PaidDateUtc>03/08/2023 21:29:31</PaidDateUtc>
    <ShippingMethod>UPS Next Day Air®</ShippingMethod>
    <ShippingRateComputationMethodSystemName>Shipping.Director</ShippingRateComputationMethodSystemName>
    <CustomValuesXml></CustomValuesXml>
    <VatNumber />
    <Deleted>False</Deleted>
    <CreatedOnUtc>3/8/2023 9:29:31 PM</CreatedOnUtc>
    <OrderItems>
        <OrderItem>
            <Id>633</Id>
            <OrderItemGuid>1cb035ca-73c8-4124-94d9-c14c54b669d8</OrderItemGuid>
            <Name>30mm 6206 High Temperature 220 °C Sealed C4 Bearing</Name>
            <Sku>6206-2RC4KX</Sku>
            <PriceExclTax>58.8300</PriceExclTax>
            <PriceInclTax>58.8300</PriceInclTax>
            <Quantity>1</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>58.8300</TotalExclTax>
            <TotalInclTax>58.8300</TotalInclTax>
        </OrderItem>
        <OrderItem>
            <Id>634</Id>
            <OrderItemGuid>1a4a005f-e669-4420-9379-611e5b3f637c</OrderItemGuid>
            <Name>30mm 6206 High Temperature 250 °C Shielded ZZ C4 Bearing</Name>
            <Sku>6206-ZZC4KX</Sku>
            <PriceExclTax>47.8900</PriceExclTax>
            <PriceInclTax>47.8900</PriceInclTax>
            <Quantity>13</Quantity>
            <DiscountExclTax>0.0000</DiscountExclTax>
            <DiscountInclTax>0.0000</DiscountInclTax>
            <TotalExclTax>622.5700</TotalExclTax>
            <TotalInclTax>622.5700</TotalInclTax>
        </OrderItem>
    </OrderItems>
    <Shipments>
        <Shipment>
            <ShipmentId>391</ShipmentId>
            <TrackingNumber>1Z4054470191053867</TrackingNumber>
            <TotalWeight>6.1600</TotalWeight>
            <ShippedDateUtc>3/8/2023 9:31:38 PM</ShippedDateUtc>
            <DeliveryDateUtc />
            <CreatedOnUtc>03/08/2023 21:31:38</CreatedOnUtc>
        </Shipment>
    </Shipments>
</Order>
pgccezyw

pgccezyw1#

考虑将迭代与嵌套循环联系起来,这可以通过解析实现:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("Input.xml")

data = [
    {
     **{
         "OrderId": o.findtext("OrderId"),
         "CustomerId": o.findtext("CustomerId") 
     },
     **{ l.tag: l.text for l in ol.findall("*") }
    }
    for o in tree.findall(".//Order")
    for ol in o.findall(".//OrderItem")
]

orders_df = pd.DataFrame(data)[
    ["OrderId", "CustomerId", "Sku", "PriceInclTax", "Quantity"]
]

orders_df
#   OrderId CustomerId           Sku PriceInclTax Quantity
# 0     516             HUCP211-35KX     560.0000        2
# 1     515              6206-2RC4KX      58.8300        1
# 2     515              6206-ZZC4KX      47.8900       13
jdgnovmf

jdgnovmf2#

尝试使用pandas.read_xml

import pandas as pd

# read the xml file with xpath
# we only care about the Order and OrderItem nodes so we pull
# everything from there and select the columns we want to keep
df = pd.read_xml('orders (1).xml', xpath='.//Order|.//OrderItem')[['OrderId', 'CustomerId', 'Sku', 'PriceExclTax', 'Quantity']]
# forward fill the OrderId
df['OrderId'] = df['OrderId'].ffill()
# only keep the rows that have a Sku
df = df[df['Sku'].notnull()].copy()

   OrderId  CustomerId           Sku  PriceExclTax  Quantity
1    516.0         NaN  HUCP211-35KX        560.00       2.0
3    515.0         NaN   6206-2RC4KX         58.83       1.0
4    515.0         NaN   6206-ZZC4KX         47.89      13.0

对于xpath,.选择当前节点,//从当前节点中选择文档中与所选内容匹配的节点,无论这些节点位于何处。|是or运算符。

相关问题