Pandas read_xml和SEPA(CAMT 053)XML

jyztefdp  于 2023-11-15  发布在  其他
关注(0)|答案(1)|浏览(83)

最近,我想在pandas中尝试新实现的xml_read函数。我想用SEPA camt-format xml测试该功能。由于我不熟悉lxml逻辑,我被函数参数卡住了。我尝试将transactions值指向行(“Ntry”标签),正如我所想的那样,这将循环遍历这些行并创建一个空框架。将xpath设置为default将返回一个空框架,其中列为“GrpHdr”和“Rpt”,但相关数据比“Rpt”低一级。设置xpath='//*'会创建一个巨大的嵌套框架,每个标签都是列,值随机排序。如果有人熟悉使用pandas xml_read和嵌套xmls,我会感谢任何提示。xml文件看起来像这样(假值):

<Document>
<BkToCstmrAcctRpt>
<GrpHdr>
<MsgId>Account</MsgId>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<MsgRcpt>
<Nm> Name</Nm>
</MsgRcpt>
</GrpHdr>
<Rpt>
<Id>Account ID</Id>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<Acct>
<Id>
<IBAN>DEXXXXX</IBAN>
</Id>
</Acct>
<Bal>
<Tp>
<CdOrPrtry>
</CdOrPrtry>
</Tp>
<Amt Ccy="EUR">161651651651</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Dt>
<DtTm>2021-08-05T14:20:23.077+02:00</DtTm>
</Dt>
</Bal>
<Ntry>
<Amt Ccy="EUR">11465165</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Sts>BOOK</Sts>
<BookgDt>
<Dt>2021-08-02</Dt>
</BookgDt>
<ValDt>
<Dt>2021-08-02</Dt>
</ValDt>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+65454</Cd>
<Issr>DFE</Issr>
</Prtry>
</BkTxCd>
<NtryDtls>
<TxDtls>
<Amt Ccy="EUR">4945141.0</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+55155</Cd>
<Issr>DFEsds</Issr>
</Prtry>
</BkTxCd>
<RltdPties>
<Dbtr>
<Nm>Name</Nm>
</Dbtr>
<Cdtr>
<Nm>Name</Nm>
</Cdtr>
</RltdPties>
<RmtInf>
<Ustrd>Referenz NOTPROVIDED</Ustrd>
<Ustrd> Buchug</Ustrd>
</RmtInf>
</TxDtls>
</NtryDtls>
</Ntry>
</Rpt>
</BkToCstmrAcctRpt>
</Document>

字符串

lmyy7pcs

lmyy7pcs1#

银行对账单不是一个浅XML,因此不太适合pandas.read_xml(如documentation中所示)。
我建议使用sepa库。

from sepa import parser
import re
import pandas as pd

# Utility function to remove additional namespaces from the XML
def strip_namespace(xml):
    return re.sub(' xmlns="[^"]+"', '', xml, count=1)

# Read file
with open('example.xml', 'r') as f:
    input_data = f.read()

# Parse the bank statement XML to dictionary
camt_dict = parser.parse_string(parser.bank_to_customer_statement, bytes(strip_namespace(input_data), 'utf8'))

statements = pd.DataFrame.from_dict(camt_dict['statements'])
all_entries = []
for i,_ in statements.iterrows():
    if 'entries' in camt_dict['statements'][i]:
        df = pd.DataFrame()
        dd = pd.DataFrame.from_records(camt_dict['statements'][i]['entries'])
        df['Date'] = dd['value_date'].str['date']
        df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
        iban = camt_dict['statements'][i]['account']['id']['iban'] 
        df['IBAN'] = iban
        df['Currency'] = dd['amount'].str['currency']
        all_entries.append(df)

df_entries = pd.concat(all_entries)

字符串

相关问题