我正在尝试使用spark/scala解析一个包含多个行标记的嵌套xml文件。解析之后,我必须将数据加载到一个表中。但是,我无法将多个行标记转换为适当的表格格式。我正在azuredatabricks集群中使用sparkxml库。有人能帮忙吗。
下面是文件的示例源和模式。原始文件的大小约为20mb
<?xml version="1.0" encoding="UTF-8"?>
<on xmlns:xsi="http://xxxxxxxx.yyyyyyy" xsi:xxxxxLocation="on_sources_2.0.xsd" schv="2.0">
<header> ` `
<cnt>On(r) TV: Sources</cnt>
<ctd>2021-04-11</ctd>
<cgr>cgr 2021 Gracenote. All rights reserved.</cgr>
<st>2021-04-11T00:00:00</st>
<pd>xx</pd>
</header>
<sources>
<prgSvcs>
<prgSvc sid="00000" pid="0000">
<nm>FXX ind</nm>
<address>
<ct>mum</ct>
<state>mh</state>
<pcd>111x2</pcd>
<cty>ind</cty>
</address>
<type>Satellite</type>
<rshps>
<rshp type="HD Version of">0000</rshp>
</rshps>
<attrbs>
<attrb>test</attrb>
<attrb>test2</attrb>
</attrbs>
<tmzn>IST Observing</tmzn>
<clsgn>XXC</clsgn>
<edlags>`test`
<edlag>en</edlag>
</edlags>
<bcaslags>
<bcaslag>en</bcaslag>
</bcaslags>
<URL>www.xxxyyyy.com/</URL>
<images>
<image type="image/png" wdt="00" hgt="22" prmy="true" ctrgy="Logo">
<URI>i0/xxxxx/00000/s00000_h4_ba.png</URI>
</image>
<image type="image/png" wdt="180" hgt="000" prmy="true" ctrgy="Logo">
<URI>h5/xxxxx/00000/s00000_h5_aa.png</URI>
</image>
<image type="image/png" wdt="360" hgt="000" prmy="true" ctrgy="Logo">
<URI>h3/xxxxx/00000/s00000_h3_aa.png</URI>
</image>
<image type="image/png" wdt="90" hgt="00" prmy="true" ctrgy="Logo">
<URI>h4/xxxxx/00000/s00000_h4_aa.png</URI>
</image>
<image type="image/png" wdt="360" hgt="003" prmy="true" ctrgy="Logo">
<URI>h3/xxxxx/00000/s00000_h3_ba.png</URI>
</image>
<image type="image/png" wdt="180" hgt="002" prmy="true" ctrgy="Logo">
<URI>h5/xxxxx/00000/s00000_h5_ba.png</URI>
</image>
</images>
</prgSvc>
</prgSvcs>
</sources>
</on>
SCHEMA:
schv
cnt
ctd
cgr
st
pd
sid
pid
nm
ct
state
pcd
cty
pty
rshp
rshp_type
attrb
tmzn
clsgn
edlag
bcaslag
num
mjrnum
mirnum
affil
afffil_pid
url
mktid
mktid_type
imgtyp
wdt
hgt
prmy
ctrgy
uri
ctdtline
暂无答案!
目前还没有任何答案,快来回答吧!