通过遍历循环来解析xml,并使用循环创建字符串数组
<USR_ORD><OrderResponse>
<OrderCount1>3</OrderCount1>
<OrderResult><orders>
<order>
<name>A</name><address>AAA</address><number>A1</number><status></status>
</order>
<order>
<name>B</name><number>B1</number>
</order>
<order>
<name>C</name><address>CCC</address><number>C1</number><status></status>
</order>
</orders></OrderResult>
</OrderResponse></USR_ORD>
我的代码如下
//creating list
val myList=List((100,1,"<USR_ORD><OrderResponse><OrderCount1>3</OrderCount1><OrderResult><orders><order><name>A</name><address>AAA</address><number>A1</number><status></status></order><order><name>B</name><number>B1</number></order><order><name>C</name><address>CCC</address><number>C1</number><status></status></order></orders></OrderResult></OrderResponse></USR_ORD>"))
//creating dataframe and temp table
val rdd = spark.sparkContext.parallelize(myList);
val DF1 = rdd.toDF("customer_id","response_id","response_output")
DF1.createOrReplaceTempView("ord_tbl");
spark.sql("""select * from ord_tbl""").show(10,false)
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResponse><OrderCount1>3</OrderCount1><OrderResult><orders><order><name>A</name><address>AAA</address><number>A1</number><status></status></order><order><name>B</name><number>B1</number></order><order><name>C</name><address>CCC</address><number>C1</number><status></status></order></orders></OrderResult></OrderResponse></USR_ORD>|
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
使用xpath标准函数
spark.sql("""select xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/name/text()') as name
,xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/address/text()') as address
, xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/status/text()') as status from ord_tbl""").show(10,false)
+---------+----------+------+
|name |address |status|
+---------+----------+------+
|[A, B, C]|[AAA, CCC]|[] |
+---------+----------+------+
但预期的Dataframe应该如下所示
+---------+----------+------+
|name |address |status|
+---------+----------+------+
|[A, B, C]|[AAA,,CCC]|[,,] |
+---------+----------+------+
我尝试使用loop作为参考的东西,我肯定是错的,没有编译
def (inorders:Int,inOrderCount:int,partxpathstring1:String,partxpathstring2:String,)
val orders=inorders
val OrderCount=inOrderCount
var i=0
var j=0
for (i <- 1 to orders){
for(j <- 1 to OrderCount){
fullxpath=xpath_string(response_output,'$partxpathstring1+[i]+$partxpathstring2+[j]')
fullxpath+=fullxpath
)
j+=1
}
i+1
}
1条答案
按热度按时间5anewei61#
Spark
xpath
函数似乎过滤xml节点中的空值。您可能需要使用自定义项来处理此问题。下面是一个使用scala.xml.XML
分析列的步骤response_output
: