scala—xml文件中不相关的损坏记录,同时将其读取为spark df

llmtgqce  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(443)

我试图在spark中将xml文件作为df读取。
xml文件:

<cool>
<incollection mdate="2002-01-03" key="books/acm/kim95/Blakeley95">
<author>Jos&eacute; A. Blakeley</author>
<title>OQL[C++]: Extending C++ with an Object Query Capability.</title>
<pages>69-88</pages>
<booktitle>Modern Database Systems</booktitle>
<url>db/books/collections/kim95.html#Blakeley95</url>
<year>1995</year>
</incollection>
</cool>

代码:

val corrupt_records_handled_DF=spark.read.format("xml").option("rootTag","cool").option("rowTag","incollection").load("/usr/local/inputs/temp.xml")

我把它当作腐败的记录。
spark版本:2.4.6包:com.databricks:spark-xml_2.11:0.9.0
输出:

scala> val corrupt_records_handled_DF=spark.read.format("xml").option("rootTag","cool").option("rowTag","incollection").load("/usr/local/inputs/temp.xml")
corrupt_records_handled_DF: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record                                                                                                                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<incollection mdate="2002-01-03" key="books/acm/kim95/Blakeley95">
<author>Jos&eacute; A. Blakeley</author>
<title>OQL[C++]: Extending C++ with an Object Query Capability.</title>
<pages>69-88</pages>
<booktitle>Modern Database Systems</booktitle>
<url>db/books/collections/kim95.html#Blakeley95</url>
<year>1995</year>
</incollection>|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

你们能帮我吗?

6psbrbz9

6psbrbz91#

这是由于作者标签中的(&I)。我使用sed命令将&替换为and。
sed-e's/&/and/g./temp.xml>/temp1.xml#替换和
sed-e's/&/and/g./temp.xml>/temp2.xml#用空格替换and

相关问题