我试图在spark中将xml文件作为df读取。
xml文件:
<cool>
<incollection mdate="2002-01-03" key="books/acm/kim95/Blakeley95">
<author>José A. Blakeley</author>
<title>OQL[C++]: Extending C++ with an Object Query Capability.</title>
<pages>69-88</pages>
<booktitle>Modern Database Systems</booktitle>
<url>db/books/collections/kim95.html#Blakeley95</url>
<year>1995</year>
</incollection>
</cool>
代码:
val corrupt_records_handled_DF=spark.read.format("xml").option("rootTag","cool").option("rowTag","incollection").load("/usr/local/inputs/temp.xml")
我把它当作腐败的记录。
spark版本:2.4.6包:com.databricks:spark-xml_2.11:0.9.0
输出:
scala> val corrupt_records_handled_DF=spark.read.format("xml").option("rootTag","cool").option("rowTag","incollection").load("/usr/local/inputs/temp.xml")
corrupt_records_handled_DF: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_corrupt_record |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<incollection mdate="2002-01-03" key="books/acm/kim95/Blakeley95">
<author>José A. Blakeley</author>
<title>OQL[C++]: Extending C++ with an Object Query Capability.</title>
<pages>69-88</pages>
<booktitle>Modern Database Systems</booktitle>
<url>db/books/collections/kim95.html#Blakeley95</url>
<year>1995</year>
</incollection>|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
你们能帮我吗?
1条答案
按热度按时间6psbrbz91#
这是由于作者标签中的(&I)。我使用sed命令将&替换为and。
sed-e's/&/and/g./temp.xml>/temp1.xml#替换和
sed-e's/&/and/g./temp.xml>/temp2.xml#用空格替换and