'不支持的编码:DELTA_BYTE_ARRAY',同时使用pyspark将Parquet数据写入csv

y1aodyip  于 2023-01-20  发布在  Spark
关注(0)|答案(3)|浏览(209)

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.

sqlContext.setConf("spark.sql.parquet.binaryAsString","true")

val source =  sqlContext.read.parquet("path to parquet file")

source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")

This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task java.lang.UnsupportedOperationException: **Unsupported encoding: DELTA_BYTE_ARRAY** at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:577) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:627) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:47) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:550) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:536) at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:141) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:164) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.

41ik7eoe

41ik7eoe1#

您可以尝试禁用VectorizedReader。

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

这不是一个解决方案,但它是一个变通方法。禁用它的后果将是https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html

lqfhib0f

lqfhib0f2#

**问题:**在Spark 2.x阅读parquet文件时出现异常,其中一些列是DELTA_BYTE_ARRAY编码的。
**异常:**java.lang.不支持的操作异常:不支持的编码:增量字节阵列
**解决方案:**如果关闭矢量化读取器属性,则可以正常阅读这些文件。

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

**说明:**这些文件是使用Parquet V2编写器编写的,因为增量字节数组编码是Parquet v2的一个特性。Spark 2.x矢量化读取器似乎不支持该格式。Issue already created on apache’s jira。要解决此特定的变通方案。

**使用此解决方案的缺点。**Vectorized Query Execution可以大幅提升Hive、Drill和Presto等SQL引擎的性能。Vectorized Query Execution可以通过一次处理一批行来简化操作,而不是一次处理一行。但是spark 2.x在Parquet第二版中不支持此功能,因此我们需要依赖此解决方案,直到更高版本发布。

s4chpxco

s4chpxco3#

添加这两个标志帮助我克服了这个错误。

parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

相关问题