从spark行获取spark列

jtjikinw  于 2021-07-14  发布在  Spark
关注(0)|答案(2)|浏览(890)

我对scala、spark还是个新手,所以我正在努力创建一个map函数。dataframe上的map函数是一行(org.apache.spark.sql.row),我一直在松散地关注本文。

  1. val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
  2. val parsed = Try(from_avro(???, currentValueSchema.value, fromAvroOptions)) match {
  3. case Success(parsedValue) => List(parsedValue, null)
  4. case Failure(ex) => List(null, ex.toString)
  5. }
  6. Row.fromSeq(row.toSeq.toList ++ parsed)
  7. }

这个 from_avro 函数想要接受一个列(org.apache.spark.sql.column),但是我在文档中找不到从行中获取列的方法。
我完全可以接受这样的想法,我可能把整件事都做错了。最终我的目标是解析来自结构流的字节。解析后的记录被写入增量表a,失败的记录被写入另一个增量表b
对于上下文,源表如下所示:

编辑- from_avro “坏记录”返回null
有一些评论说 from_avro 如果无法解析“坏记录”,则返回null。默认情况下 from_avro 使用模式 FAILFAST 如果解析失败,将引发异常。如果把模式设置为 PERMISSIVE 返回模式形状的对象,但所有属性都为null(也不是特别有用…)。链接到ApacheAvro数据源指南-spark 3.1.1文档
这是我最初的命令:

  1. val parsedDf = filterValueDF.select($"topic",
  2. $"partition",
  3. $"offset",
  4. $"timestamp",
  5. $"timestampType",
  6. $"valueSchemaId",
  7. from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))

如果有任何不正确的行,则终止作业 org.apache.spark.SparkException: Job aborted. 异常日志的一个片段:

  1. Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
  2. at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
  3. at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
  4. at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  5. at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
  6. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
  7. at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
  8. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
  9. ... 10 more
  10. Suppressed: java.lang.NullPointerException
  11. at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
  12. at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
  13. at java.io.DataOutputStream.write(DataOutputStream.java:107)
  14. at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
  15. at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
  16. at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
  17. at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
  18. at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
  19. at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
  20. at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
  21. at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
  22. at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
  23. at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
  24. at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
  25. at org.apache.parquet.format.Util.write(Util.java:222)
  26. at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
  27. at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
  28. at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
  29. at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
  30. at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
  31. at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  32. at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
  33. at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
  34. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
  35. at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
  36. ... 11 more
  37. Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
  38. at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
  39. at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
  40. at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
  41. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
  42. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
  43. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
  44. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
  45. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
  46. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
  47. at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
  48. at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
  49. at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
  50. ... 16 more
zazmityj

zazmityj1#

为了从row对象获取特定的列,可以使用 row.get(i) 或将列名与 row.getAs[T]("columnName") . 在这里您可以查看row类的详细信息。
那么您的代码如下所示:

  1. val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
  2. val binaryFixedValue = row.getSeq[Byte](6) // or row.getAs[Seq[Byte]]("fixedValue")
  3. val parsed = Try(from_avro(binaryFixedValue, currentValueSchema.value, fromAvroOptions)) match {
  4. case Success(parsedValue) => List(parsedValue, null)
  5. case Failure(ex) => List(null, ex.toString)
  6. }
  7. Row.fromSeq(row.toSeq.toList ++ parsed)
  8. }

尽管在您的例子中,您实际上不需要进入map函数,因为这样您就必须在 from_avro 使用DataFrameAPI。这就是你不能打电话的原因 from_avro 直接从 map 因为 Column 类只能与dataframe api结合使用,即: df.select($"c1") ,这里c1是列的一个示例。为了使用 from_avro ,如您最初所想,只需键入:

  1. filterValueDF.select(from_avro($"fixedValue", currentValueSchema))

正如@mike已经提到的,如果 from_avro 无法解析avro内容将返回null。最后,如果要将成功行与失败行分开,可以执行以下操作:

  1. val includingFailuresDf = filterValueDF.select(
  2. from_avro($"fixedValue", currentValueSchema) as "avro_res")
  3. .withColumn("failed", $"avro_res".isNull)
  4. val successDf = includingFailuresDf.where($"failed" === false)
  5. val failedDf = includingFailuresDf.where($"failed" === true)

请注意,代码没有经过测试。

展开查看全部
5fjcxozz

5fjcxozz2#

据我所知,你只需要为一行取一列。您可以通过使用row.get()在特定索引处获取列值来实现这一点

相关问题