不可序列化的结果：org.apache.hadoop.io.intwriteable在使用spark/scala读取序列文件时

ozxc1zmp 于 2021-06-01 发布在 Hadoop

关注(0)|答案(1)|浏览(427)

从逻辑上读取带有int和string的序列文件，
如果我这样做：

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x.toString(), y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect

这是可以的，因为intwritable被转换为string。
如果我这样做：

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x, y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect

然后我立刻得到这个错误：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0 in stage 42.0 (TID 692) had a not serializable result: org.apache.hadoop.io.IntWritable

根本原因其实并不清楚——序列化，但为什么这么难呢？这是我注意到的另一种序列化方面。而且它只在运行时被注意到。

hadoop apache-spark sequencefile serialization

来源：https://stackoverflow.com/questions/53239761/not-serializable-result-org-apache-hadoop-io-intwritable-when-reading-sequence