在spark java中将文本文件转换为序列格式

g6ll5ycj 于 2021-05-30 发布在 Hadoop

关注(0)|答案(1)|浏览(601)

在spark java中，如何将文本文件转换为序列文件？以下是我的代码：

SparkConf sparkConf = new SparkConf().setAppName("txt2seq");
    sparkConf.setMaster("local").set("spark.executor.memory", "1g");
    sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);

    JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt");
    infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class, SequenceFileOutputFormat.class);

我得到了下面的错误。

14/12/07 23:43:33 ERROR Executor: Exception in task ID 0
java.io.IOException: Could not find a serializer for the Key class: 'java.lang.String'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
    at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1176)
    at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1091)

有人知道吗？谢谢您！

hadoop apache-spark sequencefile hadoop2

来源：https://stackoverflow.com/questions/27353462/convert-a-text-file-to-sequence-format-in-spark-java

1条答案

按热度按时间

kokeuurv1#

更改此项：

JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt");
infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class, SequenceFileOutputFormat.class);

至

JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt");
JavaPairRDD<Text, Text> resultRDD = infile.mapToPair(f -> new Tuple2<>(new Text(f._1()), new Text(f._2())));
resultRDD.saveAsNewAPIHadoopFile("outfile.seq", Text.class, Text.class, SequenceFileOutputFormat.class);

赞(0）回复(0）举报 2021-05-30

我来回答

在spark java中将文本文件转换为序列格式

1条答案

相关问题

热门标签

最新问答