spark scala-读取具有多个键类型的序列文件？

mwngjboj 于 2021-06-01 发布在 Hadoop

关注(0)|答案(1)|浏览(405)

我有序列文件，其中有 LongWritable 或者 Text . 这些值都是相同的格式（json）。我想在一个spark任务中一次处理所有的代码，但是我不知道如何编写代码，这样两个都可以 Text 以及 LongWritable 钥匙。我甚至不关心我工作中的顺序记录键，我不使用它们。
这是我的工作 LongWritable . 我该如何增强它，使之同时适用于两者 LongWritable 以及 Text 钥匙？有没有办法只加载序列文件记录值而忽略键？

val rdd = sparkCtx.sequenceFile[Long, String](srcDir)

// put into Json records, don't care about seq key
val jsonRecs = rdd.map((record: (Long, String)) => new String(record._2))

hadoop scala apache-spark

来源：https://stackoverflow.com/questions/49246123/spark-scala-reading-sequence-files-with-multiple-key-types

1条答案

按热度按时间

b91juud31#

我的解决方案是空写的，可以同时使用文本和长写的seq文件密钥。
我在本地测试期间读取本地文本文件，在集群上运行时读取hdfs。

var rdd = if (inputFileType.equalsIgnoreCase(InputFileType_Text)) {
        // Read local text file
        // Tried using a NullWritable here for local testing, but it throws
        // a 'Not Serializable' error.  Using null instead.
        sparkCtx.textFile(srcDir).map(line => {
           val tokens = line.split("\t")
           (null, tokens(1))
        })
     } else  {
        // Default to assuming sequence files are input
        // Read HDFS directory of seq files.
        log.debug("SEQUENCE files, srcDir={}", srcDir)
        sparkCtx.sequenceFile[NullWritable, String](srcDir)
     }
     log.debug("LOADED: rdd<NullWritable,String>")

     // Json records
     val jsonRecs = rdd.map((record: (NullWritable, String)) => new String(record._2))

赞(0）回复(0）举报 2021-06-01

我来回答

spark scala-读取具有多个键类型的序列文件？

1条答案

相关问题

热门标签

最新问答