import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.NullWritable
val path = "/tmp/path"
val rdd = sc.parallelize(List("foo"))
val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) }
bytesRdd.saveAsSequenceFile(path)
val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
val recoveredAsString = recovered.map( new String(_) )
recoveredAsString.collect()
// result is: Array[String] = Array(foo)
2条答案
按热度按时间drnojrws1#
下面是一个包含所有必需导入的片段,您可以按照@choix的请求从sparkshell运行
zpf6vheq2#
常见的问题似乎是,从byteswriteable到nullwriteable出现了一个奇怪的cannot cast异常。另一个常见的问题是字节可写
getBytes
是一堆毫无意义的废话,根本没有字节。什么getBytes
得到的字节比最后加上一吨零还要多!你必须使用copyBytes
```val rdd: RDD[Array[Byte]] = ???
// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
.saveAsSequenceFile("/output/path", codecOpt)
// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFileNullWritable, BytesWritable
.map(_._2.copyBytes())