我在scala spark中做了一个函数,看起来像这样。
def prepareSequences(data: RDD[String], splitChar: Char = '\t') = {
val x = data.map(line => {
val Array(id, se, offset, hour) = line.split(splitChar)
(id + "-" + se,
Step(offset = if (offset == "NULL") {
-5
} else {
offset.toInt
},
hour = hour.toInt))
})
val y = x.groupBy(_._1)}
我需要你的帮助 groupBy
但我一加上它,就得到一个错误。错误要求 Lzocodec
.
Exception in thread "main" java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.Partitioner$$anonfun$defaultPartitioner$2.apply(Partitioner.scala:66)
at org.apache.spark.Partitioner$$anonfun$defaultPartitioner$2.apply(Partitioner.scala:66)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:687)
at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:687)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.groupBy(RDD.scala:686)
at com.savagebeast.mypackage.DataPreprocessing$.prepareSequences(DataPreprocessing.scala:42)
at com.savagebeast.mypackage.activity_mapper$.main(activity_mapper.scala:31)
at com.savagebeast.mypackage.activity_mapper.main(activity_mapper.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
... 44 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 49 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
... 51 more
我安装了 lzo
在CDH5上找不到spark所需的com.hadoop.compression.lzo.lzocodec类之后的其他内容?
我错过什么了吗?
更新:找到解决方案。
像这样划分rdd解决了这个问题。
val y = x.groupByKey(50)
50是我想要的rdd分区数。它可以是任何数字。
然而,我不知道为什么这样做。如果有人能解释我会很感激的。
更新2:下面的工作更合理,到目前为止是稳定的。
我抄的 hadoop-lzo-0.4.21-SNAPSHOT.jar
从 /Users/<username>/hadoop-lzo/target
至 /usr/local/Cellar/apache-spark/2.1.0/libexec/jars
. 实际上是将jar复制到spark的类路径。
1条答案
按热度按时间4c8rllxm1#
不,这不是
groupBy
. 如果您查看一下堆栈跟踪(非常感谢发布它),您将看到它在输入格式的某个地方失败:这表明您的输入是压缩的。当你打电话的时候它失败了
groupBy
,因为这一点spark必须决定分区的数量,并触摸输入。实际上-是的,似乎你需要
lzo
执行作业的编解码器。