我正在尝试从mongodb提取数据到我的df。使用java spark
下面是示例代码:-
SparkConf conf = new SparkConf()
.setAppName("MongoSparkConnectorTour")
.setMaster("local")
.set("spark.app.id", "MongoSparkConnectorTour")
.set("spark.mongodb.input.uri", uri)
.set("sampleSize", args[2])
.set("spark.mongodb.output.uri", uri)
.set("spark.mongodb.input.partitioner", "MongoPaginateByCountPartitioner")
.set("spark.mongodb.input.partitionerOptions.numberOfPartitions", "64")
JavaSparkContext jsc = new JavaSparkContext(conf)
DataFrame df = MongoSpark.load(jsc).toDF();
System.out.println("DF Count - " + df.count());
df.printSchema();
有2个表,1个表我可以毫无问题地获取数据,但是对于另一个表,我会遇到以下问题-
20/07/15 14:17:31 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 4)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a NullType (value: BsonString{value='4492148'})
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:80)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:38)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:36)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:36)
at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:109)
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:75)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:38)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:36)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at com.mongodb.spark.sql.MapFunctions$.documentToRow(MapFunctions.scala:36)
at com.mongodb.spark.sql.MapFunctions$.castToStructType(MapFunctions.scala:109)
at com.mongodb.spark.sql.MapFunctions$.com$mongodb$spark$sql$MapFunctions$$convertToDataType(MapFunctions.scala:75)
at com.mongodb.spark.sql.MapFunctions$$anonfun$3.apply(MapFunctions.scala:38)
从google我看到的唯一解决方案是增加样本量,但它仍然不起作用。
堆栈溢出导致的强制转换失败增加样本大小
第二个表的体积要大一点,我尝试了更大的样本量,但还是失败了。
解决这个问题的任何其他建议或想法都是有用的。
暂无答案!
目前还没有任何答案,快来回答吧!