为什么我得到错误：“大小超过整数.MAX_VALUE”时使用spark+cassandra？

hs1ihplo 于 2022-11-05 发布在 Cassandra

关注(0)|答案(1)|浏览(173)

我有7个cassandra节点（5 nodes with 32 cores and 32G memory, and 4 nodes with 4 cores and 64G memory），并在这个集群上部署了spark工作节点，而spark的主节点在8th node中。我用spark-cassandra-connector来表示它们。现在我的cassandra有近10亿条记录，有30个字段。我编写了scala，包括以下代码片段：

def startOneCache(): DataFrame = {
val conf = new SparkConf(true)
  .set("spark.cassandra.connection.host", "192.168.0.184")
  .set("spark.cassandra.auth.username", "username")
  .set("spark.cassandra.auth.password", "password")
  .set("spark.driver.maxResultSize", "4G")
  .set("spark.executor.memory", "12G")
  .set("spark.cassandra.input.split.size_in_mb","64")

val sc = new SparkContext("spark://192.168.0.131:7077", "statistics", conf)
val cc = new CassandraSQLContext(sc)
val rdd: DataFrame = cc.sql("select user_id,col1,col2,col3,col4,col5,col6
,col7,col8 from user_center.users").limit(100000192)
val rdd_cache: DataFrame = rdd.cache()

rdd_cache.count()
return rdd_cache
}

在spark's master中我使用spark-submit来运行上面的代码，在执行语句时：rdd_cache.count()，我在一个工作节点中得到了一个ERROR：192.168.0.185：

16/03/08 15:38:57 INFO ShuffleBlockFetcherIterator: Started 4 remote fetches in 221 ms
16/03/08 15:43:49 WARN MemoryStore: Not enough space to cache rdd_6_0 in memory! (computed 4.6 GB so far)
16/03/08 15:43:49 INFO MemoryStore: Memory use = 61.9 KB (blocks) + 4.6 GB (scratch space shared across 1 tasks(s)) = 4.6 GB. Storage limit = 6.2 GB.
16/03/08 15:43:49 WARN CacheManager: Persisting partition rdd_6_0 to disk instead.
16/03/08 16:13:11 ERROR Executor: Managed memory leak detected; size = 4194304 bytes, TID = 24002
16/03/08 16:13:11 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 24002)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

我只是认为最后的错误Size exceeds Integer.MAX_VALUE是由警告引起的：16/03/08 15:43:49 WARN MemoryStore: Not enough space to cache rdd_6_0 in memory! (computed 4.6 GB so far)，但我不知道为什么，或者我是否应该设置一个比.set("spark.executor.memory", "12G")大的，我应该做什么来纠正这个问题？

cassandra

来源：https://stackoverflow.com/questions/35863441/why-i-got-the-error-size-exceed-integer-max-value-when-using-sparkcassandra

1条答案

按热度按时间

mf98qq941#

No Spark shuffle block can be greater than 2 GB.
Spark使用ByteBuffer作为存储块的抽象，其大小受Integer.MAX_VALUE（20亿）的限制。
分区数少会导致混洗块大小变大。若要解决此问题，请尝试使用rdd.repartition()或rdd.coalesce()或增加分区数。
如果这样做没有帮助，这意味着至少有一个分区仍然太大，您可能需要使用一些更复杂的方法来使其变小-例如，使用随机性来均衡各个分区之间的RDD数据分布。

赞(0）回复(0）举报 2022-11-05

我来回答

为什么我得到错误：“大小超过整数.MAX_VALUE”时使用spark+cassandra？

1条答案

相关问题

热门标签

最新问答