spark缓存的奇怪问题

jgovgodb 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(395)

我们正在使用spark 2.2.0。我们在一个配置单元表中有1.5 tb的数据。我们有80个节点的集群，每个节点有大约512GB的ram和40个内核。
我正在使用sparksql访问这些数据。使用纯sparksql（不带缓存）的简单命令（比如获取特定列值的不同计数）大约需要13秒。但是当我在缓存表之后运行相同的命令时，它需要10分钟以上的时间。不确定是什么问题？

export SPARK_MAJOR_VERSION=2
spark-shell --master yarn --num-executors 40 --driver-memory 5g --executor-memory 100g --executor-cores 5
spark.conf.set("spark.sql.shuffle.partitions", 10)
val df = spark.sql("select * from analyticalprofiles.customer_v2")
df.createOrReplaceTempView("tmp")
spark.time(spark.sql("select count(distinct(household_number)) from tmp").show())
>> Time taken: 13927 ms

import  org.apache.spark.storage.StorageLevel
val df2 = df.persist(StorageLevel.MEMORY_ONLY)
df2.createOrReplaceTempView("tmp2")
spark.time(spark.sql("select count(distinct(household_number)) from tmp2").show())
>> 1037482 ms ==> FIRST TIME - okay if this is more
spark.time(spark.sql("select count(distinct(household_number)) from tmp2").show())
>> 834740 ms  ==> SECOND TIME - Was expecting much faster execution ???

尝试了与“spark.catalog.cachetable（“tmp”）相同的方法，但仍然使用缓存查询需要更多的时间。不知道为什么？？？有人能帮忙吗？？？

df2.storageLevel.useMemory
res6: Boolean = true

sc.getPersistentRDDs
res8: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(12 -> In-memory table tmp MapPartitionsRDD[12] at cacheTable at <console>:24)

spark.conf.get("spark.sql.inMemoryColumnarStorage.compressed")
res11: String = true

spark.conf.get("spark.sql.inMemoryColumnarStorage.batchSize")
res12: String = 10000

spark.catalog.isCached("tmp")
res13: Boolean = true

hadoop DataFrame apache-spark Caching

来源：https://stackoverflow.com/questions/51599231/strange-issue-with-spark-caching

1条答案

按热度按时间

sczxawaw1#

你可以尝试以下方法。
您可以使用以下公式增加执行器的数量并减少执行器内存

SPARK_EXECUTOR_CORES (--executor-cores) : 5 

  Number of Executors (--num-executors) : (number of nodes) *  (number of cores) /(executor cores) -1 (for Application Master) = (80*40)/5 ~ 640-1 = 639

  SPARK_EXECUTOR_MEMORY (--executor-memory): Memory/(Number of Executors/Number of Nodes):  512/(639/80) ~ 64 GB

如果要持久化Dataframe，请使用storagelevel.memory\和磁盘\ ser。如果内存（ram）已满，它将保存在磁盘中。
希望对你有帮助。

赞(0）回复(0）举报 2021-05-29

我来回答

spark缓存的奇怪问题

1条答案

相关问题

热门标签

最新问答