spark rdd中的分区数

e3bfsja2  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(585)

我通过指定分区数从文本文件创建rdd(spark 1.6)。但它给出的分区数与指定的分区数不同。
案例1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at <console>:27

scala> people.getNumPartitions
res36: Int = 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27

scala> people.getNumPartitions
res37: Int = 2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at <console>:27

scala> people.getNumPartitions
res38: Int = 3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:27

scala> people.getNumPartitions
res39: Int = 4

案例2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at <console>:27

scala> people.getNumPartitions
res47: Int = 1

案例3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at <console>:27

scala> people.getNumPartitions
res40: Int = 6

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at <console>:27

scala> people.getNumPartitions
res41: Int = 7

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at <console>:27

scala> people.getNumPartitions
res42: Int = 8

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at <console>:27

scala> people.getNumPartitions
res43: Int = 9

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at <console>:27

scala> people.getNumPartitions
res45: Int = 11

案例4

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at <console>:27

scala> people.getNumPartitions
res44: Int = 11

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at <console>:27

scala> people.getNumPartitions
res46: Int = 13

文件/home/pvikash/data/test.txt的内容如下:
这是一个测试文件。将用于rdd分区
基于以上案例,我有几个问题。
对于情况2,显式指定的分区数是0,但实际分区数是1(即使默认最小分区数是2),为什么实际分区数是1?
对于情况3,为什么在指定数量的分区上实际的分区数更改了+1?
对于情况4,为什么在指定数量的分区上实际的分区数更改了+2?
为什么spark在案例1、案例2、案例3和案例4中表现不同?
如果输入数据的大小很小(可以很容易地放入单个分区),那么为什么spark会创建空分区呢?
如有任何解释,将不胜感激。

vq8itlhq

vq8itlhq1#

不是一个完整的答案,但它可能会让你更接近它。
你要传递的那个数字叫做minsplits。它对最小分区数有影响,仅此而已。

def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String]

拆分的数量应由 getSplits 方法(文档)
这篇文章应该回答问题5

相关问题