如果设置了num executors，那么在“Yarn簇”模式下有多少个executors和rdd分区？

dsf9zpds 于 2021-06-08 发布在 Kafka

关注(0)|答案(1)|浏览(314)

我尝试使用SparkKafka直接流的方法。它通过创建与kafka主题分区一样多的rdd分区来简化并行性，如本文所述。根据我的理解，spark将为每个rdd分区创建一个执行器来执行计算。
因此，当我在yarn cluster模式下提交应用程序，并将选项num executors指定为分区数的不同值时，将有多少个executors？
例如，有一个kafka主题有2个分区，我将num executors指定为4：

export YARN_CONF_DIR=$HADOOP_HOME/client_conf

./bin/spark-submit \
--class playground.MainClass \
--master yarn-cluster \
--num-executors 4 \
../spark_applications/uber-spark-streaming-0.0.1-SNAPSHOT.jar \
127.0.0.1:9093,127.0.0.1:9094,127.0.0.1:9095 topic_1

我尝试了一下，发现执行者的数量是4，每个执行者都读取和处理Kafka的数据。为什么？Kafka主题只有2个分区，4个执行者如何读取只有2个分区的Kafka主题？
下面是spark应用程序和日志的详细信息。
我的spark应用程序，它在每个执行器中打印从kafka接收到的消息（用flatmap方法）：

...
    String brokers = args[0];
    HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(args[1].split(",")));
    kafkaParams.put("metadata.broker.list", brokers);

    JavaPairInputDStream<String, String> messages =
        KafkaUtils.createDirectStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
            kafkaParams, topicsSet);

    JavaPairDStream<String, Integer> wordCounts =
        messages.flatMap(new FlatMapFunction<Tuple2<String, String>, String>()
        {
            public Iterable<String> call(Tuple2<String, String> tuple) throws Exception
            {
                System.out.println(String.format("[received from kafka] tuple_1 is %s, tuple_2 is %s", tuple._1(),
                    tuple._2())); // print the kafka message received  in executor
                return Arrays.asList(SPACE.split(tuple._2()));
            }

        }).mapToPair(new PairFunction<String, String, Integer>()
        {
            public Tuple2<String, Integer> call(String word) throws Exception
            {
                System.out.println(String.format("[word]: %s", word));
                return new Tuple2<String, Integer>(word, 1);
            }

        }).reduceByKey(new Function2<Integer, Integer, Integer>()
        {
            public Integer call(Integer v1, Integer v2) throws Exception
            {
                return v1 + v2;
            }

        });

    wordCounts.print();

    Runtime.getRuntime().addShutdownHook(new Thread(){
        @Override
        public void run(){
            System.out.println("gracefully shutdown Spark!");
            jssc.stop(true, true);
        }
    });
    jssc.start();
    jssc.awaitTermination();

我的Kafka主题，有两个分区。字符串“hello-hello-word 1”、“hello-hello-word 2”、“hello-hello-word 3”。。。发送到主题。

Topic: topic_2  PartitionCount:2    ReplicationFactor:2 Configs:
Topic: topic_2  Partition: 0    Leader: 3   Replicas: 3,1   Isr: 3,1
Topic: topic_2  Partition: 1    Leader: 1   Replicas: 1,2   Isr: 1,2

网络控制台：

执行器1控制台输出：

...
[received from kafka] tuple_1 is null, tuple_2 is hello hello world 12
[word]: hello
[word]: hello
[word]: world
[word]: 12
...

执行器2的控制台输出：

...
[received from kafka] tuple_1 is null, tuple_2 is hello hello world 2
[word]: hello
[word]: hello
[word]: world
[word]: 2
...

执行器3的控制台输出：

...
[received from kafka] tuple_1 is null, tuple_2 is hello hello world 3
[word]: hello
[word]: hello
[word]: world
[word]: 3
...

apache-kafka apache-spark spark-streaming

来源：https://stackoverflow.com/questions/31351732/spark-kafka-direct-dstream-how-many-executors-and-rdd-partitions-in-yarn-clust

1条答案

按热度按时间

f5emj3cl1#

每个分区一次由一个执行器操作（假设没有启用推测执行）。
如果执行器比分区多，那么并不是所有的执行器都在任何给定的rdd上工作。但正如您所指出的，由于数据流是RDD序列，随着时间的推移，每个执行器都会做一些工作。

赞(0）回复(0）举报 2021-06-08

我来回答

如果设置了num executors，那么在“Yarn簇”模式下有多少个executors和rdd分区？

1条答案

相关问题

热门标签

最新问答