如何优化spark context/sql

7gyucuyw  于 2021-05-18  发布在  Spark
关注(0)|答案(0)|浏览(307)

我有很多json文件是通过spark sql查询得到的。。json文件的大小最大为150 mb。我正在通过java程序查询json文件。样品就像。。

Logger.getLogger("org.apache").setLevel(Level.OFF);  
             SparkSession spark = SparkSession
            .builder()
            .appName("Java Spark SQL data source JSON example")
            .master("local[2]")
            .getOrCreate();
            ......

            Dataset<Row> state = spark.read().json(jsonPath_File.json");
            state.createOrReplaceTempView("state");

当我运行这个程序时,需要花费大量的时间。日志显示如下。如何提高性能。我在一台独立的机器上运行。

2020-11-03 12:37:24.641  INFO 30444 --- [er-event-loop-4] o.apache.spark.scheduler.TaskSetManager  : Starting task 128.0 in stage 3.0 (TID 135, localhost, executor driver, partition 128, ANY, 4726 bytes)
2020-11-03 12:37:24.641  INFO 30444 --- [er for task 135] org.apache.spark.executor.Executor       : Running task 128.0 in stage 3.0 (TID 135)
2020-11-03 12:37:24.641  INFO 30444 --- [er-event-loop-4] o.apache.spark.scheduler.TaskSetManager  : Starting task 129.0 in stage 3.0 (TID 136, localhost, executor driver, partition 129, ANY, 4726 bytes)
2020-11-03 12:37:24.641  INFO 30444 --- [result-getter-2] o.apache.spark.scheduler.TaskSetManager  : Finished task 126.0 in stage 3.0 (TID 133) in 4 ms on localhost (executor driver) (128/181)
2020-11-03 12:37:24.641  INFO 30444 --- [er for task 136] org.apache.spark.executor.Executor       : Running task 129.0 in stage 3.0 (TID 136)
2020-11-03 12:37:24.641  INFO 30444 --- [result-getter-0] o.apache.spark.scheduler.TaskSetManager  : Finished task 127.0 in stage 3.0 (TID 134) in 4 ms on localhost (executor driver) (129/181)
2020-11-03 12:37:24.642  INFO 30444 --- [er for task 135] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.642  INFO 30444 --- [er for task 135] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.642  INFO 30444 --- [er for task 136] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.642  INFO 30444 --- [er for task 136] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.644  INFO 30444 --- [er for task 135] org.apache.spark.executor.Executor       : Finished task 128.0 in stage 3.0 (TID 135). 3324 bytes result sent to driver
2020-11-03 12:37:24.644  INFO 30444 --- [er for task 136] org.apache.spark.executor.Executor       : Finished task 129.0 in stage 3.0 (TID 136). 3351 bytes result sent to driver
2020-11-03 12:37:24.644  INFO 30444 --- [er-event-loop-5] o.apache.spark.scheduler.TaskSetManager  : Starting task 130.0 in stage 3.0 (TID 137, localhost, executor driver, partition 130, ANY, 4726 bytes)
2020-11-03 12:37:24.644  INFO 30444 --- [er for task 137] org.apache.spark.executor.Executor       : Running task 130.0 in stage 3.0 (TID 137)
2020-11-03 12:37:24.644  INFO 30444 --- [er-event-loop-5] o.apache.spark.scheduler.TaskSetManager  : Starting task 131.0 in stage 3.0 (TID 138, localhost, executor driver, partition 131, ANY, 4726 bytes)
2020-11-03 12:37:24.644  INFO 30444 --- [er for task 138] org.apache.spark.executor.Executor       : Running task 131.0 in stage 3.0 (TID 138)
2020-11-03 12:37:24.644  INFO 30444 --- [result-getter-3] o.apache.spark.scheduler.TaskSetManager  : Finished task 128.0 in stage 3.0 (TID 135) in 4 ms on localhost (executor driver) (130/181)
2020-11-03 12:37:24.644  INFO 30444 --- [result-getter-1] o.apache.spark.scheduler.TaskSetManager  : Finished task 129.0 in stage 3.0 (TID 136) in 3 ms on localhost (executor driver) (131/181)
2020-11-03 12:37:24.646  INFO 30444 --- [er for task 138] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.646  INFO 30444 --- [er for task 137] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.646  INFO 30444 --- [er for task 138] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.646  INFO 30444 --- [er for task 137] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.647  INFO 30444 --- [er for task 137] org.apache.spark.executor.Executor       : Finished task 130.0 in stage 3.0 (TID 137). 3228 bytes result sent to driver
2020-11-03 12:37:24.647  INFO 30444 --- [er for task 138] org.apache.spark.executor.Executor       : Finished task 131.0 in stage 3.0 (TID 138). 3217 bytes result sent to driver
2020-11-03 12:37:24.648  INFO 30444 --- [er-event-loop-4] o.apache.spark.scheduler.TaskSetManager  : Starting task 132.0 in stage 3.0 (TID 139, localhost, executor driver, partition 132, ANY, 4726 bytes)
2020-11-03 12:37:24.648  INFO 30444 --- [er for task 139] org.apache.spark.executor.Executor       : Running task 132.0 in stage 3.0 (TID 139)
2020-11-03 12:37:24.648  INFO 30444 --- [er-event-loop-4] o.apache.spark.scheduler.TaskSetManager  : Starting task 133.0 in stage 3.0 (TID 140, localhost, executor driver, partition 133, ANY, 4726 bytes)
2020-11-03 12:37:24.648  INFO 30444 --- [er for task 140] org.apache.spark.executor.Executor       : Running task 133.0 in stage 3.0 (TID 140)
2020-11-03 12:37:24.648  INFO 30444 --- [result-getter-2] o.apache.spark.scheduler.TaskSetManager  : Finished task 130.0 in stage 3.0 (TID 137) in 4 ms on localhost (executor driver) (132/181)
2020-11-03 12:37:24.648  INFO 30444 --- [result-getter-0] o.apache.spark.scheduler.TaskSetManager  : Finished task 131.0 in stage 3.0 (TID 138) in 4 ms on localhost (executor driver) (133/181)
2020-11-03 12:37:24.649  INFO 30444 --- [er for task 140] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.649  INFO 30444 --- [er for task 139] o.a.s.s.ShuffleBlockFetcherIterator      : Getting 1 non-empty blocks out of 2 blocks
2020-11-03 12:37:24.649  INFO 30444 --- [er for task 140] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.649  INFO 30444 --- [er for task 139] o.a.s.s.ShuffleBlockFetcherIterator      : Started 0 remote fetches in 0 ms
2020-11-03 12:37:24.651  INFO 30444 --- [er for task 139] org.apache.spark.executor.Executor       : Finished task 132.0 in stage 3.0 (TID 139). 3107 bytes result sent to driver
2020-11-03 12:37:24.651  INFO 30444 --- [er for task 140] org.apache.spark.executor.Executor       : Finished task 133.0 in stage 3.0 (TID 140). 3186 bytes result sent to driver
2020-11-03 12:37:24.651  INFO 30444 --- [er-event-loop-5] o.apache.spark.scheduler.TaskSetManager  : 
 and this goes on many times.

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题