hive上小文件的性能问题

nwsw7zdq  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(500)

我正在读一篇关于小文件如何降低配置单元查询性能的文章。https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
我理解关于重载namenode的第一部分。
然而,他所说的似乎没有发生。对于map reduce和tez。
当mapreduce作业启动时,它会为每个正在处理的数据块安排一个map任务
我没有看到每个文件创建的mapper任务。可能原因是,他引用的是mapreduce的版本1,之后做了很多更改。
配置单元版本:配置单元1.2.1000.2.6.4.0-91
我的table:

  1. create table temp.emp_orc_small_files (id int, name string, salary int)
  2. stored as orcfile;

数据:下面的代码将创建100个小文件,其中只包含几kb的数据。

  1. for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done

但是,我只看到为下面的查询创建了一个Map器和一个reducer任务。

  1. [root@sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
  2. log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
  3. Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
  4. Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
  5. Total jobs = 1
  6. Launching Job 1 out of 1
  7. Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
  8. --------------------------------------------------------------------------------
  9. VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
  10. --------------------------------------------------------------------------------
  11. Map 1 .......... SUCCEEDED 1 1 0 0 0 0
  12. Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
  13. --------------------------------------------------------------------------------
  14. VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
  15. --------------------------------------------------------------------------------
  16. OK
  17. 4989
  18. Time taken: 13.643 seconds, Fetched: 1 row(s)

与map reduce的结果相同。

  1. hive> set hive.execution.engine=mr;
  2. hive> select max(salary) from temp.emp_orc_small_files;
  3. Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
  4. Total jobs = 1
  5. Launching Job 1 out of 1
  6. Number of reduce tasks determined at compile time: 1
  7. In order to change the average load for a reducer (in bytes):
  8. set hive.exec.reducers.bytes.per.reducer=<number>
  9. In order to limit the maximum number of reducers:
  10. set hive.exec.reducers.max=<number>
  11. In order to set a constant number of reducers:
  12. set mapreduce.job.reduces=<number>
  13. Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
  14. Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
  15. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  16. 2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
  17. 2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
  18. 2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
  19. MapReduce Total cumulative CPU time: 7 seconds 360 msec
  20. Ended Job = job_1536258296893_0259
  21. MapReduce Jobs Launched:
  22. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
  23. Total MapReduce CPU Time Spent: 7 seconds 360 msec
  24. OK
  25. 4989
ijxebb2r

ijxebb2r1#

这是因为以下配置正在生效

  1. hive.hadoop.supports.splittable.combineinputformat

从文档中
是否合并小的输入文件以便生成更少的Map器。
因此本质上,hive可以推断输入是一组小于blocksize的小文件,并将它们组合起来以减少所需的Map器数量。

相关问题