count(*)在配置单元中返回多行,而不是一行

d6kp6zgx  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(391)

为什么count(*)在配置单元中返回2个计数据观察,两个返回值的总和等于表中记录的实际计数。请注意,我在查询中没有使用任何where,group by子句。下面是我的创建查询

CREATE TABLE `employee`(                    
   `file_id` string,                                
   `record_id` bigint,                              
   `emp_id` bigint,                            
   `dept` string,                                                 
   `salary` bigint                           
 )                       
 PARTITIONED BY (                                   
   `load_date` string)                              
 CLUSTERED BY (                                     
   `emp_id`)                                     
 INTO 7 BUCKETS                                     
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
 OUTPUTFORMAT                                       
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
 LOCATION                                           
   'hdfs://data/schema/emp/employee' 
 TBLPROPERTIES (                                    
   'bucketing_version'='2',                         
   'compactor.mapreduce.map.memory.mb'='2048',      
   'compactorthreshold.hive.compactor.delta.num.threshold'='4',  
   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',  
   'discover.partitions'='true',                    
   'last_modified_by'='pately',                     
   'last_modified_time'='1579080979',               
   'orc.compress'='ZLIB',                           
   'transactional'='true',                          
   'transactional_properties'='default',            
   'transient_lastDdlTime'='1579080979');

我正在尝试选择下面的计数

select count(*) from `employee`;

reduce tasks已设置为1

set mapred.reduce.tasks;
+-------------------------+
|           set           |
+-------------------------+
| mapred.reduce.tasks=-1  |
+-------------------------+

结果:

INFO  : Compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 0.176 seconds
INFO  : Executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO  : Query ID = dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO  : Session is already open
INFO  : Dag name: select count(*) from employee (Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1586778720046_0798)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      2          2        0        0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 6.14 s
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 6.12 seconds
INFO  :
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.18s
INFO  : Prepare Plan                            0.06s
INFO  : Get Query Coordinator (AM)              0.00s
INFO  : Submit Plan                             5.75s
INFO  : Start DAG                               0.89s
INFO  : Run DAG                                 6.12s
INFO  : ----------------------------------------------------------------------------------------------
INFO  :
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1           2587.00         15,520            147          12,506                0
INFO  : ----------------------------------------------------------------------------------------------
INFO  :
INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 2
INFO  :    TOTAL_LAUNCHED_TASKS: 2
INFO  :    RACK_LOCAL_TASKS: 2
INFO  :    AM_CPU_MILLISECONDS: 4030
INFO  :    AM_GC_TIME_MILLIS: 0
INFO  : File System Counters:
INFO  :    HDFS_BYTES_READ: 3956004
INFO  :    HDFS_BYTES_WRITTEN: 232
INFO  :    HDFS_READ_OPS: 363
INFO  :    HDFS_WRITE_OPS: 4
INFO  :    HDFS_OP_CREATE: 2
INFO  :    HDFS_OP_GET_FILE_STATUS: 125
INFO  :    HDFS_OP_OPEN: 238
INFO  :    HDFS_OP_RENAME: 2
INFO  : org.apache.tez.common.counters.TaskCounter:
INFO  :    GC_TIME_MILLIS: 147
INFO  :    TASK_DURATION_MILLIS: 4661
INFO  :    CPU_MILLISECONDS: 15520
INFO  :    PHYSICAL_MEMORY_BYTES: 4294967296
INFO  :    VIRTUAL_MEMORY_BYTES: 14757040128
INFO  :    COMMITTED_HEAP_BYTES: 4294967296
INFO  :    INPUT_RECORDS_PROCESSED: 119
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO  :    OUTPUT_RECORDS: 0
INFO  : HIVE:
INFO  :    CREATED_FILES: 2
INFO  :    DESERIALIZE_ERRORS: 0
INFO  :    RECORDS_IN_Map_1: 12506
INFO  :    RECORDS_OUT_0: 2
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_1: 0
INFO  :    RECORDS_OUT_OPERATOR_FS_11: 2
INFO  :    RECORDS_OUT_OPERATOR_GBY_10: 2
INFO  :    RECORDS_OUT_OPERATOR_MAP_0: 0
INFO  :    RECORDS_OUT_OPERATOR_SEL_9: 12506
INFO  :    RECORDS_OUT_OPERATOR_TS_0: 12506
INFO  : TaskCounter_Map_1_INPUT_employee:
INFO  :    INPUT_RECORDS_PROCESSED: 119
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO  : TaskCounter_Map_1_OUTPUT_out_Map_1:
INFO  :    OUTPUT_RECORDS: 0
INFO  : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO  :    GROUPED_INPUT_SPLITS_Map_1: 2
INFO  :    INPUT_DIRECTORIES_Map_1: 33
INFO  :    INPUT_FILES_Map_1: 119
INFO  :    RAW_INPUT_SPLITS_Map_1: 119
INFO  : savedToCache: true
INFO  : Completed executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 12.819 seconds
INFO  : OK
+-------+
|  _c0  |
+-------+
| 6949  |
| 5557  |
+-------+
2 rows selected (13.073 seconds)

有人能解释一下为什么聚合fxn-like count会给出多行输出吗

q1qsirdb

q1qsirdb1#

是 hive 里的虫子。当未完成轻微压缩时,它将返回多个值。
解决方案是添加到查询中

select count(*) from table limit 1

相关问题