如果未添加新分区,则需要配置单元每日msck修复

ckx4rj1h  于 2021-06-24  发布在  Hive
关注(0)|答案(1)|浏览(358)

我有一个配置单元表,它有数据,它在一个基于年份的分区列上分区。现在数据每天都被加载到这个配置单元表中。我没有选择做每日msck修复。我的分区是按年份划分的。所以,如果没有添加新分区,我需要在每日加载后进行msck修复。我试过了

val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
                      ,StructField("txn_ts",StringType,true)
                      ,StructField("txn_dt",StringType,true))

val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")

配置单元外部表

create external table test_a(
key    string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';

msck repair table test_a;
select * from test_a;
nle07wnf

nle07wnf1#

Noticed if new partition not added msck repair is not needed

    msck repair table test_a;
    select * from test_a;

        +----------------+--------------------------+------------------------+--+
        | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
        +----------------+--------------------------+------------------------+--+
        | 1              | 2020-05-11 15:17:57.188  | 2020                   |
        +----------------+--------------------------+------------------------+--+

    Now added 1 more row with the same partition value (2020) 

        val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")

  **HIVE QUERY rETURNED 2 ROWS**
        select * from test_a;

    +----------------+--------------------------+------------------------+--+
    | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
    +----------------+--------------------------+------------------------+--+
    | 1              | 2021-05-11 15:17:57.188  | 2020                   |
    | 2              | 2020-05-11 15:17:57.188  | 2020                   |
    +----------------+--------------------------+------------------------+--+

        --Now tried adding NEW PARTITION (2021) to see if select query will 
 return it with out msck
        val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")

    QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair

相关问题