如何更好地优化此配置单元查询以提高速度?

lsmepo6l  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(193)

我要做的是在pyspark和spark上下文中更好地优化这些配置单元查询,如下所示,以获得更好的速度。现在,在对大表进行hive-table比较时,结果花费了大量的时间(有些人需要221分钟的挂钟实时时间)。我的问题是,看看下面的脚本,是否有关于如何提高查询的性能和速度的建议?谢谢。
更新:
尤其值得注意的是:

  1. # Initializing Passed Parameters
  2. FIRST_TBL = sys.argv[1]
  3. SECOND_TBL = sys.argv[2]
  4. # Defining Spark Context and Hive Context
  5. GLBL_CONF = SparkConf().setAppName("No TouchPoint Refine Load Wrapper")
  6. SC = SparkContext(conf=GLBL_CONF)
  7. HC = HiveContext(SC)
  8. CLMS_DF = HC.sql("describe {0}".format(FIRST_TBL))
  9. CLMS = CLMS_DF.select("col_name").collect()
  10. TYPES = CLMS_DF.select("data_type").collect()
  11. CLMS_LIST=[]
  12. TYPES_LIST=[]
  13. CLMS_STR=""
  14. for c in CLMS:
  15. CLMS_LIST.append(c.col_name)
  16. for t in TYPES:
  17. TYPES_LIST.append(t.data_type)
  18. for i,clms in enumerate(CLMS_LIST):
  19. clms_lower = clms.lower()
  20. if(not("add_user_nm" in clms_lower or "add_usr_nm" in clms_lower or "add_user" in clms_lower or "add_usr" in clms_lower or "add_tms" in clms_lower or "updt_user" in clms_lower or "updt_usr" in clms_lower or "updt_user_nm" in clms_lower or "updt_usr_nm" in clms_lower or "updt_tms" in clms_lower or clms_lower in EXCLUDED_CLMS_LIST)):
  21. if("partition information" in clms_lower):
  22. break
  23. else:
  24. if(TYPES_LIST[i].lower() == 'date' or TYPES_LIST[i].lower() == 'timestamp'):
  25. max1 = HC.sql("select to_date(max({0})) as max_dt from {1} where to_date({0})<>'2099-12-31'".format(clms_lower,FIRST_TBL)).first()
  26. max2 = HC.sql("select to_date(max({0})) as max_dt from {1} where to_date({0})<>'2099-12-31'".format(clms_lower,SECOND_TBL)).first()
  27. CLMS_STR = CLMS_STR + "case when to_date({0}) = '{1}' or to_date({0}) = '{2}' or to_date({0}) = '2019-05-20' or to_date({0}) = '2019-05-21' or to_date({0}) = '2019-05-22' then cast('2020-06-06' as {3}) else to_date({0}) end as {0},".format(clms_lower,max1['max_dt'],max2['max_dt'],TYPES_LIST[i])
  28. else:
  29. CLMS_STR = CLMS_STR+clms+","
  30. else:
  31. LOGGER.info("{0}".format(clms_lower))
  32. CLMS_STR = CLMS_STR[:-1]
  33. df1 = HC.sql("select {0} from {1} {2}".format(CLMS_STR,FIRST_TBL,WHERE_CLAUSE))
  34. df2 = HC.sql("select {0} from {1} {2}".format(CLMS_STR,SECOND_TBL,WHERE_CLAUSE))
  35. df1_minus_df2 = df1.subtract(df2)
  36. df1_minus_df2_count = df1_minus_df2.count()
  37. df2_minus_df1 = df2.subtract(df1)
  38. df2_minus_df1_count = df2_minus_df1.count()
  39. df1_count = df1.count()
  40. df2_count = df2.count()
  41. df1_minus_df2_log = LOG_PATH + 'First_minus_second_' + str(DATE) + '_' + str(TIME)
  42. df2_minus_df1_log = LOG_PATH + 'Second_minus_first_' + str(DATE) + '_' + str(TIME)
  43. df1_minus_df2.limit(10000).write.format("com.databricks.spark.csv").save(df1_minus_df2_log)
  44. df2_minus_df1.limit(10000).write.format("com.databricks.spark.csv").save(df2_minus_df1_log)

我的目标是把两个大的表,基本上计算计数,然后找出它们的比较差异。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题