我需要将hql转换为sparksql。我使用下面的方法,这样我就看不到性能有什么变化。如果有人有更好的建议,请告诉我。
Hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
我有大约70个这样的内部hive temp表,源表1中的数据大约是18亿,没有分区和200个hdfs文件。
spark代码-运行20个executor,5个executor内核,10g executor内存,yarn客户端,驱动程序4g
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")
暂无答案!
目前还没有任何答案,快来回答吧!