sparkDataframe优化技术

tktrz96b  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(323)

我正在尝试实现blow逻辑。

1. Taking some records from one table.
    2. based on resultant data I'm using one loop.
    3.then inside loop taking data from other tables in two different dataframe
    4. joining these two dataframes and loading data into 3rd table.

    var id_chck1 = s"select distinct id ,id1, id2  from table  WHERE type =  'N';
    val id_chck = hive.executeQuery(id_chck1)
    for (data <- id_chck) {

   var id = data(0)
    var id1 = data(1)
    var id2 = data(2)

      val values_1 = "select distinct bill, bil_num, id_num,  bill_date,process_date from table l WHERE id2 = '222';
      val values_1_data = hive.executeQuery(values_1)
      for (row <- values_1_data.collect) {
        val bill = row.mkString(",").split(",")(0)
        val bil_num = row.mkString(",").split(",")(1)
        val id_num= row.mkString(",").split(",")(2)
        val bill_date = row.mkString(",").split(",")(3)

        var df1 = s"select column name from tablename where id=222"
        val df1_data = hive.executeQuery(df1)
        var df2 = s"s"select column name from tablename2 where id=222""
        val df2_data = hive.executeQuery(df2)

      val df3="joining df1 and df2"
        df3.write.format("orc").mode("Append").save("hdfslocation")
      }
      var load1 = s"load data inpath 'hdfslocation' into table tablename"
      val load1_data = hive.executeUpdate(load1)

但是这个过程需要6个多小时。有没有其他方法来做同样的事情以便在短时间内完成。有没有其他方法来做同样的事情…比如rdd或设置一些spark hive属性来提高性能。我在test1表中有500000条记录。

pexxcrt2

pexxcrt21#

你能加上输入和预期输出作为例子吗?很难看出你到底想达到什么目的

相关问题