如何将withcolumn sparkDataframescala与while一起使用

sshcrbum  于 2021-06-01  发布在  Hadoop
关注(0)|答案(1)|浏览(341)

这是我的函数应用规则,列mdp\u codcat,mdp\u idregl,usedref change根据数组bref中的数据更改。

def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame ={var matchRule = false
    var i = 0
    while (i < bRef.value.size && !matchRule) {
      if ((bRef.value(i).sensop.isEmpty || bRef.value(i).sensop.equals(col("signe")))
        && (bRef.value(i).cdopcz.isEmpty || Lib.matchCdopcz(strTail(col("cdopcz")).toString(), bRef.value(i).cdopcz))
        && (bRef.value(i).libope.isEmpty || Lib.matchRule(col("lib_ope").toString(), bRef.value(i).libope))
        && (bRef.value(i).qualib.isEmpty || Lib.matchRule(col("qualif_lib_ope").toString(), bRef.value(i).qualib))) {
        matchRule = true
        dataFrame.withColumn("mdp_codcat", lit(bRef.value(i).codcat))
        dataFrame.withColumn("mdp_idregl", lit(bRef.value(i).idregl))
        dataFrame.withColumn("usedRef", lit("SDC"))
      }else{
        dataFrame.withColumn("mdp_codcat", lit("NOT_CATEGORIZED"))
        dataFrame.withColumn("mdp_idregl", lit("-1"))
        dataFrame.withColumn("usedRef", lit(""))
      }
      i += 1
    }

    dataFrame
  }

Dataframe:“cdenjp”、“cdguic”、“numcpt”、“mdp\u codcat”、“mdp\u idregl”、“mdp\u codcat”、“mdp\u idregl”、“usedref”如果匹配,则使用值bref添加mdp\u idregl、mdp\u idregl、mdp\u idregl
示例:我的Dataframe:

val DF = Seq(("tt", "aa","bb"),("tt1", "aa1","bb2"),("tt1", "aa1","bb2")).toDF("t","a","b)
+---+---+---+---+
|  t|  a|  b|  c|
+---+---+---+---+
| tt| aa| bb| cc|
|tt1|aa1|bb2|cc3|
+---+---+---+---+

file.text内容:

,aa,bb,cc
 ,aa1,bb2,cc3
tt4,aa4,bb4,cc4
tt1,aa1,,cc6

case class TOTO(a: String, b:String, c: String, d:String)

 val text = sc.textFile("file:///home/X176616/file")
 val bRef= textFromCsv.map(row => row.split(",", -1))
      .map(c => TOTO(c(0), c(1), c(2), c(3))).collect().sortBy(_.a)

def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame
 dataframe.withColumn("mdp_codcat_new", "NOT_FOUND")  //first init not found, change if while if match 

    var matchRule = false
    var i = 0

    while (i < bRef.value.size && !matchRule) {
      if ((bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
        && (bRef.value(i).b.isEmpty || Lib.matchCdopcz(col(b), bRef.value(i).b))
        && (bRef.value(i).c.isEmpty || Lib.matchRule(col(c), bRef.value(i).c))
        )) {
        matchRule = true
        dataframe.withColumn("mdp_codcat_new", bRef.value(i).d)
        dataframe.withColumn("mdp_mdp_idregl_new" = bRef.value(i).e

      }
      i += 1
    }

如果条件为真,则最终确定

bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
            && (bRef.value(i).b.isEmpty || Lib.matchCdopcz(b.substring(1).toInt.toString, bRef.value(i).b))
            && (bRef.value(i).c.isEmpty || Lib.matchRule(c, bRef.value(i).c)

+---+---+---+---+-----------+----------+
|  t|  a|  b|  c|mdp_codcat |mdp_idregl|
+---+---+---+---+-----------|----------+
| tt| aa| bb| cc|cc         | other    |
| ab|aa1|bb2|cc3|cc4        | toto     | from bRef if true in while
| cd|aa1|bb2|cc3|cc4        | titi     |
|  b|a1 |b2 |c3 |NO_FOUND   |NO_FOUND  | (not_found if conditionnal false)
+---+---+---+---+----------------------+
+---+---+---+---+----------------------+
qfe3c7zg

qfe3c7zg1#

不能根据运行时值创建Dataframe架构。我会尽量简单一点。第一个i´d使用默认值创建三列:

dataFrame.withColumn("mdp_codcat", lit(""))
dataFrame.withColumn("mdp_idregl", lit(""))
dataFrame.withColumn("usedRef", lit(""))

然后可以将自定义项与广播值一起使用:

def mdp_codcat(bRef: Broadcast[Array[RefRglSDC]]) = udf { (field: String) =>
{
      // Your while and if stuff
      // return your update data
}}

并将每个自定义项应用于每个字段:

dataframe.withColumn("mdp_codcat_new", mdp_codcat(bRef)("mdp_codcat"))

也许能帮上忙

相关问题