scala中的regex动态字符串掩蔽

r7s23pms  于 2021-05-29  发布在  Spark
关注(0)|答案(4)|浏览(429)

有什么简单的方法可以在scala中进行数据掩蔽吗,请解释一下。我想动态地将匹配模式更改为具有相同关键字长度的x
示例:要屏蔽的模式:narendra\s*modi trump jun-\d\d
输入:印度总理纳伦德拉·莫迪2020-6-03唐纳德·特朗普美国总统
输出:xx x印度2020年x xx唐纳德xx美国总统
note:only characters 如果要屏蔽,我希望在输出中保留空格或连字符以匹配模式

d6kp6zgx

d6kp6zgx1#

scala>  val pattern = Seq("Narendra\\s*Modi", "Trump", "JUN-\\d\\d", "Trump", "JUN")
pattern: Seq[String] = List(Narendra\s*Modi, Trump, JUN-\d\d, Trump, JUN)

scala> print(mask(pattern,str))
XXXXXXXXXXXXXXX pm of india 2020-XXXXXXXX Donald XXXXX president of USA

是的,应该有用,试试上面的。

l5tcr1uw

l5tcr1uw2#

请在内联中找到正则表达式和代码解释

import org.apache.spark.sql.functions._

object RegExMasking {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    //Regex to fetch the word
    val regEx : String = """(\s+[A-Z|a-z]+\s)""".stripMargin

    //load your Dataframe
    val df = List("Narendra Modi pm of india 2020-JUN-03",
      "Donald Trump president of USA ").toDF("sentence")

    df.withColumn("valueToReplace",
      //Fetch the 1st word from the regex parse expression
          regexp_extract(col("sentence"),regEx,0)
        )
        .map(row => {
          val sentence = row.getString(0)

          //Trim for extra spaces
          val valueToReplace : String = row.getString(1).trim

          //Create masked string of equal length
          val replaceWith  = List.fill(valueToReplace.length)("X").mkString

          // Return sentence , masked sentence 
          (sentence,sentence.replace(valueToReplace,replaceWith))
        }).toDF("sentence","maskedSentence")
      .show()
  }

}
zd287kbt

zd287kbt3#

所以你有意见了 String :

val input =
  "Narendra Modi of India, 2020-JUN-03, Donald Trump of USA."

用给定的长度掩盖给定的目标是微不足道的。

input.replaceAllLiterally("abc", "XXX")

如果你有许多不同长度的目标,那么它就变得更有趣了。

"India|USA".r.replaceAllIn(input, "X" * _.matched.length)
//res0: String = Narendra Modi of XXXXX, 2020-JUN-03, Donald Trump of XXX.

如果混合使用了蒙面字符和保留字符,则仍可以将多个目标分组在一起,但它们必须具有相同数量的子组和相同的蒙面组模式。
在这种情况下,模式是(mask)(keep)(mask)。

raw"(Narendra)(\s+)(Modi)|(Donald)(\s+)(Trump)|(JUN)([-/])(\d+)".r
  .replaceAllIn(input,{m =>
      val List(a,b,c) = m.subgroups.flatMap(Option(_))
      "X"*a.length + b + "X"*c.length
  })
//res1: String = XXXXXXXX XXXX of India, 2020-XXX-XX, XXXXXX XXXXX of USA.
kr98yfug

kr98yfug4#

像这样的?

val pattern = Seq("Modi", "Trump", "JUN")
  val str  = "Narendra Modi pm of india 2020-JUN-03 Donald Trump president of USA"

  def mask(pattern: Seq[String], str: String): String = {
    var s = str
    for (elem <- pattern) {
      s = s.replaceAll(elem,elem.toCharArray.map(s=>"X").mkString)
    }
    s
  }

  print(mask(pattern,str))

输出:

Narendra XXXX pm of india 2020-XXX-03 Donald XXXXX president of USA

相关问题