spark和scala中的词干和柠檬化

azpvetkf 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(268)

我使用斯坦福nlp库对一个句子进行词干分析和引理化。例如，汽车是通勤的便捷方式。但是现在路上的车太多了。
因此，预期输出为：

car be easy way commute car road day

但我明白了：

ArrayBuffer(car, easy, way, for, commute, but, there, too, many, car, road, these, day)

这是密码

val stopWords = sc.broadcast(
  scala.io.Source.fromFile("src/main/common-english-words.txt").getLines().toSet).value

def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)
  val doc = new Annotation(text)
  pipeline.annotate(doc)
  val lemmas = new ArrayBuffer[String]()
  val sentences = doc.get(classOf[SentencesAnnotation])
  for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
      lemmas += lemma.toLowerCase
    }
  }
  lemmas
}
val lemmatized = stringRDD.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)

我从spark的高级分析书中得到了它，似乎停止词没有被删除，“是”也没有转换成“是”。我们可以从这些库中添加或删除规则吗？
http://www.textfixer.com/resources/common-english-words.txt

scala apache-spark nlp stanford-nlp

来源：https://stackoverflow.com/questions/40197658/stemming-and-lemmatization-in-spark-and-scala

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

spark和scala中的词干和柠檬化

暂无答案！

相关问题

热门标签

最新问答