我使用斯坦福nlp库对一个句子进行词干分析和引理化。例如,汽车是通勤的便捷方式。但是现在路上的车太多了。
因此,预期输出为:
car be easy way commute car road day
但我明白了:
ArrayBuffer(car, easy, way, for, commute, but, there, too, many, car, road, these, day)
这是密码
val stopWords = sc.broadcast(
scala.io.Source.fromFile("src/main/common-english-words.txt").getLines().toSet).value
def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
val pipeline = new StanfordCoreNLP(props)
val doc = new Annotation(text)
pipeline.annotate(doc)
val lemmas = new ArrayBuffer[String]()
val sentences = doc.get(classOf[SentencesAnnotation])
for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
val lemma = token.get(classOf[LemmaAnnotation])
if (lemma.length > 2 && !stopWords.contains(lemma)) {
lemmas += lemma.toLowerCase
}
}
lemmas
}
val lemmatized = stringRDD.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)
我从spark的高级分析书中得到了它,似乎停止词没有被删除,“是”也没有转换成“是”。我们可以从这些库中添加或删除规则吗?
http://www.textfixer.com/resources/common-english-words.txt
暂无答案!
目前还没有任何答案,快来回答吧!