基于hadoop的opennlp的mapreduce语句检测

mqkwyuun 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(432)

我想用opennlp和hadoop做句子检测。我已经成功地在java上实现了同样的功能。希望在mapreduce平台上实现同样的功能。有人能帮我吗？

hadoop mapreduce opennlp detection sentence

来源：https://stackoverflow.com/questions/21292785/sentence-detection-using-opennlp-on-hadoop

1条答案

按热度按时间

lmyy7pcs1#

我用了两种不同的方法。一种方法是将句子检测模型推送到每个节点上的一个标准dir（ie/opt/opennlpmodels/），并在mapper类的类级别读取序列化模型，然后在map或reduce函数中适当地使用它。
另一种方法是将模型放在数据库或分布式缓存中（作为blob或其他形式）。。。我以前曾使用accumulo存储文档分类模型（如下面所示）。然后在类级别建立到数据库的连接，并以bytearrayinputstream的形式获取模型。
我使用了puppet来推出模型，但是使用您通常使用的任何工具来保持集群中的文件是最新的。
根据您的hadoop版本，您可能可以将模型作为jobsetup的属性偷偷引入，然后只有主（或从何处启动作业）需要实际的模型文件。我从没试过这个。
如果您需要知道如何实际使用opennlp语句检测器，请告诉我，我将发布一个示例。hth公司

import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

public class SentenceDetection {

  SentenceDetector sd;

  public Span[] getSentences(String docTextFromMapFunction) throws Exception {

    if (sd == null) {
      sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
    }
    /**
     * this gives you the actual sentences as a string array
     */
    // String[] sentences = sd.sentDetect(docTextFromMapFunction);
    /**
     * this gives you the spans (the charindexes to the start and end of each
     * sentence in the doc)
     *
     */
    Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
    /**
     * you can do this as well to get the actual sentence strings based on the spans
     */
    // String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
    return sentenceSpans;
  }
}

嗯。。。只要确保文件放好就行了。有更优雅的方式做这件事，但这是工作，它很简单。

赞(0）回复(0）举报 2021-06-03

我来回答

基于hadoop的opennlp的mapreduce语句检测

1条答案

相关问题

热门标签

最新问答