带有mapreduce的stanfordcorenlp(错误：超出gc开销限制)

z5btuh9x 于 2021-06-02 发布在 Hadoop

关注(0)|答案(0)|浏览(282)

我有一个文本文件，其中包含一组文档ID和文档内容，以“：”分隔。下面是一个例子

139::This is a sentence in document 139. This is another sentence.
140::This is a sentence in document 140. This is another sentence.

我想用stanfordcorenlp对这些句子进行命名实体识别。这在传统的java程序中工作得很好。现在我想用mapreduce做同样的事情。我尝试在Map器的setup（）方法中加载stanfordcorenlp分类器，map（）方法执行命名实体标记，如下所示：

public class NerMapper extends Mapper<LongWritable, Text, Text, Text>{
StanfordCoreNLP pipeline;
@Override
protected void setup(Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    // TODO Auto-generated method stub
    super.setup(context);
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, relation");
    pipeline = new StanfordCoreNLP(props);
}
@Override
protected void map(LongWritable key, Text value,
        Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException,
        InterruptedException {
    // TODO Auto-generated method stub
    String[] input = value.toString().split("::");
    List<DataTuple> dataTuples = new ArrayList<DataTuple>();
    Annotation annotation = new Annotation(input[1]);
    pipeline.annotate(annotation);
    List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
    for(CoreMap sentence : sentences){
        //extract named entities
        //write <documentID>::<the named entity itself>::<the named entity tag>
    }
}
}

在运行作业时，它失败，错误为“超出gc开销限制”。我尝试了不同的堆大小 export HADOOP_OPTS="-Xmx892m" 在运行作业之前，我使用 -libjars 选择 hadoop jar 命令。输入文档通常只包含4-5个正常大小的句子。我知道问题出在setup（）方法中分类器的初始化上，但我还没有弄清楚到底哪里出了问题。我真的很感激这里的任何帮助！
我使用的是hadoop2.6.0、stanford corenlp3.4.1和java1.7。

Java hadoop mapreduce stanford-nlp

来源：https://stackoverflow.com/questions/31946966/stanfordcorenlp-with-mapreduceerror-gc-overhead-limit-exceeded

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

带有mapreduce的stanfordcorenlp(错误：超出gc开销限制)

暂无答案！

相关问题

热门标签

最新问答