zemberek.core.collections.Histogram.loadFromUtf8File()方法的使用及代码示例

x33g5p2x  于2022-01-20 转载在 其他  
字(2.7k)|赞(0)|评价(0)|浏览(169)

本文整理了Java中zemberek.core.collections.Histogram.loadFromUtf8File()方法的一些代码示例,展示了Histogram.loadFromUtf8File()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Histogram.loadFromUtf8File()方法的具体详情如下:
包路径:zemberek.core.collections.Histogram
类名称:Histogram
方法名:loadFromUtf8File

Histogram.loadFromUtf8File介绍

[英]Loads a String Histogram from a file. Counts are supposedly delimited with delimiter character. format: [key][delimiter][count]
[中]从文件加载字符串直方图。计数应该用`分隔符'字符分隔。格式:[键][分隔符][计数]

代码示例

代码示例来源:origin: ahmetaa/zemberek-nlp

static void multipleLetterRepetitionWords(Path in, Path out) throws IOException {
 Histogram<String> noisyWords = Histogram.loadFromUtf8File(in, ' ');
 Histogram<String> repetitionWords = new Histogram<>();
 for (String w : noisyWords) {
  if (w.length() == 1) {
   continue;
  }
  int maxRepetitionCount = 1;
  int repetitionCount = 1;
  char lastChar = w.charAt(0);
  for (int i = 1; i < w.length(); i++) {
   char c = w.charAt(i);
   if (c == lastChar) {
    repetitionCount++;
   } else {
    if (repetitionCount > maxRepetitionCount) {
     maxRepetitionCount = repetitionCount;
    }
    repetitionCount = 0;
   }
   lastChar = c;
  }
  if (maxRepetitionCount > 1) {
   repetitionWords.set(w, noisyWords.getCount(w));
  }
 }
 repetitionWords.saveSortedByCounts(out, " ");
}

代码示例来源:origin: ahmetaa/zemberek-nlp

NormalizationVocabulary(
  Path correct,
  Path incorrect,
  Path maybeIncorrect,
  int correctMinCount,
  int incorrectMinCount,
  int maybeIncorrectMinCount) throws IOException {
 Histogram<String> correctWords = Histogram.loadFromUtf8File(correct, ' ');
 Histogram<String> noisyWords = Histogram.loadFromUtf8File(incorrect, ' ');
 Histogram<String> maybeIncorrectWords = new Histogram<>();
 if (maybeIncorrect != null) {
  maybeIncorrectWords = Histogram.loadFromUtf8File(maybeIncorrect, ' ');
 }
 correctWords.removeSmaller(correctMinCount);
 noisyWords.removeSmaller(incorrectMinCount);
 maybeIncorrectWords.removeSmaller(maybeIncorrectMinCount);
 this.noisyWordStart = correctWords.size();
 this.words = new ArrayList<>(correctWords.getSortedList());
 words.addAll(noisyWords.getSortedList());
 this.maybeIncorrectWordStart = words.size();
 words.addAll(maybeIncorrectWords.getSortedList());
 int i = 0;
 for (String word : words) {
  indexes.put(word, i);
  i++;
 }
}

代码示例来源:origin: ahmetaa/zemberek-nlp

.loadFromUtf8File(cleanRoot.resolve("correct"), ' ');
Histogram<String> incorrectFromNoisy = Histogram
  .loadFromUtf8File(noisyRoot.resolve("incorrect"), ' ');
incorrectFromNoisy.removeSmaller(2);

代码示例来源:origin: ahmetaa/zemberek-nlp

Log.info("Language model = %s", lm.info());
Histogram<String> wordFreq = Histogram.loadFromUtf8File(noisyVocab.resolve("incorrect"), ' ');
wordFreq.add(Histogram.loadFromUtf8File(cleanVocab.resolve("incorrect"), ' '));
Log.info("%d words loaded.", wordFreq.size());

相关文章