本文整理了Java中zemberek.core.collections.Histogram.loadFromUtf8File()
方法的一些代码示例,展示了Histogram.loadFromUtf8File()
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Histogram.loadFromUtf8File()
方法的具体详情如下:
包路径:zemberek.core.collections.Histogram
类名称:Histogram
方法名:loadFromUtf8File
[英]Loads a String Histogram from a file. Counts are supposedly delimited with delimiter
character. format: [key][delimiter][count]
[中]从文件加载字符串直方图。计数应该用`分隔符'字符分隔。格式:[键][分隔符][计数]
代码示例来源:origin: ahmetaa/zemberek-nlp
static void multipleLetterRepetitionWords(Path in, Path out) throws IOException {
Histogram<String> noisyWords = Histogram.loadFromUtf8File(in, ' ');
Histogram<String> repetitionWords = new Histogram<>();
for (String w : noisyWords) {
if (w.length() == 1) {
continue;
}
int maxRepetitionCount = 1;
int repetitionCount = 1;
char lastChar = w.charAt(0);
for (int i = 1; i < w.length(); i++) {
char c = w.charAt(i);
if (c == lastChar) {
repetitionCount++;
} else {
if (repetitionCount > maxRepetitionCount) {
maxRepetitionCount = repetitionCount;
}
repetitionCount = 0;
}
lastChar = c;
}
if (maxRepetitionCount > 1) {
repetitionWords.set(w, noisyWords.getCount(w));
}
}
repetitionWords.saveSortedByCounts(out, " ");
}
代码示例来源:origin: ahmetaa/zemberek-nlp
NormalizationVocabulary(
Path correct,
Path incorrect,
Path maybeIncorrect,
int correctMinCount,
int incorrectMinCount,
int maybeIncorrectMinCount) throws IOException {
Histogram<String> correctWords = Histogram.loadFromUtf8File(correct, ' ');
Histogram<String> noisyWords = Histogram.loadFromUtf8File(incorrect, ' ');
Histogram<String> maybeIncorrectWords = new Histogram<>();
if (maybeIncorrect != null) {
maybeIncorrectWords = Histogram.loadFromUtf8File(maybeIncorrect, ' ');
}
correctWords.removeSmaller(correctMinCount);
noisyWords.removeSmaller(incorrectMinCount);
maybeIncorrectWords.removeSmaller(maybeIncorrectMinCount);
this.noisyWordStart = correctWords.size();
this.words = new ArrayList<>(correctWords.getSortedList());
words.addAll(noisyWords.getSortedList());
this.maybeIncorrectWordStart = words.size();
words.addAll(maybeIncorrectWords.getSortedList());
int i = 0;
for (String word : words) {
indexes.put(word, i);
i++;
}
}
代码示例来源:origin: ahmetaa/zemberek-nlp
.loadFromUtf8File(cleanRoot.resolve("correct"), ' ');
Histogram<String> incorrectFromNoisy = Histogram
.loadFromUtf8File(noisyRoot.resolve("incorrect"), ' ');
incorrectFromNoisy.removeSmaller(2);
代码示例来源:origin: ahmetaa/zemberek-nlp
Log.info("Language model = %s", lm.info());
Histogram<String> wordFreq = Histogram.loadFromUtf8File(noisyVocab.resolve("incorrect"), ' ');
wordFreq.add(Histogram.loadFromUtf8File(cleanVocab.resolve("incorrect"), ' '));
Log.info("%d words loaded.", wordFreq.size());
内容来源于网络,如有侵权,请联系作者删除!