java 如何在文本文件中找到词频?

kdfy810k  于 2023-03-21  发布在  Java
关注(0)|答案(4)|浏览(122)

我的任务是得到这个文件的词频:

test_words_file-1.txt

The quick brown fox
Hopefully245this---is   a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the

我一直试图从这个文件中删除符号和数字,并按字母顺序获得每个单词的频率,结果是:

我可以看到偶数位数已被删除,但仍在计数。您能解释为什么以及如何修复此问题吗?
另外,我如何将 “Hopefully 245 this---is” 分开并存储3个有用的单词 “hopefully”,“this”,“is”

public class WordFreq2 {
    public static void main(String[] args) throws FileNotFoundException {

        File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        Scanner scanner = new Scanner(file); 
        int maxWordLen = 0; 
        String maxWord = null;

        HashMap<String, Integer> map = new HashMap<>();
        while(scanner.hasNext()) {
            String word = scanner.next();
            word = word.toLowerCase();
            // text cleaning 
            word = word.replaceAll("[^a-zA-Z]+", "");

            if(map.containsKey(word)) {
                //if the word already exists
                int count = map.get(word)+1;
                map.put(word,count);
            }
            else {
                // The word is new 
                int count = 1;
                map.put(word, count);

                // Find the max length of Word
                if (word.length() > maxWordLen) {
                    maxWordLen = word.length();
                    maxWord = word;
                }
            }   
        }

        scanner.close();

        //HashMap unsorted, sort 
        TreeMap<String, Integer> sorted = new TreeMap<>();
        sorted.putAll(map);

        for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
            System.out.println(entry);
        }

        System.out.println(maxWordLen+" ("+maxWord+")");
    }

}
ubbxdtey

ubbxdtey1#

首先是代码。解释出现在下面的代码之后。

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordFreq2 {

    public static void main(String[] args) {
        Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        try {
            String text = Files.readString(path); // throws java.io.IOException
            text = text.toLowerCase();
            Pattern pttrn = Pattern.compile("[a-z]+");
            Matcher mtchr = pttrn.matcher(text);
            TreeMap<String, Integer> freq = new TreeMap<>();
            int longest = 0;
            while (mtchr.find()) {
                String word = mtchr.group();
                int letters = word.length();
                if (letters > longest) {
                    longest = letters;
                }
                if (freq.containsKey(word)) { 
                    freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
                }
                else {
                    freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
                }
            }
            String format = "%-" + longest + "s = %2d%n";
            freq.forEach((k, v) -> System.out.printf(format, k, v));
            System.out.println("Longest = " + longest);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

由于您的示例文件很小,所以我将整个文件内容加载到String中。
然后我将整个String转换为小写,因为单词的定义是一系列连续的字母,不区分大小写的字符。
正则表达式-[a-z]+-搜索一个或多个连续的小写字母字符(请记住,整个String现在都是小写)。
每次连续调用方法find()都会在String中找到下一个单词(根据上面单词的定义,即字母表中连续的一系列小写字母)。
为了计算字母频率,我使用TreeMap,其中Map键是单词,Map值是单词在String中出现的次数。注意,Map键和值不能是原语,因此值是Integer而不是int
如果找到的最后一个单词已经出现在Map中,则增加计数。
如果找到的最后一个单词未出现在Map中,则将其添加到Map中,并将其计数设置为1(一)。
沿着将单词添加到Map中,我还计算了找到的每个单词的字母,以便找到最长的单词。
在处理完整个String之后,我打印Map的内容,每行一个条目,最后打印找到的最长单词的字母数。注意,TreeMap对它的键进行排序,因此单词列表按字母顺序显示。
下面是输出:

a         =  1
be        =  1
brown     =  1
but       =  1
complete  =  1
for       =  1
fox       =  1
hopefully =  1
is        =  1
less      =  1
maybe     =  1
quick     =  3
task      =  2
the       = 12
this      =  1
to        =  1
will      =  1
you       =  1
Longest = 9
ki0zmccv

ki0zmccv2#

我怎样才能把“hopefully 245 this---is”分开并存储3个有用的单词“hopefully”,“this”,“is”?
使用regex API来满足这样的要求。

演示:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "Hopefully245this---is";
        Pattern pattern = Pattern.compile("[A-Za-z]+");
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

输出:

Hopefully
this
is

查看以下链接以了解有关Java正则表达式的更多信息:

  1. https://docs.oracle.com/javase/tutorial/essential/regex/index.html
  2. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
  3. https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Matcher.html
jw5wzhpr

jw5wzhpr3#

在Java 9或更高版本的Matcher中,#结果可以在流解决方案中使用,如下所示:

Pattern pattern = Pattern.compile("[a-zA-Z]+");
    try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
        br.lines()
                .map(pattern::matcher)
                .flatMap(Matcher::results)
                .map(matchResult -> matchResult.group(0))
                .collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
                .forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
    } catch (IOException e) {
        System.err.format("IOException: %s%n", e);
    }

输出:

a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1
juzqafwq

juzqafwq4#

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
 
public class test
{
  public static void main(String[] args) throws FileNotFoundException
  {
    File f = new File("C:\\Users\\Nandini\\Downloads\\CountFreq.txt");
    Scanner s = new Scanner(f);
    Map<String, Integer> counts = new HashMap<String, Integer>(); 
    while( s.hasNext() )
    {
             String word = s.next();
             word = word.toLowerCase();
            if( !counts.containsKey( word ) )
             counts.put( word, 1 );
            else
             counts.put( word, counts.get(word) + 1 );
    }
    System.out.println(counts);
  }
  
}

输出:{the=1,this=3,have=1,is=2,word=1}

相关问题