python-3.x 如何在不使用nltk的情况下从文本文件中添加停止词？

z9smfwbn 于 2023-01-27 发布在 Python

关注(0)|答案(1)|浏览(118)

import re 

input_file = open('documents.txt', 'r')
stopwords = open('stopwords.txt', 'r')

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = re.findall('\w+', line)
    for word in words: 
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

目前，这段代码输出单词在input_files文本文档中出现的频率。
但是，我需要省略stopwords.txt文档中的停止词-我无法使用nltk来完成此操作。
最有效的方法是什么

#For each line you read in input_file.readlines()
  #if a word in input_file is in stopwords
    #append it
  #else

python-3.x

来源：https://stackoverflow.com/questions/75241860/how-to-append-stopwords-from-being-in-a-text-file-without-using-nltk

1条答案

按热度按时间

5jdjgkvh1#

可以使用具有O(1)时间复杂度成员测试的set数据结构：

stop_words = set(["in", "to", "this", ...])
if word in stop_words:
    print("discarded")

import re 

input_file = open('documents.txt', 'r')
stopwords_file = 'stopwords.txt'
stopwords_list = []

with open(stopwords_file) as f:
    stopwords_list = [line.replace("\n", "") for line in f.readlines()]

stopwords_set = set(stopwords_list)

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = re.findall('\w+', line)
    for word in words:
      if word.lower() in stopwords_set:
        continue
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

赞(0）回复(0）举报 2023-01-27

我来回答

python-3.x 如何在不使用nltk的情况下从文本文件中添加停止词？

1条答案

相关问题

热门标签

最新问答