python-3.x 如何在不使用nltk的情况下从文本文件中添加停止词?

z9smfwbn  于 2023-01-27  发布在  Python
关注(0)|答案(1)|浏览(118)
import re 

input_file = open('documents.txt', 'r')
stopwords = open('stopwords.txt', 'r')

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = re.findall('\w+', line)
    for word in words: 
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

目前,这段代码输出单词在input_files文本文档中出现的频率。
但是,我需要省略stopwords.txt文档中的停止词-我无法使用nltk来完成此操作。
最有效的方法是什么

#For each line you read in input_file.readlines()
  #if a word in input_file is in stopwords
    #append it
  #else
5jdjgkvh

5jdjgkvh1#

可以使用具有O(1)时间复杂度成员测试的set数据结构:

stop_words = set(["in", "to", "this", ...])
if word in stop_words:
    print("discarded")
import re 

input_file = open('documents.txt', 'r')
stopwords_file = 'stopwords.txt'
stopwords_list = []

with open(stopwords_file) as f:
    stopwords_list = [line.replace("\n", "") for line in f.readlines()]

stopwords_set = set(stopwords_list)

word_count = {}
for line in input_file.readlines():
    words = line.strip()
    words = re.findall('\w+', line)
    for word in words:
      if word.lower() in stopwords_set:
        continue
      word = word.lower()
      if not word in word_count: 
        word_count[word] = 1
      else: 
        word_count[word] = word_count[word] + 1

word_index = sorted(word_count.keys())
for word in word_index:
  print (word, word_count[word])

相关问题