python POS-Tagger非常慢

8wigbo56  于 2024-01-05  发布在  Python
关注(0)|答案(4)|浏览(137)

我使用nltk从句子中生成n-gram,首先删除给定的停止词。然而,nltk.pos_tag()非常慢,在我的CPU(Intel i7)上占用了0.6秒。
输出:

  1. ['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
  2. 0.620481014252
  3. ["It's simply the best meal in NYC."]
  4. 0.640982151031
  5. ['You cannot go wrong at the Red Eye Grill.']
  6. 0.644664049149

字符串
代码:

  1. for sentence in source:
  2. nltk_ngrams = None
  3. if stop_words is not None:
  4. start = time.time()
  5. sentence_pos = nltk.pos_tag(word_tokenize(sentence))
  6. print time.time() - start
  7. filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
  8. else:
  9. filtered_words = ngrams(sentence.split(), n)


是真的这么慢还是我做错了什么?

frebpwbc

frebpwbc1#

使用pos_tag_sents标记多个句子:

  1. >>> import time
  2. >>> from nltk.corpus import brown
  3. >>> from nltk import pos_tag
  4. >>> from nltk import pos_tag_sents
  5. >>> sents = brown.sents()[:10]
  6. >>> start = time.time(); pos_tag(sents[0]); print time.time() - start
  7. 0.934092998505
  8. >>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
  9. 9.5061340332
  10. >>> start = time.time(); pos_tag_sents(sents); print time.time() - start
  11. 0.939551115036

字符串

zkure5ic

zkure5ic2#

  1. nltk pos_tag is defined as:
  2. from nltk.tag.perceptron import PerceptronTagger
  3. def pos_tag(tokens, tagset=None):
  4. tagger = PerceptronTagger()
  5. return _pos_tag(tokens, tagset, tagger)

字符串
所以每次调用pos_tag都会示例化perceptrontagger模块,这会花费大量的计算时间。你可以通过直接调用tagger.tag来保存这段时间:

  1. from nltk.tag.perceptron import PerceptronTagger
  2. tagger=PerceptronTagger()
  3. sentence_pos = tagger.tag(word_tokenize(sentence))

km0tfn4u

km0tfn4u3#

如果您正在寻找另一个在Python中具有快速性能的POS标记器,您可能想尝试RDRPOSTagger。例如,在英语POS标记上,Python中单线程实现的标记速度为8 K单词/秒,使用Core 2Duo 2.4GHz的计算机。只需使用多线程模式即可获得更快的标记速度。RDRPOSTagger与state相比具有非常有竞争力的精度-最先进的标记器,现在支持40种语言的预训练模型。请参阅this paper中的实验结果。

3lxsmp7m

3lxsmp7m4#

如果你使用嵌套列表,你应该扁平化并使用单个列表,这将有助于你提高StanfordPOSTagger的速度。

相关问题