此问题已在此处找到答案:
加快Python3中数百万个正则表达式的替换(9个答案)
昨天关门了。
我有一个很大的语料库,我想从中删除某些单词。类似于从文本中删除stopwords,但我现在想从语料库中删除bigrams。我有我的bigram列表,但是很明显,删除stopwords的简单列表理解方法不会减少它。我在考虑使用正则表达式,从单词列表中编译一个模式,然后替换单词。以下是一些示例代码:
txt = 'He was the type of guy who liked Christmas lights on his house in the middle of July. He picked up trash in his spare time to dump in his neighbors yard. If eating three-egg omelets causes weight-gain, budgie eggs are a good substitute. We should play with legos at camp. She cried diamonds. She had some amazing news to share but nobody to share it with. He decided water-skiing on a frozen lake wasn’t a good idea. His eyes met mine on the street. When he asked her favorite number, she answered without hesitation that it was diamonds. She is never happy until she finds something to be unhappy about; then, she is overjoyed.'
--
import re
words_to_remove = ['this is', 'We should', 'Christmas lights']
pattrn = re.compile(r' | '.join(words_to_remove))
pattrn.sub(' ',txt)
%timeit pattrn.sub(' ',txt)
--
timeit 1: 9.18 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
有没有更快的方法让我移除这些大人物?实际语料库的len是556694135个字符,bigram的数量是3205182,这在实际数据集上执行时非常慢。
1条答案
按热度按时间pvcm50d11#
您可以重写正则表达式,使其具有trie的结构(而不是
word|worse|wild
使用w(or(d|se)|ild)
),甚至更好的是,抛弃正则表达式,使用aho–corasick算法。当然,您可以使用一个库来实现这一点,例如flashtext(这是aho corasick的精简版,专门用于搜索和替换整个单词,就像您的案例一样)。flashtext的作者声称»regex运行了5天。所以我建立了一个工具,在15分钟内完成