当通过en_core_web_trf模型并行运行数据时,我在每次运行之间得到了不同的结果。
我无法在文档或其他GitHub问题中找到这种行为被解释的地方。
以下代码重现了这种行为:如果我不通过管道并行运行数据(例如设置max_workers=1),我发现结果始终是一致的。
import spacy
from concurrent.futures import ThreadPoolExecutor
nlp = spacy.load("en_core_web_trf")
def extract_entities(sentences):
with ThreadPoolExecutor(max_workers=4) as e:
submitted = [e.submit(call_spacy, sent) for sent in sentences]
resolved = [item.result() for item in submitted]
return resolved
def call_spacy(sent):
result = nlp(sent)
return result.ents
input =[
"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
"Hard working laborers visited CoCo Town to congregate at the diner.",
"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]
for i in range(10):
result = extract_entities(input)
print(sum([len(x) for x in result]))
您的环境
- 操作系统:
Amazon Linux 2
内核:Linux 4.14.294-220.533.amzn2.x86_64
- 使用的Python版本:
python 3.7.10
- 使用的spaCy版本:
3.1.3
- 环境信息:
en-core-web-trf==3.1.0
1条答案
按热度按时间zpgglvta1#
我可以复现这个问题,但它可能与
torch
有关,而不是直接与spacy
有关,我不确定 torch 中可能发生什么导致这个问题。我们来看看!我们建议作为尝试的第一个替代方案是使用内置的多进程处理
nlp.pipe
:注意事项:
torch.set_num_threads(1)
以避免与 torch 多进程相关的死锁(更多详细信息在 BERT Model (German) does not work in multiprocessing mode #4667 中)。nlp.pipe(n_process=)
进行多进程处理,都应使用nlp.pipe
对文本进行分批处理以提高速度。