spaCy 在使用ThreadPoolExecutor时，相同输入的NER预测结果不一致,

ar7v8xwq 于 6个月前发布在其他

关注(0)|答案(1)|浏览(67)

当通过en_core_web_trf模型并行运行数据时，我在每次运行之间得到了不同的结果。
我无法在文档或其他GitHub问题中找到这种行为被解释的地方。
以下代码重现了这种行为：如果我不通过管道并行运行数据(例如设置max_workers=1),我发现结果始终是一致的。

import spacy
from concurrent.futures import ThreadPoolExecutor

nlp = spacy.load("en_core_web_trf")

def extract_entities(sentences):
    with ThreadPoolExecutor(max_workers=4) as e:
        submitted = [e.submit(call_spacy, sent) for sent in sentences]
        resolved = [item.result() for item in submitted]

        return resolved

def call_spacy(sent):
    result = nlp(sent)
    return result.ents

input =[
	"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
	"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
	"Hard working laborers visited CoCo Town to congregate at the diner.",
	"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
	"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    result = extract_entities(input)
    print(sum([len(x) for x in result]))

您的环境

操作系统：Amazon Linux 2 内核：Linux 4.14.294-220.533.amzn2.x86_64
使用的Python版本：python 3.7.10
使用的spaCy版本：3.1.3
环境信息：en-core-web-trf==3.1.0

spacy

来源：https://github.com/explosion/spaCy/issues/11868

1条答案

按热度按时间

zpgglvta1#

我可以复现这个问题，但它可能与 torch 有关，而不是直接与 spacy 有关，我不确定 torch 中可能发生什么导致这个问题。我们来看看！

我们建议作为尝试的第一个替代方案是使用内置的多进程处理 nlp.pipe:

import spacy
import torch

torch.set_num_threads(1)

nlp = spacy.load("en_core_web_trf")

input =[
        "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
        "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
        "Hard working laborers visited CoCo Town to congregate at the diner.",
        "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
        "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))

注意事项：

通常需要 torch.set_num_threads(1) 以避免与 torch 多进程相关的死锁(更多详细信息在 BERT Model (German) does not work in multiprocessing mode #4667 中)。
无论您是否使用 nlp.pipe(n_process=) 进行多进程处理，都应使用 nlp.pipe 对文本进行分批处理以提高速度。

赞(0）回复(0）举报 6个月前

我来回答

spaCy 在使用ThreadPoolExecutor时，相同输入的NER预测结果不一致,

您的环境

1条答案

相关问题

热门标签

最新问答