spaCy 在使用ThreadPoolExecutor时,相同输入的NER预测结果不一致,

import spacy
from concurrent.futures import ThreadPoolExecutor

nlp = spacy.load("en_core_web_trf")

def extract_entities(sentences):
    with ThreadPoolExecutor(max_workers=4) as e:
        submitted = [e.submit(call_spacy, sent) for sent in sentences]
        resolved = [item.result() for item in submitted]

        return resolved

def call_spacy(sent):
    result = nlp(sent)
    return result.ents

input =[
	"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
	"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
	"Hard working laborers visited CoCo Town to congregate at the diner.",
	"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
	"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."

for i in range(10):
    result = extract_entities(input)
    print(sum([len(x) for x in result]))


  • 操作系统:Amazon Linux 2 内核:Linux 4.14.294-220.533.amzn2.x86_64
  • 使用的Python版本:python 3.7.10
  • 使用的spaCy版本:3.1.3
  • 环境信息:en-core-web-trf==3.1.0


我可以复现这个问题,但它可能与 torch 有关,而不是直接与 spacy 有关,我不确定 torch 中可能发生什么导致这个问题。我们来看看!

我们建议作为尝试的第一个替代方案是使用内置的多进程处理 nlp.pipe:

import spacy
import torch


nlp = spacy.load("en_core_web_trf")

input =[
        "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
        "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
        "Hard working laborers visited CoCo Town to congregate at the diner.",
        "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
        "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."

for i in range(10):
    print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))

