python-3.x 从不调用spaCy自定义组件函数

kg7wmglp  于 2023-10-21  发布在  Python
关注(0)|答案(1)|浏览(152)

我正在向spaCy添加一个自定义组件,但它从未被调用:

@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
    print(".")
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("de_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

我在sentences中得到一个结果,分析器确实列出了我的组件,但我的自定义组件接缝没有效果,我从来没有看到打印的点出现。
有什么想法吗?

okxuctiv

okxuctiv1#

在您粘贴的代码中:
您正在执行:

nlp = spacy.load("de_core_web_sm")

然而,它应该是:

nlp = spacy.load("en_core_web_sm")

我试着复制你的代码,我得到的结果是

@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
    print("...$...")                     # I am printing "...$..." so that it is visible easily 
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

输出

(请参阅底部...$...被打印,custom_sentence_boundaries被打印在parser之后,因为我们在关键字参数中声明了after="parser"

============================= Pipeline Overview =============================

#   Component                    Assigns               Requires   Scores             Retokenizes
-   --------------------------   -------------------   --------   ----------------   -----------
0   tok2vec                      doc.tensor                                          False      
                                                                                                
1   tagger                       token.tag                        tag_acc            False      
                                                                                                
2   parser                       token.dep                        dep_uas            False      
                                 token.head                       dep_las                       
                                 token.is_sent_start              dep_las_per_type              
                                 doc.sents                        sents_p                       
                                                                  sents_r                       
                                                                  sents_f                       
                                                                                                
3   custom_sentence_boundaries                                                       False      
                                                                                                
4   attribute_ruler                                                                  False      
                                                                                                
5   lemmatizer                   token.lemma                      lemma_acc          False      
                                                                                                
6   ner                          doc.ents                         ents_f             False      
                                 token.ent_iob                    ents_p                        
                                 token.ent_type                   ents_r                        
                                                                  ents_per_type                 

✔ No problems found.
...$...

相关问题