python-3.x 仅使用“空白”规则的Spacy标记器

ijxebb2r  于 2023-05-19  发布在  Python
关注(0)|答案(3)|浏览(266)

我想知道spacy tokenizer是否可以只使用“空格”规则对单词进行tokenize。例如:

sentence= "(c/o Oxford University )"

通常,使用以下空间配置:

nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
   print(token)

结果将是:

(
 c
 /
 o
 Oxford
 University
 )

相反,我希望输出如下所示(使用spacy):

(c/o 
Oxford 
University
)

使用spacy可以得到这样的结果吗?

r1wp621o

r1wp621o1#

让我们将nlp.tokenizer更改为一个自定义Tokenizer,并使用token_match正则表达式:

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])

nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]

您可以通过添加自定义后缀、前缀和中缀规则来进一步调整Tokenizer
另一种更细粒度的方法是找出为什么it's令牌像nlp.tokenizer.explain()一样被拆分:

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)

你会发现拆分是由SPECIAL规则引起的:

[('TOKEN', 'This'),
 ('TOKEN', 'is'),
 ('SPECIAL-1', 'it'),
 ('SPECIAL-2', "'s"),
 ('SUFFIX', '.'),
 ('SPECIAL-1', 'I'),
 ('SPECIAL-2', "'m"),
 ('TOKEN', 'fine')]

可以更新以从例外中删除“it's”,例如:

exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]

或完全删除撇号上的分裂:

filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]

请注意标记上附加的点,这是由于未指定后缀规则。

qv7cva1a

qv7cva1a2#

你可以在spaCy文档中找到这个问题的解决方案:简单https://spacy.io/usage/linguistic-features#custom-tokenizer-example地说,你创建了一个函数,它接受一个字符串text并返回一个Doc对象,然后将这个可调用函数赋给nlp.tokenizer

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])
lf5gs5x2

lf5gs5x23#

根据文件
https://spacy.io/usage/spacy-101#annotations-token www.example.com
-- splitting on whitespace是Tokenizer的基本行为。
因此,这个简单的解决方案应该有效:

import spacy    
from spacy.tokenizer import Tokenizer

nlp = spacy.blank("en")
tokenizer = Tokenizer(nlp.vocab)

有一个小小的警告。您没有指定如何处理多个空格。SpaCy将这些标记作为单独的标记,以便可以从标记中恢复确切的原始文本。"hello world"(有两个空格)将标记为"hello", " ", "world"。(对于一个空格,它当然只是"hello", "world")。

相关问题