pre_tokenizers.Split 的文档说明如下:pattern
(str
或 Regex
) — 用于分割字符串的模式。通常是一个字符串或使用 tokenizers.Regex
构建的正则表达式。然而,这是不正确的。在 tokenizers 0.19.1 中,str
不起作用。以下示例说明了这一点:
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
from tokenizers import Regex
# similar issue to https://github.com/huggingface/tokenizers/pull/1264
# but the documentation is still incorrect as a string does not work
GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
tokenizer_bad = Tokenizer.from_pretrained("bert-base-uncased")
# the documentation says pattern can be a string or `tokenizers.Regex` object
# but the string doesn't work
# have similar issues for other `behavior` values
tokenizer_bad.pre_tokenizer = Split(pattern=GPT2_SPLIT_PATTERN, behavior="isolated")
pretokenized_output_bad = tokenizer_bad.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")
print("incorrect:")
print(pretokenized_output_bad)
print()
# incorrect:
# [('This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script', (0, 63))]
# Regex seems to be undocumented
re = Regex(GPT2_SPLIT_PATTERN)
tokenizer_good = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer_good.pre_tokenizer = Split(pattern=re, behavior="isolated")
pretokenized_output_good = tokenizer_good.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")
print("correct:")
print(pretokenized_output_good)
# correct:
# [('This', (0, 4)), (' is', (4, 7)), (' a', (7, 9)), (' --', (9, 12)), (' -', (12, 14)), (' 0', (14, 16)), (' 00', (16, 19)), (' 0', (19, 21)), (' 0', (21, 23)), (' 13', (23, 26)), ('##$', (26, 29)), ('2', (29, 30)), ('#', (30, 31)), ('6', (31, 32)), ('klwt', (32, 36)), (' gtek', (36, 41)), (' jhrthr', (41, 48)), (' testing', (48, 56)), (' script', (56, 63))]
请更新文档,声明它必须是一个 tokenizers.Regex
对象,并提供一个示例,如上所示,说明如何进行操作。另外,最好在某个地方记录 tokenizers.Regex
。另一个替代方案是更新代码,使其能够与 str
值一起工作。
请注意,虽然 PR 1264 改进了文档,但它仍然是错误的和令人困惑的。
此外,在更新文档时,可以将 behavior
的示例放在 rust docs here 中。
1条答案
按热度按时间hwamh0ep1#
嘿!关于行为文档,请随时打开一个PR,我会审查它!
对于SPLIT,我认为我们应该使用简单的正则表达式使其正常工作(对用户来说不是更好吗?)我也可以为这个修复打开一个PR,除非你想解决它?🤗