文档中关于pre_tokenizers.Split模式参数的描述不正确,

6qftjkof 于 6个月前发布在其他

关注(0)|答案(1)|浏览(92)

pre_tokenizers.Split 的文档说明如下：
pattern (str 或 Regex) — 用于分割字符串的模式。通常是一个字符串或使用 tokenizers.Regex 构建的正则表达式。然而，这是不正确的。在 tokenizers 0.19.1 中，str 不起作用。以下示例说明了这一点：

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
from tokenizers import Regex

# similar issue to https://github.com/huggingface/tokenizers/pull/1264
# but the documentation is still incorrect as a string does not work

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

tokenizer_bad = Tokenizer.from_pretrained("bert-base-uncased")
# the documentation says pattern can be a string or `tokenizers.Regex` object
# but the string doesn't work
# have similar issues for other `behavior` values
tokenizer_bad.pre_tokenizer = Split(pattern=GPT2_SPLIT_PATTERN, behavior="isolated")  

pretokenized_output_bad = tokenizer_bad.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("incorrect:")
print(pretokenized_output_bad)
print()

# incorrect:
# [('This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script', (0, 63))]

# Regex seems to be undocumented 
re = Regex(GPT2_SPLIT_PATTERN)

tokenizer_good = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer_good.pre_tokenizer = Split(pattern=re, behavior="isolated")

pretokenized_output_good = tokenizer_good.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("correct:")
print(pretokenized_output_good)

# correct:
# [('This', (0, 4)), (' is', (4, 7)), (' a', (7, 9)), (' --', (9, 12)), (' -', (12, 14)), (' 0', (14, 16)), (' 00', (16, 19)), (' 0', (19, 21)), (' 0', (21, 23)), (' 13', (23, 26)), ('##$', (26, 29)), ('2', (29, 30)), ('#', (30, 31)), ('6', (31, 32)), ('klwt', (32, 36)), (' gtek', (36, 41)), (' jhrthr', (41, 48)), (' testing', (48, 56)), (' script', (56, 63))]

请更新文档，声明它必须是一个 tokenizers.Regex 对象，并提供一个示例，如上所示，说明如何进行操作。另外，最好在某个地方记录 tokenizers.Regex。另一个替代方案是更新代码，使其能够与 str 值一起工作。

请注意，虽然 PR 1264 改进了文档，但它仍然是错误的和令人困惑的。
此外，在更新文档时，可以将 behavior 的示例放在 rust docs here 中。

tokenizers

来源：https://github.com/huggingface/tokenizers/issues/1565