文档中关于pre_tokenizers.Split模式参数的描述不正确,

6qftjkof  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(92)

pre_tokenizers.Split 的文档说明如下:
pattern (strRegex) — 用于分割字符串的模式。通常是一个字符串或使用 tokenizers.Regex 构建的正则表达式。然而,这是不正确的。在 tokenizers 0.19.1 中,str 不起作用。以下示例说明了这一点:

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Split
from tokenizers import Regex

# similar issue to https://github.com/huggingface/tokenizers/pull/1264
# but the documentation is still incorrect as a string does not work

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

tokenizer_bad = Tokenizer.from_pretrained("bert-base-uncased")
# the documentation says pattern can be a string or `tokenizers.Regex` object
# but the string doesn't work
# have similar issues for other `behavior` values
tokenizer_bad.pre_tokenizer = Split(pattern=GPT2_SPLIT_PATTERN, behavior="isolated")  

pretokenized_output_bad = tokenizer_bad.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("incorrect:")
print(pretokenized_output_bad)
print()

# incorrect:
# [('This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script', (0, 63))]

# Regex seems to be undocumented 
re = Regex(GPT2_SPLIT_PATTERN)

tokenizer_good = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer_good.pre_tokenizer = Split(pattern=re, behavior="isolated")

pretokenized_output_good = tokenizer_good.pre_tokenizer.pre_tokenize_str("This is a -- - 0 00 0 0 13##$2#6klwt gtek jhrthr testing script")

print("correct:")
print(pretokenized_output_good)

# correct:
# [('This', (0, 4)), (' is', (4, 7)), (' a', (7, 9)), (' --', (9, 12)), (' -', (12, 14)), (' 0', (14, 16)), (' 00', (16, 19)), (' 0', (19, 21)), (' 0', (21, 23)), (' 13', (23, 26)), ('##$', (26, 29)), ('2', (29, 30)), ('#', (30, 31)), ('6', (31, 32)), ('klwt', (32, 36)), (' gtek', (36, 41)), (' jhrthr', (41, 48)), (' testing', (48, 56)), (' script', (56, 63))]

请更新文档,声明它必须是一个 tokenizers.Regex 对象,并提供一个示例,如上所示,说明如何进行操作。另外,最好在某个地方记录 tokenizers.Regex。另一个替代方案是更新代码,使其能够与 str 值一起工作。

请注意,虽然 PR 1264 改进了文档,但它仍然是错误的和令人困惑的。
此外,在更新文档时,可以将 behavior 的示例放在 rust docs here 中。

hwamh0ep

hwamh0ep1#

嘿!关于行为文档,请随时打开一个PR,我会审查它!
对于SPLIT,我认为我们应该使用简单的正则表达式使其正常工作(对用户来说不是更好吗?)我也可以为这个修复打开一个PR,除非你想解决它?🤗

相关问题