tokenizers 预训练的分词器Fast char_to_token token_to_char 工作不如预期,

rkue9o1l 于 6个月前发布在其他

关注(0)|答案(4)|浏览(146)

系统信息

transformers 版本：4.44.0
平台：macOS-13.6.9-arm64-arm-64bit
Python 版本：3.11.4
Huggingface_hub 版本：0.23.4
Safetensors 版本：0.4.3
Accelerate 版本：0.32.1
Accelerate 配置：未找到
PyTorch 版本(GPU?):2.4.0(False)
Tensorflow 版本(GPU?):未安装(NA)
Flax 版本(CPU?/GPU?/TPU?):未安装(NA)
Jax 版本：未安装
JaxLib 版本：未安装
在脚本中使用分布式或并行设置？:否

谁可以帮忙？

@ArthurZucker

信息

官方示例脚本
我自己的修改过的脚本

任务

examples 文件夹中的一个官方支持的任务(如GLUE/SQuAD等)
我自己的任务或数据集(以下详细说明)

重现问题

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

对于任何非零的字符索引，这都返回None
此外，token_to_char 不返回预期的结果：
out.token_to_chars(4) 返回
CharSpan(start=15, end=15)
而不是 CharSpan(start=15, end=19)

预期行为

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "the quick brown fox jumps over the lazy dog"
out = tokenizer(text)
out.char_to_token(0)

应返回 1
out.token_to_chars(4)
应返回 CharSpan(start=15, end=19)

tokenizers

来源：https://github.com/huggingface/tokenizers/issues/1620