pytorch 在分词器解码步骤huggingface中标记到单词的Map？

qlfbtfca 于 2023-11-19 发布在其他

关注(0)|答案(2)|浏览(115)

有没有办法知道从标记到tokenizer.decode()函数中原始单词的Map？
举例来说：

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

字符串
目标是有一个函数，将decode过程中的每个标记Map到正确的输入单词，因为它将是：
desired_output = [[1],[2],[3],[4,5],[6]]个
因为this对应于id 42，而token和ization对应于id [19244,1938]，它们位于input_ids数组的索引4,5处。

pytorch

来源：https://stackoverflow.com/questions/62317723/tokens-to-words-mapping-in-the-tokenizer-decode-step-huggingface

2条答案

按热度按时间

bakd9h0s1#

transformers version>=2.9.0：

FastTokenizers返回一个BatchEnconding对象，您可以使用：

from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-large')

example = "This is a tokenization example"

enc = tokenizer(example, add_special_tokens=False)

desired_output = []

#BatchEncoding.word_ids returns a list mapping words to tokens
for w_idx in set(enc.word_ids()):
    #BatchEncoding.word_to_tokens tells us which and how many tokens are used for the specific word
    start, end = enc.word_to_tokens(w_idx)
    # we add +1 because you wanted to start with 1 and not with 0
    start+=1
    end+=1
    desired_output.append(list(range(start,end)))

字符串
输出量：

[[1], [2], [3], [4, 5], [6]]

型

transformers version<2.9.0：

据我所知，没有内置的方法，但你可以自己创建一个：

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

型
输出量：

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

型
为了得到你想要的输出，你必须使用一个列表解析：

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

型
输出量：

[[1], [2], [3], [4, 5], [6]]

型

赞(0）回复(0）举报 2023-11-19

ctehm74n2#

如果您使用快速分词器，即来自tokenizers库的rust backed版本，则编码包含一个word_ids方法，可用于将子单词Map回其原始单词。什么构成word vs subword取决于分词器，单词是由预分词阶段生成的，即通过空格分割，子字由实际模型（例如BPE或Unigram）生成。
下面的代码应该在一般情况下工作，即使预标记化执行额外的拆分。例如，我创建了自己的自定义步骤，基于PascalCase进行拆分-这里的words是Pascal和Case，接受的答案在这种情况下不起作用，因为它假设单词是空格分隔的。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

encoded = tokenizer(example)

desired_output = []
for word_id in encoded.word_ids():
    if word_id is not None:
        start, end = encoded.word_to_tokens(word_id)
        if start == end - 1:
            tokens = [start]
        else:
            tokens = [start, end-1]
        if len(desired_output) == 0 or desired_output[-1] != tokens:
            desired_output.append(tokens)
desired_output

字符串

赞(0）回复(0）举报 2023-11-19

我来回答

pytorch 在分词器解码步骤huggingface中标记到单词的Map？

2条答案

相关问题

热门标签

最新问答