unilm LayoutLM:为什么使用pad_token_label_id?

w8f9ii69 于 5个月前发布在其他

关注(0)|答案(3)|浏览(65)

我正在使用的模型是LayoutLM。在函数convert_examples_to_features中，有一个代码片段：

# Use the real label id for the first token of the word, and padding ids for the remaining tokens
    label_ids.extend(
        [label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)
    )

我想问为什么不为所有分词后的单词提供真实的标签id?

unilm

来源：https://github.com/microsoft/unilm/issues/303

3条答案

按热度按时间

u91tlkcl1#

这只是存储标签的一种方式。假设一个句子 hello there general kenobi hello 在分词后变成了 hello there general ke ##no ##bi hello,而标签是 "句子中的位置",那么原始标签是 0 1 2 3 4,而结果的 '分词后的' 标签变成了 (pt = pad token) 0 1 2 3 pt pt 4。你可以将其变成 0 1 2 3 3 3 4,但你并没有添加任何信息，因为你可以将 0 1 2 3 pt pt 4 转换为 0 1 2 3 3 3 4 并返回，而不会造成任何信息损失。