How to predict the probability of an empty string using BERT

u3r8eeie 于 5个月前发布在其他

关注(0)|答案(1)|浏览(67)

假设我们有一个类似的模板句子：

"The ____ house is our meeting place."

并且我们有一个形容词列表需要填充空白，例如：

"yellow"
"large"
""

注意其中一个是空字符串。
目标是在给定句子的上下文中比较选择最可能描述“house”的单词的概率。如果更有可能有 nothing,这也应该被考虑进去。
我们可以预测每个单词填充空白的概率，但是如何预测一个空字符串填充空白的概率，即没有形容词描述“house”的概率？
要预测一个单词的概率：

from transformers import BertTokenizer, BertForMaskedLM
import torch
from torch.nn import functional as F

# Load BERT tokenizer and pre-trained model
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertForMaskedLM.from_pretrained('bert-large-uncased', return_dict=True)

targets = ["yellow", "large"]
sentence = "The [MASK] house is our meeting place."

# Using BERT, compute probability over its entire vocabulary, returning logits
input = tokenizer.encode_plus(sentence, return_tensors = "pt") 
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)[0] 
with torch.no_grad():
    output = model(**input) 

# Run softmax over the logits to get the probabilities
softmax = F.softmax(output.logits[0], dim=-1)

# Find the words' probabilities in this probability distribution
target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
target_probabilities

这输出一个单词及其关联概率的列表：

{'yellow': 0.0061520976, 'large': 0.00071377633}

如果我尝试将空字符串添加到列表中，我得到以下错误：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-62-6f726220a108> in <module>
     18 
     19 # Find the words' probabilities in this probability distribution
---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
     21 target_probabilities

<ipython-input-62-6f726220a108> in <dictcomp>(.0)
     18 
     19 # Find the words' probabilities in this probability distribution
---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets}
     21 target_probabilities

KeyError: ''

这是因为BERT的词汇表中没有空字符串，所以我们无法查找模型中不存在的东西的概率。
我们应该如何获得没有单词填充空白的概率？使用模型是否可行？使用空标记 [PAD] 而不是空字符串是否有意义？(我只在句子末尾见过 [PAD],用于使一组句子具有相同的长度。)

bert

来源：https://github.com/google-research/bert/issues/1286