tensorflow 什么是词的向量表示中的UNK标记

mwg9r5ms 于 2023-04-21 发布在其他

关注(0)|答案(1)|浏览(127)

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

我正在学习使用Tensorflow的单词向量表示的基本示例。
这个步骤2的标题是“构建字典并用UNK令牌替换罕见单词”，然而，没有关于“UNK”所指的内容的预先定义过程。
要指定问题，请执行以下操作：
0)在NLP中，UNK通常指的是什么？
1)count = 'UNK'，-1是什么意思？我知道括号[]在python中是指list，但是，为什么我们要将它与-1搭配？

tensorflow

来源：https://stackoverflow.com/questions/45735357/what-is-unk-token-in-vector-representation-of-words

1条答案

按热度按时间

llew8vvj1#

正如在评论中已经提到的那样，在标记化和NLP中，当你看到UNK标记时，它可能表示未知词。
例如，如果你想预测一个句子中缺少的单词。你如何将你的数据提供给它？你肯定需要一个标记来显示缺少的单词在哪里。所以，如果“房子”是我们缺少的单词，在标记之后，它将像这样：
'my house is big' -〉['my', 'UNK', 'is', 'big']
PS：count = [['UNK', -1]]是初始化count的，它就像Ivan Aksamentov已经说过的[['word', number_of_occurences]]一样。

赞(0）回复(0）举报 2023-04-21

我来回答

tensorflow 什么是词的向量表示中的UNK标记

1条答案

相关问题

热门标签

最新问答