Keras文本矢量化图层的反转?

xjreopfe  于 2023-01-26  发布在  其他
关注(0)|答案(2)|浏览(194)

tf.keras.layers.TextVectorization层将文本特征Map为整数序列,由于它可以作为keras模型层添加,因此可以轻松地将模型部署为单个文件,该文件将字符串作为输入并对其进行处理。但我还需要执行相反的操作,但找不到任何方法来完成此操作。我正在使用一个LSTM模型,该模型可以从前面的单词预测下一个单词。例如,我的模型需要接受一个字符串“I love”,并且应该输出可能的下一个单词,如“cats”、“dogs”等。我可以使用tf.keras.preprocessing.text.Tokenizer手动将字符串Map到integer,或者从integerMap字符串,如下所示:

text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])

seqs = tokenizer.texts_to_sequences([text])
prediction = model.predict(seqs) # an integer
actual_prediction = tokenizer.sequences_to_texts(prediction) # now the desired string

如何在模型的输出层实现TextVecorization层的功能,以便获得由TextVectorization层的索引表示的字符串,而不是获得索引的预测?

uz75evzq

uz75evzq1#

这很简单,但是您需要将字符串-文本-序列和模型之间任务分开,以找到它们之间的关系。

[样品1 ]:作为字符串序列

import tensorflow as tf

text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])

# input
vocab = [ "a", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_" ]
data = tf.constant([["_", "_", "_", "I"], ["l", "o", "v", "e"], ["c", "a", "t", "s"]])

layer = tf.keras.layers.StringLookup(vocabulary=vocab)
sequences_mapping_string = layer(data)
sequences_mapping_string = tf.constant( sequences_mapping_string, shape=(1,12) )

decoder = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode="int", invert=True)
result = decoder(sequences_mapping_string)
print( "encode: " + str( sequences_mapping_string ) )
print( "decode: " + str( result ) )

mapping_vocab = [ "_", "I", "l", "o", "v", "e", "c", "a", "t", "s" ]
string_matching = [ 27, 9, 12, 15, 22, 5, 3, 1, 20, 19 ]
string_matching_reverse = [ 1/27, 1/9, 1/12, 1/15, 1/22, 1/5, 1/3, 1/1, 1/20, 1/19 ]

print( tf.math.multiply( tf.constant(string_matching, dtype=tf.float32), tf.constant(string_matching_reverse, dtype=tf.float32 ), name=None ) )

[输出]:

# encode: tf.Tensor([[27 27 27  9 12 15 22  5  3  1 20 19]], shape=(1, 12), dtype=int64)
# decode: tf.Tensor([[b'_' b'_' b'_' b'I' b'l' b'o' b'v' b'e' b'c' b'a' b't' b's']], shape=(1, 12), dtype=string)
# text: I love cats
# seqs: [[2, 3, 4]]
# prediction: [[2.004947  0.        0.        1.4835927 3.3234084 3.586834  0.  0.6012034 0.       ]]
# tf.Tensor([1. 1. 1. 1. 1. 1. 1. 1. 1. 1.], shape=(10,), dtype=float32)

[样品2 ]:作为单词序列应用模型要求

dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
dataset = dataset.batch(10)
batched_features = dataset
predictions = model.predict(input_array)

lb3vh1jj

lb3vh1jj2#

就这么办吧:

vocabulary = text_vectorizer.get_vocabulary()
vocab_arr = np.asarray(vocabulary) 
" ".join(vocab_arr[prediction_sequence])

相关问题