我有一个专栏名为 description
在我的数据集中有每种食物的描述。但是,数据集不完整,因此我想生成一些合成文本数据。我正在寻找一种批量处理这些数据的方法,而不是单个块(因为它会杀死内核)。
这就是我的数据集的外观:
branded_food_category description napcs
0 Ice Cream & Frozen Yogurt mochi ice cream bonbons 3
1 Ketchup, Mustard, BBQ & Cheese Sauce chipotle barbecue sauce 0
2 Ketchup, Mustard, BBQ & Cheese Sauce hot spicy barbecue sauce 0
3 Ketchup, Mustard, BBQ & Cheese Sauce barbecue sauce 0
4 Ketchup, Mustard, BBQ & Cheese Sauce barbecue sauce 0
当我只取一组我的描述列时,我得到的是:
print(len(set(description)))
> 152398
在这里,我将数据分成51块
length_sentence = 50 + 1
lines = []
for i in range(length_sentence, len(description)):
seq = description[i-length_sentence:i]
line = seq
lines.append(line)
if i > 200000: # limit our dataset to 200000 words.
break
print(len(lines))
现在我正在为lstm模型准备数据:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines) # texts_to_sequences() transforms each text in texts to a sequence of integers.
sequences = np.array(sequences)
x, y = sequences[:, :-1], sequences[:,-1] # Now we will split each line such that the first 50 words are in X and the last word is in y.
x[0]
> array([128280, 128278, 128276, 43, 43, 1483, 7803, 1968,
247, 128273, 128271, 128269, 1967, 345, 420, 51,
23, 3690, 128265, 1175, 128263, 128262, 128261, 128259,
16737, 16736, 128257, 128255, 128254, 558, 195, 454,
3689, 128250, 128248, 1964, 128247, 128245, 890, 128244,
673, 890, 673, 128241, 7801, 64, 1, 557,
557, 128239])
这就是抛出错误的原因。当我尝试将数据转换为“分类”并将其分配给 y
.
# tokenizer.word_index gives the mapping of each unique word to its numerical equivalent.
# tokenizer.word_index gives the vocab_size.
vocab_size = len(tokenizer.word_index) + 1
y = to_categorical(y, num_classes=vocab_size)
seq_length = x.shape[1]
暂无答案!
目前还没有任何答案,快来回答吧!