如何批处理数百万文本数据

rqqzpn5f  于 2021-09-29  发布在  Java
关注(0)|答案(0)|浏览(264)

我有一个专栏名为 description 在我的数据集中有每种食物的描述。但是,数据集不完整,因此我想生成一些合成文本数据。我正在寻找一种批量处理这些数据的方法,而不是单个块(因为它会杀死内核)。
这就是我的数据集的外观:

  1. branded_food_category description napcs
  2. 0 Ice Cream & Frozen Yogurt mochi ice cream bonbons 3
  3. 1 Ketchup, Mustard, BBQ & Cheese Sauce chipotle barbecue sauce 0
  4. 2 Ketchup, Mustard, BBQ & Cheese Sauce hot spicy barbecue sauce 0
  5. 3 Ketchup, Mustard, BBQ & Cheese Sauce barbecue sauce 0
  6. 4 Ketchup, Mustard, BBQ & Cheese Sauce barbecue sauce 0

当我只取一组我的描述列时,我得到的是:

  1. print(len(set(description)))
  2. > 152398

在这里,我将数据分成51块

  1. length_sentence = 50 + 1
  2. lines = []
  3. for i in range(length_sentence, len(description)):
  4. seq = description[i-length_sentence:i]
  5. line = seq
  6. lines.append(line)
  7. if i > 200000: # limit our dataset to 200000 words.
  8. break
  9. print(len(lines))

现在我正在为lstm模型准备数据:

  1. tokenizer = Tokenizer()
  2. tokenizer.fit_on_texts(lines)
  3. sequences = tokenizer.texts_to_sequences(lines) # texts_to_sequences() transforms each text in texts to a sequence of integers.
  4. sequences = np.array(sequences)
  5. x, y = sequences[:, :-1], sequences[:,-1] # Now we will split each line such that the first 50 words are in X and the last word is in y.
  6. x[0]
  7. > array([128280, 128278, 128276, 43, 43, 1483, 7803, 1968,
  8. 247, 128273, 128271, 128269, 1967, 345, 420, 51,
  9. 23, 3690, 128265, 1175, 128263, 128262, 128261, 128259,
  10. 16737, 16736, 128257, 128255, 128254, 558, 195, 454,
  11. 3689, 128250, 128248, 1964, 128247, 128245, 890, 128244,
  12. 673, 890, 673, 128241, 7801, 64, 1, 557,
  13. 557, 128239])

这就是抛出错误的原因。当我尝试将数据转换为“分类”并将其分配给 y .

  1. # tokenizer.word_index gives the mapping of each unique word to its numerical equivalent.
  2. # tokenizer.word_index gives the vocab_size.
  3. vocab_size = len(tokenizer.word_index) + 1
  4. y = to_categorical(y, num_classes=vocab_size)
  5. seq_length = x.shape[1]

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题