每当我运行下面的代码时,我都会收到以下错误:
Unable to allocate 221. GiB for an array with shape (3324240, 17877) and data type float32.
代码的一般逻辑如下:我使用pandas从CSV文件加载无标点符号的文本。“重要”的词是那些在文本中出现频率足够高的词。编码器和解码器用于keras的seq2seq,两者都是文本的numpy数组,从CSV加载,以便模型可以在它们上进行训练。
from tensorflow.python.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, Input, LSTM
from keras.utils import to_categorical
from keras.utils import pad_sequences
import pandas as pd
import numpy as np
word2count = {}
for _, line in df.iterrows():
for word in line['no_punc'].split():
if word not in word2count:
word2count[word] = 1
else:
word2count[word] += 1
important_word={}
thresh = 5
important_word['<PAD>'] = 0
word_id = 1
for key, value in word2count.items():
if value > thresh:
important_word[key] = word_id
word_id += 1
questions = {}
answers = {}
for q, ans in convo_line.items():
questions[q] = df.loc[q, 'no_punc']
answers[ans] = df.loc[ans, 'no_punc']
for key, value in answers.items():
answers[key] = '<SOS>' + value + '<EOS>'
tokens=['<EOS>', '<OUT>', '<SOS>']
x = len(important_word)
for t in tokens:
important_word[t] = x
x += 1
encoder_inp = []
for id, line in questions.items():
lst = []
for word in line.split():
if word not in important_word:
lst.append(important_word['<OUT>'])
else:
lst.append(important_word[word])
encoder_inp.append(lst)
decoder_inp = []
for id, line in answers.items():
lst=[]
for word in line.split():
if word not in important_word:
lst.append(important_word['<OUT>'])
else:
lst.append(important_word[word])
decoder_inp.append(lst)
encoder_inp = pad_sequences(encoder_inp, 15, padding='post', truncating='post')
decoder_inp = pad_sequences(decoder_inp, 15, padding='post', truncating='post')
decoder_inp_final = []
for i in decoder_inp:
decoder_inp_final.append(i[1:])
decoder_inp_final = pad_sequences(decoder_inp_final, 15, padding='post', truncating='post')
de = decoder_inp_final[:17876]
decoder_inp_final = to_categorical(decoder_inp_final)
错误信息:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Cell In[183], line 2
1 de=decoder_inp_final[:17876]
----> 2 decoder_inp_final=to_categorical(decoder_inp_final)
File d:\Desktop AI\env\Lib\site-packages\keras\utils\np_utils.py:73, in to_categorical(y, num_classes, dtype)
71 num_classes = np.max(y) + 1
72 n = y.shape[0]
---> 73 categorical = np.zeros((n, num_classes), dtype=dtype)
74 categorical[np.arange(n), y] = 1
75 output_shape = input_shape + (num_classes,)
MemoryError: Unable to allocate 221. GiB for an array with shape (3324240, 17877) and data type float32
1条答案
按热度按时间h22fl7wq1#
您正在加载大量数据。您的decoder_inp_final是一个大小为3M x 18K的矩阵,因此它大约有600亿浮点数。这对你的机器内存来说太多了。
实际上,你不太可能需要一次将300万个令牌加载到内存中,你应该重写你的代码,以随机的块来执行,进行更新,加载新的块等。