tensorflow keras的to_categorical()错误:无法分配221,GiB用于形状(3324240,17877)和数据类型float32的数组

yruzcnhs  于 2023-06-23  发布在  Go
关注(0)|答案(1)|浏览(202)

每当我运行下面的代码时,我都会收到以下错误:

Unable to allocate 221. GiB for an array with shape (3324240, 17877) and data type float32.

代码的一般逻辑如下:我使用pandas从CSV文件加载无标点符号的文本。“重要”的词是那些在文本中出现频率足够高的词。编码器和解码器用于keras的seq2seq,两者都是文本的numpy数组,从CSV加载,以便模型可以在它们上进行训练。

from tensorflow.python.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, Input, LSTM
from keras.utils import to_categorical
from keras.utils import pad_sequences
import pandas as pd
import numpy as np

word2count = {}
for _, line in df.iterrows():
    for word in line['no_punc'].split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

important_word={}
thresh = 5
important_word['<PAD>'] = 0
word_id = 1
for key, value in word2count.items():
    if value > thresh:
        important_word[key] = word_id
        word_id += 1

questions = {}
answers = {}

for q, ans in convo_line.items():
    questions[q] = df.loc[q, 'no_punc']
    answers[ans] = df.loc[ans, 'no_punc']

for key, value in answers.items():
    answers[key] = '<SOS>' + value + '<EOS>'

tokens=['<EOS>', '<OUT>', '<SOS>']

x = len(important_word)

for t in tokens:
    important_word[t] = x
    x += 1

encoder_inp = []
for id, line in questions.items():
    lst = []
    for word in line.split():
        if word not in important_word:
            lst.append(important_word['<OUT>'])
        else:
            lst.append(important_word[word])
    encoder_inp.append(lst)

decoder_inp = []

for id, line in answers.items():
    lst=[]
    for word in line.split():
        if word not in important_word:
            lst.append(important_word['<OUT>'])
        else:
            lst.append(important_word[word])
    decoder_inp.append(lst)

encoder_inp = pad_sequences(encoder_inp, 15, padding='post', truncating='post')
decoder_inp = pad_sequences(decoder_inp, 15, padding='post', truncating='post')

decoder_inp_final = []

for i in decoder_inp:
    decoder_inp_final.append(i[1:])

decoder_inp_final = pad_sequences(decoder_inp_final, 15, padding='post', truncating='post')

de = decoder_inp_final[:17876]
decoder_inp_final = to_categorical(decoder_inp_final)

错误信息:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[183], line 2
      1 de=decoder_inp_final[:17876]
----> 2 decoder_inp_final=to_categorical(decoder_inp_final)

File d:\Desktop AI\env\Lib\site-packages\keras\utils\np_utils.py:73, in to_categorical(y, num_classes, dtype)
     71     num_classes = np.max(y) + 1
     72 n = y.shape[0]
---> 73 categorical = np.zeros((n, num_classes), dtype=dtype)
     74 categorical[np.arange(n), y] = 1
     75 output_shape = input_shape + (num_classes,)

MemoryError: Unable to allocate 221. GiB for an array with shape (3324240, 17877) and data type float32
h22fl7wq

h22fl7wq1#

您正在加载大量数据。您的decoder_inp_final是一个大小为3M x 18K的矩阵,因此它大约有600亿浮点数。这对你的机器内存来说太多了。
实际上,你不太可能需要一次将300万个令牌加载到内存中,你应该重写你的代码,以随机的块来执行,进行更新,加载新的块等。

相关问题