python RuntimeError当尝试从BERT模型中提取文本特征,然后使用KNN进行分类时

kgqe7b3p  于 2023-08-02  发布在  Python
关注(0)|答案(1)|浏览(125)

我尝试使用camembert模型来提取文本特征。之后,我尝试使用KNN分类器将特征向量分类为输入。
这是我写的代码

import torch
from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier

tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")

data = df.to_dict(orient='split')
data = dict(zip(data['index'], data['data']))

# Collect all the input texts into a list of strings
input_texts = [str(text) for text in data.values()]

# Tokenize all the input texts together
inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)

# Get the model outputs for all the input texts
with torch.no_grad():
    outputs = model(**inputs)

# Extract the last hidden states and convert them to a numpy array
last_hidden_states = outputs.last_hidden_state
input_features = last_hidden_states[:, 0, :].numpy()

# Extract the labels from the data dictionary
input_labels = list(data.keys())

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

字符串
然而,我得到这个错误

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 19209424896 bytes.


我在字典中使用的数据具有以下形式:

{
    'index': [row_index_1, row_index_2, ...],
    'columns': [column_name_1, column_name_2, ...],
    'data': [
        [cell_value_row_1_col_1, cell_value_row_1_col_2, ...],
        [cell_value_row_2_col_1, cell_value_row_2_col_2, ...],
        ...
    ]
}

dw1jzc5e

dw1jzc5e1#

看起来你一次将所有的数据都输入到模型中,而你没有足够的内存来做到这一点。您可以逐句或以小句子批调用模型,这样就可以在可用的系统资源中保留所需的内存。

相关问题