python RuntimeError当尝试从BERT模型中提取文本特征，然后使用KNN进行分类时

kgqe7b3p 于 2023-08-02 发布在 Python

关注(0)|答案(1)|浏览(125)

我尝试使用camembert模型来提取文本特征。之后，我尝试使用KNN分类器将特征向量分类为输入。
这是我写的代码

import torch
from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier

tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")

data = df.to_dict(orient='split')
data = dict(zip(data['index'], data['data']))

# Collect all the input texts into a list of strings
input_texts = [str(text) for text in data.values()]

# Tokenize all the input texts together
inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)

# Get the model outputs for all the input texts
with torch.no_grad():
    outputs = model(**inputs)

# Extract the last hidden states and convert them to a numpy array
last_hidden_states = outputs.last_hidden_state
input_features = last_hidden_states[:, 0, :].numpy()

# Extract the labels from the data dictionary
input_labels = list(data.keys())

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

字符串
然而，我得到这个错误

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 19209424896 bytes.

型
我在字典中使用的数据具有以下形式：

{
    'index': [row_index_1, row_index_2, ...],
    'columns': [column_name_1, column_name_2, ...],
    'data': [
        [cell_value_row_1_col_1, cell_value_row_1_col_2, ...],
        [cell_value_row_2_col_1, cell_value_row_2_col_2, ...],
        ...
    ]
}

型

python

来源：https://stackoverflow.com/questions/76802096/runtimeerror-when-trying-to-extract-text-features-from-a-bert-model-then-using-k