我尝试使用camembert模型来提取文本特征。之后,我尝试使用KNN分类器将特征向量分类为输入。
这是我写的代码
import torch
from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier
tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")
data = df.to_dict(orient='split')
data = dict(zip(data['index'], data['data']))
# Collect all the input texts into a list of strings
input_texts = [str(text) for text in data.values()]
# Tokenize all the input texts together
inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)
# Get the model outputs for all the input texts
with torch.no_grad():
outputs = model(**inputs)
# Extract the last hidden states and convert them to a numpy array
last_hidden_states = outputs.last_hidden_state
input_features = last_hidden_states[:, 0, :].numpy()
# Extract the labels from the data dictionary
input_labels = list(data.keys())
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)
字符串
然而,我得到这个错误
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 19209424896 bytes.
型
我在字典中使用的数据具有以下形式:
{
'index': [row_index_1, row_index_2, ...],
'columns': [column_name_1, column_name_2, ...],
'data': [
[cell_value_row_1_col_1, cell_value_row_1_col_2, ...],
[cell_value_row_2_col_1, cell_value_row_2_col_2, ...],
...
]
}
型
1条答案
按热度按时间dw1jzc5e1#
看起来你一次将所有的数据都输入到模型中,而你没有足够的内存来做到这一点。您可以逐句或以小句子批调用模型,这样就可以在可用的系统资源中保留所需的内存。