如何在pytorch中将一个字符串列表转换为Tensor?

xzlaal3s  于 2022-11-09  发布在  其他
关注(0)|答案(4)|浏览(1303)

我正在研究分类问题,我有一个字符串列表作为类标签,我想把它们转换成Tensor。到目前为止,我已经尝试过用numpy模块提供的np.array函数把字符串列表转换成numpy array
truth = torch.from_numpy(np.array(truths))
但我得到以下错误。
RuntimeError: can't convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.
有人能提出一个替代方法吗?谢谢

yvt65v4c

yvt65v4c1#

不幸的是,你现在不能。我不认为这是一个好主意,因为它会使PyTorch笨拙。一个流行的解决方案是使用sklearn将其转换为数字类型。
下面是一个简短的示例:

  1. from sklearn import preprocessing
  2. import torch
  3. labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
  4. le = preprocessing.LabelEncoder()
  5. targets = le.fit_transform(labels)
  6. # targets: array([0, 1, 2, 3, 4])
  7. targets = torch.as_tensor(targets)
  8. # targets: tensor([0, 1, 2, 3, 4])

由于您可能需要在真标签和转换标签之间进行转换,因此最好存储变量le

展开查看全部
hof1towb

hof1towb2#

技巧是首先找出列表中单词的最大长度,然后在第二个循环中用零填充Tensor。注意utf8字符串每个字符占用两个字节。

  1. In[]
  2. import torch
  3. words = ['שלום', 'beautiful', 'world']
  4. max_l = 0
  5. ts_list = []
  6. for w in words:
  7. ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
  8. max_l = max(ts_list[-1].size()[0], max_l)
  9. w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
  10. for i, ts in enumerate(ts_list):
  11. w_t[i, 0:ts.size()[0]] = ts
  12. w_t
  13. Out[]
  14. tensor([[215, 169, 215, 156, 215, 149, 215, 157, 0],
  15. [ 98, 101, 97, 117, 116, 105, 102, 117, 108],
  16. [119, 111, 114, 108, 100, 0, 0, 0, 0]], dtype=torch.uint8)
展开查看全部
l3zydbqr

l3zydbqr3#

如果你不想使用sklearn,另一个解决方案是保留原始列表并创建一个额外的索引列表,你可以用它来引用你的原始值。我特别需要这个,当我必须跟踪我的原始字符串,同时批处理标记化的字符串。
示例如下:

  1. labels = ['cat', 'dog', 'mouse']
  2. sentence_idx = np.linspace(0,len(labels), len(labels), False)
  3. # [0, 1, 2]
  4. torch_idx = torch.tensor(sentence_idx)
  5. # do what ever you would like from torch eg. pass it to a dataloader
  6. dataset = TensorDataset(torch_idx)
  7. loader = DataLoader(dataset, batch_size=1, shuffle=True)
  8. for batch in iter(loader):
  9. print(batch[0])
  10. print(labels[int(batch[0].item())])
  11. # output:
  12. # tensor([0.], dtype=torch.float64)
  13. # cat
  14. # tensor([1.], dtype=torch.float64)
  15. # dog
  16. # tensor([2.], dtype=torch.float64)
  17. # mouse

对于我的特定用例,代码如下所示:

  1. input_ids, attention_masks, labels = tokenize_sentences(tokenizer, sentences, labels, max_length)
  2. # create a indexes tensor to keep track of original sentence index
  3. sentence_idx = np.linspace(0,len(sentences), len(sentences),False )
  4. torch_idx = torch.tensor(sentence_idx)
  5. dataset = TensorDataset(input_ids, attention_masks, labels, torch_idx)
  6. loader = DataLoader(dataset, batch_size=1, shuffle=True)
  7. for batch in loader:
  8. _, logit = model(batch[0],
  9. token_type_ids=None,
  10. attention_mask=batch[1],
  11. labels=batch[2])
  12. pred_flat = np.argmax(logit.detach(), axis=1).flatten()
  13. print(pred_flat)
  14. print(batch[2])
  15. if pred_flat == batch[2]:
  16. print("\nThe following sentence was predicted correctly:")
  17. print(sentences[int(batch[3].item())])
展开查看全部
7cwmlq89

7cwmlq894#

  1. truth = [float(truths) for x in truths]
  2. truth = np.asarray(truth)
  3. truth = torch.from_numpy(truth)

相关问题