nlp嵌入不提供预期的相关性[pytorch]

wh6knrhe  于 2021-07-13  发布在  Java
关注(0)|答案(0)|浏览(222)

我试图训练一个单词嵌入到一个只有主语改变的重复句子列表中。我期望被试能像预期的那样在单词嵌入中提供很强的相关性。然而,主语向量之间的夹角并不总是大于主语与随机词之间的夹角。

  1. Man is going to write a very long novel that no one can read.
  2. Woman is going to write a very long novel that no one can read.
  3. Boy is going to write a very long novel that no one can read.

代码基于pytorch教程:

  1. import torch
  2. from torch import nn
  3. import torch.nn.functional as F
  4. import numpy as np
  5. class EmbedTrainer(nn.Module):
  6. def __init__(self, d_vocab, d_embed, d_context):
  7. super(EmbedTrainer, self).__init__()
  8. self.embed = nn.Embedding(d_vocab, d_embed)
  9. self.fc_1 = nn.Linear(d_embed * d_context, 128)
  10. self.fc_2 = nn.Linear(128, d_vocab)
  11. def forward(self, x):
  12. x = self.embed(x).view((1, -1)) # flatten after embedding
  13. x = self.fc_2(F.relu(self.fc_1(x)))
  14. x = F.log_softmax(x, dim=1)
  15. return x
  16. text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
  17. text_split = text.split()
  18. trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
  19. dic = list(set(text.split()))
  20. tok_to_ids = {w:i for i, w in enumerate(dic)}
  21. tokens_text = text.split(" ")
  22. d_vocab, d_embed, d_context = len(dic), 10, 2
  23. """ Train """
  24. loss_func = nn.NLLLoss()
  25. model = EmbedTrainer(d_vocab, d_embed, d_context)
  26. print(model)
  27. optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
  28. losses = []
  29. epochs = 10
  30. for epoch in range(epochs):
  31. total_loss = 0
  32. for input, target in trigrams:
  33. tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
  34. target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
  35. model.zero_grad()
  36. log_prob = model(tok_ids)
  37. #if total_loss == 0: print("train ", log_prob, target_id)
  38. loss = loss_func(log_prob, target_id)
  39. total_loss += loss.item()
  40. loss.backward()
  41. optimizer.step()
  42. print(total_loss)
  43. losses.append(total_loss)
  44. embed_map = {}
  45. for word in ["Man", "Woman", "Boy", "novel"]:
  46. embed_map[word] = model.embed.weight[tok_to_ids[word]]
  47. print(word, embed_map[word])
  48. def angle(a, b):
  49. from numpy.linalg import norm
  50. a, b = a.detach().numpy(), b.detach().numpy()
  51. return np.dot(a, b) / norm(a) / norm(b)
  52. print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
  53. print("man.novel", angle(embed_map["Man"], embed_map["novel"]))

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题