python DataFrame中两列的余弦相似性

我有一个2列的 Dataframe ，我特灵得到每对句子的余弦相似度得分。
Dataframe （df）

A                   B
0    Lorem ipsum ta      lorem ipsum
1    Excepteur sint      occaecat excepteur
2    Duis aute irure     aute irure

我尝试过的一些代码片段是：

1. df["cosine_sim"] = df[["A","B"]].apply(lambda x1,x2:cosine_sim(x1,x2))

2. from spicy.spatial.distance import cosine
df["cosine_sim"] = df.apply(lambda row: 1 - cosine(row['A'], row['B']), axis = 1)

上述代码没有工作，我仍在尝试不同的方法，但在此期间，我将感谢任何指导，谢谢你提前！
预期输出：

A                   B                     cosine_sim
0    Lorem ipsum ta      lorem ipsum                 0.8
1    Excepteur sint      occaecat excepteur          0.5
2    Duis aute irure     aute irure                  0.4

您需要先将句子转换为向量，此过程称为 * 文本向量化 *。执行文本向量化的方法有很多，具体取决于所需的复杂程度、语料库的外观、和预期的应用。最简单的是“词袋”（BoW），我已经在下面实现了它。一旦你理解了将句子表示为向量的含义，你就可以使用其他更复杂的方法来表示 * 词法 * 相似性。例如：

tf-idf，它根据单词在多个文档（或句子）中出现的频率来加权单词，您可以将其视为加权BoW方法。
BM25修正了tf-idf的一个缺点，即在一个短文档中单个单词的提及会产生高的“相关性”分数。它通过考虑文档的长度来实现这一点。

进一步到 * 语义 * 相似性的度量，您可以使用Doc 2 Vec [1]这样的方法，它开始使用“嵌入空间”来表示文本的语义。最近的方法如SentenceBERT [2]和Dense Passage Retrieval [3]使用基于Transformer的技术（编码器-解码器）架构[4]，以允许形成“上下文感知”表示。
溶液

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from numpy.linalg import norm

df = pd.DataFrame({
    "A": [
    "I'm not a party animal, but I do like animal parties.",
    "That must be the tenth time I've been arrested for selling deep-fried cigars.",
    "He played the game as if his life depended on it and the truth was that it did."
    ],
    "B": [
    "The mysterious diary records the voice.",
    "She had the gift of being able to paint songs.",
    "The external scars tell only part of the story."
    ]
    })

# Combine all to make single corpus of text (i.e. list of sentences)
corpus = pd.concat([df["A"], df["B"]], axis=0, ignore_index=True).to_list()
# print(corpus)  # Display list of sentences

# Vectorization using basic Bag of Words (BoW) approach
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names_out())  # Display features
vect_sents = X.toarray()

cosine_sim_scores = []
# Iterate over each vectorised sentence in the A-B pairs from the original dataframe
for A_vect, B_vect in zip(vect_sents, vect_sents[int(len(vect_sents)/2):]):
    # Calculate cosine similarity and store result
    cosine_sim_scores.append(np.dot(A_vect, B_vect)/(norm(A_vect)*norm(B_vect)))
# Append results to original dataframe
df.insert(2, 'cosine_sim', cosine_sim_scores)
print(df)

输出

A                                         B  cosine_sim
0  I'm not a party animal, but...          The mysterious diary records ...    0.000000
1  That must be the tenth time...   She had the gift of being able to pa...    0.084515
2  He played the game as if hi...  The external scars tell only part of ...    0.257130

参考文献

[1] Le，Q.和Mikolov，T.，2014年6月。句子和文档的分布式表示。机器学习国际会议（第1188-1196页）。PMLR。
[2] N.赖默斯和I.古列维奇，2019年。句子-伯特：使用连体伯特网络的句子嵌入。arXiv预印本arXiv：1908.10084。
Karpukhin，V.，Oğuz，B.，Min，S.，刘易斯，P.，Wu，L.，Edunov，S.，Chen，D.和Yih，W.T.，2020.用于开放领域问答的密集段落检索. arXiv预印本arXiv：2004.04906.
[4] Vaswani，A.，Shazeer，N.，Parmar，N.，Uszkoreit，J.，Jones，L.，Gomez，A.N.，Kaiser，.和Polosukhin，I.，2017。注意力是你所需要的一切。神经信息处理系统的进展，30。

python DataFrame中两列的余弦相似性

1条答案

输出

参考文献

相关问题

热门标签

最新问答