BERTopic 相同的主题：一些成为离群值，一些被分配到它们所属的主题,

我正在使用BERTopic 0.16.2,并且试图理解为什么大约三分之一的文档被归类为异常值。len(docs) 是 7578,我最大的主题中的文档数量是

0       -1   2454                          
1        0    207                  
2        1    203
3        2    152  
4        3    130

当我查看一些异常值文档时，我立刻意识到其中许多明显属于“真实”主题之一。例如，我有一个“交通工具”主题(基于该主题中的最高词是诸如汽车、自行车、火车、公共汽车等词汇的解释),其中有关于骑自行车和乘坐公共汽车的异常值。
然而，最清楚的指示某些事情出了问题的是这个异常值：

Document ID: sp-1-3.3
Text: Godmorgon. Godmorgon.

(是的，整个文档由这两个单词组成。)
鉴于我有这样两个主题：
| 主题35 | 主题60 |
| ------------ | ------------ |
|

|

|
godmorgon = 瑞典语中的早上好。
我不太明白为什么有两个单独的早上好主题，一个地方“god morgon”几乎被转录成两个单词，另一个地方被转录成“godmorron”，但这可能不是错误，而是微调的问题(或者手动合并主题，所以现在我们不用担心这个问题。
这里是主题35中所有文档的列表(点击展开)

Document ID: sp-1-6.4
Text: Godmorgon. 

Document ID: sp-14-9.3
Text: Godmorgon. 

Document ID: sp-16-30.0
Text:  Godmorgon.

Document ID: sp-16-32.8
Text: Godmorgon. 

Document ID: sp-16-47.8
Text: Godmorgon.

Document ID: sp-16-96.5
Text: Godmorgon.

Document ID: sp-16-100.1
Text:  Godmorgon.

Document ID: sp-19-66.7
Text: Godmorgon godmorgon.

Document ID: sp-19-131.0
Text: Godmorgon.

Document ID: sp-19-146.0
Text: Godmorgon. 

Document ID: sp-19-180.3
Text: Godmorgon. 

Document ID: sp-19-182.2
Text: Godmorgon. 

Document ID: sp-19-183.2
Text: Godmorgon.

Document ID: sp-21-23.0
Text: Godmorgon. 

Document ID: sp-30-230.9
Text: Godmorgon.

Document ID: sp-30-232.3
Text: Godmorgon. 

Document ID: sp-31-236.5
Text: Godmorgon.

Document ID: sp-33-20.8
Text: Godmorgon.

Document ID: sp-33-22.7
Text: Godmorgon.

Document ID: sp-34-2.9
Text: Godmorgon. 

Document ID: sp-34-4.5
Text: Godmorgon. 

Document ID: sp-38-85.2
Text: Godmorgon. 

Document ID: sp-38-93.0
Text: Godmorgon godmorgon. 

Document ID: sp-38-94.6
Text: Godmorgon godmorgon.

Document ID: sp-38-96.8
Text: Godmorgon. 

Document ID: sp-38-100.6
Text: Godmorgon. 

Document ID: sp-40-48.5
Text:  Godmorgon.

Document ID: sp-40-79.1
Text: Godmorgon. 

Document ID: sp-40-80.5
Text: Godmorgon godmorgon 

Document ID: sp-40-82.6
Text: godmorgon. 

Document ID: sp-40-91.1
Text: Godmorgon.

Document ID: sp-41-68.3
Text:  Godmorgon. 

Document ID: sp-41-84.3
Text: Godmorgon.

Document ID: sp-41-85.8
Text: Godmorgon.

Document ID: sp-41-87.1
Text: Godmorgon.

Document ID: sp-41-103.7
Text: Godmorgon.

Document ID: sp-41-105.6
Text: Godmorgon.

Document ID: sp-42-17.9
Text: Godmorgon. 

Document ID: sp-43-56.9
Text: Godmorgon.

Document ID: sp-43-58.9
Text: Godmorgon.

Document ID: sp-43-108.8
Text: Godmorgon.

Document ID: sp-43-111.1
Text: Godmorgon.

我意识到没有哪个文档与异常值完全相同，但非常接近：
异常值：“Godmorgon. Godmorgon.”
主题35中的两个文档：“Godmorgon godmorgon.”
余弦相似度：0.9904226(我使用KBLab/sentence-bert-swedish-cased作为句子转换器，因为我无法使用默认的多语言嵌入模型获得任何有意义的主题)
如果我们将“Godmorgon.”作为参考，余弦相似度下降到0.9797847,但我仍然觉得困惑，为什么BERTopic将其分类为异常值，而不是将其添加到主题35中。
另一个可能相关的观察是，无论我在

new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities", threshold=0.5)
new_topic_info = topic_model.get_topic_info()
print(new_topic_info)

中使用什么阈值，异常值的数量始终保持不变。我是BERTopic的新用户，所以我可能配置错误了，我希望得到任何关于哪里出错的提示。请参阅我的代码(如下所示)。(话虽如此，如果BERTopic能够自动处理这种情况，那将非常棒，因为当你没有超级相似的文档和像我这样的简单主题时，实现这一点是非常困难的。)
这是我的代码(点击展开)
然后

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance
#from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Initialize BERTopic
#representation_model = KeyBERTInspired()
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, language="swedish")

# Fit the model with documents
topics, probs = topic_model.fit_transform(docs, embeddings)

# Get topic information
topic_info = topic_model.get_topic_info()

# Map each document to its topic and add document IDs to topic_info
doc_topic_map = pd.DataFrame({
    "doc_id": doc_ids,
    "text": docs,
    "topic": topics,
    "meeting": meetings
})

# Create a dictionary to collect document metadata for each topic
topic_docs_metadata = doc_topic_map.groupby("topic").apply(lambda x: x.to_dict(orient='records'), include_groups=False).to_dict()

# Add a new column with document metadata to the topic_info DataFrame
topic_info["documents"] = topic_info["Topic"].map(topic_docs_metadata)

# Create a dictionary to collect document IDs for each topic
#topic_doc_ids = doc_topic_map.groupby("topic")["doc_id"].apply(list).to_dict()

# Adjust display settings for Pandas DataFrame
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_colwidth', None)  # Display full content of each column

# Print modified topic information with document IDs
print(topic_info.head)

我正在使用BERTopic 0.16.2,并且我试图理解为什么大约三分之一的文档被归类为异常值。
你熟悉BERTopic的underlying algorithms吗？如果不熟悉，我强烈建议你阅读一下聚类算法(HDBSCAN),这是实际将数据点分配到某些聚类(也称为主题)的算法。
如果你熟悉，让我更深入地讨论一下你分享的一些事情：
我不太明白为什么有两个单独的早安主题，一个其中“god morgon”基本上被转录为两个单词，另一个则被转录为“godmorron”，但这可能不是错误，而是微调(或手动合并主题的问题，所以现在先不要担心。
只需注意，如果你愿意的话，也可以自动合并主题，尽管首选方法是使用min_topic_size。
余弦相似度：0.9904226(我使用KBLab/sentence-bert-swedish-cased作为句子转换器，因为我无法使用默认的多语言嵌入模型获得任何有意义的主题)
如果我们以“Godmorgon.”为例，余弦相似度会下降到0.9797847,但我仍然发现令人困惑的是BERTopic将其分类为异常值，而不是将其添加到主题35中。
绝对余弦相似度有点难以解释，因为它们让你对相似性的分布了解不多。例如，这可能是由于这种特定的嵌入在默认情况下创建了非常高的相似性，使得文档的分离变得更加困难。
另一个可能相关的观察是，无论我在阈值中使用什么，异常值的数量始终保持不变。
你尝试将阈值设置为0了吗？另外，使用不同的策略(例如“嵌入”)可能会很有帮助，因为你的问题似乎与你使用的嵌入模型有关。

# Initialize BERTopic
#representation_model = KeyBERTInspired()
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(ctfidf_model=ctfidf_model, representation_model=representation_model, calculate_probabilities=True, language="swedish")

# Fit the model with documents
topics, probs = topic_model.fit_transform(docs, embeddings)

你也应该设置embedding_model=sentence_model。原因是KeyBERTInspired和MaximalMarginalRelevance都使用底层的嵌入模型与嵌入一起使用。
余弦相似度：0.9904226(我使用KBLab/sentence-bert-swedish-cased作为句子转换器，因为我无法使用默认的多语言嵌入模型获得任何有意义的主题)
如上所述，请注意您正在使用默认的多语言嵌入模型为KeyBERTInspired和MMR创建嵌入。
我对BERTopic不太熟悉，所以我可能配置错了一些东西，我希望得到任何提示，告诉我哪里可能出错了。请查看我的代码如下。(话虽如此，如果BERTopic能自动处理这个问题就太好了，毕竟当你没有非常相似的文档和非常简单的主题时，很难弄清楚发生了什么。)
根据你分享的内容，很难说确切的“问题”是什么，因为这可能仅仅是底层聚类算法HDBSCAN处理这些类型的输入嵌入的一种结果。例如，以前有人建议HDBSCAN在将嵌入分配给聚类时非常小心，从而产生许多异常值。
相反，我建议尝试一些HDBSCAN超参数看看是否有所改变(例如min_cluster_size)。此外，优化UMAP的n_neighbors也可能有所帮助，因为它也会影响它“看到”的程度。

BERTopic 相同的主题：一些成为离群值，一些被分配到它们所属的主题,

1条答案

相关问题

热门标签

最新问答