BERTopic Some topics have topic words starts with same letters or end with same letters

lyr7nygr 于 2个月前发布在其他

关注(0)|答案(3)|浏览(34)

你好，MaartenGr,

感谢你开发了bertopic,它在我们的项目中发挥了重要作用。我非常感激你为我们提供了如此积极/进步的讨论和解决方案。

我正在处理200万份文档，以下是主要代码：

from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP
import numpy as np
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
from sklearn.feature_extraction.text import CountVectorizer
umap_model = UMAP(n_neighbors=15, n_components=5,
 min_dist=0.0, metric='cosine', random_state=42, low_memory=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=100)
model = BERTopic(vectorizer_model=vectorizer_model, umap_model=umap_model, calculate_probabilities=False, nr_topics=500)
topics, probabilities = model.fit_transform(docs)
red_topics = model.reduce_outliers(docs, topics, strategy='c-tf-idf')
model.update_topics(docs, topics=red_topics, vectorizer_model=vectorizer_model)

我发现一些主题只是以相同字母开头的主题词组，例如：
'xabvd, xser, xwesd, xrfde'
'jfrsd, jresa, jliok, joiun'
'dau' 'dan' 'daud'
或者以相同字母结尾的主题词组，例如：
'calcium' 'valium' 'xxxxium'
这是否来自CountVectorizer?有什么方法可以解决这个问题吗？

非常感谢！
此致
Ji Hyun

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1385

3条答案

按热度按时间

ldioqlga1#

我认为这是嵌入模型的结果，而不是向量化器。很可能的情况是，存在一些语义上难以组合的文档集群，而嵌入模型根据字符层面的n-grams找到了这些文档相似。值得探索这些主题中的文档，以确定是否确实如此。如果是这样的话，那么尝试使用更准确的嵌入技术可能是值得的。例如，MTEB Leaderboard是一个很好的开始探索用于聚类目的嵌入技术的起点。

赞(0）回复(0）举报 2个月前

huus2vyu2#

非常感谢！如果我理解正确的话，如果我想在医疗记录上进行自然语言处理(NLP),我应该使用适当的嵌入模型，而不是all-MiniLM-L6-v2吗？
另外我想问一个问题：
我得到了info=model.get_topic_info()的信息，主题的顺序不是文档数量的顺序，这是因为我使用了reduce_outliers吗？

赞(0）回复(0）举报 2个月前

pokxtpni3#

这取决于文档本身，但尝试使用最适合您拥有的数据的模型是首选。我得到了info=model.get_topic_info()和主题顺序不是文档数量的顺序，是因为我使用了reduce_outliers吗？是的，如果进行了其他更新，确实会出现这种情况。

赞(0）回复(0）举报 2个月前