Bertopic聚类问题

ep6jt1vc 于 2个月前发布在其他

关注(0)|答案(3)|浏览(41)

你好，
我正在使用BERTtopic进行主题建模，但有时发现它将主题聚类得更广泛，而不是我希望的那样。
问题1:是否有通用的方法或参数可以调整以控制主题聚类的粒度？
例如，如果我有两句话：“我喜欢吃苹果”和“我喜欢吃香蕉”，我更倾向于根据我喜欢的具体食物将它们分类到两个不同的主题中。目前，它们可能被归为一个主题。我如何调整模型或算法以实现这种期望的主题分类粒度？
问题2:我发现有时主题-1中的outliter句子太多。有没有办法减少噪声？实际上，我觉得一些噪声实际上被误分类为噪声。
谢谢！

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1607

3条答案

按热度按时间

jobtbby31#

Question 1:Is there a general approach or parameter I can tweak to control the granularity of the topic clustering?
Generally, the granularity of the topic clustering is controlled, to an extent, by the size of a cluster. The larger a cluster, the more broad it tends to be. By increasing the number of micro clusters generated you are likely to get more fine-grained topics. To do so, you can decrease either min_topic_size or control the parameters of HDBSCAN directly.
Question2: I found sometimes for the topic -1, outliter there is too many sentence. Is there any way to reduce the noise? Actually I feel some noise actually is misclassify as noise
For this, you can apply outlier reduction .

赞(0）回复(0）举报 2个月前

bf1o4zei2#

你能详细介绍一下如何直接控制HDBSCAN的参数吗？它有很多参数，例如：min_cluster_size(最小簇大小)、min_samples(最小样本数)、metric(度量方法)、cluster_selection_method(簇选择方法)和cluster_selection_epsilon(簇选择阈值)。

赞(0）回复(0）举报 2个月前

wsxa1bj13#

我强烈建议阅读HDBSCAN本身的文档，因为它描述得更加详细。

赞(0）回复(0）举报 2个月前