BERTopic 如何获取单个文档的前N个主题,

jfewjypa 于 2个月前发布在其他

关注(0)|答案(7)|浏览(70)

我想问一些简单的问题。在对文档进行拟合、转换、减少离群值并以这种方式更新主题模型后：

embedding = ...
topics, probs = model.fit_transform(docs, embeddings = emb)
new_topics = model.reduce_outliers(docs, topics)
model.update_topics(new_docs, topics=new_topics)

我想用这种方式预测一个新文档：
prediction= model.transform(documents=new_document, embeddings=embedding)
结果如下：
([-1], array([[1.16350449e-02, 2.01218509e-01, 1.11319454e-02, 7.07114037e-04, 3.97837834e-03, 2.08542765e-04, 3.01921767e-03, 7.32074922e-02, 4.58492993e-01, 9.87979027e-03, 9.41436941e-03, 1.09079184e-02, 1.02977729e-02, 9.64706333e-03, 1.12269956e-02, 1.04989969e-02, 3.24051104e-03, 1.23429396e-02, 1.00992529e-02, 9.99152714e-03, 1.06927620e-02, 1.20518505e-02, 1.06281160e-02, 4.32279154e-03]]))
元组包含：

主题预测[-1]
其他剩余主题的概率数组

我的理解对吗？
为什么在我明确删除离群值并更新模型后，主题预测仍然是[-1]?
如果我简单地调用 model.topic_labels_ ,我不再得到[-1]主题，而是从[0]开始，如预期的那样。为什么我的模型仍然预测主题[-1]?
此外，我猜概率数组是按照正常的升序排列(在我的例子中从0到23)。对吗？
如果是这样的话，我可以想象列表中的第8个值(9th如果非Pythonic地计数)是与最相关主题对应的值。这是正确的吗？
我向您提出这些问题，以便我能轻松地为单个新文档预测前N个主题。
我已经查阅了BERTopic文档并探索了可用的方法，但我找不到一种直接获取单个新文档除前1个预测之外所有主题概率的方法。
您能帮助我解决这个问题吗？我将非常感激任何帮助或建议的解决方法来实现这个目标。
谢谢！

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1398

7条答案

按热度按时间

6tdlim6h1#

为什么在明确去除离群值并更新模型后，主题预测仍然是[-1]?
如果我简单地调用model.topic_labels_,我不再得到[-1]主题，而是从[0]开始，如预期。
我的模型为什么仍然预测主题[-1]?
这确实不应该发生，假设new_topics中没有一个-1。你能检查一下吗？另外，你能分享一下你的完整代码吗？这里可能有一些细微之处可以解释情况。最后，你使用的是BERTopic的哪个版本？
此外，我猜概率数组是按照正常的升序排列的(在我的例子中是从0到23)。对吗？
如果是这样，我可以想象这个列表4.58492993e-01中的第8个值(第9个如果不是这样计算的话)就是最相关的主题。这是真的吗？
如果你没有加载之前训练过的模型，并且使用了calculate_probabilities=True,返回的概率(假设你没有加载之前训练过的模型)会按照非离群主题的升序排列。
我已经查阅了BERTopic的文档并探索了可用的方法，但我找不到一种直接获取所有主题概率的方法，除了单个新文档的前1个预测之外。
有几种方法可以做到这一点。首先，可以通过设置calculate_probabilities=True来实现。然后，你可以使用topic_model.probabilities_访问文档-主题分布。第二种方法是通过应用.approximate_distribution在训练初始模型后生成概率。

赞(0）回复(0）举报 2个月前

6xfqseft2#

你好，

BERTopic版本---0.15.0

生成这些结果的代码：

umap_args = {
'n_neighbors' : int(value),
'n_components' : int(value),
'min_dist' : value,
'metric' : cosine,
'random_state' : 42
}
hdbscan_args = {
'min_cluster_size' : int(value),
'min_samples' :int(value),
'cluster_selection_epsilon' : float(value), # needs to be casted as float
'prediction_data' : True  # needed when making inference
}
ctfidf_model = ClassTfidfTransformer()
reduce_frequent_words = value 
umap_model = UMAP(**umap_args)
hdbscan_model = HDBSCAN(**hdbscan_args)

model = BERTopic(
        calculate_probabilities=True,
        umap_model=umap_model,             
        hdbscan_model=hdbscan_model,        
        vectorizer_model=TfidfVectorizer(stop_words=nltk.corpus.stopwords.words('italian')),  
        ctfidf_model=ctfidf_model,          
        nr_topics=None,
        language=None
)

topics, probs = model.fit_transform(docs, embeddings = emb)
model_topics_info = model.get_topic_info()
print('Old Topics')
print(model_topics_info)
new_topics = model.reduce_outliers(docs, topics)
model_topics_info = model.get_topic_info()
print('Topics After Reduction')
print(model_topics_info)
print("New Topics:", new_topics)
#update topics
model.update_topics(docs, topics=new_topics)
model_topics_info = model.get_topic_info()
print('Topics After Updating')
print(model_topics_info)

总共有25个旧主题：

在. reduce_outliers and update_topics之后，总共有25个新主题：

对于单个新文档的嵌入(我正在使用一个HF 16k tokens模型和嵌入维度为(1, 768)):

from . import Embedder
new_doc = """string containing many words in italian"""
embedder = Embedder(model)
embedder.generate_embeddings(document = new_doc)
# need to expand dims and list the single document
emb = np.expand_dims(np.array(emb), axis = 0)
new_doc = [new_doc]

prediction = model.transform(documents=new_doc, embeddings=emb)

prediction = ([-1],
 array([[1.16350449e-02, 2.01218509e-01, 1.11319454e-02, 7.07114037e-04,
         3.97837834e-03, 2.08542765e-04, 3.01921767e-03, 7.32074922e-02,
         4.58492993e-01, 9.87979027e-03, 9.41436941e-03, 1.09079184e-02,
         1.02977729e-02, 9.64706333e-03, 1.12269956e-02, 1.04989969e-02,
         3.24051104e-03, 1.23429396e-02, 1.00992529e-02, 9.99152714e-03,
         1.06927620e-02, 1.20518505e-02, 1.06281160e-02, 4.32279154e-03]]))

不幸的是，所有非异常主题的概率都不是按升序排列的：(

values = prediction[1][0]  # Extract the values from the array
indices = np.argsort(-values)  # Sort indices in decreasing order
sorted_values = sorted_values = values[indices]
print(sorted_values)
print(indices)

[4.58492993e-01 2.01218509e-01 7.32074922e-02 1.23429396e-02
 1.20518505e-02 1.16350449e-02 1.12269956e-02 1.11319454e-02
 1.09079184e-02 1.06927620e-02 1.06281160e-02 1.04989969e-02
 1.02977729e-02 1.00992529e-02 9.99152714e-03 9.87979027e-03
 9.64706333e-03 9.41436941e-03 4.32279154e-03 3.97837834e-03
 3.24051104e-03 3.01921767e-03 7.07114037e-04 2.08542765e-04]
[ 8  1  7 17 21  0 14  2 11 20 22 15 12 18 19  9 13 10 23  4 16  6  3  5]

感谢您的支持！ :)

赞(0）回复(0）举报 2个月前

bq8i3lrv3#

遗憾的是，所有非异常值主题的概率并不按升序排列：(
它们的确不是按照它们的值升序排列，但它们是按照它们的主题ID升序排列。换句话说，主题0应该位于索引0,主题1位于索引1等。

赞(0）回复(0）举报 2个月前

q5iwbnjs4#

好的，这是个好消息。然而，在进行预测时仍然存在问题。
我该如何消除[-1]的预测？

赞(0）回复(0）举报 2个月前

goucqfw65#

每当你遇到一个[-1]预测时，只需在相应的概率向量中找到最高值的索引。那将是你非异常主题。

赞(0）回复(0）举报 2个月前

new9mtju6#

是的，我实际上也在想同样的事情，所以很高兴知道我们的想法一致！顺便说一下，我注意到当我把概率值加起来时，它们并不总是等于1。
[4.58492993e-01 2.01218509e-01 7.32074922e-02 1.23429396e-02 1.20518505e-02 1.16350449e-02 1.12269956e-02 1.11319454e-02 1.09079184e-02 1.06927620e-02 1.06281160e-02 1.04989969e-02 1.02977729e-02 1.00992529e-02 9.99152714e-03 9.87979027e-03 9.64706333e-03 9.41436941e-03 4.32279154e-03 3.97837834e-03 3.24051104e-03 3.01921767e-03 7.07114037e-04 2.08542765e-04]
在这种情况下，它们的和是0.9088418947404864。也许这就是为什么我们仍然看到[-1]主题徘徊的原因，因为它可以解释剩余的0.09115810525951362。然而，即使是这样，它相对于其他主题的概率仍然较小。

赞(0）回复(0）举报 2个月前

j13ufse27#

-1概率确实是1减去概率之和。然而，这些概率通常只是分配的近似值，而不是其核心部分。这是HDBSCAN特有的行为，你可以找到更多关于它如何处理软聚类的信息。

赞(0）回复(0）举报 2个月前