BERTopic 模型.transform()在去除离群值处理后出现问题,

zd287kbt 于 5个月前发布在其他

关注(0)|答案(3)|浏览(104)

你好，

我正在尝试从一组文本中提取主题。由于我的数据可能缺少一些质量，并且一半的文本被分类为异常值，所以我在提取主题之后进行了异常值减少阶段。
这是代码：

self.model = BERTopic(
                language="italian",
                top_n_words=10,
                n_gram_range=(1, 1),
                min_topic_size=40,
                nr_topics="auto",
                embedding_model=embedding_model,
                seed_topic_list=seed_topic_list,
                calculate_probabilities=False,
                # verbose=self.verbose
)
# Fit the model to the documents
topics, _ = self.model.fit_transform(self.history_tickets)

# Use this method to reduce the number of outliers taken, and get the new topics
new_topics = self.model.reduce_outliers(self.history_tickets, topics, strategy="c-tf-idf")
# Then, update the topics to the ones that considered the new data
self.model.update_topics(self.history_tickets, topics=new_topics)

它运行得很好。然而，当我继续对单个文本(已经包含在self.history_tickets中)进行主题提取时：
prediction = self.model.transform([text])[0][0]
大多数情况下预测结果是-1。
问题出在哪里？在进行单个预测后，我是否还需要继续进行异常值减少？
提前感谢！

答案：
根据你的描述，问题可能出在模型没有足够的训练数据来学习有效的主题表示。你可以尝试以下方法来解决这个问题：

增加训练数据：确保你的训练数据集足够大，包含了各种类型的主题和文本。这将有助于模型更好地学习主题表示。
调整模型参数：尝试调整模型的参数，如主题数量、迭代次数等，以找到更适合你数据的模型设置。
使用预训练模型：如果可能的话，尝试使用预训练的主题模型(如LDA、LSA等),这些模型已经在大量文本数据上进行了训练，可以为你的任务提供更好的初始表示。
在单个预测后进行异常值减少：你可以尝试在单个预测后进行异常值减少，但这可能不会对结果产生显著影响。因为在单个文本上进行预测时，模型可能无法学习到足够的信息来进行有效的异常值减少。

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1507

3条答案

按热度按时间

uyhoqukh1#

底层的聚类模型，HDBSCAN,在使用其内部的.predict类似功能时，倾向于将未见过的文档分配给离群类。你可以选择使用.reduce_outliers,或者你可以保存模型并在之后加载它。这会移除底层的聚类模型，并重新分配预测的方式。

赞(0）回复(0）举报 5个月前

u3r8eeie2#

Thank you for the response.
I am trying to use .reduce_outliers after the .transform call:

def get_message_topic(self, message, preprocess=True):
        """
        Returns the topic prediction related to the input message (used when a new (message,feedback) arrives).
        :param preprocess: to choose whether to preprocess the input or not (bool)
        :param message: message (str)
        :return: prediction of the input message (DataFrame)
        """
        if preprocess:
            message = preprocess_text(message)

        # predict new text's topics with BERTopic
        prediction = self.model.transform([message])

        prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions")

        if self.verbose:
            logger.info(f"Topic for the message {message} is: {prediction[0][0]}")

            return prediction[0][0]

but, when this method is called, this error raises:
File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\codega\drift\topic_modeling\topic_modeling.py", line 176, in get_message_topic
prediction = self.model.reduce_outliers([message], [prediction], strategy="distributions")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 2109, in reduce_outliers
topic_distr, _ = self.approximate_distribution(outlier_docs, min_similarity=threshold, **distributions_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\bertopic_bertopic.py", line 1241, in approximate_distribution
topic_distributions = np.vstack(topic_distributions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 180, in vstack
File "C:\Users\mcitterio\PycharmProjects\generative-ai-model-control\venv\Lib\site-packages\numpy\core\shape_base.py", line 282, in vstack
return _nx.concatenate(arrs, 0)
^^^^^^^^^^^^^^^^^^^^^^^^
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate