BERTopic 代表文档请求在在线主题建模的多轮partial_fit后可能会失败,

dsf9zpds 于 9个月前发布在其他

关注(0)|答案(1)|浏览(149)

你好，
我正在使用River在线主题建模，调用方式如下(例如对于200个文档的列表):

first = docs[:100]
second = docs[100:]
self.topic_model.partial_fit(first)
self.topic_model._save_representative_docs(docs)
self.topic_model.partial_fit(second)
self.topic_model._save_representative_docs(docs)

在第二次调用之后，我想检查是否由River生成了新的聚类，如果是，则检索新聚类的代表性文档。BERtopic在partial_fit之后不执行此操作，因此我正在手动运行它。然而，第二次运行save方法会导致异常：

if ensure_min_samples > 0:
            n_samples = _num_samples(array)
            if n_samples < ensure_min_samples:
>               raise ValueError(
                    "Found array with %d sample(s) (shape=%s) while a"
                    " minimum of %d is required%s."
                    % (n_samples, array.shape, ensure_min_samples, context)
                )
E               ValueError: Found array with 0 sample(s) (shape=(0, 1018)) while a minimum of 1 is required by the normalize function.
.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:967: ValueError

这看起来是由于在第二组文档中第一次调用时生成的第一个聚类中有零个样本。在这种情况下，如果库跳过空聚类，只发出它确实拥有的代表性文档，那就太好了。

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1620

1条答案

按热度按时间

uyto3xhc1#

在在线主题建模中，特别是.partial_fit,提取代表性文档实际上从未被支持。原因是随着新聚类的发现和旧聚类的更新，代表性文档的概念会随时间而变化。用户并非默认情况下可以随时访问所有文档，因此重新计算代表性文档并不简单。
此外，不建议使用私有函数，也不会在未来的开发中得到支持。原因是私有函数可能会随着时间的推移而发生变化，不是BERTopic公共功能的一部分。任何对私有函数的使用都存在用户的风险。

赞(0）回复(0）举报 9个月前

我来回答

BERTopic 代表文档请求在在线主题建模的多轮partial_fit后可能会失败,

1条答案

相关问题

热门标签

最新问答