BERTopic 在在线拟合后获取代表性文档

but5z9lq 于 5个月前发布在其他

关注(0)|答案(6)|浏览(91)

通常情况下，partial_fit会被多次调用以处理大型数据集。在这种情况下，似乎属性representative_docs_没有被填充。是否有一种简单的方法在这种情况下获取代表性的文档？

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1679

6条答案

按热度按时间

guz6ccqo1#

你必须使用内部函数来提取它们。我相信有一些与此相关的开放问题，所以我建议搜索那些问题。
你也可以使用 .merge_models 来实现 partial_fit 类似的功能，但我认为这种方法也无法保存代表性的文档。

赞(0）回复(0）举报 5个月前

9rnv2umw2#

@MaartenGr 感谢！根据您的建议，以下是使用 _create_topic_vectors 和 _save_representative_docs 内部函数的代码片段。
假设 docs 是文档，embeds 是它们的嵌入表示，topic_model 是在线拟合的模型，train_idxs 是打乱顺序后的索引(如果适用)。我们首先填充主题表示，然后我们可以填充代表性文档：

doc_topic = pd.DataFrame({
  'Topic':topic_model.topics_,
  'ID':range(len(topic_model.topics_)),
  'Document':docs.loc[train_idxs]}
) # topics and docs combined, required by internal functions
topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) # populate topic embeddings
#topic_model._save_representative_docs(doc_topic)
repr_docs, _, _, _=  topic_model._extract_representative_docs(
    topic_model.c_tf_idf_, 
    doc_topic,
    topic_model.topic_representations_,
    nr_samples=1000,
    nr_repr_docs=5
)
topic_model.representative_docs_ = repr_docs

我在 >100万篇文档上进行了测试，以下是一个示例：

赞(0）回复(0）举报 5个月前

sd2nnvve3#

太棒了，感谢分享！其他用户肯定会从这里提供的代码片段中受益。

赞(0）回复(0）举报 5个月前

g0czyy6m4#

太好了，谢谢你的分享！其他用户肯定会从这里有这个代码片段中受益。
@MaartenGr 如果你不介意的话，我愿意自愿发起一个PR,扩展一点the online tutorial example,展示这些内部函数在News20上的使用？

赞(0）回复(0）举报 5个月前

aor9mmx15#

[0] Lin, Xule对你的信息做出了回应：...

_______________________________ From: Maciej Skorski ***@***.***> Sent: 星期四，2024年1月11日1:12:02 AM To: MaartenGr/BERTopic ***@***.***> Cc: Subscribed ***@***.***> Subject: 回复： [MaartenGr/BERTopic] 在在线拟合后获取代表性文档(问题#1679) @MaartenGr< https://github.com/MaartenGr > 感谢！根据你的建议，以下是使用 create_topic_vectors 和 save_representative_docs 内部函数的代码片段。假设文档是文档，嵌入是它们的嵌入，主题模型是在线拟合的模型，train_idxs是在随机顺序中的索引(如果适用)。我们首先填充主题表示，然后我们可以填充代表性文档：doc_topic = pd.DataFrame({'Topic':topic_model.topics,'ID':range(len(topic_model.topics)),'Document':docs.loc[train_idxs]}) topic_model._create_topic_vectors(doc_topic,embeds[train_idxs]) #topic_model._save_representative_docs(doc_topic) repr_docs, _, , = topic_model.extract_representative_docs( topic_model.c_tf_idf, doc_topic, topic_model.topic_representations, nr_samples=1000, nr_repr_docs=5 ) topic_model.representative_docs = repr_docs 我在这个基础上测试了超过100万个文档，这里是一个例子：image.png (在网页上查看)< https://github.com/MaartenGr/BERTopic/assets/31315784/93dadba8-5bfd-4f71-a3da-2a670a91ba9a > — 直接回复此电子邮件，在GitHub上查看它<#1679 (评论)>,或取消订阅< https://github.com/notifications/unsubscribe-auth/AKJABPNZHMGROYTZVN2G76DYN436FAVCNFSM6AAAAABAMT3EOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGAZDSNJXGM >。你收到这封邮件是因为你订阅了这个线程。消息ID: ***@***.***>