@clstaudt There isn't but it should be relatively straightforward. You could save the results in a dataframe which would have sentences with their assigned topics and the ID of their document. Then, simply count how often a topic appears in each document based on the collection of sentences. Other than that, you could look at using .approximate_distribution
8条答案
按热度按时间wz1wpwve1#
我建议在将文档发送给BERTopic之前,先将其转换为句子或段落。由于这样的大文档很可能包含多个主题,将其拆分肯定会有所帮助。
sauutmhj2#
假设我们想要从整个文档中提取主题。因此,我们执行以下步骤:
我对如何最好地执行第3步感兴趣。我想如果我们采用这种方法,最终会得到段落级别的主题权重,但不清楚如何将这些段落级别的结果组合成整体文档级别,即文档中最常见的主题是什么?我相信我可以想出一个方法,但想了解一下这是否是解决这个问题(可能相当常见的)的推荐方法,或者您能指出任何例子吗?
wvmv3b1j3#
你可以根据文本的长度对分布进行汇总。主题分布就是将被分类为主题的文本所占的百分比。
knpiaxh14#
MaartenGr: 有没有这个聚合步骤的代码示例?
44u64gxh5#
@clstaudt There isn't but it should be relatively straightforward. You could save the results in a dataframe which would have sentences with their assigned topics and the ID of their document. Then, simply count how often a topic appears in each document based on the collection of sentences. Other than that, you could look at using
.approximate_distribution
sq1bmfud6#
Take a look at both of these. It helped me a ton https://medium.com/@armandj.olivares/using-bert-for-classifying-documents-with-long-texts-5c3e7b04573d
https://arxiv.org/abs/1910.10781
alen0pnh7#
我对这个很感兴趣。但我有一个问题:在我看来,另一个好的方法是
如何使用
bertopic
来实现这一点?这有意义吗?谢谢!
ao218c7q8#
在将文档传递给BERTopic之前,通常建议先拆分长文档。但是要注意合并嵌入时,如果句子包含明显不同的主题,平均嵌入可能会变得混乱。