BERTopic Topic modeling regression in 0.14.0 with nr_topics

r7knjye2  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(50)

我已经注意到,在指定nr_topics参数时,0.14.0版本的bertopic降低了主题建模的质量。以下是我的测试脚本:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

from bertopic.vectorizers import ClassTfidfTransformer

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(nr_topics=len(newsgroups_train['target_names']),
                        ctfidf_model=ctfidf_model,
                        calculate_probabilities=True)
topic_model.fit(newsgroups_train['data'])

print(topic_model.get_topic_info())

使用bertopic==0.13.0:

Topic  Count                                           Name
0      -1   4688  -1_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_for_on_you
1       0    700                             0_car_bike_cars_my
2       1    638                       1_drive_scsi_drives_disk
3       2    575                    2_gun_guns_militia_firearms
4       3    547                  3_key_encryption_clipper_chip
5       4    539                         4_team_hockey_550_game
6       5    527                 5_patients_msg_medical_disease
7       6    483                    6_year_baseball_pitching_he
8       7    405                       7_card_monitor_video_vga
9       8    375                  8_israel_turkish_jews_israeli
10      9    317                          9_ditto_ites_hello_hi
11     10    199                           10_god_jesus_hell_he
12     11    182               11_window_widget_colormap_server
13     12    173                    12_morality_truth_god_moral
14     13    172                    13_fbi_koresh_compound_batf
15     14    171                   14_amp_condition_scope_offer
16     15    141               15_atheists_atheism_god_universe
17     16    131                    16_printer_fonts_font_print
18     17    118                     17_ted_post_challenges_you
19     18    118                      18_windows_dos_cview_swap
20     19    115     19_xfree86_libxmulibxmuso_symbol_undefined

使用bertopic==0.14.0:

Topic  Count                                               Name
0      -1   3334                                   -1_you_it_for_is
1       0   4402                                   0_for_with_on_be
2       1    620                       1_god_stephanopoulos_that_mr
3       2    559                      2_patients_medical_msg_health
4       3    437                          3_space_launch_nasa_lunar
5       4    436                     4_israel_were_turkish_armenian
6       5    376                                5_car_bike_cars_dog
7       6    296                        6_gun_guns_firearms_militia
8       7    230                     7_morality_objective_gay_moral
9       8    139               8_symbol_xterm_libxmulibxmuso_server
10      9    119                             9_printer_ink_print_hp
11     10     94                      10_requests_send_address_list
12     11     88                  11_radar_detector_detectors_radio
13     12     42                      12_church_pope_schism_mormons
14     13     40              13_ground_battery_grounding_conductor
15     14     36                        14_tax_taxes_deficit_income
16     15     24            15_marriage_married_ceremony_commitment
17     16     20  16_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_mg9vg9vg9vg...
18     17     12                              17_ditto_hello_hi_too
19     18     10                    18_professors_tas_phds_teaching
cu6pst1q

cu6pst1q1#

这确实是一个很大的区别!我更新了nr_topics的底层算法,以防止任何主题在离群值中被合并,我对结果感到非常满意,但这似乎显示出了完全不同的东西。我会更详细地测试一下,看看是否对其他数据集也发生了同样的事情。如果是这样的话,那么可能是一个bug,或者我可能会简单地将其恢复到旧的算法。

相关问题