我已经注意到,在指定nr_topics参数时,0.14.0版本的bertopic降低了主题建模的质量。以下是我的测试脚本:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from bertopic.vectorizers import ClassTfidfTransformer
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(nr_topics=len(newsgroups_train['target_names']),
ctfidf_model=ctfidf_model,
calculate_probabilities=True)
topic_model.fit(newsgroups_train['data'])
print(topic_model.get_topic_info())
使用bertopic==0.13.0:
Topic Count Name
0 -1 4688 -1_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_for_on_you
1 0 700 0_car_bike_cars_my
2 1 638 1_drive_scsi_drives_disk
3 2 575 2_gun_guns_militia_firearms
4 3 547 3_key_encryption_clipper_chip
5 4 539 4_team_hockey_550_game
6 5 527 5_patients_msg_medical_disease
7 6 483 6_year_baseball_pitching_he
8 7 405 7_card_monitor_video_vga
9 8 375 8_israel_turkish_jews_israeli
10 9 317 9_ditto_ites_hello_hi
11 10 199 10_god_jesus_hell_he
12 11 182 11_window_widget_colormap_server
13 12 173 12_morality_truth_god_moral
14 13 172 13_fbi_koresh_compound_batf
15 14 171 14_amp_condition_scope_offer
16 15 141 15_atheists_atheism_god_universe
17 16 131 16_printer_fonts_font_print
18 17 118 17_ted_post_challenges_you
19 18 118 18_windows_dos_cview_swap
20 19 115 19_xfree86_libxmulibxmuso_symbol_undefined
使用bertopic==0.14.0:
Topic Count Name
0 -1 3334 -1_you_it_for_is
1 0 4402 0_for_with_on_be
2 1 620 1_god_stephanopoulos_that_mr
3 2 559 2_patients_medical_msg_health
4 3 437 3_space_launch_nasa_lunar
5 4 436 4_israel_were_turkish_armenian
6 5 376 5_car_bike_cars_dog
7 6 296 6_gun_guns_firearms_militia
8 7 230 7_morality_objective_gay_moral
9 8 139 8_symbol_xterm_libxmulibxmuso_server
10 9 119 9_printer_ink_print_hp
11 10 94 10_requests_send_address_list
12 11 88 11_radar_detector_detectors_radio
13 12 42 12_church_pope_schism_mormons
14 13 40 13_ground_battery_grounding_conductor
15 14 36 14_tax_taxes_deficit_income
16 15 24 15_marriage_married_ceremony_commitment
17 16 20 16_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_mg9vg9vg9vg...
18 17 12 17_ditto_hello_hi_too
19 18 10 18_professors_tas_phds_teaching
1条答案
按热度按时间cu6pst1q1#
这确实是一个很大的区别!我更新了
nr_topics
的底层算法,以防止任何主题在离群值中被合并,我对结果感到非常满意,但这似乎显示出了完全不同的东西。我会更详细地测试一下,看看是否对其他数据集也发生了同样的事情。如果是这样的话,那么可能是一个bug,或者我可能会简单地将其恢复到旧的算法。