BERTopic Topic modeling regression in 0.14.0 with nr_topics

r7knjye2  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(49)

我已经注意到,在指定nr_topics参数时,0.14.0版本的bertopic降低了主题建模的质量。以下是我的测试脚本:

  1. from bertopic import BERTopic
  2. from sklearn.datasets import fetch_20newsgroups
  3. from bertopic.vectorizers import ClassTfidfTransformer
  4. newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
  5. ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
  6. topic_model = BERTopic(nr_topics=len(newsgroups_train['target_names']),
  7. ctfidf_model=ctfidf_model,
  8. calculate_probabilities=True)
  9. topic_model.fit(newsgroups_train['data'])
  10. print(topic_model.get_topic_info())

使用bertopic==0.13.0:

  1. Topic Count Name
  2. 0 -1 4688 -1_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_for_on_you
  3. 1 0 700 0_car_bike_cars_my
  4. 2 1 638 1_drive_scsi_drives_disk
  5. 3 2 575 2_gun_guns_militia_firearms
  6. 4 3 547 3_key_encryption_clipper_chip
  7. 5 4 539 4_team_hockey_550_game
  8. 6 5 527 5_patients_msg_medical_disease
  9. 7 6 483 6_year_baseball_pitching_he
  10. 8 7 405 7_card_monitor_video_vga
  11. 9 8 375 8_israel_turkish_jews_israeli
  12. 10 9 317 9_ditto_ites_hello_hi
  13. 11 10 199 10_god_jesus_hell_he
  14. 12 11 182 11_window_widget_colormap_server
  15. 13 12 173 12_morality_truth_god_moral
  16. 14 13 172 13_fbi_koresh_compound_batf
  17. 15 14 171 14_amp_condition_scope_offer
  18. 16 15 141 15_atheists_atheism_god_universe
  19. 17 16 131 16_printer_fonts_font_print
  20. 18 17 118 17_ted_post_challenges_you
  21. 19 18 118 18_windows_dos_cview_swap
  22. 20 19 115 19_xfree86_libxmulibxmuso_symbol_undefined

使用bertopic==0.14.0:

  1. Topic Count Name
  2. 0 -1 3334 -1_you_it_for_is
  3. 1 0 4402 0_for_with_on_be
  4. 2 1 620 1_god_stephanopoulos_that_mr
  5. 3 2 559 2_patients_medical_msg_health
  6. 4 3 437 3_space_launch_nasa_lunar
  7. 5 4 436 4_israel_were_turkish_armenian
  8. 6 5 376 5_car_bike_cars_dog
  9. 7 6 296 6_gun_guns_firearms_militia
  10. 8 7 230 7_morality_objective_gay_moral
  11. 9 8 139 8_symbol_xterm_libxmulibxmuso_server
  12. 10 9 119 9_printer_ink_print_hp
  13. 11 10 94 10_requests_send_address_list
  14. 12 11 88 11_radar_detector_detectors_radio
  15. 13 12 42 12_church_pope_schism_mormons
  16. 14 13 40 13_ground_battery_grounding_conductor
  17. 15 14 36 14_tax_taxes_deficit_income
  18. 16 15 24 15_marriage_married_ceremony_commitment
  19. 17 16 20 16_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_mg9vg9vg9vg...
  20. 18 17 12 17_ditto_hello_hi_too
  21. 19 18 10 18_professors_tas_phds_teaching
cu6pst1q

cu6pst1q1#

这确实是一个很大的区别!我更新了nr_topics的底层算法,以防止任何主题在离群值中被合并,我对结果感到非常满意,但这似乎显示出了完全不同的东西。我会更详细地测试一下,看看是否对其他数据集也发生了同样的事情。如果是这样的话,那么可能是一个bug,或者我可能会简单地将其恢复到旧的算法。

相关问题