Saving BERTopic model when using Parametric UMAP

pkmbmrz7  于 2个月前  发布在  其他
关注(0)|答案(3)|浏览(39)

你好,
非常感谢你的所有帮助。我创建了一个符合我所有需求的模型,目前结果如预期。我需要每月保存模型、加载它并转换新数据。我使用参数化的UMAP而不是原始的UMAP进行降维,因为参数化的可以产生确定性的结果,并且不受批处理的影响。我对结果非常满意。然而,问题是无法保存模型。无论我如何尝试保存模型,都失败了。我在想是否可以独立保存降维模型(bertopic中的umap_model组件),一旦加载训练好的聚类模型,就可以替换它而不影响整个模型。你有什么建议吗?这是我项目的最后一步,如果我不能保存模型,那么所有的努力都将白费。如果可能的话,我真的希望你能提供一些解决这个问题的方法。
附言:当我尝试safetensor或pytorch方法时,在加载过程中会出现错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_331/1509421400.py in <module>
      1 from bertopic import BERTopic
----> 2 loaded_model = BERTopic.load('/home/mmotall/complaints_subcat/model_training/models_developped/bert/model-save-test/topics_model')

~/venv/lib/python3.7/site-packages/bertopic/_bertopic.py in load(cls, path, embedding_model)
   3006         else:
   3007             raise ValueError("Make sure to either pass a valid directory or HF model.")
-> 3008         topic_model = _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
   3009 
   3010         # Replace embedding model if one is specifically chosen

~/venv/lib/python3.7/site-packages/bertopic/_bertopic.py in _create_model_from_files(topics, params, tensors, ctfidf_tensors, ctfidf_config, images)
   4022 
   4023         # CountVectorizer
-> 4024         topic_model.vectorizer_model = CountVectorizer(**ctfidf_config["vectorizer_model"]["params"])
   4025         topic_model.vectorizer_model.vocabulary_ = ctfidf_config["vectorizer_model"]["vectorizer_model"]["vocab"]
   4026 

~/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

TypeError: __init__() got an unexpected keyword argument 'norm'

。当我将模型保存为pickle时,除了降维模型(参数化的UMAP)之外,其他方面都保存正确。因此我在想是否可以独立保存参数化的UMAP,然后将其连接到已加载的bertopic模型中。这是可能的吗?

yebdmbv4

yebdmbv41#

你能分享一下用于训练、保存和加载BERTopic模型的完整代码吗?这样可以更容易地调试这里发生的情况。我通常建议使用safetensors,但你提到它也不起作用。同样,你能分享包括两个示例的错误日志在内的完整代码吗?

bvn4nwqk

bvn4nwqk2#

感谢您的及时回复。我无法使用SafeTensor或PyTorch的原因是,这些方法没有保留包括umap在内的模型组件。我的工作性质要求我加载整个模型并转换新的嵌入。我在历史数据上进行训练,然后每个月需要加载模型并在新的嵌入上运行转换。这意味着我需要保留我正在使用的降维模型。这就是为什么我需要使用pickle保存的方法。

我意识到pickle与原始UMAP函数一起工作得很好,但是我使用了参数化的UMAP,因为它在转换新数据时更稳定,而且不依赖于批处理。参数化的UMAP包括tf组件(神经网络编码器),这些组件不可pickle化。因此,我决定将整个bertopic模型作为pickle保存,并在另一个笔记本中加载时用我保存的训练过的参数化umap替换topic_model.umap_model。

让我先分享我的整个模型,并向您展示如何在另一个笔记本中再次加载模型。
这是bertopic训练的代码:

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from umap import UMAP
from umap.parametric_umap import ParametricUMAP
import tensorflow as tf

import random
random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)
random.seed(random_state)

# Define an encoder for parametric UMAP. Parametric umap is almost the same as UMAP but uses a neural network to optimize and preserve the structure of the constructed graph.

def encoder_func_nd(dim = 50):
    init = tf.initializers.HeNormal()
    alpha_val = 0.001
    dims = (768,)
    encoder = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape = dims),
        tf.keras.layers.Dense(units=650, kernel_initializer=init),
        tf.keras.layers.LeakyReLU(alpha=alpha_val),
        .... other layers ...
        tf.keras.layers.Dense(units=dim),
    ])
    return encoder

n_neighbors_in = 15
n_components_in = 50

# I use this 3D reducer for plotting purposes:

Reducer_3D = ParametricUMAP(encoder = encoder_func_nd(dim = 3),
                            n_components=3,
                            n_neighbors = n_neighbors_in,
                            min_dist = 0,
                            metric='cosine',
                            spread = 0.5,
                            unique = False,
                            transform_queue_size = 50,
                            negative_sample_rate = 50,
                            n_jobs = 50,
                            angular_rp_forest=True,
                            random_state = random_state,
                            transform_seed = random_state,
                              )

embeddings_reduced3D = Reducer_3D.fit_transform(embeddings)

Reducer_nD = ParametricUMAP(encoder = encoder_func_nd(dim = 50),
                            n_components=50,
                            n_neighbors = n_neighbors_in,
                            min_dist = 0,
                            metric='cosine',
                            spread = 0.5,
                            unique = False,
                            transform_queue_size = 50,
                            negative_sample_rate = 50,
                            n_jobs = 50,
                            angular_rp_forest=True,
                            random_state = random_state,
                            transform_seed = random_state,
                              )

 
clusterer_model = HDBSCAN(min_cluster_size = 14,
                          min_samples = 1,
                          cluster_selection_epsilon = 0,
                          cluster_selection_method = "eom",
                          prediction_data=True,
                          approx_min_span_tree = False)
 
topic_model = BERTopic(embedding_model = sentence_model,
                       verbose = True,
                       top_n_words = 20,
                       n_gram_range = (1, 2),
                       ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True),
                       vectorizer_model= TfidfVectorizer(stop_words=SWV, ngram_range=(1, 2), vocabulary = vocabulary, min_df=5),
                       umap_model = Reducer_nD,
                       hdbscan_model = clusterer_model,
                       calculate_probabilities = True,
                       # representation_model = [MaximalMarginalRelevance(diversity=0.1), KeyBERTInspired()]
                      )

topics, probs = topic_model.fit_transform(docs_t, embeddings

n_outlier = topic_model.get_topic_info()[topic_model.get_topic_info()["Topic"] == -1]["Count"][0]

print(f"Number of Outliers: {n_outlier}")

fig = topic_model.visualize_documents(docs_t,
                                topics = topic_model.topics_,
                                embeddings = embeddings,
                                reduced_embeddings =  embeddings_reduced3D,
                                sample = 1,
                                hide_annotations = True,
                                hide_document_hover = False,
                                custom_labels = False,
                                title= "<b>Documents and Topics</b>",
                                width= 1500,
                                height= 750)

在我训练好模型后,我想将其保存为:

import pickle
os.chdir('/directory/to/save/')
pkl_name = 'model.pkl'
with open(pkl_name, 'wb') as file:
     pickle.dump(topic_model, file)

现在我必须保存参数化的UMAP,并在另一个笔记本中加载pickle模型后用topic_model.umap_model替换它:
这是来自参数化umap网页的内容( https://umap-learn.readthedocs.io/en/latest/parametric_umap.html )

保存和加载您的模型

与非参数化UMAP不同,Parametric UMAP不能简单地通过pickle UMAP对象来保存,因为它包含Keras网络。要保存Parametric UMAP,有一个内置函数:

embedder.save('/your/path/here')

然后可以在其他地方加载参数化的UMAP:

from umap.parametric_umap import load_ParametricUMAP
embedder = load_ParametricUMAP('/your/path/here')

这将加载UMAP对象及其包含的参数化网络。这就是为什么我使用了这个:

Reducer_3D.save('/directory/to/save/)

为了保存用于绘图的parametric_umap和已经嵌入并在topic_model中训练好的umap_model。
我的计划是按照以下方式加载模型,并将保存的parametric UMAP连接到pickle模型:

from umap.parametric_umap import load_ParametricUMAP
Reducer_3D = load_ParametricUMAP('/saved/parametric_umap3d')
Reducer_nD = load_ParametricUMAP('/saved/parametric_umapnd')

然后再连接到已加载的pickle:

import pickle
os.chdir('/directory/with/saved/bertopic/model')
pkl_name = 'model.pkl'
with open(pkl_name, 'wb') as file:
     topic_model = pickle.load(file)

topic_model.umap_model = Reducer_nD

现在的问题是,当我保存一个parametric_umap示例时,会发生一些奇怪的事情。在保存一个示例后,我无法再保存另一个示例,甚至无法再保存之前保存的那个示例。例如,当我使用Reducer_3D.save('/directory/to/save/')保存第一个示例时,我可以在出现这些警告和注解的情况下保存模型:

WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: /home/reducernd/encoder/assets
Keras encoder model saved to /home/ reducernd /encoder
WARNING:absl:Found untraced functions such as _update_step_xla while saving (showing 1 of 1). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: /home/ parametric_model/assets
INFO:tensorflow:Assets written to: /home/parametric_model/assets
Keras full model saved to /home/reducernd/parametric_model
Keras weights file (<HDF5 file "variables.h5" (mode r+)>) saving:
Keras model archive saving:
File Name                                             Modified             Size
metadata.json                                  2023-09-26 20:47:55           64
config.json                                    2023-09-26 20:47:55         6339
variables.h5                                   2023-09-26 20:47:55      6184520
Keras model archive loading:
File Name                                             Modified             Size
metadata.json                                  2023-09-26 20:47:54           64
config.json                                    2023-09-26 20:47:54         6339
variables.h5                                   2023-09-26 20:47:54      6184520
Keras weights file (<HDF5 file "variables.h5" (mode r)>) loading:
...layers..
Pickle of ParametricUMAP model saved to /home/reducernd/model.pkl

现在,如果我想要保存任何parametric umap示例,我就会遇到这个错误:

TypeError: Cannot serialize object <tensorflow.python.eager.polymorphic_function.polymorphic_function.Function object at 0x7f141071bc90> of type <class 'tensorflow.python.eager.polymorphic_function.polymorphic_function.Function'>. To be serializable, a class must implement the `get_config()` method.

这是一个非常奇怪的行为。保存一个parametric UMAP对象示例会改变一个函数,使其在第二次时不可序列化。我在这里卡住了。我不知道如何保存一个完整的模型,以便可以加载所有模型组件,并且我可以在新数据上运行转换函数!如果您能提供一些建议,我将不胜感激。我已经尽力了,但仍然失败了。

qq24tv8q

qq24tv8q3#

我正在使用历史数据进行训练,然后每个月需要加载模型并在新嵌入上运行转换。文档分配主题也是通过嵌入完成的。这些是否可以替代UMAP/HDBSCAN组合进行分配?如果您使用了safetensors,那么这将绕过这个问题。我认为值得尝试一下。

这是一个非常奇怪的行为。保存一次参数化的UMAP对象会改变一个函数,使其在第二次无法序列化。我在这里卡住了。我不知道如何保存一个完整的模型,以便它可以与所有模型组件一起加载,并且我可以在新数据上运行转换函数!如果您能提供一些想法,我将非常感激。我已经尽力了,但还是失败了。

不确定发生了什么。这似乎与UMAP有关,所以最好也在UMAP仓库中发布这个问题。我认为他们会更好地帮助您解决问题!

相关问题