BERTopic 在转换概率时出现错误,

iszxjhcz 于 23天前发布在其他

关注(0)|答案(9)|浏览(16)

我定期似乎会遇到以下错误：

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 550, in transform
    probabilities = self._map_probabilities(probabilities, original_topics=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 4124, in _map_probabilities
    mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 14 is out of bounds for axis 1 with size 14

我不确定如何帮助调试，因为它只出现在某些运行中，而不是其他运行中。在每种情况下，都有一个 BERTopic 模型的形式为 BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True) ,我已经成功地使用 fit_transform 拟合了该模型，然后调用 transform 在一个新的样本中计算主题和概率。此外，在每种情况下，我都提供了文档和嵌入。代码在一组文档集合上运行，如下所示：

for key in topic_models:
    topics[key], _ = topic_models[key].fit_transform(datasets[key], embeddings[key])

我知道模型已经成功拟合，因为我可以从中获得主题，而且似乎没有错误。只有在调用 transform 时，错误才会定期出现。它的随机出现表明它与拟合的主题有关，但我完全不清楚如何调试。
在这段代码中：

# Map array of probabilities (probability for assigned topic per document)
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities

to_topic 是否保证是顺序的？索引之间可能存在间隙吗？我不了解代码库足够深入，但 len(set(mappings.values())) 可能是个问题？也许类似于：

if probabilities is not None:
    if len(probabilities.shape) == 2:
        # Find the maximum 'to_topic' index, ensuring the array is large enough
        max_to_topic = max(mappings.values())
        
        # Initialize 'mapped_probabilities' with a size based on the maximum index found
        mapped_probabilities = np.zeros((probabilities.shape[0], max_to_topic + 1 - self._outliers))
        
        for from_topic, to_topic in mappings.items():
            if to_topic != -1 and from_topic != -1:
                # Safely add probabilities, knowing 'mapped_probabilities' has enough columns
                mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                
        # If necessary, additional steps to handle outliers or resize the array can be added here

        return mapped_probabilities

在这段代码中，非顺序索引的情况自然得到处理。然而，我不知道非顺序索引是否是更深层次问题的标志。祝好运。
我应该指出，我对 self._outliers 的具体操作并不清楚，所以我保留了它。也许这应该是 max_to_topic + 1 ?这是我本可以在不使用 self._outliers 的情况下做的，但我保留了 self._outliers ,因为我不明白(没有时间仔细研究)它是什么。

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1807

9条答案

按热度按时间

iih3973s1#

嗯，要了解具体情况很难说。你能分享一下你示例化模型的完整代码吗？可能与你使用的BERTopic变体或其他对模型所做的任何更改有关。

赞(0）回复(0）举报 23天前

pkmbmrz72#

我无法分享原始代码，因为变量名等都与我无法分享的内容相关联。我已经尝试创建一个去除这些元素并替换它们的最小可复现示例(尤其是数据),但我无法重现错误。抱歉，我知道这对调试至关重要，但当我尝试创建一个最小可复现示例时，我最终得到的是一个相当通用的版本，它可以工作。

赞(0）回复(0）举报 23天前

wfauudbj3#

嗯，这相当困难。没有一种方法让我重现这个问题，我不确定我是否能找出确切的问题是什么。这就像是在不知道针实际上是什么的情况下在稻草堆里找针。
让我们以不同的方式来解决它。你能分享一下 BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True) 里面有什么吗？这些变量可能会给我们一些线索。

赞(0）回复(0）举报 23天前

i2byvkas4#

这是去除变量名等后的代码，因此在更改过程中引入错误的概率非零。简而言之，每个df运行两个模型 - 一个在column1上，另一个在column2上。当给定连接的两列时，这些模型的主题概率是感兴趣的对象。这样，用于分配column1和column2概率的模型是在column1上估计的，同样适用于column2。因此，对于两列和2个数据集，我最终得到4个概率矩阵。嵌入预先计算并保存。这些也是垂直堆叠的，与输入的连接镜像。

# Import necessary libraries
import ast
import openai
import numba
import numpy as np
from openai import OpenAI
import pandas as pd
import umap.umap_ as umap
import sys
import os

from hdbscan import HDBSCAN
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend
from bertopic.representation import OpenAI as OAI

openai.api_key = ""
OAI_client = OpenAI(
    api_key="",
)

# Step 1 - Extract embeddings using OpenAI's Ada model
# Changing embedding_model does not make a difference AFAIK
embedding_model = OpenAIBackend(
    embedding_model="text-embedding-3-large", delay_in_seconds=1, batch_size=1024
)

# Step 2 - Reduce dimensionality using UMAP
# UMAP parameters are chosen based on dataset characteristics and desired dimensionality reduction
umap_model = umap.UMAP(
    n_neighbors=2500, n_components=72, min_dist=0.01, metric="cosine"
)

# Step 3 - Cluster reduced embeddings using HDBSCAN
# The 'leaf' method is used for cluster selection for potentially better-defined clusters
hdbscan_model = HDBSCAN(
    cluster_selection_method="leaf", min_cluster_size=125, prediction_data=True
)

prompt_text = "Identify the primary topic in the reviews represented by the following documents and keywords: [DOCUMENTS] [KEYWORDS]. Provide only the topic label."

# Step 4 - Determine Topic representations using GPT-4 from OpenAI
# Changing model does not make a difference AFAIK
representation_model = OAI(
    client=OAI_client,
    model="gpt-4-turbo-preview",
    chat=True,
    exponential_backoff=True,
    nr_docs=12,
    prompt=prompt_text,
)

# Dictionary to hold BERTopic models
topic_models = {
    "a1": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "a2": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "b1": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    "b2": BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        representation_model=representation_model,
        calculate_probabilities=True,
    ),
    ### Dictionary has more models
}

# Dictionary to hold datasets
datasets = {
    "a1": some_documents,
    "a2": some_documents,
    "b1": some_documents,
    "b2": some_documents,
    ### More data
}

def load_embeddings(base_path, file_name):
    # Load the DataFrame from a pickle file
    df = pd.read_pickle(f"{base_path}/embedding_{file_name}.pkl")
    # Assuming the embeddings are already lists in the first column, directly convert to a NumPy array
    numpy_array = np.array([row for row in df.iloc[:, 0]])
    return numpy_array

# Load embeddings and process with UMAP
base_path = ""
embedding_names = [
    "a1",
    "a2",
    "b1",
    "b2",
]  # More names
embeddings = {name: load_embeddings(base_path, name) for name in embedding_names}

# Fit and transform the BERTopic models
topics = {}
original_probabilities = {}
for key in topic_models:
    topics[key], original_probabilities[key] = topic_models[key].fit_transform(
        datasets[key], embeddings[key]
    )

# Datasets and embeddings
datasets = {"big_a": ("a1", "a2", df_a), "big_b": ("b1", "b2", df_b)}

# Process each dataset
for name, (key1, key2, df) in datasets.items():
    # Concatenate 'column1' and 'column2' columns
    combined_df = pd.concat(
        [df["column1"].to_frame(name="data"), df["column2"].to_frame(name="data")],
        axis=0,
    )
    setattr(sys.modules[__name__], f"combined_{name}", combined_df)

    # Concatenate embeddings
    combined_embedding = np.vstack([embeddings[key1], embeddings[key2]])
    setattr(sys.modules[__name__], f"combined_{name}_embedding", combined_embedding)

# Initialize dictionaries to store probabilities
probabilities_dict = {}

# Performing predictions using the models from the dictionary
for dataset in ["big_a", "big_b"]:
    for model_key in ["1", "2"]:
        key = f"{dataset[0]}_{model_key}"
        _, probabilities_dict[key] = topic_models[key].transform(
            documents=getattr(sys.modules[__name__], f"combined_{dataset}"),
            embeddings=getattr(sys.modules[__name__], f"combined_{dataset}_embedding"),
        )

赞(0）回复(0）举报 23天前

kyvafyod5#

老实说，我在你的代码中没有看到任何可能解释这个问题的东西。它应该可以正常工作，我很惊讶它不能。不过也许有一个解决方案。
如果你使用 safetensors 或 pytorch 保存模型，然后再加载模型，进行预测的方法会改变，因此可能会阻止问题产生。

赞(0）回复(0）举报 23天前

bz4sfanl6#

好的，让我尝试一下。我认为这与通过UMAP/HDBScan返回的对象有关，因为当我更改参数时，似乎会失败或至少更频繁地失败。感谢您调查此事。

赞(0）回复(0）举报 23天前

e1xvtsh37#

没问题，如果成功了请告诉我！

赞(0）回复(0）举报 23天前

nzkunb0c8#

我无法解决这个问题。这与HDBScan返回的内容有关，我相当确定在聚类步骤之前一切都是正常的。从那里开始，当在HDBScan中调用相应的转换方法并Map概率时，失败会不可预测地发生。

我认为问题可能出在HDBScan上，但我的猜测是，HDBScan可能没有将任何新文档分配给一个主题，从而使得返回的概率数组变小(可能被零化的列被删除)。我说这一点是因为问题的最预测性参数是min_cluster_size,当这个值较大(对应于适用于许多文档的簇，因此很可能会有新的文档)时，我更少看到错误。当这个值较小时，错误变得更加频繁。

由于我找不到解决这个问题的方法，我将把这个bug留开放，希望更好的程序员能找到解决方法。

赞(0）回复(0）举报 23天前

kq0g1dla9#

感谢您分享这个！希望其他人能通过创建可复现的示例来追踪问题。确实，让我们保持开放，看看其他人是否能提供一些帮助。

赞(0）回复(0）举报 23天前