BERTopic 调谐器内存不足,通过多模态主题建模处理40万张图像的image_to_text

q0qdq0h2 于 2个月前发布在其他

关注(0)|答案(9)|浏览(46)

嘿，首先——非常感谢这个很棒的套餐——尤其是多模态的更新，我在这里正在使用它。
我有一个只有图像的数据集，我试图在这个数据集上运行主题模型。我在GPU上预先计算了嵌入，现在正尝试构建我的完整模型。降维和聚类似乎有效，但在算法试图将图像转换为文本的表示阶段，我确实会内存不足。有什么解决办法吗？
目前，我正在使用具有30 GB RAM的GPU(T4)运行代码，因为在我看来，带有CUML的GPU上的HDBSCAN / UMAP比CPU更快。使用CPU时，我可以获得RAM直到大约512 GB,但1.4 TB仍然似乎太高了？特别是与我只处理一个包含44万小图像的数据集相比......
你对此有什么想法吗？

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1464

9条答案

按热度按时间

nnsrf1az1#

你能分享你的完整代码吗？这样更容易理解到底发生了什么。另外，你的训练数据有多大？

赞(0）回复(0）举报 2个月前

lokaqttq2#

嘿，@MaartenGr,感谢快速回复！
当然，这是完整的代码。
上面的错误信息显示，min_samples被设置为10,当更改为100时，它只请求405 GB的RAM。当我在CPU上运行这个时，速度会慢很多，对吗？
至于训练数据——我不是在新数据上进行预测，我仍然试图在我包含652609张图片的原始数据集中拟合(在这种情况下，每张图片的大小在50到100 KB之间，而不是400k,刚刚再次检查了一下，抱歉)。
但是，我还有两个数据集，我想用同样的方法处理它们——其中一个是中等大小的，另一个有250万张图片。计算嵌入也适用于那个数据集，但我认为我在image_to_text oom方面遇到了同样的问题，有没有一种方法可以分批处理或减少RAM的需求？
非常感谢！

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

from bertopic.representation import KeyBERTInspired, VisualRepresentation
from bertopic.backend import MultiModalBackend
from bertopic import BERTopic
import numpy as np

import pickle

hdbscan_model = HDBSCAN(min_samples=100, gen_min_span_tree=True, prediction_data=False)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, low_memory=True)

# Image to text representation model
representation_model = {
    "Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning")
}

# Image embedding model
embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)

embeddingsLoaded = pickle.load(open("embeddings_quer.pkl", 'rb')) #To load saved model from local directory

embeddings_joint_imageslist = pickle.load(open("embeddings_quer_imageslist.pkl", 'rb')) #To load saved model from local directory

# Train our model with images only
topic_model = BERTopic(low_memory=True,verbose=True,calculate_probabilities=False,embedding_model=embedding_model,representation_model=representation_model,umap_model=umap_model,hdbscan_model=hdbscan_model)

topics = topic_model.fit(documents=None, embeddings=embeddingsLoaded, images=embeddings_joint_imageslist)

pickle.dump(topics, open("embeddings_quer_fullmodel.pkl", 'wb')) #Saving images

赞(0）回复(0）举报 2个月前

ijxebb2r3#

我相信正在发生的情况如下。当你创建了一组图像的聚类时，每个聚类中都会选择一部分图像进行转换为文本、生成标题。为了实现这个目标，使用了最大边缘相关性来选择与质心相近但仍具有一定多样性的前9张图片。然而，它计算了图像本身之间的余弦相似度来实现这一点，从而得到了一个438365行x438365列的矩阵。这个值438365指的是你聚类中的文档数量。

如果你更改 min_samples ,可能会减小最大聚类的大小。但是对于你的大型数据集，你仍然可能创建大型聚类，从而导致这些大型相似矩阵。

为了防止这种情况，我认为需要进行修复。例如，与其在聚类内的所有图像上使用最大边缘相关性，仅对大约10000张图像这样做就足够了，从而最多创建10000x10000的矩阵。然而，这将使生成的图像表示略微不那么准确，我认为通过使用10000张图像的子集，这种影响几乎不会被注意到。

换句话说，你能将以下代码行：
BERTopic/bertopic/representation/_visual.py
第167行 37064e2
| | indices=mmr(topic_embedding.reshape(1, -1), embeddings[indices], indices, top_n=top_n, diversity=0.1) |
修改为：

indices = mmr(topic_embedding.reshape(1, -1), embeddings[indices[:10_000]], indices[:10_000], top_n=top_n, diversity=0.1)

然后再试一次？

赞(0）回复(0）举报 2个月前

cl25kdpy4#

谢谢，这个声音不仅非常合理，而且快速的代码采用也起作用了。
你介意把那个票留着多开几天吗？我可能会遇到更大的数据集的问题。

赞(0）回复(0）举报 2个月前

mnowg1ta5#

这正是正在发生的事情...所以我昨天在中小规模上运行了我的代码，一切都运行得很好。开始大规模(包含近10 GB或250万个图像文件的pkl文件)的工作，但什么都没有发生：(
输出开始与前两个集合相同，尽管它甚至没有打印出降维已经完成。所以我得到了与其他人相同的警告信息(当时也没有问题),但在那之后，没有进一步的输出，而作业现在已经运行了15.5小时。
这个集合与其他两个集合之间的唯一区别是大小，而且在计算嵌入时，对于大集合，我将图像分成了大批次，否则我会遇到OOM,这就是为什么我必须先连接加载的pkl文件。虽然那里没有引发错误。
我是不是太急躁了？我现在在一个有240 GB RAM的CPU上运行这个程序，没有出现OOM。之前在较小的设备上尝试过，马上就出现了OOM,所以现在应该没问题了吧？大集合大约是中小规模的4倍，因此输出开始的时间显然需要一些时间，比如30分钟，而不是15.5小时？
你有什么想法/建议吗？非常感谢你的帮助！
完整输出

/users/groesch/.local/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: �[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.�[0m
  @numba.jit()
/users/groesch/.local/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: �[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.�[0m
  @numba.jit()
/users/groesch/.local/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: �[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.�[0m
  @numba.jit()
/users/groesch/.local/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: �[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.�[0m
  @numba.jit()
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.

完整代码

from hdbscan import HDBSCAN
from umap import UMAP

from bertopic.representation import KeyBERTInspired, VisualRepresentation
from bertopic.backend import MultiModalBackend
from bertopic import BERTopic
import numpy as np

import pickle

hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, low_memory=True)

# Image to text representation model
representation_model = {
    "Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning")
}

# Image embedding model
embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)

embeddingsLoaded = pickle.load(open("embeddings_joint.pkl", 'rb')) #To load saved model from local directory

embeddings_joint_imageslist = pickle.load(open("embeddings_joint_imageslist.pkl", 'rb')) #To load saved model from local directory

# Join all the embedding batches 
embeddingsConc = np.concatenate(embeddingsLoaded, axis=0)

# Train our model with images only
topic_model = BERTopic(low_memory=True,verbose=True,calculate_probabilities=False,embedding_model=embedding_model,representation_model=representation_model,umap_model=umap_model,hdbscan_model=hdbscan_model)

topics = topic_model.fit_transform(documents=None, embeddings=embeddingsConc, images=embeddings_joint_imageslist)

pickle.dump(topics, open("embeddings_joint_fullmodel.pkl", 'wb')) #Saving images

embedding_model_name = "clip-ViT-B-32"
topics.save("embeddings_joint_fullmodel_safetensors", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model_name)
topics.save("embeddings_joint_fullmodel_pytorch", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model_name)

赞(0）回复(0）举报 2个月前

68de4m5k6#

你绝对不是不耐烦！但似乎你已经从cuML切换到了原始的UMAP实现。如果你想处理大型数据集进行降维，那么cuML和强大的GPU是必不可少的。此外，值得注意的是，并非所有东西都呈线性增长，包括UMAP,它随着大小的增加而呈线性增长，所以如果计算时间也不呈线性增长，那也就不足为奇了。
总之，我建议你使用cuML。此外，确保在HDBSCAN中将min_cluster_size与min_samples设置得相对较大(几百个)。否则，你很可能会得到数万个主题。最后，检查cuML的UMAP中是否可能存在一个cosine度量。如果不存在，那么对嵌入进行归一化可能是必要的。

赞(0）回复(0）举报 2个月前

au9on6nz7#

是的，我切换到了原始的UMAP和HDBSCAN,因为我在CPU上运行那个，没有可用的GPU,cuML会失败。

我想min_samples被限制在100以上，当我尝试将其设置为1000时，我记得收到了错误。

我现在将在具有cuML的GPU上使用128 GB RAM进行尝试，min_samples=100 和 min_cluster_size=500。

根据文档，似乎cuML中不可能使用余弦函数，因此我整合了您的代码片段。您介意稍微详细说明一下为什么这有意义吗？这将对我有很大帮助！我会让您知道这个设置是否有效。

from cuml.preprocessing import normalize
embeddings = normalize(embeddings)

赞(0）回复(0）举报 2个月前

xn1cxnb48#

我现在有一个非常奇怪的行为，我无法解释给自己听。在大型示例上尝试了一下，但似乎在实际减少维度之前就卡住了。所以我测试了加载pkl所需的时间，发现——只有几秒钟。将代码更改为在导入BERTopic部分时显示进度条并添加一些打印输出。

所以只要在导入BERTopic部分之前调用quit(),它就可以正常工作，日志会被打印出来等等。但是当去掉quit()后，脚本在没有任何错误信息的情况下继续运行，但在embeddings_joint_imageslist = test_tqdm_reader("embeddings_joint_imageslist.pkl")之后就会卡住。"Loaded Embeddings"和embeddings_joint_imageslist.pkl的进度条被打印出来，但在那之后——什么都没有，没有错误提示，任务继续运行，但什么都不做？!
有什么想法吗？我无法解释的是，这似乎受到仅在此之后运行的代码的影响。有了quit(),一切都似乎可以处理到那时，没有的话，甚至在之前就停止了。
我还从import中得到了NumbaDeprecationWarning,但从未得到过print("Loaded Images List")或print("Normalized")的输出......
编辑——我已经缩小了问题的范围，问题出现在在fit函数之前不调用quit()的时候。

赞(0）回复(0）举报 2个月前

toe950279#

是的，我切换回了原始的UMAP和HDBSCAN,因为我在CPU上运行那个，没有可用的GPU,cuML会失败。如果有好的GPU可用，cuML的UMAP将被优先考虑。一般来说，建议使用GPU,因为在整个文本/图像分析过程中使用了一堆不同的LLMs。

根据文档，似乎cuML中不能使用余弦，所以我整合了你的代码片段。你介意详细解释一下为什么这有意义吗？这对我很有帮助！我会告诉你这个设置是否有效。

你可以在这个链接找到一个很好的解释。

有什么想法吗？我无法解释的是，这似乎受到仅在此之后运行的代码的影响。使用quit(),一切都似乎处理到那时为止，不使用它，甚至在之前就停止了。

我不确定发生了什么。但话说回来，这似乎与BERTopic无关，因为你创建了一个自定义的tqdm相关的阅读器。因此，这些问题最好在相应的仓库中解决。

赞(0）回复(0）举报 2个月前

我来回答

BERTopic 调谐器内存不足,通过多模态主题建模处理40万张图像的image_to_text

9条答案

相关问题

热门标签

最新问答