llama_index [OOM]:在为大型文档创建索引时内存不足

c0vxltue  于 2个月前  发布在  其他
关注(0)|答案(5)|浏览(29)

错误描述

如何使用llamaindex为大型文档创建索引?
我注意到,当我执行index.insert时,index变量存储在RAM中,因此随着我不断添加新的文档块,它会增加RAM的使用量,最终导致内存不足。是否有某种方法可以重新加载索引并仅将向量存储在内存中,同时将其他有效负载/元数据存储在磁盘上?

版本

0.10.57

重现步骤

# llama-index==0.10.57
# qdrant-client==1.10.1

from qdrant_client import QdrantClient, models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os
from pathlib import Path
from qdrant_client import QdrantClient
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex, StorageContext
from typing import List

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
embed_size = 1024 # Change as per your embed_model

# Local
client_path = os.path.join(Path.home(), ".cache", "aliparse", "qdrant_vector_store_del")
qdrant_client = QdrantClient(path=client_path)

# Remote
# url = "http://0.0.0.0:6333"
# qdrant_client = QdrantClient(url=url, timeout=20)

def create_index(input_files: List=None, input_dir: List=None, collection_name: str="test"):

    qdrant_client.create_collection(
        collection_name=collection_name,
        on_disk_payload= True, # TODO Seems not working
        vectors_config= models.VectorParams(
                size=embed_size,
                distance=models.Distance.COSINE,
                on_disk=True
            ),
        optimizers_config=models.OptimizersConfigDiff(memmap_threshold=10000),
        hnsw_config=models.HnswConfigDiff(on_disk=False), # False to avoid Request Timeout Error, https://github.com/qdrant/qdrant-client/issues/632#issuecomment-2126729153
    )

    qdrant_vector_store = QdrantVectorStore(
        client=qdrant_client, 
        collection_name=collection_name, 
        enable_hybrid=False, #  whether to enable hybrid search using dense and sparse vectors
    )

    storage_context = StorageContext.from_defaults(vector_store=qdrant_vector_store)

    if input_files is not None:
        reader = SimpleDirectoryReader(input_files=input_files, raise_on_error=True)
    elif input_dir is not None:
        # Careful: Qdrant gives weird errors if you pass filename_as_id as True
        reader = SimpleDirectoryReader(input_dir=input_dir, raise_on_error=True)
    else:
        raise ValueError("Please either pass 'input_files' (a list of file paths) or 'input_dir' (directory containing test files) as an argument to load_index")

    qdrant_index = VectorStoreIndex.from_documents([], embed_model=embed_model, storage_context=storage_context)
    for docs in reader.iter_data():
        for d in docs:
            qdrant_index.insert(d)
        if len(docs) > 0:
            try:
                print(f"Succesfully inserted index for {docs[-1].metadata['file_name']}")
            except:
                pass
    
    return qdrant_index
        
qdrant_index = create_index(input_dir="/test-directory", collection_name="test")

相关日志/回溯

  • 无响应*
epfja78i

epfja78i1#

嘿,@bibekyess!我在这里帮助你解决任何bug、问题或贡献。让我们一起解决这个问题。
要在使用llamaindex为大型文档创建索引时更有效地管理内存使用情况,你可以只在内存中存储向量,并将其他有效负载/元数据保留在磁盘上。以下是使用QdrantVectorStore实现此目标的示例:

  1. 初始化Qdrant客户端并创建集合:确保集合已配置为将有效负载存储在磁盘上。
  2. 配置QdrantVectorStore:设置向量存储以使用Qdrant客户端。
  3. 创建并使用索引:使用VectorStoreIndex将文档插入索引中。
    以下是修改后的代码,以实现此目的:
from qdrant_client import QdrantClient, models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os
from pathlib import Path
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from typing import List

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
embed_size = 1024  # Change as per your embed_model

# Local
client_path = os.path.join(Path.home(), ".cache", "aliparse", "qdrant_vector_store_del")
qdrant_client = QdrantClient(path=client_path)

def create_index(input_files: List = None, input_dir: List = None, collection_name: str = "test"):
    qdrant_client.create_collection(
        collection_name=collection_name,
        on_disk_payload=True,  # Ensure payloads are stored on disk
        vectors_config=models.VectorParams(
            size=embed_size,
            distance=models.Distance.COSINE,
            on_disk=True  # Store vectors on disk
        ),
        optimizers_config=models.OptimizersConfigDiff(memmap_threshold=10000),
        hnsw_config=models.HnswConfigDiff(on_disk=False),  # False to avoid Request Timeout Error
    )

    qdrant_vector_store = QdrantVectorStore(
        client=qdrant_client,
        collection_name=collection_name,
        enable_hybrid=False,  # Disable hybrid search to save memory
    )

    storage_context = StorageContext.from_defaults(vector_store=qdrant_vector_store)

    if input_files is not None:
        reader = SimpleDirectoryReader(input_files=input_files, raise_on_error=True)
    elif input_dir is not None:
        reader = SimpleDirectoryReader(input_dir=input_dir, raise_on_error=True)
    else:
        raise ValueError("Please either pass 'input_files' (a list of file paths) or 'input_dir' (directory containing test files) as an argument to load_index")

    qdrant_index = VectorStoreIndex.from_documents([], embed_model=embed_model, storage_context=storage_context)
    for docs in reader.iter_data():
        for d in docs:
            qdrant_index.insert(d)
        if len(docs) > 0:
            try:
                print(f"Successfully inserted index for {docs[-1].metadata['file_name']}")
            except:
                pass

    return qdrant_index

qdrant_index = create_index(input_dir="/test-directory", collection_name="test")

此设置确保向量存储在磁盘上,从而减少了索引过程中的内存占用。此外,您还可以在QdrantVectorStore中调整参数batch_sizeparallel以进一步优化内存使用情况1(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/vector_stores/qdrant.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-qdrant/llama_index/vector_stores/qdrant/base.py)。

  • 要继续与Dosu交谈,请提及@dosu。*
t1qtbnec

t1qtbnec2#

@dosu 它并没有解决这个问题。

xlpyo6sf

xlpyo6sf3#

要解决使用 llamaindexQdrantVectorStore 为大型文档创建索引时出现的内存不足(OOM)问题,您可以调整 batch_size 参数以控制单个请求中上传的点数。这将有助于更有效地管理内存使用。此外,您还可以确保仅在内存中存储向量,同时将其他有效负载/元数据保留在磁盘上。
以下是如何使用较小的 batch_size 初始化 QdrantVectorStore 以减少内存使用的例子:

from qdrant_client import QdrantClient, models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import os
from pathlib import Path
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from typing import List

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
embed_size = 1024  # Change as per your embed_model

# Local
client_path = os.path.join(Path.home(), ".cache", "aliparse", "qdrant_vector_store_del")
qdrant_client = QdrantClient(path=client_path)

def create_index(input_files: List = None, input_dir: List = None, collection_name: str = "test"):
    qdrant_client.create_collection(
        collection_name=collection_name,
        on_disk_payload=True,  # Ensure payloads are stored on disk
        vectors_config=models.VectorParams(
            size=embed_size,
            distance=models.Distance.COSINE,
            on_disk=True  # Store vectors on disk
        ),
        optimizers_config=models.OptimizersConfigDiff(memmap_threshold=10000),
        hnsw_config=models.HnswConfigDiff(on_disk=False),  # False to avoid Request Timeout Error
    )

    qdrant_vector_store = QdrantVectorStore(
        client=qdrant_client,
        collection_name=collection_name,
        batch_size=16,  # Adjust batch_size to a smaller value to reduce memory usage
        parallel=1,
        max_retries=3,
        enable_hybrid=False,  # Disable hybrid search to save memory
    )

    storage_context = StorageContext.from_defaults(vector_store=qdrant_vector_store)

    if input_files is not None:
        reader = SimpleDirectoryReader(input_files=input_files, raise_on_error=True)
    elif input_dir is not None:
        reader = SimpleDirectoryReader(input_dir=input_dir, raise_on_error=True)
    else:
        raise ValueError("Please either pass 'input_files' (a list of file paths) or 'input_dir' (directory containing test files) as an argument to load_index")

    qdrant_index = VectorStoreIndex.from_documents([], embed_model=embed_model, storage_context=storage_context)
    for docs in reader.iter_data():
        for d in docs:
            qdrant_index.insert(d)
        if len(docs) > 0:
            try:
                print(f"Successfully inserted index for {docs[-1].metadata['file_name']}")
            except:
                pass

    return qdrant_index

qdrant_index = create_index(input_dir="/test-directory", collection_name="test")

通过减小 batch_size ,您可以在上传过程中控制内存占用,这应该有助于缓解 OOM 问题。

  • 要继续与 Dosu 对话,请提及 @dosu。*
qacovj5a

qacovj5a4#

可能有一些关于qdrant的设置。我通常使用qdrant的docker服务器。

另外,请记住bge-m3是一个相当大的模型。我不确定你指的是ram还是vram,但是你可能想要降低批处理大小。

nuypyhwy

nuypyhwy5#

感谢您的回复。

  1. 是的,我发送的复制代码已经考虑到了qdrant的磁盘存储配置,但奇怪的是内存在增加。
  2. 我指的是RAM,我认为embed_batch_size会影响vram,而我对vram的使用没有问题,因为在创建索引后,vram的使用量也会减少。但是使用上述代码,即使所有索引都已创建,RAM使用量也不会减少。我需要杀死整个进程来释放内存。我已经附上了图片。左边的是qdrant Docker统计信息,右边的是RAM内存使用情况。

相关问题