llama_index [问题]:ChromaDB 设置错误

4ioopgfo  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(33)

问题验证

  • 我已经在文档和Discord上搜索了答案。

问题

你好,我正在使用更多的数据在我的系统中,因此我尝试设置一个ChromaDB服务器来检索我的向量化信息,而不是使用磁盘存储检索。以下是我正在使用的代码:

def citation_indexing():
    chroma_client = chromadb.Client()

    try:
        print("Loading Vector Content")
        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure')
        azure_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='assessment')
        assessment_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='control')
        control_index = load_index_from_storage(storage_context, show_progress=True)

        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='questionaire')
        questionaire_index = load_index_from_storage(storage_context, show_progress=True)
        
        storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='aws_docs')
        aws_index = load_index_from_storage(storage_context, show_progress=True)

        index_loaded = True
    except Exception as e:
        print(f"Error Loading Vector Content: {e}")
        index_loaded = False
    
    if not index_loaded:
        print('Vectorizing Content')
        # load data
        azure_docs = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/All Docs/azure_services.pdf"]
        ).load_data()
        assessment_docs = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/assessment-procedures.pdf"]).load_data()
        control_docs = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/controls").load_data()
        ques_docs = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/ques").load_data()
        aws_documents = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/AWSDOCS").load_data()
        # build index

        azure_index = VectorStoreIndex.from_documents(azure_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='azure'), show_progress=True)
        assessment_index = VectorStoreIndex.from_documents(assessment_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='assessment'), show_progress=True)
        control_index = VectorStoreIndex.from_documents(control_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='control'), show_progress=True)
        ques_index = VectorStoreIndex.from_documents(ques_docs, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='questionaire'), show_progress=True)
        aws_index = VectorStoreIndex.from_documents(aws_documents, storage_context=StorageContext.from_chroma(client=chroma_client, collection_name='aws_docs'), show_progress=True)

    print("Vector Content Loaded")
    return azure_index, assessment_index, control_index, ques_index, aws_index

我遇到了一个错误,提示 -> ValueError: 无法连接到租户default_tenant。你确定它存在吗?
我还尝试使用 -> chroma_server --host {EC2 IP} --port {EC2 端口号} 启动ChromaDB服务器。
我该如何修复我的错误,以便在我的Cloud9环境中创建一个ChromaDB服务器。

2admgd59

2admgd591#

storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure') 这不是正确的语法。
请参阅文档
https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/?h=chroma
例如

import chromadb

remote_db = chromadb.HttpClient(...)
chroma_collection = remote_db.get_or_create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

也可以查看chromas文档以设置客户端
https://docs.trychroma.com/guides#using-the-python-http-only-client

jqjz2hbq

jqjz2hbq2#

storage_context = StorageContext.from_chroma(client=chroma_client, collection_name='azure') 这不是正确的语法。
请参阅文档 https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/?h=chroma
例如

import chromadb

remote_db = chromadb.HttpClient(...)
chroma_collection = remote_db.get_or_create_collection("quickstart")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

请查看chromas文档,了解如何设置客户端 https://docs.trychroma.com/guides#using-the-python-http-only-client
我遵循了文档,我现在遵循的特定部分是这个:

# save to disk

db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model
)

# load from disk
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

现在我的代码看起来像这样:

def citation_indexing():
    db_path = "./chroma_db"
    embed_model = OpenAIEmbedding(model="text-embedding-3-large")

    try:
        print("Loading Vector Content")
        
        db = chromadb.PersistentClient(path=db_path)

        azure_collection = db.get_or_create_collection("azure")
        azure_vector_store = ChromaVectorStore(chroma_collection=azure_collection)
        azure_index = VectorStoreIndex.from_vector_store(
            vector_store=azure_vector_store,
            embed_model=embed_model,
        )

        assessment_collection = db.get_or_create_collection("assessment")
        assessment_vector_store = ChromaVectorStore(chroma_collection=assessment_collection)
        assessment_index = VectorStoreIndex.from_vector_store(
            vector_store=assessment_vector_store,
            embed_model=embed_model,
        )

        control_collection = db.get_or_create_collection("control")
        control_vector_store = ChromaVectorStore(chroma_collection=control_collection)
        control_index = VectorStoreIndex.from_vector_store(
            vector_store=control_vector_store,
            embed_model=embed_model,
        )

        questionnaire_collection = db.get_or_create_collection("questionnaire")
        questionnaire_vector_store = ChromaVectorStore(chroma_collection=questionnaire_collection)
        questionnaire_index = VectorStoreIndex.from_vector_store(
            vector_store=questionnaire_vector_store,
            embed_model=embed_model,
        )

        aws_docs_collection = db.get_or_create_collection("aws_docs")
        aws_docs_vector_store = ChromaVectorStore(chroma_collection=aws_docs_collection)
        aws_index = VectorStoreIndex.from_vector_store(
            vector_store=aws_docs_vector_store,
            embed_model=embed_model,
        )

        index_loaded = True
    except Exception as e:
        print(f"Error Loading Vector Content: {e}")
        index_loaded = False
    
    if not index_loaded:
        print('Vectorizing Content')
        # load data
        azure = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/All Docs/azure.pdf"]
        ).load_data()
        assessment = SimpleDirectoryReader(
            input_files=["/home/ubuntu/environment/revised-Project/assessment-procedures.pdf"]).load_data()
        control = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/Controls").load_data()
        questionnaire = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/Questionnaire").load_data()
        aws_documents = SimpleDirectoryReader("/home/ubuntu/environment/revised-Project/AWSDOCS").load_data()

        # build index
        azure_collection = db.get_or_create_collection("azure")
        azure_vector_store = ChromaVectorStore(chroma_collection=azure_collection)
        azure_storage_context = StorageContext.from_defaults(vector_store=azure_vector_store)
        azure_index = VectorStoreIndex.from_documents(azure, storage_context=azure_storage_context, embed_model=embed_model, show_progress=True)

        assessment_collection = db.get_or_create_collection("assessment")
        assessment_vector_store = ChromaVectorStore(chroma_collection=assessment_collection)
        assessment_storage_context = StorageContext.from_defaults(vector_store=assessment_vector_store)
        assessment_index = VectorStoreIndex.from_documents(assessment, storage_context=assessment_storage_context, embed_model=embed_model, show_progress=True)

        control_collection = db.get_or_create_collection("control")
        control_vector_store = ChromaVectorStore(chroma_collection=control_collection)
        control_storage_context = StorageContext.from_defaults(vector_store=control_vector_store)
        control_index = VectorStoreIndex.from_documents(control, storage_context=control_storage_context, embed_model=embed_model, show_progress=True)

        questionnaire_collection = db.get_or_create_collection("questionnaire")
        questionnaire_vector_store = ChromaVectorStore(chroma_collection=questionnaire_collection)
        questionnaire_storage_context = StorageContext.from_defaults(vector_store=questionnaire_vector_store)
        questionnaire_index = VectorStoreIndex.from_documents(questionnaire, storage_context=questionnaire_storage_context, embed_model=embed_model, show_progress=True)

        aws_docs_collection = db.get_or_create_collection("aws_docs")
        aws_docs_vector_store = ChromaVectorStore(chroma_collection=aws_docs_collection)
        aws_docs_storage_context = StorageContext.from_defaults(vector_store=aws_docs_vector_store)
        aws_index = VectorStoreIndex.from_documents(aws_documents, storage_context=aws_docs_storage_context, embed_model=embed_model, show_progress=True)

    print("Vector Content Loaded")
    return azure_index, assessment_index, control_index, questionnaire_index, aws_index

我面临的问题是内容不再被识别,这就是为什么观察结果为空的原因。我正在使用Chain of Thought来查看RAG LLM的思想过程,但它给我的所有只是:

Batch
> Current query: Write a detailed description of the following service: Batch. Describe what it's used for and what it does.
> New query: Which AWS services apply to the  analytics system Controls?
> Running step f1266b2e-5d6d-4ba6-8290-d952378a6856. Step input: Which AWS services apply to the system analytics Controls?
Thought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: aws_services
Action Input: {'input': 'System Analytics Controls'}
Observation: Empty Response
Thought: Since the tools did not return any information, I will provide a general answer based on my knowledge.
Answer: AWS offers a wide range of services that are satisfy the provided Controls. Some of these services include:

1. **Amazon EC2 (Elastic Compute Cloud)** - Provides scalable computing capacity.
2. **Amazon S3 (Simple Storage Service)** - Offers scalable object storage.
3. **Amazon RDS (Relational Database Service)** - Simplifies setting up, operating, and scaling a relational database.
4. **AWS Lambda** - Allows you to run code without provisioning or managing servers.
5. **Amazon VPC (Virtual Private Cloud)** - Enables you to launch AWS resources in a virtual network that you define.
6. **AWS IAM (Identity and Access Management)** - Helps you securely control access to AWS services and resources.
7. **AWS CloudTrail** - Enables governance, compliance, and operational and risk auditing of your AWS account.
8. **AWS Config** - Provides AWS resource inventory, configuration history, and configuration change notifications.
9. **AWS Shield** - Provides managed DDoS protection.
10. **AWS WAF (Web Application Firewall)** - Helps protect your web applications from common web exploits.

These services are part of AWS's compliance with , which ensures that they meet the stringent security requirements outlined in the Controls. For a complete and up-to-date list of AWS services that are authorized, you can refer to the AWS page or the AWS Services in Scope by Compliance Program documentation.
> Current query: Write a detailed description of the following service: Batch. Describe what it's used for and what it does.
c0vxltue

c0vxltue3#

try/except永远不会捕获到异常。
例如,这两行代码总是可以正常运行:

db = chromadb.PersistentClient(path=db_path)
azure_collection = db.get_or_create_collection("azure")

无论db_path是否存在,也无论集合是否已经存在。
可能你应该检查一下db_path是否存在,而不是使用try/except。(在重新运行之前删除db_path,以便正确重建)

6qfn3psc

6qfn3psc4#

try/except永远不会捕获到异常。
例如,这两行代码总是可以正常工作:

db = chromadb.PersistentClient(path=db_path)
azure_collection = db.get_or_create_collection("azure")

无论db_path是否存在,也无论集合是否已经存在。
可能你应该检查一下db_path是否存在,而不是使用try/except。(在重新运行之前删除db_path,这样它就可以正确地重建)
我已经让RAG系统正常工作了。我添加了一些不在文档中的功能:
is the ServiceContext:
service_context = ServiceContext.from_defaults(embed_model = embed_model, chunk_size = 1000, chunk_overlap = 20)
至少在我提供的内容中,它需要在运行向量索引之前进行分块。在我对数据进行分块之后,它就可以正常工作了。
至于try/except,我主要是出于调试目的才加上的,但现在它可以正常工作了,我会按照你们的建议进行修改。感谢你们的帮助。

相关问题