llama_index [Bug]: mmr_threshold 不支持 ChromaVectorStore

68de4m5k 于 2个月前发布在其他

关注(0)|答案(6)|浏览(25)

错误描述

我之前使用LlamaIndex内置的向量索引实现了一个VectorIndexRetriever,没有使用任何向量数据库，并启用了MMR模式和mmr_threshold。它运行正常。
然后我添加了ChromaDB,发现只要不包含mmr_threshold,MMR模式就可以正常工作。如果你在vector_store_kwargs参数中设置了mmr_threshold,就会报错。

版本

llama-index-0.10.1

重现步骤

这是我代码中的相关部分。如果我注解掉"# ERROR"这一行，那么代码就可以正常工作。

from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.core.storage import StorageContext
from llama_index.core.indices.vector_store.retrievers.retriever import (
    VectorIndexRetriever,
)
from llama_index.core.query_engine.retriever_query_engine import (
    RetrieverQueryEngine
)
from llama_index.core import Settings
from llama_index.vector_stores.chroma.base import ChromaVectorStore
import chromadb

def run_query(
    question: str,
    vectorstore: str,
    top_k: int,
    mmr_threshold: float,
) -> RetrieverQueryEngine | None:
    '''
    Return an LLM response to an input query after doing a vector search.

    Args:
        question (str): The query to the LLM.
        vectorstore (str): Folder name of vector database.
        top_k (int): Number of retrievals or citations to retrieve via
            vector search.
        mmr_threshold (float): A float between 0 and 1, for MMR search mode.
            Closer to 0 gives you more diversity in the retrievals.
            Closer to 1 gives you more relevance in the retrievals.

    Returns:
        RetrieverQueryEngine | None: If vectorstore location exists, return a
            Response object from RetrieverQueryEngine, else return nothing.
    '''

    if not os.path.exists(vectorstore):
        print('Error: Vectorstore', vectorstore, 'not found!')
        return
    else:
        # Instantiate a Chroma client, setting storage folder location:
        client = chromadb.PersistentClient(path=vectorstore)

        # Instantiate a Chroma collection based on the client:
        chroma_collection = client.get_or_create_collection(vectorstore)

        # Instantiate a ChromaVectorStore based on the Chroma collection:
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

        # Instantiate a storage context based on the ChromaVectorStore:
        storage_context = StorageContext.from_defaults(
            vector_store=vector_store
        )

        # Instantiate LLM and embedding model:
        llm, embedding = create_azure_models()

        # Add these 2 models to the LlamaIndex Settings:
        Settings.llm = llm
        Settings.embed_model = embedding

        index = VectorStoreIndex.from_vector_store(
            vector_store,
            storage_context=storage_context
        )

        # Instantiate and configure a VectorIndexRetriever:
        # Note about parameters:
        # similarity_top_k sets the number of retrievals (citations).
        # mmr_threshold is a value between 0 and 1.
        # Closer to 0 gives you more diversity.
        # Closer to 1 gives you more relevance.
        # If the data contains duplicated entries, set it lower (e.g. 0.2)
        # so that the retriever will skip over search results that are
        # identical or very similar and go for greater diversity.
        retriever = VectorIndexRetriever(
            index=index,
            similarity_top_k=top_k,
            vector_store_query_mode='mmr',
            vector_store_kwargs={'mmr_threshold': mmr_threshold}  # ERROR
        )

        # Instantiate RetrieverQueryEngine and pass in the VectorIndexRetriever:
        query_engine = RetrieverQueryEngine(
            retriever=retriever
        )

        # Query the index:
        response = query_engine.query(question)

        return response

6条答案

按热度按时间

w41d8nur1#

🤖*

嘿，@jfkoh!很高兴再次见到你。希望你一切都好！
你遇到的错误信息 "Collection.query() got an unexpected keyword argument 'mmr_threshold'",是由于在LlamaIndex版本0.10.1的'ChromaVectorStore'类的'query'方法中，'mmr_threshold'参数没有被识别为该方法的参数。
在'ChromaVectorStore'类的'query'方法中，它接受以下参数：'query'和'**kwargs'。'query'参数是'VectorStoreQuery'的示例，而'**kwargs'是一个可以包含额外参数的字典。然而，在这个方法中，'mmr_threshold'并不是一个被识别的参数，因此会出现TypeError。
要解决这个问题，你应该从'query'方法调用中删除'mmr_threshold'参数。如果你的使用场景需要这个参数，你可能需要实现一个包含这个参数的自定义解决方案，或者检查LlamaIndex的更新版本是否支持这个参数。
以下是相关的代码片段：
你可以在 llama_index/vector_stores/chroma.py 文件中找到这段代码。
关于'mmr_threshold'参数，它是在使用最大边缘相关性(MMR)检索方法时控制相关性和多样性之间的权衡。MMR检索方法用于迭代地查找与先前结果不相似的文档。MMR算法中的'mmr_threshold'参数(用λ表示)决定了相关性(文档与查询的相似度)和多样性(文档与已选文档的不相似度)之间的平衡。
如果'mmr_threshold'接近1,那么会更注重相关性；而一个接近0的'mmr_threshold'则会更注重多样性。当阈值接近1时，分数将相等，并且忽略与先前节点的相似性，关闭MMR的影响。通过降低阈值，算法将倾向于选择更多样化的文档。
以下是在LlamaIndex框架中设置'mmr_threshold'的方法：
在这个例子中，将'mmr_threshold'设置为0.2,这意味着MMR检索方法将更注重多样性。你可以在 docs/examples/vector_stores/SimpleIndexDemoMMR.ipynb 文件中找到这个设置。
希望这对你有所帮助！如果你还有其他问题，请随时提问。

参考资料

关于Dosu

这个回复是为了帮助你节省时间，而不是一个精确的解决方案。它只是一个你自己研究的起点。
帮我变得更有用吧！如果你觉得有帮助就请点赞 👍 ,如果觉得无关就请点 👎 。
如果你想继续讨论，请以 @dosu-bot 开始你的回复。

赞(0）回复(0）举报 2个月前

k0pti3hp2#

这不是一个llama-index错误吗？Chroma似乎没有一个mmr_threshold kwarg。
事实上，MMR在他们的文档中没有任何提及，我认为他们不支持它。

赞(0）回复(0）举报 2个月前

klh5stk13#

感谢Logan的评论 :)
我看到Langchain在Chroma中实现了类似于mmr_threshold的功能，但他们称之为lambda_mult或score_threshold。请参阅此页面：
https://api.python.langchain.com/en/v0.0.342/vectorstores/langchain.vectorstores.chroma.Chroma.html
我在LlamaIndex中尝试了这两个术语，但它们不起作用。

赞(0）回复(0）举报 2个月前

rmbxnbpk4#

嗯，我想我们目前还没有实现它：) prs非常受欢迎。

赞(0）回复(0）举报 2个月前

n6lpvg4x5#

@logan-markewich 没有人解决这个问题吗？这似乎是一个非常重要的事情，对于正确的RAG检索来说。

赞(0）回复(0）举报 2个月前

cwdobuhd6#

@cmosguy MMr阈值并没有被广泛使用。自从这个问题打开以来，就没有人请求它。欢迎提交PR

赞(0）回复(0）举报 2个月前