llama_index 如何重用使用Elasticsearch创建的向量存储中的嵌入(使用Dense-x时查询ID未找到)

jpfvwuh4  于 5个月前  发布在  ElasticSearch
关注(0)|答案(1)|浏览(61)

问题验证

  • 我已经在文档和discord上搜索了答案。

问题

当我尝试使用Dense-x从Elasticsearch重用嵌入时,遇到了一个问题。当我尝试使用之前创建的嵌入来获取响应时,问题就出现了。在第一次运行时,嵌入成功创建,没有遇到任何问题。然而,在随后的运行中,当尝试重用这些嵌入时,我遇到了以下错误:
ValueError: Query id [id] not found in either retriever_dict or query_engine_dict.
以下是用于创建和重用嵌入的代码片段:

self.documents = SimpleDirectoryReader(input_dir=os.environ.get('DOC_PATH'), required_exts=[".docx", ".doc", ".pdf", ".txt"]).load_data()

nodes = self.text_splitter.get_nodes_from_documents(self.documents)
sub_nodes = await self._gen_propositions(nodes)
all_nodes = nodes + sub_nodes
all_nodes_dict = {n.node_id: n for n in all_nodes}

vector_store = ElasticsearchStore(
    index_name=os.environ.get('INDEX_NAME'),
    es_url=os.environ.get('ES_URL'),
)

service_context = ServiceContext.from_defaults(
    llm=self._proposition_llm,
    embed_model=Settings.embed_model,
    num_output=self._proposition_llm.metadata.num_output,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

if os.path.exists("./elastic_db"):
    print("From Elasticsearch")
    self.vector_index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
else:
    self.vector_index = VectorStoreIndex(
        all_nodes, service_context=service_context, show_progress=True, storage_context=storage_context
    )
    os.makedirs("./elastic_db")

self.retriever = RecursiveRetriever(
    "vector",
    retriever_dict={
        "vector": self.vector_index.as_retriever(
            similarity_top_k=self.similarity_top_k
        )
    },
    node_dict=all_nodes_dict,
)

self.query_engine = RetrieverQueryEngine.from_args(
    self.retriever, streaming=self.streaming
)

response = self.query_engine.query("Query in String")
print(response)

预期行为:
我希望能够重用Elasticsearch中的嵌入并成功获取响应,而不会遇到ValueError。
重现步骤:

  • 运行代码以首次创建嵌入并将它们存储在Elasticsearch中。
  • 再次运行代码以使用之前创建的嵌入获取响应。
  • 观察ValueError。

附加信息:

  • 我正在使用Dense-x和Elasticsearch进行向量存储。
  • 代码检查./elastic_db目录是否存在,以确定是从Elasticsearch加载还是创建新的嵌入

需要帮助:
如何修改代码以成功重用来自Elasticsearch向量存储的嵌入并避免ValueError?

pdsfdshx

pdsfdshx1#

要成功地从Elasticsearch向量存储中重用嵌入并避免ValueError: "Query id [id] not found in either retriever_dict or query_engine_dict",您需要确保retriever_dictquery_engine_dict得到正确设置,并且查询ID被正确引用。以下是如何修改代码以实现此目的的示例:
在这种情况下,

from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.elasticsearch import ElasticsearchStore, AsyncDenseVectorStrategy
from llama_index.core.retrievers import VectorIndexAutoRetriever, RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

# Define the Elasticsearch store
dense_vector_store = ElasticsearchStore(
    es_url=os.environ.get('ES_URL'),
    index_name=os.environ.get('INDEX_NAME'),
    retrieval_strategy=AsyncDenseVectorStrategy(),
)

# Define the helper function to print results
def print_results(results):
    for rank, result in enumerate(results, 1):
        print(
            f"{rank}. title={result.metadata['title']} score={result.get_score()} text={result.get_text()}"
        )

# Define the search function
def search(vector_store: ElasticsearchStore, nodes: list[TextNode], query: str):
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(nodes, storage_context=storage_context)

    # Setup auto retriever
    auto_retriever = VectorIndexAutoRetriever(index)
    index_auto_retriever = IndexAutoRetriever(retriever=auto_retriever)

    # Define per-document retriever
    retriever_dict = {}
    for node in nodes:
        index_id = node.metadata["index_id"]
        filters = MetadataFilters(
            filters=[
                MetadataFilter(
                    key="index_id", operator=FilterOperator.EQ, value=index_id
                ),
            ]
        )
        retriever = index.as_retriever(filters=filters)
        retriever_dict[index_id] = retriever

    # Setup recursive retriever
    recursive_retriever = RecursiveRetriever(
        "vector",
        retriever_dict={"vector": index_auto_retriever, **retriever_dict},
        verbose=True,
    )

    # Plug into query engine
    llm = OpenAI(model="gpt-3.5-turbo")
    query_engine = RetrieverQueryEngine.from_args(
        recursive_retriever, llm=llm
    )

    print(">>> Documents:")
    results = recursive_retriever.retrieve(query)
    print_results(results)

    print("\n>>> Answer:")
    response = query_engine.query(query)
    print(response)

# Example data
from llama_index.core.schema import TextNode

movies = [
    TextNode(
        text="The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.",
        metadata={"title": "Pulp Fiction", "index_id": "1"},
    ),
    TextNode(
        text="When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.",
        metadata={"title": "The Dark Knight", "index_id": "2"},
    ),
    TextNode(
        text="An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.",
        metadata={"title": "Fight Club", "index_id": "3"},
    ),
    TextNode(
        text="A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
        metadata={"title": "Inception", "index_id": "4"},
    ),
    TextNode(
        text="A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",
        metadata={"title": "The Matrix", "index_id": "5"},
    ),
    TextNode(
        text="Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.",
        metadata={"title": "Se7en", "index_id": "6"},
    ),
    TextNode(
        text="An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.",
        metadata={"title": "The Godfather", "theme": "Mafia", "index_id": "7"},
    ),
]

# Perform the search
search(dense_vector_store, movies, "which movie involves dreaming?")

函数通过从VectorStoreIndex创建retrieverquery_engine并确保查询ID被正确引用来正确设置retriever_dictquery_engine_dict。这应该有助于避免与缺少查询ID相关的ValueError。此外,ElasticsearchStore类及其方法,如_to_elasticsearch_filter_mode_must_match_retrieval_strategy,确保检索策略与查询模式兼容,这对于在查询过程中避免错误至关重要。

  • 要继续与Dosu交谈,请提及@dosu。*

相关问题