llama_index [问题]:有没有办法从Elastic Search数据库初始化我的索引数据？

yqkkidmi 于 2个月前发布在其他

关注(0)|答案(3)|浏览(22)

问题验证

我已经在文档和discord上搜索了答案。

问题

如何通过VectorStoreIndex.from_vector_store初始化一个带有docstore的索引？当我使用VectorStoreIndex.from_vector_store方法创建索引时，我发现它的docstore实际上是空的，这意味着我的内存没有用向量初始化，也没有节点。这是个bug吗？
或者，还有其他从Elastic Search数据库初始化索引数据的方法吗？

llama_index

来源：https://github.com/run-llama/llama_index/issues/13593

3条答案

按热度按时间

8yparm6h1#

方法 VectorStoreIndex.from_vector_store 的设计目的是在不预先填充数据的情况下初始化一个 VectorStoreIndex 示例。它设置了索引结构，以便将来进行操作，但不会从向量存储中加载现有数据。这就是为什么在使用此方法后，您会发现 docstore 为空。这是预期的行为，而不是一个错误。
要使用 Elastic Search 数据库中的数据初始化索引，您需要显式地将数据加载到索引中。这可能涉及使用类似于 ElasticsearchReader.load_data 的方法查询 Elastic Search 中现有的数据，然后将这些节点添加到索引中。
如果您正在尝试处理已经在向量存储中的数据，并在 docstore 为空时遇到问题，请考虑直接与向量存储 API 交互，或使用索引的检索方法来获取和添加节点。例如，您可以像这样检索存储在向量存储中的所有节点，并将它们添加到您的索引中：

retriever = index.as_retriever(similarity_top_k=10000)
source_nodes = retriever.retrieve("fake")
nodes = [x.node for x in source_nodes]

这种方法允许您使用来自向量存储(如 Elastic Search 数据库)的现有数据填充索引。

详细信息

来源

此响应旨在帮助您节省时间并提高效率。它不是 精确解决方案,而是您自己研究的起点。
帮助我变得更有用！如果这有帮助，请留下一个 👍，如果无关紧要，请留下一个 👎。

要继续对话，请提及 @dosu。*

赞(0）回复(0）举报 2个月前

hgc7kmma2#

@mglslg with all vector db integrations, the docstore is always empty, because all the nodes are stored in the vector db. This is done to simplify storage.
If you need a docstore, you can either manually maintain one outside of the index using the class itself, or set store_nodes_override=True when creating your index. Both options mean you need to save and load the docstore somewhere

赞(0）回复(0）举报 2个月前

8gsdolmq3#

感谢您的回答！
我对这个docstore对象有些困惑。
它似乎像是一个缓存，因为我在调用refresh_ref_docs时，发现了类似self.docstore.set_document_hash(document.get_doc_id(), document.hash)的代码，后来在决定是否更新时，它会比较docstore中的哈希值。
您能帮我解释一下它的设计概念吗？我在官方文档中没有找到相关的解释。我最初认为使用refresh_ref_docs方法会自动从Elasticsearch读取数据并匹配哈希值，但后来发现它只匹配docstore中缓存的哈希值。最后，我不得不手动检查Elasticsearch中的哈希值以进行匹配。我的代码最终如下：

def get_changed_docs(es_index_name: str, doc_list: List[Document]) -> List[Document]:
    es_client = get_es_client()
    changed_doc_list = []
    for doc in doc_list:
        query = {
            "query": {
                "match": {
                    "metadata.doc_id": f"{doc.get_doc_id()}"
                }
            }
        }
        result = es_client.search(index=es_index_name, body=query)

        if not result['hits']['hits']:
            changed_doc_list.append(doc)
            continue

        hits = result['hits']['hits']

        for hit in hits:
            node_content = hit['_source']['metadata']['_node_content']
            node_obj = json.loads(node_content)
            if node_obj['relationships']['1']['hash'] != doc.hash:
                changed_doc_list.append(doc)

    return changed_doc_list

need_refresh_docs = get_changed_docs(es_index_name, mongo_documents)

index.refresh_ref_docs(need_refresh_docs)

llamaindex框架中还有其他更好的实现方式吗？
@dosubot

赞(0）回复(0）举报 2个月前

我来回答

llama_index [问题]:有没有办法从Elastic Search数据库初始化我的索引数据？

问题验证

问题

3条答案

详细信息

相关问题

热门标签

最新问答