llama_index [问题]:在持久存储上存储索引并加载 翻译结果:将索引存储在持久存储上并加载

oknwwptz  于 5个月前  发布在  其他
关注(0)|答案(4)|浏览(70)

问题验证

  • 我已经在文档和discord上寻找答案。

问题

你好,我在从持久化存储加载索引时遇到了一些问题。
以下脚本保存了我的向量和图索引:

imports
...

logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(level=logging.INFO)

...

os.environ["NEBULA_USER"] = "root"
os.environ["NEBULA_PASSWORD"] = "nebula"
os.environ["NEBULA_ADDRESS"] = "127.0.0.1:9669"

space_name = "test10"
edge_types, rel_prop_names = ["relationship"], ["relationship"]
tags = ["entity"]

def encode_string(s):
    return base64.urlsafe_b64encode(s.encode()).decode()

def decode_string(s):
    return base64.urlsafe_b64decode(s.encode()).decode()

def sanitize_and_encode(data):
    sanitized_data = {}
    for key, value in data.items():
        if isinstance(value, str):
            sanitized_data[key] = encode_string((value))
        else:
            sanitized_data[key] = value
    return sanitized_data

def decode_metadata(metadata):
    decoded_metadata = {}
    for key, value in metadata.items():
        if isinstance(value, str):
            decoded_metadata[key] = decode_string(value)
        else:
            decoded_metadata[key] = value
    return decoded_metadata

def load_json_nodes(json_directory):
    nodes = []
    for filename in os.listdir(json_directory):
        if filename.endswith('.json'):
            with open(os.path.join(json_directory, filename), 'r') as file:
                data = json.load(file)
                for node_data in data:
                    sanitized_metadata = sanitize_and_encode(node_data['metadata'])
                    node = TextNode(
                        text=encode_string((node_data['text'])),
                        id_=node_data['id_'],
                        embedding=node_data['embedding'],
                        metadata=sanitized_metadata
                    )
                    nodes.append(node)
                    logging.debug(f"Loaded node ID: {node.id_}, text: {node_data['text']}, metadata: {node_data['metadata']}")
                    
    return nodes

def create_index():
    graph_store = NebulaGraphStore(
        space_name=space_name,
        edge_types=[etype.lower() for etype in edge_types], 
        rel_prop_names=[rprop.lower() for rprop in rel_prop_names],  
        tags=[tag.lower() for tag in tags] 
    )

    storage_context = StorageContext.from_defaults(graph_store=graph_store)
    
    json_nodes = load_json_nodes("JSON_nodes_999_large_syll_small")
    documents = [
        Document(
            text=decode_string(node.text),
            id_=node.id_,
            metadata=decode_metadata(node.metadata),
            embedding=node.embedding
        ) for node in json_nodes
    ]
    
    kg_index = KnowledgeGraphIndex.from_documents(
        documents,
        storage_context=storage_context,
        max_triplets_per_chunk=10,
        space_name=space_name,
        edge_types=edge_types,
        rel_prop_names=rel_prop_names,
        tags=tags,
        max_knowledge_sequence=15,
        include_embeddings=True
    )
    
    # Set the index_id for KnowledgeGraphIndex
    kg_index.set_index_id("kg_index")
    
    kg_index.storage_context.persist(persist_dir='./storage_graph_syllabus_test_small')
    logging.debug(f"KG Index created with {len(documents)} documents")

    # Create VectorStoreIndex
    vector_index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
    
    # Set the index_id for VectorStoreIndex
    vector_index.set_index_id("vector_index")
    
    # Persist the storage context
    storage_context.persist(persist_dir='./storage_graph_syllabus_test_small')
    logging.debug(f"Vector Index created with {len(documents)} documents")
    return kg_index, vector_index, storage_context

print("Creating Index...")
kg_index, vector_index, storage_context = create_index()
print("Index Created...")

然后,在我的查询脚本中的以下函数尝试加载这些索引,但是,出于某种原因,kg索引总是返回空响应:

persist_dir = './storage_graph_syllabus_test_small'

def initialize_indices():
    global vector_index, kg_index, vector_retriever, kg_retriever
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)

    start_time = time.time() 

    if os.path.exists(persist_dir):
        vector_index = load_index_from_storage(storage_context, index_id="vector_index")
        kg_index = load_index_from_storage(storage_context, index_id="kg_index")
        
    else:
        ...
    vector_retriever = VectorIndexRetriever(index=vector_index, top_k=7)
    kg_retriever = KnowledgeGraphRAGRetriever(storage_context=storage_context, verbose=True, top_k = 7)

此外,当我运行查询脚本时,我已经包含了调试以给我更多的信息,我也在这里附加了调试信息:

WARNING:llama_index.core.graph_stores.simple:No existing llama_index.core.graph_stores.simple found at ./storage_graph_syllabus_test_small\graph_store.json. Initializing a new graph_store from scratch. 
INFO:llama_index.core.indices.loading:Loading indices with ids: ['vector_index']
INFO:llama_index.core.indices.loading:Loading indices with ids: ['kg_index']
WARNING:llama_index.core.indices.knowledge_graph.base:Upgrading previously saved KG index to new storage format.
 * Serving Flask app 'main_complete-emb-flask'
 * Debug mode: on
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:3000
 * Running on http://192.168.2.204:3000
INFO:werkzeug:Press CTRL+C to quit

有人知道为什么会发生这种情况吗?
我还检查了存储索引中的文件,它们似乎正确填充。
感谢您的时间!

svmlkihl

svmlkihl1#

关于您的知识图谱(KG)索引在从持久性存储加载时返回空响应的问题,尽管存储索引文件似乎已正确填充,但可能是由于 "text" 字段在 index_struct 中为 null 导致的。
在提供的JSON结构中,"text" 字段设置为 null :

{
  "index_struct": {
    "text": null,
    "doc_id": "446680de-660e-421a-8fdc-ed244bd466ce",
    "embedding": null,
    "extra_info": null,
    "table": {
      ...
    }
  }
}

text 字段对于知识图谱返回有意义的响应至关重要。如果此字段为 null ,则索引可能没有生成响应所需的数据,导致结果为空。
此外,确保提供给 KnowledgeGraphQueryEnginestorage_context 包含有效的 graph_storegraph_store 对于从知识图谱查询和检索数据至关重要。如果 graph_store 没有正确设置或丢失,查询引擎将无法检索任何数据,导致返回空响应。
要解决此问题,请确保:

  1. 在将索引保存到持久性存储时,正确填充 text 字段并填充相关数据。
  2. 在初始化 KnowledgeGraphQueryEngine 时,storage_context 包含有效的 graph_store
    以下是 KnowledgeGraphQueryEngine 应该如何初始化的示例:
storage_context = StorageContext(graph_store=your_graph_store)
query_engine = KnowledgeGraphQueryEngine(storage_context=storage_context)

请确保 your_graph_store 已正确配置并包含必要的模式和数据。
此外,如果您正在使用 LlamaIndex 框架中的 KnowledgeGraphIndex 类,请确保 graph_store 已正确填充。该类包括一种机制,可以在必要时将先前保存的 KG 索引升级到新存储格式:

if (
    len(self.index_struct.table) > 0
    and isinstance(self.graph_store, SimpleGraphStore)
    and len(self.graph_store._data.graph_dict) == 0
):
    logger.warning("Upgrading previously saved KG index to new storage format.")
    self.graph_store._data.graph_dict = self.index_struct.rel_map

这确保了 graph_store 已正确填充来自 index_struct 的关系Map。如果未执行此升级步骤,graph_store 可能仍为空,导致返回空响应。
此外,您可以参考 LlamaIndex 测试中的 test_load_graph_from_storage_simple 函数,以了解如何正确从存储加载图形。以下是测试的简化版本:

def test_load_graph_from_storage_simple(
    documents: List[Document],
    tmp_path: Path,
    mock_service_context: ServiceContext,
) -> None:
    storage_context = StorageContext.from_defaults()

    vector_index_1 = VectorStoreIndex.from_documents(
        documents=documents,
        storage_context=storage_context,
        service_context=mock_service_context,
    )

    vector_index_2 = VectorStoreIndex.from_documents(
        documents=documents,
        storage_context=storage_context,
        service_context=mock_service_context,
    )

    summary_index = SummaryIndex.from_documents(
        documents=documents,
        storage_context=storage_context,
        service_context=mock_service_context,
    )

    graph = ComposableGraph.from_indices(
        SummaryIndex,
        children_indices=[vector_index_1, vector_index_2, summary_index],
        index_summaries=["vector index 1", "vector index 2", "summary index"],
        storage_context=storage_context,
        service_context=mock_service_context,
    )

    query_engine = graph.as_query_engine()
    response = query_engine.query("test query")

    storage_context.persist(str(tmp_path))

    new_storage_context = StorageContext.from_defaults(persist_dir=str(tmp_path))

    new_graph = load_graph_from_storage(
        new_storage_context, root_id=graph.root_id, service_context=mock_service_context
    )

    new_query_engine = new_graph.as_query_engine()
    new_response = new_query_engine.query("test query")

    assert str(response) == str(new_response)

此测试确保图形正确从存储加载,并且查询响应保持一致 1(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/index_structs/knowledge_graph/index_kg.json)2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/query_engine/knowledge_graph_query_engine.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/knowledge_graph/base.py)4(https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/indices/test_loading_graph.py)。

  • 要继续与 Dosu 交谈,请提及 @dosu。*
envsm3lx

envsm3lx2#

我仍然困惑,如何更新我的脚本以确保在创建和存储我的持久存储时,我的rel_map被填充?

@dosu

yx2lnoni

yx2lnoni3#

为了确保在创建和存储持久存储时,rel_map被填充,你需要在插入或更新数据后调用get_rel_map方法。以下是如何更新脚本以包含此步骤的示例:

class SimpleGraphStore(GraphStore):
    """Simple Graph Store.

In this graph store, triplets are stored within a simple, in-memory dictionary.

Args:
simple_graph_store_data_dict (Optional[dict]): data dict
containing the triplets. See SimpleGraphStoreData
for more details.
"""

    def __init__(
        self,
        data: Optional[SimpleGraphStoreData] = None,
        fs: Optional[fsspec.AbstractFileSystem] = None,
        **kwargs: Any,
    ) -> None:
        """Initialize params."""
        self._data = data or SimpleGraphStoreData()
        self._fs = fs or fsspec.filesystem("file")

    def upsert_triplet(self, subj: str, rel: str, obj: str) -> None:
        """Add triplet and update rel_map."""
        if subj not in self._data.graph_dict:
            self._data.graph_dict[subj] = []
        if [rel, obj] not in self._data.graph_dict[subj]:
            self._data.graph_dict[subj].append([rel, obj])

        # Update rel_map after upserting the triplet
        self.rel_map = self.get_rel_map(subjs=[subj])

    def get_rel_map(
        self, subjs: Optional[List[str]] = None, depth: int = 2, limit: int = 30
    ) -> Dict[str, List[List[str]]]:
        """Get depth-aware rel map."""
        return self._data.get_rel_map(subjs=subjs, depth=depth, limit=limit)

# Example usage
graph_store = SimpleGraphStore()
graph_store.upsert_triplet("Software", "Mention in", "Footnotes")
print(graph_store.rel_map)

在这个例子中,在插入三元组后,调用get_rel_map方法来使用新的关系更新rel_map。这确保了rel_map始终用最新的数据1(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/graph_stores/llama-index-graph-stores-tidb/llama_index/graph_stores/tidb/graph.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/graph_stores/kuzu.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/graph_stores/llama-index-graph-stores-neo4j/llama_index/graph_stores/neo4j/base.py)4(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/graph_stores/neo4j.py)5(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/graph_stores/simple.py)填充。

  • 要继续与Dosu交流,请提及@dosu。*
7eumitmz

7eumitmz4#

嘿,@logan-markewich,我已经尝试从dosubot理解以下内容,但我很难正确实现。我只希望能够从持久化目录加载我创建的向量和图形索引,而不必连接到我的nebula图服务器。这是可能的吗?

我想要填充rel_map,因为在index_store.json(图形存储)中它是空的。原因是我认为当我使用这些索引调用持久化目录时,我的关系应该在这里。

你有什么建议吗?我的最终目标是在不要求持续连接到nebula图的情况下部署这个应用程序。

相关问题