llama_index [Bug]: 当从本地缓存加载并使用新的docstore时，IngestionPipeline与docstore不输出节点,

q5iwbnjs 于 23天前发布在其他

关注(0)|答案(3)|浏览(23)

Bug描述

在运行一个带有docstore的IngestionPipeline,并使用pipeline.persist()将其缓存本地存储后，通过调用pipeline.load()在相同的文档上运行另一个管道(加载已存储的缓存),将不会产生输出(没有节点返回)。请注意，即使没有附加到第二个管道的docstore,也不会从第二个管道返回任何文档。如果附加到第二个管道的docstore,则不会将任何文档加载到该docstore中。

我怀疑缓存中存储的docstore被使用，而不是附加到第二个管道的新docstore。

版本

0.10.38

重现问题的步骤

参见#14068(评论)

3条答案

按热度按时间

kognpnkq1#

你在使用IngestionPipeline从本地缓存加载并运行之前遇到的问题是由于缓存机制导致的。缓存根据哈希值检索之前转换过的节点。如果哈希值匹配，将使用缓存的节点而不是再次运行转换。这种行为由run_transformations和arun_transformations函数1(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/ingestion/pipeline.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/ingestion/pipeline.py)控制。
以下是相关的代码片段：

如果节点和转换的哈希值与缓存条目匹配，将使用缓存的节点( nodes = cached_nodes )。如果文档或转换已经改变，但哈希计算没有准确反映这些变化，管道将错误地使用缓存的节点，导致没有新的输出2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/ingestion/pipeline.py)。
此外，IngestionPipeline使用文档存储( docstore )来管理重复数据删除。如果启用了文档存储，它将阻止重新处理已经摄取的文档3(https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/ingestion/test_pipeline.py)。
要解决这个问题，你可能需要确保哈希计算准确地反映了文档或转换中的任何更改。这可能涉及到修改get_transformation_hash函数或在发生更改时确保适当清除或更新缓存。

要继续与Dosu交谈，请提及@dosu。*

赞(0）回复(0）举报 23天前

q3qa4bjr2#

你需要提供更可重复的代码。这段代码对我来说运行良好，每次都会返回一个节点。

赞(0）回复(0）举报 23天前

0ve6wy6x3#

非常感谢您的调查。是的，我不应该这么匆忙地在问题中提供更具体可重现的代码。经过更多的实验，我发现仅仅 pipeline.persist() 和 pipeline.load() 是不够触发这个bug的。相反，我认为这与缓存如何与docstore交互有关。
以下代码展示了拥有缓存和docstore如何触发这个bug:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document
from llama_index.core.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
pipeline = IngestionPipeline(transformations=[SentenceSplitter()], docstore=docstore)

nodes = pipeline.run(documents=[Document(text="hello")])
print(f"Got {len(nodes)} nodes")  # should be 1
nodes = pipeline.run(documents=[Document(text="hello")])
print(f"Got {len(nodes)} nodes")  # should be 0
print(f"Docstore has {len(docstore.docs)} nodes")  # should be 1

pipeline.persist()

docstore = SimpleDocumentStore()
pipeline = IngestionPipeline(transformations=[SentenceSplitter()], docstore=docstore)
pipeline.load()
nodes = pipeline.run(documents=[Document(text="hello")])
print(f"Got {len(nodes)} nodes")  # should be 1
nodes = pipeline.run(documents=[Document(text="hello"), Document(text="hello world")])
print(f"Got {len(nodes)} nodes")  # should be 1
print(f"Docstore has {len(docstore.docs)} nodes")  # should be 2

我的计算机上的输出是：

Docstore strategy set to upserts, but no vector store. Switching to duplicates_only strategy.
Got 1 nodes
Got 0 nodes
Docstore has 1 nodes
Docstore strategy set to upserts, but no vector store. Switching to duplicates_only strategy.
Got 0 nodes
Got 1 nodes
Docstore has 0 nodes

我认为第三个 run() 是有问题的是，输出是 0 而不是 1 。根据文档，附加一个docstore会使摄取管道跳过重复的文档。然而，由于我正在使用一个新的docstore,它不应该认为再次将 "hello" 添加为重复项。如果没有调用 pipeline.load() 或没有附加docstore,这个问题就不会发生。此外，第二个docstore应该有2个节点而不是0个。
此外，使用第一个管道的docstore而不是第二个管道仍然会导致第三个 run() 不处理任何节点。如 docstore.json 所示，我认为docstore也存储在缓存中。尽管如此，我认为在我想清除我的向量数据库并重新加载所有数据的情况下，重用上一个管道的docstore可能会有问题，在这种情况下，之前加载的所有节点都会被视为“已加载”并被跳过，而我认为正确的行为是输出它们并在适用时使用缓存版本。

编辑：添加了确切的输出和docstore的大小。最初，我还提供了一个同时使用docstore和vector store的例子，但我将其删除了，因为是否有vector store并不会影响这个bug。
此外，我提到的更严重的情况，即使添加新文档也不会从 pipeline.run() 返回任何节点，似乎与我的项目中文档加载的处理方式有关，而与LlamaIndex关系不大。因此，我将编辑此问题以删除该部分。

赞(0）回复(0）举报 23天前

我来回答

llama_index [Bug]: 当从本地缓存加载并使用新的docstore时，IngestionPipeline与docstore不输出节点,

Bug描述

版本

重现问题的步骤

相关日志/回溯

3条答案

相关问题

热门标签

最新问答