llama_index [Bug]: WeaviateVectorStore在pipeline.run之后从未删除旧文档,

p8ekf7hl 于 6个月前发布在其他

关注(0)|答案(3)|浏览(49)

Bug描述

TL;DR

当我多次使用 DocstoreStrategy.UPSERTS_AND_DELETE 运行 pipeline.run 时，weaviate中的对象总是在增加。我正在使用 IngestionPipeline 和 WeaviateVectorStore。pipeline.run 将调用 _handle_upserts 方法，最终调用 docstore.add_documents。在 add_documents 方法中，它将调用 _prepare_kv_pairs 以准备一个名为 metadata_kv_pairs 的元组，该元组使用 node.node_id 而不是 node.ref_doc_id 作为第一个元素。
下一次当我运行 pipeline.run 时，_handle_upserts 将调用以下代码以获取一组 node_id,

existing_doc_ids_before = set(
                self.docstore.get_all_document_hashes().values()
            )

然后它将使用 node_id 作为 ref_doc_id,并将其传递给 vector_store.delete(ref_doc_id)。WeaviateVectorStore.delete 将使用以下代码创建一个过滤器，无法找到任何匹配的数据。

where_filter = wvc.query.Filter.by_property("ref_doc_id").equal(ref_doc_id)

版本

0.10.58

重现步骤

init

vector_index_name = 'LA' 
weaviate_client = weaviate.connect_to_wcs(
    cluster_url=settings.get("WEAVIATE_CLOUD_REST_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(st.secrets.WEAVIATE_API_KEY),
)

vector_store = WeaviateVectorStore(
    weaviate_client=weaviate_client, index_name=vector_index_name,
)

docstore = SimpleDocumentStore()
pipelineForTransforms = IngestionPipeline(     
    transformations=[
        SimpleFileNodeParser(),
        SentenceSplitter(chunk_size=500, chunk_overlap=100),
        Settings.embed_model
    ]
) 
pipelineForStorage = IngestionPipeline(     
    transformations=[], # dont't need transform
    docstore = docstore,
    vector_store = vector_store,
    docstore_strategy = DocstoreStrategy.UPSERTS_AND_DELETE,
)

加载数据几次，你会看到对象的数量在增加。

reader = FlatReader()

docs = reader.load_data(Path("./data/xxx.md"))

dir = os.listdir("./storage")
if len(dir) != 0:
   _pipelineForStorage.load("./storage")

nodes = _pipelineForTransform.run(documents=docs)

# save to storage
nodes = _pipelineForStorage.run(nodes=nodes)   # <------------ it will succeed, but weaviate won't delete old data

3条答案

按热度按时间

4sup72z81#

要解决WeaviateVectorStore因node_id与ref_doc_id不匹配而无法删除旧文档的问题，需要确保在删除过程中正确设置和使用ref_doc_id。WeaviateVectorStore类中的delete方法使用ref_doc_id来识别并删除文档。

以下是delete方法的相关内容：

确保传递给delete方法的ref_doc_id与Weaviate数据库中存储的ref_doc_id相匹配。如果不匹配，将找不到文档并因此无法删除。

如果node_id和ref_doc_id应该相同或相关，请确保在向索引添加节点时正确设置它们。add方法应为每个节点正确设置ref_doc_id。

以下是参考的add方法：

确保add_node函数为每个节点正确设置ref_doc_id。如有必要，请修改add_node函数以确保正确设置ref_doc_id。

此外，如果您正在使用具有IngestionPipeline和docstore的vector_store,请确保正确设置docstore_strategy。DocstoreStrategy.UPSERTS_AND_DELETE策略将处理upserts并从doc store和vector store中删除不存在的文档：

当使用IngestionPipeline时，请确保正确处理文档的删除：

通过确保正确设置和使用ref_doc_id,并使用适当的docstore_strategy,您应该能够解决旧文档无法删除的问题。

赞(0）回复(0）举报 6个月前

h5qlskok2#

如果传入文档的id不一致，那么它无法将哈希值与现有文档进行比较。请确保输入文档具有相同文档ID(例如，文件名是一个很好的选择)。

赞(0）回复(0）举报 6个月前

a11xaf1n3#

我用新文档替换了旧文档，所以id不再相同。然而，我认为DocstoreStrategy.UPSERTS_AND_DELET应该在旧文档不存在时执行此操作以将其删除。实际上，在_handle_upserts方法中，它会从docstore中删除旧文档，因为docstore中的delete_ref_doc实际上将ref_doc_id作为doc_id。但是矢量存储(即我们在这里使用的weaviate)不会这样做。
@logan-markewich

赞(0）回复(0）举报 6个月前

我来回答

llama_index [Bug]: WeaviateVectorStore在pipeline.run之后从未删除旧文档,

Bug描述

版本

重现步骤

init

加载数据几次，你会看到对象的数量在增加。

相关日志/回溯

3条答案

相关问题

热门标签

最新问答