langchain 为什么TileDB向量存储实现在指令中说有"from_documents"方法时，实际上没有这个方法呢？

jv4diomz 于 6个月前发布在其他

关注(0)|答案(3)|浏览(49)

检查其他资源

我为这个问题添加了一个非常描述性的标题。
我使用集成搜索在LangChain文档中进行了搜索。
我使用GitHub搜索查找类似的问题，但没有找到。
我确信这是LangChain中的一个错误，而不是我的代码。
通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此错误。

示例代码

使用位于tiledb.py中的标准tiledb代码

错误信息和堆栈跟踪(如果适用)

没有提供具体的错误信息

描述

尽管在线说明中说tiledb源代码具有"from_documents"方法，但实际上它并没有。

系统信息

Windows 10

langchain

来源：https://github.com/langchain-ai/langchain/issues/22964

3条答案

按热度按时间

wvmv3b1j1#

TileDB 是 Core 中的 VectorStore 的子类。VectorStore 在这里有一个名为 from_document 的类方法。

赞(0）回复(0）举报 6个月前

qqrboqgw2#

解决方案：

调查和解决步骤：

验证文档：
确认TileDB矢量存储文档是否确实指出 from_documents 方法应该存在。
如果确实存在，那么文档和实际实现之间可能存在差异。
代码审查：
审查 tiledb.py 中的TileDB矢量存储实现，以确认不存在 from_documents 方法。
与核心中的基类 VectorStore 进行对照，查看该方法是否应被继承或需要显式实现。
可能的修复：
如果 from_documents 应被继承，请确保TileDB正确地继承并公开此方法。
如果TileDB需要显式实现，请根据基类定义和功能添加该方法。
更新文档：
如果该方法不应存在，请更正文档以避免将来的混淆。
建议的代码修复：

如果应该显式定义 from_documents 方法，则将其添加到 TileDB矢量存储实现中。

# File path: tiledb.py

from vector_store import VectorStore  # Assuming VectorStore is defined in vector_store module

class TileDB(VectorStore):
    @classmethod
    def from_documents(cls, documents, embedding_function):
        """
Creates a TileDB vector store from a list of documents.

Args:
documents (list): List of documents to be stored.
embedding_function (callable): Function to convert documents to embeddings.

Returns:
TileDB: An instance of the TileDB vector store.
"""
        # Convert documents to embeddings
        embeddings = [embedding_function(doc) for doc in documents]
        # Create an instance of TileDB and store embeddings
        instance = cls()
        instance.store_embeddings(embeddings)
        return instance

    def store_embeddings(self, embeddings):
        """
Store embeddings in the TileDB storage.

Args:
embeddings (list): List of embeddings to be stored.
"""
        # Implement the logic to store embeddings in TileDB
        pass

后续行动：

如果该方法不应存在，通知文档团队更新文档。
推送代码更改并创建一个拉取请求以供审查。
a. 为TileDB矢量存储实现实施建议的代码修复。
b. 更新LangChain文档，以准确反映TileDB矢量存储中可用的方法。

赞(0）回复(0）举报 6个月前

v1l68za43#

当我尝试使用 from_documents 方法时，它一直给我各种错误，而且，更糟糕的是，会在没有给出任何错误的情况下提前退出；例如：
AttributeError: 'int' object has no attribute '_length_function'
TypeError: 'str_iterator' object does not support the context manager protocol
TypeError: 'dict' object is not callable
TypeError: unsupported operand type(s) for +: 'int' and 'str'
经过大约两天的艰苦排查，我通过修改 sentence-transformers 库本身找到了一个解决方案；具体来说是在 SentenceTransformers.py 中的 _text_length 方法。这个修改使得 _text_length 能够接受一个 "字符串列表" 以及像它目前所做的那样的 "单个字符串"。
这是修改后的代码：

def _text_length(self, text: Union[str, List[str], List[int], List[List[int]]]):
    """
    Help function to get the length for the input text. Text can be either
    a list of ints (which means a single text as input), or a tuple of list of ints
    (representing several text inputs to the model).
    """
    if isinstance(text, str):
        return len(text)
    elif isinstance(text, dict):  # {key: value} case
        return len(next(iter(text.values())))
    elif not hasattr(text, "__len__"):  # Object has no len() method
        return 1
    elif len(text) == 0 or isinstance(text[0], int):  # Empty string or list of ints
        return len(text)
    else:
        return sum([len(t) for t in text])  # Sum of length of individual strings

显然，这种修改并不是理想的，因为它需要修改 sentence-transformers 源代码，并可能导致与其他 sentence-transformers 库功能的不稳定性...
AI建议的另一个解决方案是使用 TileDB 的 from_texts 方法。然而，这将需要将所有 "文档对象" 分解成两个列表...

一个由字符串组成的 "chunks 列表";
一个由字典组成的包含所有以前包含在文档对象中的元数据的列表。
我还没有尝试过这个解决方案，而是提交了这个问题。
以下是我如何使用 TileDB 的方法。请注意...在我的项目中，"texts" 变量是一个 "文档对象" 列表，而 "embedding" 参数使用 HuggingFaceEmbeddings、HuggingFaceInstructEmbeddings 或 HuggingFaceBgeEmbeddings:

db = TileDB.from_documents(
            documents=texts,
            embedding=embeddings,
            index_uri=str(self.PERSIST_DIRECTORY),
            allow_dangerous_deserialization=True,
            metric="euclidean",
            index_type="FLAT",
        )

赞(0）回复(0）举报 6个月前

我来回答

langchain 为什么TileDB向量存储实现在指令中说有"from_documents"方法时，实际上没有这个方法呢？

检查其他资源

示例代码

错误信息和堆栈跟踪(如果适用)

描述

系统信息

3条答案

解决方案：

相关问题

热门标签

最新问答