langchain 当Chroma作为检索器使用时,它会返回相同的文档多次,

ubby3x7f  于 4个月前  发布在  其他
关注(0)|答案(5)|浏览(89)

检查其他资源

  • 为这个问题添加了一个非常描述性的标题。
  • 使用集成搜索在LangChain文档中进行了搜索。
  • 使用GitHub搜索找到了一个类似的问题,但没有找到。
  • 我确信这是LangChain中的一个错误,而不是我的代码。
  • 通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此错误。

示例代码

import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50)

splits = text_splitter.split_documents(blog_docs)

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
docs = retriever.get_relevant_documents("What is Task Decomposition?")

print(f"number of documents - {len(docs)}")
for doc in docs:
  print(f"document content - `{doc.__dict__}")

打印出的值是
文档内容 - {'page_content': 'Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.
1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', 'metadata': {'source': ' https://lilianweng.github.io/posts/2023-06-23-agent/'} , 'type': 'Document'}
文档内容 - {'page_content': 'Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.
1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.', 'metadata': {'source': ' https://lilianweng.github.io/posts/2023-06-23-agent/'} , 'type': 'Document'}
文档内容 - {'page_content': 'Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps。 CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process。
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure。 The search process can be BFS (breadth-first search)或DFS(深度优先搜索)与每个状态由分类器(通过提示)或多数投票进行评估。
任务分解可以通过(1)LLM简单提示如“XYZ的步骤。
1。”,“实现XYZ的子目标是什么?”,(2)使用任务特定指令;例如“编写小说大纲”。用于写小说,或者(3)与人类输入完成。', 'metadata': {'source': ' https://lilianweng.github.io/posts/2023-06-23-agent/'} , 'type': 'Document'}
文档内容 - {'page_content': '资源:

  1. 用于搜索和信息收集的互联网访问。
  2. 长期记忆管理。
  3. GPT-3.5驱动的代理,用于分配简单任务。
  4. 文件输出。

性能评估:

  1. 不断回顾和分析您的行动,确保您正在尽最大努力执行。
  2. 建设性地自我批评您的整体行为。
  3. 反思过去的决策和策略,以完善您的方法。
  4. 每个命令都有成本,因此要明智高效。尽量在最少的步骤中完成任务。', 'metadata': {'source': ' https://lilianweng.github.io/posts/2023-06-23-agent/'} , 'type': 'Document'}
    如您所见,有3个文档是相同的。我检查了分割并包含52个文档,但
res = vectorstore.get()
res.keys()

len(res['documents'])

的值为156,所以我认为每个文档存储了3次而不是1次。

错误信息和堆栈跟踪(如果适用)

  • 无响应*

描述

我尝试在玩具示例中将chroma用作检索器,除了在应用“get_relevant_documents”时获得不同的文档外,我还得到了相同的文档3次。

系统信息

langchain==0.2.1
langchain-community==0.2.1
langchain-core==0.2.3
langchain-openai==0.1.8
langchain-text-splitters==0.2.0
langchainhub==0.1.17
Linux
Python 3.10.12
我在Colab上运行

jm2pwxwz

jm2pwxwz1#

你好,@amirhagai。
在上面的代码中,你运行了以下片段3次:
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
这意味着你向vectorstore添加了3个文档。
我不知道是否有任何机制可以防止保存重复项,但是如果你想重置vectorstore的状态,请运行:vectorstore.delete_collection()

zpqajqem

zpqajqem2#

感谢您的澄清!
实际上,它出现了两次,因为我错误地复制粘贴了同一个单元格两次 :)
但是我确实看到当我多次运行这个单元格时,数据库会向自己添加项目,而不是创建新的示例。这是预期的行为吗?
为了说明情况,每次我运行这个单元格时 -

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

Chroma 添加项目并更改检索,即使它没有被重新定义。
文档中说该函数是“从文档列表创建一个Chroma向量存储”。
https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html
再次感谢您 :)

sxissh06

sxissh063#

在使用Langchain学习时,我也遇到了同样奇怪的行为。我不知道这是否是预期的行为,但我认为这是奇怪的,不应该是这样。这令人困惑,因为正如@amirhagai提到的,它们是两个不同的"Chroma" Package 器示例。但从内部来看,它们引用的是相同的chroma集合,即默认集合:_LANGCHAIN_DEFAULT_COLLECTION_NAME(定义为"lanchain")。
此外(正如@amirhagai提到的),文档字符串和文档中说"创建一个Chroma向量存储",这加强了"新且干净的集合"的概念。
我认为,至少对于期望返回示例工厂的classmethod from_text和from_documents,它应该自动调用delete_collections。

fiei3ece

fiei3ece4#

@amirhagai@mariano22

  1. Use from langchain_chroma.vectorstores import Chroma instead of from langchain_community.vectorstores import Chroma . because langchain_chroma.vectorstores is maintained regularly.
  2. The behavior of vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings()) is correct as expected. It will be creating a collection of name _LANGCHAIN_DEFAULT_COLLECTION_NAME .
    If you want to create a separate collection, please specify name of the collection as it is a necessary parameter while defining the chroma client. Example:
vectorstore_1 = Chroma.from_documents(documents=splits, embedding = embedding, collection_name="col")
vectorstore_2 = Chroma.from_documents(documents=splits, embedding = embedding, collection_name="sol")

You can also store them in your local directory since these are currently in your RAM and will be lost as soon as you stop the kernel. Example:

vectorstore = Chroma.from_documents(documents=splits, embedding = embeddings, collection_name="sol", persist_directory="my_dir")

However, I was not able to find a classmethod that returns vectorstore from persisted collection without insering new document. @eyurtsev Please quote me if I am wrong on this.

3phpmpom

3phpmpom5#

感谢@spike-spiegel-21的澄清。
这种困惑源于静态方法指定了一种创建向量存储的方式,但不清楚如果集合已经存在,它是否会加载它而不是覆盖它。
我认为,如果.from_collection在集合名称已经存在的情况下执行delete_collection操作,会更加直观。
但是,如果你不同意我的看法,至少我会建议在教程中以及特别是在文档中进行澄清:https://api.python.langchain.com/en/latest/vectorstores/langchain_chroma.vectorstores.Chroma.html#langchain_chroma.vectorstores.Chroma.from_documents
(我不得不阅读代码来理解行为或者根据我观察到的情况做出假设)

相关问题