langchain 语义分块器：列表索引超出范围

w51jfk4q 于 6个月前发布在其他

关注(0)|答案(7)|浏览(65)

检查其他资源

为这个问题添加了一个非常描述性的标题。
使用集成搜索在LangChain文档中进行了搜索。
使用GitHub搜索查找类似的问题，但没有找到。
我确信这是LangChain中的一个bug,而不是我的代码。
通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此bug。

示例代码

在text_splitter.py (SemanticChunker)中

def _calculate_sentence_distances(
        self, single_sentences_list: List[str]
    ) -> Tuple[List[float], List[dict]]:
        """Split text into multiple components."""

        _sentences = [
            {"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
        ]
        sentences = combine_sentences(_sentences, self.buffer_size)
        embeddings = self.embeddings.embed_documents(
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
            sentence["combined_sentence_embedding"] = embeddings[i] << Failed here since embeddings size is less than i at a later point

        return calculate_cosine_distances(sentences)

错误信息和堆栈跟踪(如果适用)

Traceback (most recent call last):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/main.py", line 132, in start
    store.load_data_to_db(configured_spaces)
  File "/Users/A72281951/telly/telly-backend/ingestion/common/utils.py", line 70, in wrapper
    value = func(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 86, in load_data_to_db
    for docs in self.ingest_data(spaces):
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 77, in ingest_data
    documents.extend(self.chunker.split_documents(docs))
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 258, in split_documents
    return self.create_documents(texts, metadatas=metadatas)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 243, in create_documents
    for chunk in self.split_text(text):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 201, in split_text
    distances, sentences = self._calculate_sentence_distances(single_sentences_list)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 186, in _calculate_sentence_distances
    sentence["combined_sentence_embedding"] = embeddings[i]
IndexError: list index out of range

描述

我正在尝试对文档列表进行分块，但它失败了
我正在使用来自langchain-experimental~=0.0.61的SemanticChunker
breakpoint_threshold = percentile,breakpoint_threshold amount = 95.0

系统信息

langchain==0.2.5
langchain-community==0.2.5
langchain-core==0.2.9
langchain-experimental==0.0.61
langchain-google-vertexai==1.0.5
langchain-postgres==0.0.8
langchain-text-splitters==0.2.1
Mac M3
Python 3.10.14

langchain

来源：https://github.com/langchain-ai/langchain/issues/23250

7条答案

按热度按时间

6uxekuva1#

你好，@amitjoy,你能分享一下你的MVE吗？我无法使用Cohere(而不是OpenAI)重现，并且使用笔记本中引用的Greg Kamradt的样本文本。我有Python 3.10.11,但使用了相同的软件包。以下代码在没有任何错误的情况下工作(百分位数和95是默认值，所以没有更改它们)。

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.docstore.document import Document
from langchain_community.embeddings import CohereEmbeddings

if __name__ == '__main__':
    os.environ["OPENAI_API_KEY"] = "<your_key>"
    os.environ["COHERE_API_KEY"] = "<your_key>"

    with open(r'./data/mit.txt') as file:
        essay = file.read()
        doc = Document(page_content=essay)

    # embeddings = OpenAIEmbeddings()
    embeddings = CohereEmbeddings(model="embed-english-light-v3.0")
    chunker = SemanticChunker(embeddings)
    docs = chunker.transform_documents([doc, ])
    print(f"{len(docs)}")

赞(0）回复(0）举报 6个月前

qvtsj1bj2#

我目前正在使用VertexAI Gemini来从Confluence中摄取数据：

self.chunker = SemanticChunker(
                    embeddings=vector_db.embedding, //VertexAIEmbeddings
                    breakpoint_threshold_type=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.type, // percentile
                    breakpoint_threshold_amount=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.amount) // 95.0

    def ingest_data(self, spaces: List[str]):
        for space in spaces:
            click.echo(f"⇢ Loading data from space '{space}'")
            confluence_loader = self.loader(space)

            documents: List[Document] = []
            if self.chunker is not None:
                docs: List[Document] = confluence_loader.load()
                documents.extend(self.chunker.split_documents(docs))
            elif if self.splitter is not None:
                documents.extend(confluence_loader.load_and_split(self.splitter))

            """adding space ID to the existing metadata"""
            for doc in documents:
                doc.metadata["space_key"] = space
                # the following metadata is required for ragas
                doc.metadata['filename'] = space
            yield documents

赞(0）回复(0）举报 6个月前

pexxcrt23#

你好，@amitjoy,这不是一个MVE(多文档验证示例),例如elif甚至不使用chunker。

我的最佳猜测是，你没有得到任何嵌入。堆栈跟踪对此很清楚，所以例如，在for循环之前尝试打印出embeddings的长度。

我建议将你的代码简化为一个文档，其中调用split_documents(或transform_documents,它只是一个 Package 器)函数。此外，尽量去掉confluence_loader,因为这不应该影响最终结果。

一个MVE的示例：使用我提供的代码(包括提到的文档),但只需用VertexAIEmbeddings替换CohereEmbeddings。如果失败了，那就是VertexAiEmbeddings的问题。如果没有失败，那么使用其中一个文档。如果没有失败，那就是confluence_loader的问题，否则就是文档的问题。

赞(0）回复(0）举报 6个月前

nbnkbykc4#

你好@tibor-reiss,@amitjoy,
我遇到了类似的问题。可以通过以下代码片段重现：

import itertools
import lorem
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), 200)))

请注意，在使用 from langchain.embeddings import VertexAIEmbeddings 时不会出现此问题，但会触发一个弃用警告。
问题似乎来自于 langchain_google_vertexai/embeddings.py 中的批处理大小计算，尽管文本总数较高，但它仍然为批处理大小产生任意低的值。
在 text_splitter.py 中，embeddings 的长度与 sentences 不同。

embeddings = self.embeddings.embed_documents( # <<< does not return with the correct number of embeddings
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
>>>>        sentence["combined_sentence_embedding"] = embeddings[i]

赞(0）回复(0）举报 6个月前

r9f1avp55#

你好，jsconan,感谢你检查这个问题。正如我们所怀疑的那样，这是一个新VertexAIEmbeddings的问题(或者说是一个特性),而不是SemanticChunker的问题。我在源代码中看到确实有一些重大的变化。我建议你更改这个问题的标题，或者更好的是，在https://github.com/langchain-ai/langchain-google上开一个新的问题。

赞(0）回复(0）举报 6个月前

svdrlsy46#

感谢您@tibor-reiss。我已经按照您的建议创建了一个问题：langchain-ai/langchain-google#353

赞(0）回复(0）举报 6个月前

bzzcjhmw7#

@amitjoy 请注意，该问题已在未发布的版本中修复： langchain-ai/langchain-google#353 (评论)

赞(0）回复(0）举报 6个月前