llama_index [Bug]:当使用HuggingFace嵌入时,LlamaIndex-DSPy集成问题,

az31mfrm  于 23天前  发布在  其他
关注(0)|答案(2)|浏览(18)

Bug描述

我正在遵循cookbook( https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb )的教程,但将LLM和嵌入模型更改为非OpenAI模型。当我尝试使用HuggingFaceEmbeddings类编译dspy训练管道时遇到了一个错误,但在使用其他任何嵌入模型时没有遇到这个问题。这是我在DSPy-AI上打开的GitHub问题: stanfordnlp/dspy#1209

版本

10.50

重现步骤

我正在遵循cookbook( https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb )的教程,但更改了以下变量:

turbo = dspy.OllamaLocal(model="llama3")

并且我指定了嵌入模型为HuggingFace模型。如果没有进行此指定,LlamaIndex将使用OpenAI嵌入作为其默认嵌入模型。

from llama_index.core import (
    SimpleDirectoryReader, 
    VectorStoreIndex,
    Settings
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.embed_model = HuggingFaceEmbedding()

docs = SimpleDirectoryReader(
    input_files = ["paul_graham_essay.txt"]
).load_data()

index = VectorStoreIndex.from_documents(docs)

retriever = index.as_retriever(similarity_top_k=2)

当我运行单元格时遇到了一个错误消息。

from dspy.teleprompt import BootstrapFewShot
from llama_index.core.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator(similarity_threshold=0.5)

def validate_context_and_answer(example, pred, trace=None):
    """We check that the predicted answer is correct, and that
    the retrieved context does contain the answer."""
    result = evaluator.evaluate(response = pred.answer,
                                reference = example.answer)
    return result.passing

teleprompter = BootstrapFewShot(
    max_labeled_demos=0,
    metric=validate_context_and_answer
)

#compile
compiled_dspy_qp = teleprompter.compile(dspy_qp, trainset=train_examples)

相关日志/回溯

Failed to run or to evaluate example Example({'query_str': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'query_str'}) with <function validate_context_and_answer at 0x7feac38a1f80> due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
bogh5gae

bogh5gae1#

要解决在使用DSPy训练管道中的HuggingFaceEmbedding类时出现的"'HuggingFaceEmbedding'对象没有属性'_model'"错误,请确保在HuggingFaceEmbedding类中正确初始化_model属性。以下是显示如何在__init__方法中初始化_model的相关信息:

确保AutoModel的初始化正确放置在__init__方法中,并确保_model属性得到正确设置。这应该解决属性错误1(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-huggingface/llama_index/embeddings/huggingface/base.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/huggingface.py)。

  • 要继续与Dosu交谈,请提及@dosu。*
syqv5f0l

syqv5f0l2#

@tituslhy 你好,我也在尝试使用 HuggingFaceEmbedding 与 DSPy。与烹饪书不同,我只是在 DSPy 模块中使用了向量检索器。这里是一个更简洁的示例来演示这个问题:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import dspy
from dspy.teleprompt import BootstrapFewShot

class Rag(dspy.Module):
    def __init__(self):
        super().__init__()
        reader = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"])
        docs = reader.load_data()
        index = VectorStoreIndex.from_documents(docs)
        self.retriever = index.as_retriever()

    def forward(self, question):
        return dspy.Prediction(answer=str(self.retriever.retrieve(question)))

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5", trust_remote_code=True
)
Settings.llm = None

teleprompter = BootstrapFewShot()
train_examples = [
    dspy.Example(
        question="What did the author do growing up?",
        answer="The author wrote short stories and also worked on programming.",
    ).with_inputs("question"),
    dspy.Example(
        question="What did the author do during his time at YC?",
        answer="organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC",
    ).with_inputs("question"),
]
teleprompter.compile(Rag(), trainset=train_examples)

只需下载数据集:

wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -O paul_graham_essay.txt

即使没有 LLM,你也应该能够运行这个示例。
这在我这边的输出是:

2024-07-02T08:47:05.960695Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
2024-07-02T08:47:05.961326Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do during his time at YC?', 'answer': 'organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211

虽然我还没有做更多的调查,但我认为这可能与我之前打开的一个问题有关 #13956 。在一个相关的 issue 上,一个开发者提到你不能使用本地嵌入模型进行多进程。因此,我怀疑 teleprompter.compile() 中可能有一些与此相关的问题(因为 DSPy 不使用多进程)。

相关问题