llama_index [Bug]: BatchEvalRunner在计算指标失败时无法捕获异常

bybem2ql 于 2个月前发布在其他

关注(0)|答案(7)|浏览(34)

Bug 描述

当在大量测试用例和指标上运行 BatchEvalRunner 时，如果其中一个失败(由于其不可靠的 JSON 解析，我发现 GuidelineEvaluator 经常发生这种情况),异常将不会被捕获，您将失去所有其他结果，导致时间和金钱的损失。

版本

0.10.4

重现步骤

它并不总是失败，所以您可能需要运行几次。

from llama_index.core.evaluation import GuidelineEvaluator
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.llms.openai import OpenAI

llm = OpenAI("gpt-4")

runner = BatchEvalRunner(
    {
     "my_guideline": GuidelineEvaluator(llm=llm, guidelines="The response should fully answer the query.")
     },
    workers=2,
    show_progress=True,
)

eval_results = await runner.aevaluate_response_strs(
    queries=["Limite de credito\n?\nOi"],
    response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\n\n1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\n2. Selecione: **Meus Limites**;\n3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\n\nLembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\n\nA soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \n\nEntre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\n\nCaso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no "Me Ajuda".\n\nVocê gostaria de ser transferido para um agente agora?'],
)

相关日志/回溯

{
	"name": "ValidationError",
	"message": "1 validation error for EvaluationData
__root__
Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)",
	"stack": "---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:539, in pydantic.main.BaseModel.parse_raw()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/parse.py:37, in pydantic.parse.load_str_bytes()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344         parse_int is None and parse_float is None and
345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
347 if cls is None:

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
333 \"\"\"Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335 
336 \"\"\"
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:353, in JSONDecoder.raw_decode(self, s, idx)
352 try:
--> 353     obj, end = self.scan_once(s, idx)
354 except StopIteration as err:

JSONDecodeError: Expecting ',' delimiter: line 1 column 349 (char 348)

During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
Cell In[8], line 15
5 llm = OpenAI(\"gpt-4\")
7 runner = BatchEvalRunner(
8     {
9      \"my_guideline\": GuidelineEvaluator(llm=llm, guidelines=\"The response should fully answer the query.\")
(...)
12     show_progress=True,
13 )
---> 15 eval_results = await runner.aevaluate_response_strs(
16     queries=[\"Limite de credito\
?\
Oi\"],
17     response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\
\
1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\
2. Selecione: **Meus Limites**;\
3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\
\
Lembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\
\
A soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \
\
Entre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\
\
Caso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no \"Me Ajuda\".\
\
Você gostaria de ser transferido para um agente agora?'],
18 )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:188, in BatchEvalRunner.aevaluate_response_strs(self, queries, response_strs, contexts_list, **eval_kwargs_lists)
176     for name, evaluator in self.evaluators.items():
177         eval_jobs.append(
178             eval_worker(
179                 self.semaphore,
(...)
186             )
187         )
--> 188 results = await self.asyncio_mod.gather(*eval_jobs)
190 # Format results
191 return self._format_results(results)

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in tqdm_asyncio.gather(cls, loop, timeout, total, *fs, **tqdm_kwargs)
76     return i, await f
78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
80                                          total=total, **tqdm_kwargs)]
81 return [i for _, i in sorted(res)]

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in <listcomp>(.0)
76     return i, await f
78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
80                                          total=total, **tqdm_kwargs)]
81 return [i for _, i in sorted(res)]

File ~/miniforge3/envs/project-evaluation/lib/python3.9/asyncio/tasks.py:611, in as_completed.<locals>._wait_for_one()
608 if f is None:
609     # Dummy value from _on_timeout().
610     raise exceptions.TimeoutError
--> 611 return f.result()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:76, in tqdm_asyncio.gather.<locals>.wrap_awaitable(i, f)
75 async def wrap_awaitable(i, f):
---> 76     return i, await f

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:43, in eval_worker(semaphore, evaluator, evaluator_name, query, response_str, contexts, eval_kwargs)
39 eval_kwargs = eval_kwargs or {}
40 async with semaphore:
41     return (
42         evaluator_name,
---> 43         await evaluator.aevaluate(
44             query=query, response=response_str, contexts=contexts, **eval_kwargs
45         ),
46     )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/guideline.py:115, in GuidelineEvaluator.aevaluate(***failed resolving arguments***)
107 await asyncio.sleep(sleep_time_in_seconds)
109 eval_response = await self._llm.apredict(
110     self._eval_template,
111     query=query,
112     response=response,
113     guidelines=self._guidelines,
114 )
--> 115 eval_data = self._output_parser.parse(eval_response)
116 eval_data = cast(EvaluationData, eval_data)
118 return EvaluationResult(
119     query=query,
120     response=response,
(...)
123     feedback=eval_data.feedback,
124 )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/output_parsers/pydantic.py:62, in PydanticOutputParser.parse(self, text)
60 \"\"\"Parse, validate, and correct errors programmatically.\"\"\"
61 json_str = extract_json_str(text)
---> 62 return self._output_cls.parse_raw(json_str)

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:548, in pydantic.main.BaseModel.parse_raw()

ValidationError: 1 validation error for EvaluationData
__root__
Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)"
}

llama_index

来源：https://github.com/run-llama/llama_index/issues/10764

7条答案

按热度按时间

twh00eeo1#

🤖*

您遇到的问题似乎是 BatchEvalRunner 没有正确处理评估过程中发生的异常。可以通过在 aevaluate_response_strs、aevaluate_responses 和 aevaluate_queries 方法中的评估调用周围添加 try/except 块来解决这个问题。
以下是更新后的 aevaluate_response_strs 方法代码：

async def aevaluate_response_strs(
    self,
    queries: Optional[List[str]] = None,
    response_strs: Optional[List[str]] = None,
    contexts_list: Optional[List[List[str]]] = None,
    **eval_kwargs_lists: List,
) -> Dict[str, List[EvaluationResult]]:
    """Evaluate query, response pairs.

This evaluates queries, responses, contexts as string inputs.
Can supply additional kwargs to the evaluator in eval_kwargs_lists.

Args:
queries (Optional[List[str]]): List of query strings. Defaults to None.
response_strs (Optional[List[str]]): List of response strings.
Defaults to None.
contexts_list (Optional[List[List[str]]): List of context lists.
Defaults to None.
**eval_kwargs_lists (Dict[str, Any]): Dict of lists of kwargs to
pass to evaluator. Defaults to None.
"""
    queries, response_strs, contexts_list = self._validate_and_clean_inputs(
        queries, response_strs, contexts_list
    )
    for k in eval_kwargs_lists:
        v = eval_kwargs_lists[k]
        if not isinstance(v, list):
            raise ValueError(
                f"Each value in eval_kwargs must be a list. Got {k}: {v}"
            )
        eval_kwargs_lists[k] = self._validate_and_clean_inputs(v)[0]

    # run evaluations
    eval_jobs = []
    for idx, query in enumerate(cast(List[str], queries)):
        response_str = cast(List, response_strs)[idx]
        contexts = cast(List, contexts_list)[idx]
        eval_kwargs = self._get_eval_kwargs(eval_kwargs_lists, idx)
        for name, evaluator in self.evaluators.items():
            eval_jobs.append(
                eval_worker(
                    self.semaphore,
                    evaluator,
                    name,
                    query=query,
                    response_str=response_str,
                    contexts=contexts,
                    eval_kwargs=eval_kwargs,
                )
            )
    try:
        results = await self.asyncio_mod.gather(*eval_jobs)
    except Exception as e:
        print(f"An error occurred during evaluation: {e}")
        results = []

    # Format results
    return self._format_results(results)

您需要在 aevaluate_responses 和 aevaluate_queries 方法中也添加类似的 try/except 块。这将确保在评估过程中发生的任何异常都能被捕获并处理，防止所有结果丢失。
您可以在 LlamaIndex repository 中找到 BatchEvalRunner 类的原始实现。
请告诉我这个解决方案是否对您有帮助，或者您需要进一步的协助。

来源

llama_index/evaluation/batch_runner.py

关于 Dosu

此回复旨在节省您的时间并提供帮助。它不是精确的解决方案,而是您自己研究的一个起点。
帮助我变得更有用！如果您觉得这有帮助，请留下一个 👍，如果不相关，请留下一个 👎。
如果您想继续对话，请以 @dosu-bot 开头回复。

赞(0）回复(0）举报 2个月前

cyvaqqii2#

如果按照你的方法，所有结果仍然会丢失，因为你是在整个执行过程中捕获异常，而不是在每个步骤中。
相反，你需要做的是在每个评估器的类的 aevaluate 方法中捕获异常，就像这里一样。
一个更好的解决方案是直接捕获到 BaseEvaluator 类中，这样就可以轻松地应用于每个评估器。

赞(0）回复(0）举报 2个月前

bq8i3lrv3#

🤖*

感谢您将此问题提请我们的注意。为了防止在计算度量失败时丢失所有结果，似乎在 BaseEvaluator 类的 aevaluate 方法中添加异常处理是一个好主意。以下是如何修改 BaseEvaluator 类中的 aevaluate 方法：

from abc import ABC, abstractmethod
from typing import Any, Optional, Sequence
from llama_index.core.evaluation import EvaluationResult

class BaseEvaluator(ABC):
    @abstractmethod
    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
and generated response string.

Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
        try:
            # Your evaluation logic here
            raise NotImplementedError
        except Exception as e:
            return EvaluationResult(
                invalid_result=True,
                invalid_reason=str(e),
            )

在此代码中，如果在评估过程中发生异常，它将被捕获并返回一个 EvaluationResult 对象，其中 invalid_result 设置为 True,而 invalid_reason 设置为异常消息。这样，您就可以防止在计算度量失败时丢失所有结果。
请注意，您应该用实际的评估逻辑替换 # Your evaluation logic here。 raise NotImplementedError 只是占位符，应予以删除。
此更改应在您本地副本的 LlamaIndex 库的 BaseEvaluator 类中进行。进行此更改后，您应该能够在每个评估器类的 aevaluate 方法中捕获异常，从而在计算度量失败时防止丢失所有结果。
请让我知道这个解决方案是否对您有所帮助，或者您需要进一步的帮助。

来源

llama_index/evaluation/base.py

关于 Dosu

此响应旨在为您提供帮助并节省时间。它不是精确的解决方案,而是您自己研究的一个起点。
帮助我变得更有用！如果您觉得这有帮助，请留下一个 👍，如果无关紧要，请留下一个 👎。
如果您想继续对话，请以 @dosu-bot 开头回复。

赞(0）回复(0）举报 2个月前

koaltpgm4#

你的思路是正确的！但是，你实现的问题在于，从BaseEvaluator继承的类会覆盖抽象方法aevaluate,并且try-except逻辑将丢失。你可以在GuidelineEvaluator中找到一个示例。

赞(0）回复(0）举报 2个月前

5lwkijsr5#

感谢您提供详细的信息。看起来您遇到了一个问题，即 aevaluate 方法的子类中的异常导致所有结果丢失。如果异常处理不当，这确实是一个问题。
为了解决这个问题，您可以修改 BaseEvaluator 类，在 aevaluate 周围包含一个 Package 方法，该方法包含 try-except 逻辑。这个 Package 方法将调用 aevaluate 方法，该方法可以在子类中被覆盖，而不影响 try-except 逻辑。
以下是如何实现它的代码：

class BaseEvaluator(PromptMixin):
    """Base Evaluator class."""

    # ...

    async def aevaluate_wrapper(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Wrapper for aevaluate method with try-except logic."""
        try:
            return await self.aevaluate(
                query=query,
                response=response,
                contexts=contexts,
                **kwargs,
            )
        except Exception as e:
            return EvaluationResult(
                invalid_result=True,
                invalid_reason=str(e),
            )

    @abstractmethod
    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
and generated response string.

Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
        raise NotImplementedError

    # ...

    def evaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
and generated response string.

Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
        return asyncio.run(
            self.aevaluate_wrapper(
                query=query,
                response=response,
                contexts=contexts,
                **kwargs,
            )
        )

在这段代码中，aevaluate_wrapper 方法包含一个 try-except 块，捕获由 aevaluate 方法抛出的任何异常。如果捕获到异常，它将返回一个 EvaluationResult,其中 invalid_result 设置为 True,而 invalid_reason 设置为异常的字符串表示形式。现在，evaluate 方法调用 aevaluate_wrapper 而不是 aevaluate,即使在子类中覆盖了 aevaluate,也保留了 try-except 逻辑。
这个更改应该有助于防止因无法计算指标而导致的所有结果丢失。如果您有任何问题或需要进一步的帮助，请告诉我。