Bug 描述
当在大量测试用例和指标上运行 BatchEvalRunner
时,如果其中一个失败(由于其不可靠的 JSON 解析,我发现 GuidelineEvaluator
经常发生这种情况),异常将不会被捕获,您将失去所有其他结果,导致时间和金钱的损失。
版本
0.10.4
重现步骤
它并不总是失败,所以您可能需要运行几次。
from llama_index.core.evaluation import GuidelineEvaluator
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.llms.openai import OpenAI
llm = OpenAI("gpt-4")
runner = BatchEvalRunner(
{
"my_guideline": GuidelineEvaluator(llm=llm, guidelines="The response should fully answer the query.")
},
workers=2,
show_progress=True,
)
eval_results = await runner.aevaluate_response_strs(
queries=["Limite de credito\n?\nOi"],
response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\n\n1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\n2. Selecione: **Meus Limites**;\n3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\n\nLembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\n\nA soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \n\nEntre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\n\nCaso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no "Me Ajuda".\n\nVocê gostaria de ser transferido para um agente agora?'],
)
相关日志/回溯
{
"name": "ValidationError",
"message": "1 validation error for EvaluationData
__root__
Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)",
"stack": "---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:539, in pydantic.main.BaseModel.parse_raw()
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/parse.py:37, in pydantic.parse.load_str_bytes()
File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:
File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
333 \"\"\"Return the Python representation of ``s`` (a ``str`` instance
334 containing a JSON document).
335
336 \"\"\"
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:353, in JSONDecoder.raw_decode(self, s, idx)
352 try:
--> 353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
JSONDecodeError: Expecting ',' delimiter: line 1 column 349 (char 348)
During handling of the above exception, another exception occurred:
ValidationError Traceback (most recent call last)
Cell In[8], line 15
5 llm = OpenAI(\"gpt-4\")
7 runner = BatchEvalRunner(
8 {
9 \"my_guideline\": GuidelineEvaluator(llm=llm, guidelines=\"The response should fully answer the query.\")
(...)
12 show_progress=True,
13 )
---> 15 eval_results = await runner.aevaluate_response_strs(
16 queries=[\"Limite de credito\
?\
Oi\"],
17 response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\
\
1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\
2. Selecione: **Meus Limites**;\
3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\
\
Lembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\
\
A soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \
\
Entre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\
\
Caso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no \"Me Ajuda\".\
\
Você gostaria de ser transferido para um agente agora?'],
18 )
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:188, in BatchEvalRunner.aevaluate_response_strs(self, queries, response_strs, contexts_list, **eval_kwargs_lists)
176 for name, evaluator in self.evaluators.items():
177 eval_jobs.append(
178 eval_worker(
179 self.semaphore,
(...)
186 )
187 )
--> 188 results = await self.asyncio_mod.gather(*eval_jobs)
190 # Format results
191 return self._format_results(results)
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in tqdm_asyncio.gather(cls, loop, timeout, total, *fs, **tqdm_kwargs)
76 return i, await f
78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
80 total=total, **tqdm_kwargs)]
81 return [i for _, i in sorted(res)]
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in <listcomp>(.0)
76 return i, await f
78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
80 total=total, **tqdm_kwargs)]
81 return [i for _, i in sorted(res)]
File ~/miniforge3/envs/project-evaluation/lib/python3.9/asyncio/tasks.py:611, in as_completed.<locals>._wait_for_one()
608 if f is None:
609 # Dummy value from _on_timeout().
610 raise exceptions.TimeoutError
--> 611 return f.result()
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:76, in tqdm_asyncio.gather.<locals>.wrap_awaitable(i, f)
75 async def wrap_awaitable(i, f):
---> 76 return i, await f
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:43, in eval_worker(semaphore, evaluator, evaluator_name, query, response_str, contexts, eval_kwargs)
39 eval_kwargs = eval_kwargs or {}
40 async with semaphore:
41 return (
42 evaluator_name,
---> 43 await evaluator.aevaluate(
44 query=query, response=response_str, contexts=contexts, **eval_kwargs
45 ),
46 )
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/guideline.py:115, in GuidelineEvaluator.aevaluate(***failed resolving arguments***)
107 await asyncio.sleep(sleep_time_in_seconds)
109 eval_response = await self._llm.apredict(
110 self._eval_template,
111 query=query,
112 response=response,
113 guidelines=self._guidelines,
114 )
--> 115 eval_data = self._output_parser.parse(eval_response)
116 eval_data = cast(EvaluationData, eval_data)
118 return EvaluationResult(
119 query=query,
120 response=response,
(...)
123 feedback=eval_data.feedback,
124 )
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/output_parsers/pydantic.py:62, in PydanticOutputParser.parse(self, text)
60 \"\"\"Parse, validate, and correct errors programmatically.\"\"\"
61 json_str = extract_json_str(text)
---> 62 return self._output_cls.parse_raw(json_str)
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:548, in pydantic.main.BaseModel.parse_raw()
ValidationError: 1 validation error for EvaluationData
__root__
Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)"
}
7条答案
按热度按时间twh00eeo1#
您遇到的问题似乎是
BatchEvalRunner
没有正确处理评估过程中发生的异常。可以通过在aevaluate_response_strs
、aevaluate_responses
和aevaluate_queries
方法中的评估调用周围添加 try/except 块来解决这个问题。以下是更新后的
aevaluate_response_strs
方法代码:您需要在
aevaluate_responses
和aevaluate_queries
方法中也添加类似的 try/except 块。这将确保在评估过程中发生的任何异常都能被捕获并处理,防止所有结果丢失。您可以在 LlamaIndex repository 中找到
BatchEvalRunner
类的原始实现。请告诉我这个解决方案是否对您有帮助,或者您需要进一步的协助。
来源
关于 Dosu
此回复旨在节省您的时间并提供帮助。它不是精确的解决方案,而是您自己研究的一个起点。
帮助我变得更有用!如果您觉得这有帮助,请留下一个 👍,如果不相关,请留下一个 👎。
如果您想继续对话,请以 @dosu-bot 开头回复。
cyvaqqii2#
如果按照你的方法,所有结果仍然会丢失,因为你是在整个执行过程中捕获异常,而不是在每个步骤中。
相反,你需要做的是在每个评估器的类的
aevaluate
方法中捕获异常,就像这里一样。一个更好的解决方案是直接捕获到 BaseEvaluator 类中,这样就可以轻松地应用于每个评估器。
bq8i3lrv3#
感谢您将此问题提请我们的注意。为了防止在计算度量失败时丢失所有结果,似乎在
BaseEvaluator
类的aevaluate
方法中添加异常处理是一个好主意。以下是如何修改BaseEvaluator
类中的aevaluate
方法:在此代码中,如果在评估过程中发生异常,它将被捕获并返回一个
EvaluationResult
对象,其中invalid_result
设置为True
,而invalid_reason
设置为异常消息。这样,您就可以防止在计算度量失败时丢失所有结果。请注意,您应该用实际的评估逻辑替换
# Your evaluation logic here
。raise NotImplementedError
只是占位符,应予以删除。此更改应在您本地副本的 LlamaIndex 库的
BaseEvaluator
类中进行。进行此更改后,您应该能够在每个评估器类的aevaluate
方法中捕获异常,从而在计算度量失败时防止丢失所有结果。请让我知道这个解决方案是否对您有所帮助,或者您需要进一步的帮助。
来源
关于 Dosu
此响应旨在为您提供帮助并节省时间。它不是精确的解决方案,而是您自己研究的一个起点。
帮助我变得更有用!如果您觉得这有帮助,请留下一个 👍,如果无关紧要,请留下一个 👎。
如果您想继续对话,请以 @dosu-bot 开头回复。
koaltpgm4#
你的思路是正确的!但是,你实现的问题在于,从
BaseEvaluator
继承的类会覆盖抽象方法aevaluate
,并且try-except逻辑将丢失。你可以在GuidelineEvaluator
中找到一个示例。5lwkijsr5#
感谢您提供详细的信息。看起来您遇到了一个问题,即
aevaluate
方法的子类中的异常导致所有结果丢失。如果异常处理不当,这确实是一个问题。为了解决这个问题,您可以修改
BaseEvaluator
类,在aevaluate
周围包含一个 Package 方法,该方法包含 try-except 逻辑。这个 Package 方法将调用aevaluate
方法,该方法可以在子类中被覆盖,而不影响 try-except 逻辑。以下是如何实现它的代码:
在这段代码中,
aevaluate_wrapper
方法包含一个 try-except 块,捕获由aevaluate
方法抛出的任何异常。如果捕获到异常,它将返回一个EvaluationResult
,其中invalid_result
设置为True
,而invalid_reason
设置为异常的字符串表示形式。现在,evaluate
方法调用aevaluate_wrapper
而不是aevaluate
,即使在子类中覆盖了aevaluate
,也保留了 try-except 逻辑。这个更改应该有助于防止因无法计算指标而导致的所有结果丢失。如果您有任何问题或需要进一步的帮助,请告诉我。
来源
关于 Dosu
此响应旨在帮助您节省时间并提高效率。它不是精确解决方案,而是您自己研究的起点。
帮助我变得更有用!如果您觉得这有帮助,请留下一个👍,如果不相关则请留下一个👎。
如果您想继续对话,请以 @dosu-bot 开头回复。
vmpqdwk36#
我也遇到了这个问题。希望能够解决。
5gfr0r5j7#
你是否设法找到解决这个问题的方法?