langchain ChatHuggingFace + HuggingFaceEndpoint没有正确实现max_new_tokens

pw136qt2  于 3个月前  发布在  其他
关注(0)|答案(4)|浏览(42)

检查其他资源

  • 我为这个问题添加了一个非常描述性的标题。
  • 我在LangChain文档中使用集成搜索进行搜索。
  • 我使用GitHub搜索找到了一个类似的问题,但没有找到。
  • 我确信这是LangChain中的一个bug,而不是我的代码。
  • 通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此bug。

示例代码

from transformers import AutoTokenizer
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFaceEndpoint

import requests

sample = requests.get(
    "https://raw.githubusercontent.com/huggingface/blog/main/langchain.md"
).text

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")

def n_tokens(text):
    return len(tokenizer(text)["input_ids"])

print(f"The number of tokens in the sample is {n_tokens(sample)}")

llm_10 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=10,
    cache=False,
    seed=123,
)
llm_4096 = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-70B-Instruct",
    max_new_tokens=4096,
    cache=False,
    seed=123,
)

messages = [
    (
        "system",
        "You are a smart AI that has to describe a given text in to at least 1000 characters.",
    ),
    ("user", f"Summarize the following text:\n\n{sample}\n"),
]

# native endpoint
response_10_native = llm_10.invoke(messages)
print(f"Native response 10: {n_tokens(response_10_native)} tokens")
response_4096_native = llm_4096.invoke(messages)
print(f"Native response 4096: {n_tokens(response_4096_native)} tokens")

# make sure the native responses are different lengths
assert len(response_10_native) < len(
    response_4096_native
), f"Native response 10 should be shorter than native response 4096, 10 `max_new_tokens`: {n_tokens(response_10_native)}, 4096 `max_new_tokens`: {n_tokens(response_4096_native)}"

# chat implementation from langchain_huggingface
chat_model_10 = ChatHuggingFace(llm=llm_10)
chat_model_4096 = ChatHuggingFace(llm=llm_4096)

# chat implementation for 10 tokens
response_10 = chat_model_10.invoke(messages)
print(f"Response 10: {n_tokens(response_10.content)} tokens")
actual_response_tokens_10 = response_10.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 10: {actual_response_tokens_10} tokens (always 100 for some reason!)"
)

# chat implementation for 4096 tokens
response_4096 = chat_model_4096.invoke(messages)
print(f"Response 4096: {n_tokens(response_4096.content)} tokens")
actual_response_tokens_4096 = response_4096.response_metadata.get(
    "token_usage"
).completion_tokens

print(
    f"Actual response 4096: {actual_response_tokens_4096} tokens (always 100 for some reason!)"
)

# assert that the responses are different lengths, which fails because the token usage is always 100
print("-" * 20)
print(f"Output for 10 tokens: {response_10.content}")
print("-" * 20)
print(f"Output for 4096 tokens: {response_4096.content}")
print("-" * 20)
assert len(response_10.content) < len(
    response_4096.content
), f"Response 10 should be shorter than response 4096, 10 `max_new_tokens`: {n_tokens(response_10.content)}, 4096 `max_new_tokens`: {n_tokens(response_4096.content)}"

这是脚本的输出:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The number of tokens in the sample is 1809
Native response 10: 11 tokens
Native response 4096: 445 tokens
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Response 10: 101 tokens
Actual response 10: 100 tokens (always 100 for some reason!)
Response 4096: 101 tokens
Actual response 4096: 100 tokens (always 100 for some reason!)

--------------------
Output for 10 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------
Output for 4096 tokens: The text announces the launch of a new partner package called `langchain_huggingface` in LangChain, jointly maintained by Hugging Face and LangChain. This package aims to bring the power of Hugging Face's latest developments into LangChain and keep it up-to-date. The package was created by the community, and by becoming a partner package, the time it takes to bring new features from Hugging Face's ecosystem to LangChain's users will be reduced.

The package integrates seamlessly with Lang
--------------------

错误消息和堆栈跟踪(如果适用)

AssertionError:响应10应该比响应4096短,10 max_new_tokens : 101,4096 max_new_tokens : 101

描述

在使用langchain_huggingface.llms.huggingface_endpoint.HuggingFaceEndpointlangchain_huggingface.chat_models.huggingface.ChatHuggingFace实现时似乎存在问题。
当仅使用HuggingFaceEndpoint时,参数max_new_tokens得到了正确实现,但在 Package 在ChatHuggingFace(llm=...)内部时无法正常工作。后者的实现始终返回一个100个令牌的响应,在搜索了文档和源代码后仍无法正常工作。
我已使用meta-llama/Meta-Llama-3-70B-Instruct创建了一个可复现的例子(因为该模型也支持无服务器)。

系统信息

系统信息

操作系统:Darwin
OS版本:Darwin内核版本23.5.0:Wed May 1 20:19:05 PDT 2024;root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
Python版本:3.12.3 (main,Apr 9 2024,08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]

软件包信息

langchain_core:0.2.10
langchain:0.2.6
langchain_community:0.2.5
langsmith:0.1.82
langchain_anthropic:0.1.15
langchain_aws:0.1.7
langchain_huggingface:0.0.3
langchain_openai:0.1.9
langchain_text_splitters:0.2.2
langchainhub:0.1.20

未安装的软件包(不一定是问题)

以下软件包未找到:
langgraph
langserve

xxhby3vn

xxhby3vn1#

好的,你正在比较两个不同的事物。Huggingface推理客户端返回以下对象,该对象具有usage属性,其类型为ChatCompletionOutputUsage。
ChatCompletionOutputUsage有三种类型的令牌使用情况:

  1. completion_tokens :完成提示所需的令牌数量。在你的情况中,这总是固定的,因为你调用相同的提示来完成。尝试其他内容,它应该会改变。
  2. prompt_tokens :提示中的令牌数量。
  3. total_tokens :completion_tokensprompt_tokens的总和。
    因此,你通过n_tokens函数隐式地将total_tokenscompletion_tokens进行比较,这是不正确的。你应该比较total_tokens属性来进行正确的比较。
    P.S.我仔细检查了LangChain代码,并确保ChatHuggingFace在没有任何修改的情况下返回正确的ChatCompletionOutputUsage
igetnqfo

igetnqfo2#

好的,你正在比较两个不同的事物。Huggingface推理客户端返回以下对象,该对象具有usage属性,类型为ChatCompletionOutputUsage。
ChatCompletionOutputUsage有三种类型的令牌使用情况:

  1. completion_tokens:完成提示所需的令牌数量。在你的情况中,这总是固定的,因为你调用相同的提示来完成。尝试其他内容,它应该会改变。
  2. prompt_tokens:提示中的令牌数量。
  3. total_tokens:completion_tokensprompt_tokens的总和。
    所以,你是通过total_tokens函数隐式地将n_tokenscompletion_tokens进行比较,这是不正确的。你应该比较total_tokens属性来进行正确的比较。
    P.S.我仔细检查了LangChain代码,并确保ChatHuggingFace在没有任何修改的情况下返回正确的ChatCompletionOutputUsage
    我认为你误解了示例代码,n_tokens()函数是在链内容的输出上调用的,因此completion_tokens == n_tokens(output) - 1。减去的1是特殊的序列结束令牌(这就是为什么输出说有101个令牌,而不是100个的原因)。问题在于ChatCompletionOutputUsage.output_tokens应该始终小于或等于max_new_tokens,但无论如何提供的max_new_tokens,都是100个令牌。
qmb5sa22

qmb5sa223#

我遇到了同样的问题...你找到解决方案了吗?

oymdgrw7

oymdgrw74#

我遇到了同样的问题...你找到了解决方案吗?
没有,这个问题让我觉得整个Huggingface x Langchain实现已经过时了。我一直在尝试通过LlamaCpp/Ollama使用一个与OpenAI兼容的网络服务器来解决这个问题。

相关问题