python-3.x LangChain：使用Pydantic和ChatGPT查询文档并获得结构化输出，效果不佳

tmb3ates 于 2023-08-08 发布在 Python

关注(0)|答案(1)|浏览(187)

我试图让LangChain应用程序查询包含不同类型信息的文档。为了方便我的应用程序，我希望得到一个特定格式的响应，所以我使用Pydantic来根据需要构造数据，但我遇到了一个问题。
有时ChatGPT不尊重我的Pydantic结构的格式，所以我会引发异常，我的程序会停止。当然，我可以处理异常，但我更希望ChatGPT尊重格式，我想知道我是否做错了什么。
具体而言：

ChatGPT中的日期格式不正确，因为它返回的是找到文档时的日期，而不是datetime.date格式。
Pydantic的Enum Field也不能很好地工作，因为有时文档有Lastname，而不是Surname，ChatGPT将其格式化为Lastname，而不会将其转换为Surname。
最后，我不知道我是否正确地使用了链，因为我总是对LangChain文档中的所有不同示例感到困惑。
在加载所有必要的包之后，这是我的代码：

FILE_PATH = 'foo.pdf'

class NameEnum(Enum):
    Name = 'Name'
    Surname = 'Surname'

class DocumentSchema(BaseModel):
    date: datetime.date = Field(..., description='The date of the doc')
    name: NameEnum = Field(..., description='Is it name or surname?')

def main():
    loader = PyPDFLoader(FILE_PATH)
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)
    all_splits = text_splitter.split_documents(data)
    vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
    llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
    question = """What is the date on the document?
        Is it about a name or surname?
    """

    doc_prompt = PromptTemplate(
        template="Content: {page_content}\nSource: {source}",
        input_variables=["page_content", "source"],
    )
    prompt_messages = [
        SystemMessage(
            content=(
                "You are a world class algorithm for extracting information in structured formats."
            )
        ),
        HumanMessage(content="Answer the questions using the following context"),
        HumanMessagePromptTemplate.from_template("{context}"),
        HumanMessagePromptTemplate.from_template("Question: {question}"),
        HumanMessage(
            content="Tips: Make sure to answer in the correct format"
        ),
    ]

    chain_prompt = ChatPromptTemplate(messages=prompt_messages)

    chain = create_structured_output_chain(output_schema=DocumentSchema, llm=llm, prompt=chain_prompt)
    final_qa_chain_pydantic = StuffDocumentsChain(
        llm_chain=chain,
        document_variable_name="context",
        document_prompt=doc_prompt,
    )
    retrieval_qa_pydantic = RetrievalQA(
        retriever=vectorstore.as_retriever(), combine_documents_chain=final_qa_chain_pydantic
    )
    data = retrieval_qa_pydantic.run(question)

字符串
根据我正在检查的文件，执行脚本将引发错误，因为ChatGPT的返回不尊重Pydantic的格式。
我错过了什么？
谢谢你，谢谢

python-3.x

来源：https://stackoverflow.com/questions/76822673/langchain-querying-a-document-and-getting-structured-output-using-pydantic-with

1条答案

按热度按时间

zfycwa2u1#

我设法解决了我的问题，这就是我解决它们的方法。

try/except块

首先，我在链执行代码周围添加了一个try/except块，以在不停止执行的情况下捕获那些淘气的错误。

清理vectorstore

我还注意到，vectorstore变量在每次运行时都没有得到“清理”，我将在同一次执行中对不同的文档进行处理，以便在新文档中保留旧数据。我意识到我需要在每次运行时清理vectorstore：

try:
    # Retrieve the data
    data = retrieval_qa_pydantic.run(question)
    # Delete the embeddings for the next run
    vectorstore.delete()
except error_wrappers.ValidationError as e:
    log.error(f'Error parsing file: {e}')
else:
    return data
return None

字符串

格式化提示

然后，我注意到我需要更明确的数据格式。我修改了说明，以满足我的要求，并提供了如下额外帮助：

HumanMessage(
        content="Tips: Make sure to answer in the correct format. Dates should be in the format YYYY-MM-DD."
    ),

型
关键是消息的Tips部分。从那一刻起，我再也没有关于日期的格式问题了。

`None`枚举

为了解决Enum的问题，我修改了这个类，以考虑None值，这意味着当LLM无法找到我需要的信息时，它会将变量设置为None。我是这样修复的：

class NameEnum(Enum):
    Name = 'Name'
    Surname = 'Surname'
    NON = None

型
最后但并非最不重要的是，我注意到我从文档中得到了很多错误的信息，所以我必须调整一些额外的东西：

更大的拆分和`gpt-4`

我将分割数从200增加到500，并且为了提高任务的准确性，我使用了gpt-4作为模型，而不再使用gpt-3.5-turbo。通过更改块的大小并使用gpt-4，我消除了任何不一致，数据提取几乎完美地工作。

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
llm = ChatOpenAI(model_name='gpt-4', temperature=0)

型
我希望这些提示在未来对某人有所帮助。

赞(0）回复(0）举报 2023-08-08

我来回答

python-3.x LangChain：使用Pydantic和ChatGPT查询文档并获得结构化输出，效果不佳

1条答案

try/except块

清理vectorstore

格式化提示

`None`枚举

更大的拆分和`gpt-4`

相关问题

热门标签

最新问答

python-3.x LangChain：使用Pydantic和ChatGPT查询文档并获得结构化输出，效果不佳

1条答案

try/except块

清理vectorstore

格式化提示

None枚举

更大的拆分和gpt-4

相关问题

热门标签

最新问答

`None`枚举

更大的拆分和`gpt-4`