llama_index [文档]:记录NLSQLTableQueryEngine(以及其他查询引擎)使用的管道,

m528fe3b  于 2个月前  发布在  其他
关注(0)|答案(5)|浏览(45)

文档问题描述

讨论中原始问题仍未得到解答,因此我不确定这是否是一个显而易见的问题,还是确实需要纠正的文档中的一个空白。
我已经查阅了文档、示例和源代码,但仍然不清楚当你执行 query() NLSQLTableQueryEngine 时到底会发生什么。根据我所了解的,一个 SQL 检索器通过使用预定义的提示并将模式作为上下文来形成提示,将其发送到 LLM,获取 SQL 查询,从 SQL 服务器获取结果并将其附加到最终发送到 LLM 的提示。然而,我找到的关于模式的唯一提及似乎会产生一个巨大的冗长的模式,不适合上下文窗口。
整个过程应该被记录下来,这不应该超过几行。
非常感谢!

文档链接

https://docs.llamaindex.ai/en/stable/api_reference/query_engine/NL_SQL_table/?h=nlsqltablequeryengine#llama_index.core.query_engine.NLSQLTableQueryEngine

2ledvvac

2ledvvac1#

嘿,@szhydkov!我在这里帮助你解决任何bug、问题或贡献。让我们一起解决这个问题!
Llama Index API中的NLSQLTableQueryEngine使用管道将自然语言查询处理为SQL查询并执行它们。以下是当调用query()方法时发生的详细过程:

  • 初始化:
  • NLSQLTableQueryEngine使用各种参数(包括SQL数据库、提示模板、上下文查询参数和响应合成和详细程度标志)进行初始化。
  • 创建一个NLSQLRetriever的示例并分配给_sql_retriever
  • 查询执行:
  • 当调用query()方法时,它会内部调用BaseSQLTableQueryEngine类的_query()方法。
  • _query()方法首先通过在sql_retriever上调用retrieve_with_metadata()来检索SQL查询和元数据。
  • SQL查询检索:
  • NLSQLRetrieverretrieve_with_metadata()方法处理自然语言查询以生成SQL查询。
  • SQL查询随后针对SQL数据库执行,并检索结果。
  • 响应合成:
  • 如果设置了synthesize_responseTrue,则使用response_synthesis_prompt_query()方法合成响应。
  • 使用get_response_synthesizer()函数创建响应合成器,该函数将格式化SQL查询及其结果为连贯的响应。
  • 然后将合成的响应作为最终输出返回。
  • 返回响应:
  • 如果设置了synthesize_responseFalse,则直接返回原始SQL查询结果。
  • 最终的响应(无论是合成的还是原始的),都封装在一个Response对象中并返回。

以下是说明此过程的相关代码片段:

class NLSQLTableQueryEngine(BaseSQLTableQueryEngine):
    def __init__(self, sql_database: SQLDatabase, text_to_sql_prompt: Optional[BasePromptTemplate] = None, context_query_kwargs: Optional[dict] = None, synthesize_response: bool = True, response_synthesis_prompt: Optional[BasePromptTemplate] = None, tables: Optional[Union[List[str], List[Table]]] = None, service_context: Optional[ServiceContext] = None, context_str_prefix: Optional[str] = None, sql_only: bool = False, verbose: bool = False, **kwargs: Any) -> None:
        self._sql_retriever = NLSQLRetriever(sql_database, text_to_sql_prompt=text_to_sql_prompt, context_query_kwargs=context_query_kwargs, tables=tables, context_str_prefix=context_str_prefix, service_context=service_context, sql_only=sql_only, verbose=verbose)
        super().__init__(synthesize_response=synthesize_response, response_synthesis_prompt=response_synthesis_prompt, service_context=service_context, verbose=verbose, **kwargs)

    @property
    def sql_retriever(self) -> NLSQLRetriever:
        return self._sql_retriever

class BaseSQLTableQueryEngine(BaseQueryEngine):
    def _query(self, query_bundle: QueryBundle) -> Response:
        retrieved_nodes, metadata = self.sql_retriever.retrieve_with_metadata(query_bundle)
        sql_query_str = metadata["sql_query"]
        if self._synthesize_response:
            partial_synthesis_prompt = self._response_synthesis_prompt.partial_format(sql_query=sql_query_str)
            response_synthesizer = get_response_synthesizer(llm=self._llm, callback_manager=self.callback_manager, text_qa_template=partial_synthesis_prompt, refine_template=self._refine_synthesis_prompt, verbose=self._verbose, streaming=self._streaming)
            response = response_synthesizer.synthesize(query=query_bundle.query_str, nodes=retrieved_nodes)
            cast(Dict, response.metadata).update(metadata)
            return cast(Response, response)
        else:
            response_str = "\n".join([node.node.text for node in retrieved_nodes])
            return Response(response=response_str, metadata=metadata)

此外,Llama Index文档提供了一个关于设置文本到SQL管道的高级指南,其中包括:

  1. 查询时间表检索:动态检索文本到SQL提示中的相关表。
  2. 查询时间示例行检索:嵌入/索引每行并动态检索文本到SQL提示中每个表的示例行。
    该指南演示了如何加载和摄取数据、定义核心模块(对象索引、检索器、SQLDatabase、Text-to-SQL Prompt、Response synthesis Prompt、LLM),以及使用表检索和行检索功能设置端到端文本到SQL管道。它还包括用于索引表、定义查询管道和运行查询的代码示例。
    要获取更详细的信息,请参阅 Llama Index documentation1(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/indices/struct_store/sql_query.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/struct_store/sql_query.py)3(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/pipeline/query_pipeline_sql.ipynb)。
  • 要继续与 Dosu 交谈,请提及 @dosu。*
hgncfbus

hgncfbus2#

感谢Dosu,然而它仍然没有完全回答我的问题。考虑这样的代码:

# engine is a mysql connection
sql_database = SQLDatabase(engine)
# llm is a groq llama3-70b model
query_engine = NLSQLTableQueryEngine(sql_database=sql_database, llm=llm) 
response = query_engine.query("How many entries are there in the data?")

我如何确切地查看正在发送的提示以生成SQL查询?当日志设置为INFO时,会打印出模式,但它有50k个令牌,所以我不明白这是如何发送到模型提供者(groq)的。

yb3bgrhw

yb3bgrhw3#

要查看在使用 NLSQLTableQueryEngine 时发送的确切提示以生成 SQL 查询,您可以启用详细日志记录。这将记录提示的详细信息,包括从自然语言输入生成的 SQL 查询。以下是操作方法:

设置 verbose=True 后,NLSQLTableQueryEngine 将记录提示的详细信息,包括从自然语言输入生成的 SQL 查询 1(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/struct_store/sql_query.py)。

关于大型模式,NLSQLTableQueryEngine 通过允许您使用 tables 参数指定表的子集来处理大型模式。这可以帮助在将其发送到 LLM 之前截断或汇总模式。以下是一个示例:

这样,您可以管理模式大小并确保提示适应模型提供者 2(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/indices/struct_store/sql_query.py)1(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/struct_store/sql_query.py) 的上下文窗口。

  • 要继续与 Dosu 对话,请提及 @dosu。*
kd3sttzy

kd3sttzy4#

我无法让verbose=True正常工作,但是将日志级别设置为DEBUG有助于解决问题:它记录了发送给提供商API的请求。这揭示了使用完整模式大约需要5k个令牌,因此它适合在8k个令牌窗口内。如果我还想通过使用更简洁的表示形式来减少提示大小,我应该怎么做?

oxcyiej7

oxcyiej75#

为了通过使用更简洁的模式表示来减小提示大小,您可以专注于模式的关键部分并删除冗余或不太重要的信息。以下是模式的更简洁表示:

from typing import Any, Dict, List, Optional
from dataclasses import dataclass
from enum import Enum, auto
import uuid
import json

class BaseComponent:
    @classmethod
    def class_name(cls) -> str:
        return "base_component"

    def to_dict(self, **kwargs: Any) -> Dict[str, Any]:
        data = self.dict(**kwargs)
        data["class_name"] = self.class_name()
        return data

    def to_json(self, **kwargs: Any) -> str:
        return json.dumps(self.to_dict(**kwargs))

    @classmethod
    def from_dict(cls, data: Dict[str, Any], **kwargs: Any) -> "BaseComponent":
        data.pop("class_name", None)
        return cls(**data)

    @classmethod
    def from_json(cls, data_str: str, **kwargs: Any) -> "BaseComponent":
        return cls.from_dict(json.loads(data_str), **kwargs)

class ObjectType(str, Enum):
    TEXT = auto()
    IMAGE = auto()
    INDEX = auto()
    DOCUMENT = auto()

class MetadataMode(str, Enum):
    ALL = "all"
    EMBED = "embed"
    LLM = "llm"
    NONE = "none"

class BaseNode(BaseComponent):
    id_: str = str(uuid.uuid4())
    embedding: Optional[List[float]] = None
    metadata: Dict[str, Any] = {}
    relationships: Dict[str, Any] = {}

    @classmethod
    def get_type(cls) -> str:
        return "base_node"

    def get_content(self, metadata_mode: MetadataMode = MetadataMode.ALL) -> str:
        return self.metadata.get("content", "")

    def get_metadata_str(self, mode: MetadataMode = MetadataMode.ALL) -> str:
        return json.dumps(self.metadata)

    @property
    def hash(self) -> str:
        return sha256((self.get_content() + str(self.metadata)).encode()).hexdigest()

class TextNode(BaseNode):
    text: str = ""

    @classmethod
    def get_type(cls) -> str:
        return ObjectType.TEXT

    def get_content(self, metadata_mode: MetadataMode = MetadataMode.NONE) -> str:
        return self.text

class Document(TextNode):
    id_: str = str(uuid.uuid4())

    @classmethod
    def get_type(cls) -> str:
        return ObjectType.DOCUMENT

    @classmethod
    def example(cls) -> "Document":
        return Document(text="Sample text", metadata={"filename": "README.md", "category": "codebase"})

    @classmethod
    def class_name(cls) -> str:
        return "Document"

这种简洁的表示在保留模式的核心功能和结构的同时省略了不太关键的细节。此外,您还可以使用提示模板进一步简化模式表示,如LlamaIndex示例所示:

from llama_index.core.prompts.base import PromptTemplate

DEFAULT_SUMMARY_PROMPT_TMPL = (
    "Write a summary of the following. Try to use only the "
    "information provided. "
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'SUMMARY:"""\n'
)

DEFAULT_SUMMARY_PROMPT = PromptTemplate(
    DEFAULT_SUMMARY_PROMPT_TMPL, prompt_type=PromptType.SUMMARY
)

DEFAULT_INSERT_PROMPT_TMPL = (
    "Context information is below. It is provided in a numbered list "
    "(1 to {num_chunks}), "
    "where each item in the list corresponds to a summary.\n"
    "---------------------\n"
    "{context_list}"
    "---------------------\n"
    "Given the context information, here is a new piece of "
    "information: {new_chunk_text}\n"
    "Answer with the number corresponding to the summary that should be updated. "
    "The answer should be the number corresponding to the "
    "summary that is most relevant to the question.\n"
)
DEFAULT_INSERT_PROMPT = PromptTemplate(
    DEFAULT_INSERT_PROMPT_TMPL, prompt_type=PromptType.TREE_INSERT
)

DEFAULT_QUERY_PROMPT_TMPL = (
    "Some choices are given below. It is provided in a numbered list "
    "(1 to {num_chunks}), "
    "where each item in the list corresponds to a summary.\n"
    "---------------------\n"
    "{context_list}"
    "\n---------------------\n"
    "Using only the choices above and not prior knowledge, return "
    "the choice that is most relevant to the question: '{query_str}'\n"
    "Provide choice in the following format: 'ANSWER: <number>' and explain why "
    "this summary was selected in relation to the question.\n"
)
DEFAULT_QUERY_PROMPT = PromptTemplate(
    DEFAULT_QUERY_PROMPT_TMPL, prompt_type=PromptType.TREE_SELECT
)

DEFAULT_REFINE_PROMPT_TMPL = (
    "The original query is as follows: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the query. "
    "If the context isn't useful, return the original answer.\n"
    "Refined Answer: "
)
DEFAULT_REFINE_PROMPT = PromptTemplate(
    DEFAULT_REFINE_PROMPT_TMPL, prompt_type=PromptType.REFINE
)

DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)
DEFAULT_TEXT_QA_PROMPT = PromptTemplate(
    DEFAULT_TEXT_QA_PROMPT_TMPL, prompt_type=PromptType.QUESTION_ANSWER
)

DEFAULT_TEXT_TO_SQL_PGVECTOR_TMPL = """\
Given an input question, first create a syntactically correct {dialect} \
query to run, then look at the results of the query and return the answer. \
You can order the results by a relevant column to return the most \
interesting examples in the database.

Pay attention to use only the column names that you can see in the schema \
description. Be careful to not query for columns that do not exist. \
Pay attention to which column is in which table. Also, qualify column names \
with the table name when needed.

IMPORTANT NOTE: you can use specialized pgvector syntax (`<->`) to do nearest \
neighbors/semantic search to a given vector from an embeddings column in the table. \
The embeddings value for a given row typically represents the semantic meaning of that row. \
The vector represents an embedding representation \
of the question, given below. Do NOT fill in the vector values directly, but rather specify a \
`[query_vector]` placeholder. For instance, some select statement examples below \
(the name of the embeddings column is `embedding`):
SELECT * FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 5;
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
SELECT * FROM items WHERE embedding <-> '[query_vector]' < 5;

You are required to use the following format, \
each taking one line:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use tables listed below.
{schema}

Question: {query_str}
SQLQuery: \
"""

DEFAULT_TEXT_TO_SQL_PGVECTOR_PROMPT = PromptTemplate(
    DEFAULT_TEXT_TO_SQL_PGVECTOR_TMPL,
    prompt_type=PromptType.TEXT_TO_SQL,
)

DEFAULT_TABLE_CONTEXT_TMPL = (
    "We have provided a table schema below. "
    "---------------------\n"
    "{schema}\n"
    "---------------------\n"
    "We have also provided context information below. "
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and the table schema, "
    "give a response to the following task: {query_str}"
)

DEFAULT_TABLE_CONTEXT_PROMPT = PromptTemplate(
    DEFAULT_TABLE_CONTEXT_TMPL, prompt_type=PromptType.TABLE_CONTEXT
)

DEFAULT_REFINE_TABLE_CONTEXT_TMPL = (
    "We have provided a table schema below. "
    "---------------------\n"
    "{schema}\n"
    "---------------------\n"
    "We have also provided some context information below. "
    "{context_msg}\n"
    "---------------------\n"
    "Given the context information and the table schema, "
    "give a response to the following task: {query_str}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)
DEFAULT_REFINE_TABLE_CONTEXT_PROMPT = PromptTemplate(
    DEFAULT_REFINE_TABLE_CONTEXT_TMPL, prompt_type=PromptType.TABLE_CONTEXT
)

DEFAULT_KG_TRIPLET_EXTRACT_TMPL = (
    "Some text is provided below. Given the text, extract up to "
    "{max_knowledge_triplets} "
    "knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.\n"
    "---------------------\n"
    "Example:"
    "Text: Alice is Bob's mother."
    "Triplets:\n(Alice, is mother of, Bob)\n"
    "Text: Philz is a coffee shop founded in Berkeley in 1982.\n"
    "Triplets:\n"
    "(Philz, is, coffee shop)\n"
    "(Philz, founded in, Berkeley)\n"
    "(Philz, founded in, 1982)\n"
    "---------------------\n"
    "Text: {text}\n"
    "Triplets:\n"
)
DEFAULT_KG_TRIPLET_EXTRACT_PROMPT = PromptTemplate(
    DEFAULT_KG_TRIPLET_EXTRACT_TMPL,
    prompt_type=PromptType.KNOWLEDGE_TRIPLET_EXTRACT,
)

DEFAULT_HYDE_PROMPT = PromptTemplate(
    "Please write a passage to answer the question\n"
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'Passage:"""\n', prompt_type=PromptType.SUMMARY
)

DEFAULT_SIMPLE_INPUT_PROMPT = PromptTemplate(
    "{query_str}", prompt_type=PromptType.SIMPLE_INPUT
)

DEFAULT_JSON_PATH_PROMPT = PromptTemplate(
    "We have provided a JSON schema below:\n"
    "{schema}\n"
    "Given a task, respond with a JSON Path query that "
    "can retrieve data from a JSON value that matches the schema.\n"
    "Provide the JSON Path query in the following format: 'JSONPath: <JSONPath>'\n"
    "You must include the value 'JSONPath:' before the provided JSON Path query."
    "Example Format:\n"
    "Task: What is John's age?\n"
    "Response: JSONPath: $.John.age\n"
    "Let's try this now: \n\n"
    "Task: {query_str}\n"
    "Response: ", prompt_type=PromptType.JSON_PATH
)

DEFAULT_CHOICE_SELECT_PROMPT = PromptTemplate(
    "A list of documents is shown below. Each document has a number next to it along "
    "with a summary of the document. A question is also provided. \n"
    "Respond with the numbers of the documents "
    "you should consult to answer the question, in order of relevance, as well \n"
    "as the relevance score. The relevance score is a number from 1-10 based on "
    "how relevant you think the document is to the question.\n"
    "Do not include any documents that are not relevant to the question. \n"
    "Example format: \n"
    "Document 1:\n<summary of document 1>\n\n"
    "Document 2:\n<summary of document 2>\n\n"
    "...\n\n"
    "Document 10:\n<summary of document 10>\n\n"
    "Question: <question>\n"
    "Answer:\n"
    "Doc: 9, Relevance: 7\n"
    "Doc: 3, Relevance: 4\n"
    "Doc: 7, Relevance: 3\n\n"
    "Let's try this now: \n\n"
    "{context_str}\n"
    "Question: {query_str}\n"
    "Answer:\n", prompt_type=PromptType.CHOICE_SELECT
)

RANKGPT_RERANK_PROMPT = PromptTemplate(
    "Search Query: {query}. \nRank the {num} passages above "
    "based on their relevance to the search query. The passages "
    "should be listed in descending order using identifiers. "
    "The most relevant passages should be listed first. "
    "The output format should be [] > [], e.g., [1] > [2]. "
    "Only response the ranking results, "
    "do not say any word or explain.", prompt_type=PromptType.RANKGPT_RERANK
)

DEFAULT_JSONALYZE_PROMPT = PromptTemplate(
    "You are given a table named: '{table_name}' with schema, "
    "generate SQLite SQL query to answer the given question.\n"
    "Table schema:\n"
    "{table_schema}\n"
    "Question: {question}\n\n"
    "SQLQuery: ", prompt_type=PromptType.TEXT_TO_SQL
)

使用提示模板有助于保持简洁和有序的模式表示。

  • 要继续与 Dosu 对话,请提及 @dosu。*

相关问题