llama_index [问题]:使用Weaviate向量存储的多文档RAG

flvlnr44  于 23天前  发布在  其他
关注(0)|答案(1)|浏览(19)

问题验证

  • 我已在文档和Discord中寻找答案。

问题

你好,
我如何使用Weaviate作为向量存储进行多文档RAG?

dbf7pr2w

dbf7pr2w1#

要使用Weaviate作为矢量存储执行多文档检索增强生成(RAG),请按照以下步骤操作:

  1. 安装所需软件包:
%pip install llama-index-readers-github
%pip install llama-index-vector-stores-weaviate
%pip install llama-index-llms-openai
  1. 设置和下载数据:
import nest_asyncio
nest_asyncio.apply()

import os
os.environ["GITHUB_TOKEN"] = "ghp_..."
os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.readers.github import (
    GitHubRepositoryIssuesReader,
    GitHubIssuesClient,
)

github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
    github_client,
    owner="run-llama",
    repo="llama_index",
    verbose=True,
)

orig_docs = loader.load_data()

limit = 100
docs = []
for idx, doc in enumerate(orig_docs):
    doc.metadata["index_id"] = int(doc.id_)
    if idx >= limit:
        break
    docs.append(doc)
  1. 设置矢量存储和索引:
import weaviate

auth_config = weaviate.AuthApiKey(api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")
client = weaviate.Client(
    "https://llama-index-test-v0oggsoz.weaviate.network",
    auth_client_secret=auth_config,
)

class_name = "LlamaIndex_docs"
client.schema.delete_class(class_name)  # optional: delete schema

from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name=class_name
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

doc_index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)
  1. 为检索和过滤创建IndexNodes:
from llama_index.core import SummaryIndex
from llama_index.core.async_utils import run_jobs
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import IndexNode
from llama_index.core.vector_stores import (
    FilterOperator,
    MetadataFilter,
    MetadataFilters,
)

async def aprocess_doc(doc, include_summary: bool = True):
    metadata = doc.metadata
    date_tokens = metadata["created_at"].split("T")[0].split("-")
    year = int(date_tokens[0])
    month = int(date_tokens[1])
    day = int(date_tokens[2])
    assignee = "" if "assignee" not in doc.metadata else doc.metadata["assignee"]
    size = ""
    if len(doc.metadata["labels"]) > 0:
        size_arr = [l for l in doc.metadata["labels"] if "size:" in l]
        size = size_arr[0].split(":")[1] if len(size_arr) > 0 else ""
    new_metadata = {
        "state": metadata["state"],
        "year": year,
        "month": month,
        "day": day,
        "assignee": assignee,
        "size": size,
    }

    summary_index = SummaryIndex.from_documents([doc])
    query_str = "Give a one-sentence concise summary of this issue."
    query_engine = summary_index.as_query_engine(llm=OpenAI(model="gpt-3.5-turbo"))
    summary_txt = await query_engine.aquery(query_str)
    summary_txt = str(summary_txt)

    index_id = doc.metadata["index_id"]
    filters = MetadataFilters(
        filters=[
            MetadataFilter(key="index_id", operator=FilterOperator.EQ, value=int(index_id)),
        ]
    )

    index_node = IndexNode(
        text=summary_txt,
        metadata=new_metadata,
        obj=doc_index.as_retriever(filters=filters),
        index_id=doc.id_,
    )

    return index_node

async def aprocess_docs(docs):
    index_nodes = []
    tasks = []
    for doc in docs:
        task = aprocess_doc(doc)
        tasks.append(task)

    index_nodes = await run_jobs(tasks, show_progress=True, workers=3)
    return index_nodes

index_nodes = await aprocess_docs(docs)
  1. 创建顶级自动检索器:
auth_config = weaviate.AuthApiKey(api_key="XRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk")
client = weaviate.Client(
    "https://llama-index-test-v0oggsoz.weaviate.network",
    auth_client_secret=auth_config,
)

class_name = "LlamaIndex_auto"
client.schema.delete_class(class_name)  # optional: delete schema

vector_store_auto = WeaviateVectorStore(
    weaviate_client=client, index_name=class_name
)
storage_context_auto = StorageContext.from_defaults(vector_store=vector_store_auto)

index = VectorStoreIndex(objects=index_nodes, storage_context=storage_context_auto)
  1. 设置可组合的自动检索器:
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo

vector_store_info = VectorStoreInfo(
    content_info="Github Issues",
    metadata_info=[
        MetadataInfo(name="state", description="Whether the issue is `open` or `closed`", type="string"),
        MetadataInfo(name="year", description="The year issue was created", type="integer"),
        MetadataInfo(name="month", description="The month issue was created", type="integer"),
        MetadataInfo(name="day", description="The day issue was created", type="integer"),
        MetadataInfo(name="assignee", description="The assignee of the ticket", type="string"),
        MetadataInfo(name="size", description="How big the issue is (XS, S, M, L, XL, XXL)", type="string"),
    ],
)

from llama_index.core.retrievers import VectorIndexAutoRetriever

retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    similarity_top_k=2,
    empty_query_top_k=10,
    verbose=True,
)
  1. 尝试一下:
from llama_index.core import QueryBundle

nodes = retriever.retrieve(QueryBundle("Tell me about some issues on 01/11"))
print(f"Number of source nodes: {len(nodes)}")
print(nodes[0].node.metadata)

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

response = query_engine.query("Tell me about some issues on 01/11")
print(str(response))

response = query_engine.query("Tell me about some open issues related to agents")
print(str(response))

通过遵循这些步骤,您可以使用Weaviate作为矢量存储设置一个多文档RAG系统。此外,还可以使用LlamaIndex库中的 MultiDocAutoRetrieverPack 类来简化过程,确保元数据节点和文档片段分别被索引,并启用递归检索以提高查询效率 1(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/query_engine/multi_doc_auto_retrieval/multi_doc_auto_retrieval.ipynb)2(https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-multidoc-autoretrieval/llama_index/packs/multidoc_autoretrieval/base.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/vector_stores/weaviate.py)4(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-weaviate/llama_index/vector_stores/weaviate/base.py)。

  • 要继续与 Dosu 交流,请提及 @dosu。*

相关问题