llama_index [问题]:如何加载多个HTML文件,以在HTML页面中获取无结构和有结构的表格?

twh00eeo  于 3个月前  发布在  其他
关注(0)|答案(7)|浏览(90)

问题验证

  • 我已经在文档和discord上搜索过答案。

问题

大家好,
我检查了这个特斯拉10k分析的示例,我认为这是我从一堆HTML文件中检索信息所需的:
https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table/
我的文件看起来像这样:
https://www.cisco.com/c/en/us/td/docs/iosxr/cisco8000/segment-routing/24xx/configuration/guide/b-segment-routing-cg-cisco8000-24xx/configuring-segment-routing-for-ospf-protocol.html
或者
https://www.cisco.com/c/en/us/td/docs/iosxr/cisco8000/segment-routing/24xx/configuration/guide/b-segment-routing-cg-cisco8000-24xx/configuring-segment-routing-over-ipv6-srv6-micro-sids.html
正如你所看到的,这些是典型的包含表格、文本混合的配置文件。我的问题是如何自动加载一堆这样的HTML文件,也许就像简单的目录读取器?
你也可以建议一种解析这种HTML文件以获得最佳检索和查询结果的方法......
最好的

gorkyyrv

gorkyyrv1#

要加载多个HTML文件并解析它们以获取非结构化文本和结构化表格,您可以使用LlamaIndex框架中的HTMLTagReaderUnstructuredElementNodeParser类。以下是实现此目标的方法:

  1. 加载HTML文件:使用HTMLTagReader读取HTML文件并从特定标签中提取文本。
  2. 解析内容:使用UnstructuredElementNodeParser有效地处理混合文本和表格。
    以下是一个完整示例:
from pathlib import Path
from llama_index.readers.file.html.base import HTMLTagReader
from llama_index.core.node_parser.relational.unstructured_element import UnstructuredElementNodeParser
from llama_index.core.schema import Document

def load_html_directory(directory_path: str, tag: str = "section", ignore_no_id: bool = False):
    html_reader = HTMLTagReader(tag=tag, ignore_no_id=ignore_no_id)
    node_parser = UnstructuredElementNodeParser()
    documents = []

    directory = Path(directory_path)
    for html_file in directory.glob("*.html"):
        docs = html_reader.load_data(html_file)
        for doc in docs:
            document = Document(text=doc)
            nodes = node_parser.get_nodes_from_documents([document])
            documents.extend(nodes)

    return documents

# Example usage
directory_path = "/path/to/html/files"
documents = load_html_directory(directory_path, tag="section", ignore_no_id=False)

# Now `documents` contains the parsed content from all HTML files in the directory

此方法将加载并解析指定目录中的所有HTML文件,提取非结构化文本和结构化表格以获得最佳检索和查询结果 1(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/html_reader.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/html/base.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/tests/node_parser/test_unstructured.py)4(https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/node_parser/test_unstructured.py)。

  • 要继续与Dosu交谈,请提及@dosu。*
cigdeys3

cigdeys32#

获取了一些错误:

--------------------------------------------------------------------------- ValidationError
Traceback (most recent call last) Cell In[82], line 23 21 # Example usage
22 directory_path = "./ html/" ---> 23 documents = load_html_directory(directory_path, tag="section", ignore_no_id=False)
Cell In[82], line 15, in load_html_directory(directory_path, tag, ignore_no_id)
13 docs = html_reader.load_data(html_file)
14 for doc in docs: ---> 15 document = Document(text=doc)
16 nodes = node_parser.get_nodes_from_documents([document])
17 documents.extend(nodes)
File ~/miniforge3/envs/new2/lib/python3.11/site-packages/pydantic/v1/main.py:341, in BaseModel.__init__(__pydantic_self__, **data)
339 values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
340 if validation_error: --> 341 raise validation_error
342 try:
343 object_setattr(__pydantic_self__, '__dict__', values)
ValidationError: 1 validation error for Document text str type expected (type=type_error.str) From: dosubot[bot] ***@***.***> Date: Tuesday, June 25, 2024 at 4:10 PM To: run-llama/llama_index ***@***.***> Cc: gmdl007 ***@***.***>, Author ***@***.***> Subject: Re: [run-llama/llama_index] [Question]: how to load multiple html files to get unstructured and structured table in a html page (Issue #14380) To load multiple HTML files and parse them for both unstructured text and structured tables, you can use the HTMLTagReader and UnstructuredElementNodeParser classes from the LlamaIndex framework. Here's a method to achieve this: 1. Load HTML Files: Use HTMLTagReader to read the HTML files and extract text from specific tags. 2. Parse Content: Use UnstructuredElementNodeParser to handle mixed text and tables effectively. Here's a complete example: from pathlib import Path from llama_index.readers.file.html.base import HTMLTagReader from llama_index.core.node_parser.relational.unstructured_element import UnstructuredElementNodeParser from llama_index.core.schema import Document def load_html_directory(directory_path: str, tag: str = "section", ignore_no_id: bool = False): html_reader = HTMLTagReader(tag=tag, ignore_no_id=ignore_no_id) node_parser = UnstructuredElementNodeParser() documents = [] directory = Path(directory_path) for html_file in directory.glob("*.html"): docs = html_reader.load_data(html_file) for doc in docs: document = Document(text=doc) nodes = node_parser.get_nodes_from_documents([document]) documents.extend(nodes) return documents # Example usage directory_path = "/path/to/html/files" documents = load_html_directory(directory_path, tag="section", ignore_no_id=False) # Now `documents` contains the parsed content from all HTML files in the directory This method will load and parse all HTML files in the specified directory, extracting both unstructured text and structured tables for optimal retrieval and query results [1]< [https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/html_reader.py>[2]<https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/html/base.py>[3]<https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/tests/node_parser/test_unstructured.py>[4]<https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/node_parser/test_unstructured.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/html_reader.py>[2]<https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/html/base.py>[3]<https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/tests/node_parser/test_unstructured.py>[4]<https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/node_parser/test_unstructured.py) >. To continue talking to Dosu< [https://dosu.dev](https://dosu.dev) >, mention @dosu< [https://github.com/dosu](https://github.com/dosu) >. — Reply to this email directly, view it on GitHub<#14380 (comment)>, or unsubscribe< [https://github.com/notifications/unsubscribe-auth/AG2FHXHUPQ4F62JLDBBFQV3ZJF257AVCNFSM6AAAAABJ35VKHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGA3TCMJXGY](https://github.com/notifications/unsubscribe-auth/AG2FHXHUPQ4F62JLDBBFQV3ZJF257AVCNFSM6AAAAABJ35VKHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGA3TCMJXGY) >. You are receiving this because you authored the thread.Message ID: ***@***.***>
velaa5lx

velaa5lx3#

错误发生是因为 Document 类期望 text 属性是一个字符串,但它接收到的是一个 Document 对象。你需要将 Document 对象的 text 属性传递给 Document 构造函数。以下是修正后的代码:

from pathlib import Path
from llama_index.readers.file.html.base import HTMLTagReader
from llama_index.core.node_parser.relational.unstructured_element import UnstructuredElementNodeParser
from llama_index.core.schema import Document

def load_html_directory(directory_path: str, tag: str = "section", ignore_no_id: bool = False):
html_reader = HTMLTagReader(tag=tag, ignore_no_id=ignore_no_id)
node_parser = UnstructuredElementNodeParser()
documents = []

directory = Path(directory_path)
for html_file in directory.glob("*.html"):
    docs = html_reader.load_data(html_file)
    for doc in docs:
        document = Document(text=doc.text, metadata=doc.metadata)  # Pass the text and metadata attributes
        nodes = node_parser.get_nodes_from_documents([document])
        documents.extend(nodes)

return documents

Example usage

directory_path = "/path/to/html/files"
documents = load_html_directory(directory_path, tag="section", ignore_no_id=False)

Now documents contains the parsed content from all HTML files in the directory

这个更改确保了 Document 类接收到的 text 属性是一个字符串,这应该解决了 ValidationError1(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json) 的问题。

  • 要继续与 Dosu 交流,请提及 @dosu。*
o2gm4chl

o2gm4chl4#

好的,这个代码运行得很好。你能给我一些关于实际RAG(即查询配置)的代码建议吗?

最佳来源:dosubot[bot] ***@***.***> 日期:2024年6月25日,星期二,下午4点28分
收件人:run-llama/llama_index ***@***.***> 抄送:gmdl007 ***@***.***>
作者:***@***.***>
主题:回复:[run-llama/llama_index] [问题]:如何在HTML页面中加载多个HTML文件以获取无结构和结构化表格(问题编号14380)
错误发生是因为Document类期望文本属性是一个字符串,但它接收到的是一个Document对象。你需要将Document对象的文本属性传递给Document构造函数。以下是修正后的代码:

from pathlib import Path
from llama_index.readers.file.html.base import HTMLTagReader
from llama_index.core.node_parser.relational.unstructured_element import UnstructuredElementNodeParser
from llama_index.core.schema import Document

def load_html_directory(directory_path: str, tag: str = "section", ignore_no_id: bool = False):
    html_reader = HTMLTagReader(tag=tag, ignore_no_id=ignore_no_id)
    node_parser = UnstructuredElementNodeParser()
    documents = []
    directory = Path(directory_path)
    for html_file in directory.glob("*.html"):
        docs = html_reader.load_data(html_file)
        for doc in docs:
            document = Document(text=doc.text, metadata=doc.metadata)
            # Pass the text and metadata attributes nodes
            nodes = node_parser.get_nodes_from_documents([document])
            documents.extend(nodes)
    return documents

# Example usage

directory_path = "/path/to/html/files"
documents = load_html_directory(directory_path, tag="section", ignore_no_id=False)

# Now `documents` contains the parsed content from all HTML files in the directory

这个更改确保了Document类接收到的文本属性是一个字符串,这应该解决了ValidationError [1]< https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/discover_llamaindex/document_management/discord_dumps/help_channel_dump_05_25_23.json >。要继续与Dosu< https://dosu.dev >交流,请提及@dosu< https://github.com/dosu >。——直接回复此电子邮件,查看GitHub<#14380 (comment)>,或取消订阅< https://github.com/notifications/unsubscribe-auth/AG2FHXGWLVUBVSAVBVYBDBLZJF5ALAVCNFSM6AAAAABJ35VKHKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZGEYTSNJWGM >。您收到此邮件是因为您创建了该线程。消息ID:***@***.***>

igetnqfo

igetnqfo5#

要为使用LlamaIndex框架的检索增强生成(RAG)系统设置查询配置,您需要遵循几个阶段:加载、索引、存储、查询和评估。以下是一个代码示例,演示了如何配置查询阶段,这对于RAG系统至关重要。

RAG中查询配置的代码示例

from llama_index.core import Document, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.response_synthesizers import BaseSynthesizer
from llama_index.core import get_response_synthesizer
from pathlib import Path

# Step 1: Load your data into Documents
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()

# Step 2: Create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
    ]
)
nodes = pipeline.run(documents=documents, show_progress=True)

# Step 3: Index your data
embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)

# Step 4: Configure the retriever
retriever = VectorIndexRetriever(index=index, similarity_top_k=10)

# Step 5: Define a Custom Query Engine
class RAGQueryEngine(CustomQueryEngine):
    """RAG Query Engine."""

    retriever: BaseRetriever
    response_synthesizer: BaseSynthesizer

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)
        response_obj = self.response_synthesizer.synthesize(query_str, nodes)
        return response_obj

# Step 6: Configure the Query Engine
synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RAGQueryEngine(
    retriever=retriever, response_synthesizer=synthesizer
)

# Step 7: Query the index
query = "How tall is the smallest penguin?"
response = query_engine.query(query)

print(response)

解释

  1. 加载:使用SimpleDirectoryReader将您的数据加载到Document对象中。
  2. 管道:创建一个摄取管道,其中包含诸如SentenceSplitter之类的转换。
  3. 索引:使用VectorStoreIndex创建索引并将文档添加到其中。
  • 查询:
  • 检索器:设置一个VectorIndexRetriever以从索引中获取相关上下文。
  • 自定义查询引擎:定义一个RAGQueryEngine,该引擎执行检索和合成操作。
  • 查询执行:执行查询并获取响应。

此代码提供了使用LlamaIndex框架在RAG系统中进行查询的基本设置。您可以根据您的具体需求(检索器、查询引擎)自定义每个组件。

  • 要继续与Dosu交谈,请提及@dosu。*
szqfcxe2

szqfcxe26#

这个与之前的加载HTML文档的例子有什么不同?这个配置是否也像之前一样解析带有表格和非结构化数据的HTML文档?

从我的iPhone发送,2024年6月27日,8:54,dosubot[bot] ***@***.***>写道:要为使用LlamaIndex框架的检索增强生成(RAG)系统设置查询配置,需要遵循几个阶段:加载、索引、存储、查询和评估。以下是一个代码示例,演示了如何配置查询阶段,这对于RAG系统至关重要。RAG查询配置的代码示例来自llama_index.core

from llama_index.core import Document, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.response_synthesizers import BaseSynthesizer
from llama_index.core import get_response_synthesizer
from pathlib import Path

# Step 1: Load your data into Documents

loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()

# Step 2: Create the pipeline with transformations

pipeline = IngestionPipeline(transformations=[SentenceSplitter()])
nodes = pipeline.run(documents=documents, show_progress=True)

# Step 3: Index your data

embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)

# Step 4: Configure the retriever

retriever = VectorIndexRetriever(index=index, similarity_top_k=10)

# Step 5: Define a Custom Query Engine class RAGQueryEngine(CustomQueryEngine):

"""RAG Query Engine."""
retriever: BaseRetriever
response_synthesizer: BaseSynthesizer

def custom_query(self, query_str: str):
    nodes = self.retriever.retrieve(query_str)
    response_obj = self.response_synthesizer.synthesize(query_str, nodes)
    return response_obj

# Step 6: Configure the Query Engine

synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RAGQueryEngine(retriever=retriever, response_synthesizer=synthesizer)

# Step 7: Query the index

query = "How tall is the smallest penguin?"
response = query_engine.query(query)
print(response)
kyks70gy

kyks70gy7#

使用LlamaIndex框架的检索增强生成(RAG)系统的查询配置与加载HTML文档的前一个示例在解析带有表格和非结构化数据的HTML文档的能力上并不固有地不同。这两种配置都可以处理非结构化文档,包括HTML,并可以使用LlamaIndex提供的专有文档解析器LlamaParse来解析具有表格和非结构化数据的复杂文档。

支持证据来自文档:

  • "LlamaIndex可以提取非结构化文本、PDF、Notion和Slack文档等,并在其中对数据进行索引。"
  • "LlamaIndex提供高级索引/检索,与我们的专有文档解析器LlamaParse集成。"

此外,新文档提供了如何使用Unstructured库从HTML文档中解析出表格和非表格元素的例子,然后使用LlamaIndex的递归检索将这些元素按需索引和检索。这进一步证实了RAG系统配置可以像前一个示例一样解析带有表格和非结构化数据的HTML文档。

支持证据来自新文档:

  • "我们使用Unstructured来解析表格,并根据用户问题使用LlamaIndex递归检索来索引/检索表格(如果需要)。"
  • "在这些部分中,我们使用Unstructured来解析表格和非表格元素。"
  • "现在我们已经提取了表格及其摘要,我们可以在LlamaIndex中设置一个递归检索器来查询这些表格。"

因此,RAG系统配置也以与前一个示例相同的方式解析带有表格和非结构化数据的HTML文档。

相关问题