llama_index [问题]:带有超链接的句子分割器

voase2hg 于 8个月前发布在其他

关注(0)|答案(4)|浏览(112)

问题验证

我已在文档和discord中寻找答案。

问题

我正在为文本实现一个句子分割器。如果文本包含超链接，那么在IngestionPipeline内运行分割器与在IngestionPipeline外运行分割器的行为如下：
file_splitted = splitter.split_text_metadata_aware(text, node.get_metadata_str())
超链接显示的方式不同，但文本的其余部分没有变化，我对这种行为感到困惑。

llama_index

来源：https://github.com/run-llama/llama_index/issues/14805

4条答案

按热度按时间

1u4esq0p1#

嘿，@truski22!👋 我在您等待人工维护者时为您提供任何错误、问题或贡献方面的帮助。让我们一起解决这些错误吧！
为了在使用 splitter.split_text_metadata_aware 方法时确保在 IngestionPipeline 内部和外部都能一致处理超链接，您需要自定义分词器或分割函数以识别和保留超链接。
以下是如何修改 SentenceSplitter 类以一致处理超链接的示例：

自定义分词器:修改分词器将超链接视为单个标记。
自定义分割函数:确保分割函数不会破坏超链接。

import re
from typing import List, Optional, Callable
from llama_index.core.node_parser.text.sentence import SentenceSplitter
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import Document

def custom_tokenizer(text: str) -> List[str]:
    """Custom tokenizer that treats hyperlinks as single tokens."""
    # Regex to match hyperlinks
    hyperlink_regex = r'https?://\S+|www\.\S+'
    tokens = re.findall(hyperlink_regex, text)
    # Split the text by spaces and add the hyperlinks
    return re.split(r'(\s+)', text) + tokens

class CustomSentenceSplitter(SentenceSplitter):
    def __init__(
        self,
        chunk_size: int = SentenceSplitter.chunk_size,
        chunk_overlap: int = SentenceSplitter.chunk_overlap,
        tokenizer: Optional[Callable] = custom_tokenizer,
        callback_manager: Optional[CallbackManager] = None,
        separator: str = " ",
        paragraph_separator: str = SentenceSplitter.paragraph_separator,
        secondary_chunking_regex: str = SentenceSplitter.secondary_chunking_regex,
        include_metadata: bool = True,
        include_prev_next_rel: bool = True,
        id_func: Optional[Callable[[int, Document], str]] = None,
    ):
        super().__init__(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            tokenizer=tokenizer,
            callback_manager=callback_manager,
            separator=separator,
            paragraph_separator=paragraph_separator,
            secondary_chunking_regex=secondary_chunking_regex,
            include_metadata=include_metadata,
            include_prev_next_rel=include_prev_next_rel,
            id_func=id_func,
        )

    def _split(self, text: str, chunk_size: int) -> List[str]:
        """Override _split to ensure hyperlinks are not broken."""
        if len(self._tokenizer(text)) <= chunk_size:
            return [text]

        for split_fn in self._split_fns:
            splits = split_fn(text)
            if len(splits) > 1:
                break

        new_splits = []
        for split in splits:
            split_len = len(self._tokenizer(split))
            if split_len <= chunk_size:
                new_splits.append(split)
            else:
                # recursively split
                new_splits.extend(self._split(split, chunk_size=chunk_size))
        return new_splits

# Usage
splitter = CustomSentenceSplitter()
chunks = splitter.split_text_metadata_aware(text, metadata_str)

在此示例中，custom_tokenizer 函数使用正则表达式识别超链接并将其视为单个标记。然后，CustomSentenceSplitter 类使用此分词器确保在分割过程中不会破坏超链接。这种方法确保了 IngestionPipeline1(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/node_parser/text/token.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/text/token.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/node_parser/text/sentence.py)4(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/text/sentence.py) 内外对超链接的一致处理。

要继续与 Dosu 交谈，请提及 @dosu。*

赞(0）回复(0）举报 8个月前

ct2axkht2#

为什么在使用IngestionPipeline时问题不出现？问题出现在我将splitter移出IngestionPipeline时。

赞(0）回复(0）举报 8个月前

wmomyfyw3#

问题出现在 IngestionPipeline 之外的超链接显示可能是由于 IngestionPipeline 内部应用的转换导致的。具体来说，SentenceSplitter 和 HuggingFaceEmbedding 转换可能会以影响超链接处理的方式更改文本。
为了确保 IngestionPipeline 内外的超链接处理行为一致，您需要确保 IngestionPipeline 内部应用的转换不会更改超链接。您可以通过修改或添加保留超链接的自定义转换来实现这一点。
以下是如何修改 IngestionPipeline 以包含保留超链接的自定义转换的示例：

from llama_index.core.ingestion.pipeline import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.schema import Document

class HyperlinkPreservingSplitter(SentenceSplitter):
    def split(self, text: str) -> list:
        # Custom logic to split text while preserving hyperlinks
        # This is a placeholder implementation
        return super().split(text)

# Example usage
documents = [
    Document(text="This is a test document with a hyperlink: http://example.com", doc_id="1"),
]

pipeline = IngestionPipeline(
    transformations=[
        HyperlinkPreservingSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-m3"),
    ],
    docstore=SimpleDocumentStore(),
)

nodes = pipeline.run(documents=documents)

通过使用自定义 HyperlinkPreservingSplitter ,您可以确保在摄取过程中保留超链接，从而使 IngestionPipeline1(https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-rag-cli-local/llama_index/packs/rag_cli_local/base.py)2(https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/ingestion/test_pipeline.py) 内外的行为保持一致。

要继续与 Dosu 对话，请提及 @dosu。*

赞(0）回复(0）举报 8个月前

mu0hgdu04#

这并不能解决我的问题。例如，原始文本是：

[Buscar](https://es.wikipedia.org/wiki/Wikipedia:Portada) •[Crear una cuenta](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)•[Acceder](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)Herra[mientas](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol) personales  |

经过分割器处理后，这两行变成了：

text = node.byte_content.decode(encoding=node.encoding) if isinstance(node,RawDocument) else node.text

file_splitted = splitter.split_text_metadata_aware(text, node.get_metadata_str())

文本被转换成了这样：

[Buscar](https://es.wikipedia.org/wiki/Wikipedia:Portada) •[Crear un](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol)[a cuenta](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)•[Acceder](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol)Herra[mientas](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol) personales  |

有一些差异是我不明白为什么会出现的。其他例子：
原始文本：

• [Artículo](https://es.wikipedia.org/wiki/Copa_Mundial_de_F%C3%BAtbol)

经过分割器处理后：

• [Artículo](https://es.wikipedia.org/wiki/Discusi%C3%B3n:Copa_Mundial_de_F%C3%BAtbol)

赞(0）回复(0）举报 8个月前

我来回答

llama_index [问题]:带有超链接的句子分割器

问题验证

问题

4条答案

相关问题

热门标签

最新问答