unstructured feat/Decouple Partitioning User API and Implementation

z9zf31ra  于 3个月前  发布在  其他
关注(0)|答案(3)|浏览(52)

描述

用户API与分区器的内部实现紧密耦合。因此,关键字参数或函数行为的更改会立即影响用户API。用户可以直接访问实现函数,如当前文档所示。这种紧密耦合限制了我们在不引入破坏性更改的情况下重构或增强分区器的能力。
提议的更改旨在提高开发速度,并最终提供在未来引入新面向用户的API的能力。

建议解决方案

在用户API和分区器实现之间引入一个抽象层。这个层只暴露必要的功能,并委托给适当的分区器。这种解耦使我们能够在不影响用户API的情况下进行更多的内部更改。需要分阶段进行,将此问题作为以下步骤的第一个:

  1. 将每个分区器的分区函数内容移动到私有函数_partition_{type}中。例如,电子邮件分区器。虽然这似乎很简单,但它将使我们能够在不影响生产环境中可能存在的用户API的情况下进一步开发分区器。它将使我们能够简化和整合分区器中的功能,以便最终支持将来为用户提供更简单的接口。
@process_metadata()
@add_metadata_with_filetype(FileType.EML)
@add_chunking_strategy()
def partition_email(
    filename: Optional[str] = None,
    file: Optional[Union[IO[bytes], SpooledTemporaryFile]] = None,
    text: Optional[str] = None,
    content_source: str = "text/html",
    encoding: Optional[str] = None,
    include_headers: bool = False,
    max_partition: Optional[int] = 1500,
    include_metadata: bool = True,
    metadata_filename: Optional[str] = None,
    metadata_last_modified: Optional[str] = None,
    process_attachments: bool = False,
    attachment_partitioner: Optional[Callable] = None,
    min_partition: Optional[int] = 0,
    chunking_strategy: Optional[str] = None,
    **kwargs,
) -> List[Element]:
    """ Wrap the _partition_email function to separate user facing API from internal API"""
    return _partition_email(
        filename=filename,
        file=file,
        text=text,
        content_source=content_source,
        encoding=encoding,
        include_headers=include_headers,
        max_partition=max_partition,
        include_metadata=include_metadata,
        metadata_filename=metadata_filename,
        metadata_last_modified=metadata_last_modified,
        process_attachments=process_attachments,
        attachment_partitioner=attachment_partitioner,
        min_partition=min_partition,
        chunking_strategy=chunking_strategy,
        **kwargs,
    )

完成阶段1后的后续阶段/问题:
2. 将文件读取提取到自己的函数中,该函数传递一个流到私有的分区函数
3. 将元数据处理从装饰器转移到在分区之前/之后的函数调用。这使我们能够支持处理用户输入(例如语言)的预处理步骤和在解耦的接口中合并已处理元素数据的后处理步骤(例如层次结构、分块),以便更好地进行单元测试。
4. 在分区之后将分块从装饰器转移到函数调用
5. 为访问更简单、解耦的组件编写一个新的面向用户的API
所有这些都可以逐个分区器进行增量操作。最终目标是减少用户面向API中的关键字参数的复杂性,并减少下游组件的耦合,以便进一步进行功能开发。特别是对于文档预处理/后处理:元数据(如语言和层次结构)、分块以及文档级别的元数据(如编码、内容来源)。

替代方案

什么都不做:保持原样,继续使用紧密耦合的系统,接受维护和可扩展性的成本。

其他上下文

Unstructured已经具有一定程度的抽象层,与自动分区器相比,文档特定的分区器可以提高速度、减少依赖项并提供额外的功能。

neskvpey

neskvpey1#

A couple very nice concrete benefits of this that I see:

  1. Ease unit-testing. Decorators complicate unit testing because the "inner" function cannot be isolated from the decorator function. So the only alternative is to test the entire "glued-together" composition. Having a facade API function allows the core partitioner to be tested in isolation from the decorated API function.
  2. Allow flexible composition of implementation from distinct units. An API function naturally tends to become "fat" with keyword options as functionality is added. This is good for the user (if designed carefully) providing them a lot of flexibility. However, not all options are necessarily used by all parts of the implemention. We can see this already where some kwargs are used by decorators only. A distinct API function allows the implementation to be flexibly composed and recomposed from a growing set of distinct "step" or "aspect" implementor functions, each taking only the arguments they require. This improves our ability to respond to newly added function without struggling with a rigid implementation.
  3. Provide code-tree shape flexibility. As partitioner function modules begin to approach 1000 lines, it becomes time to think about transitioning each into a subpackage (directory). The distinct and smallish API function can be placed in the init.py module and the supporting implementation distributed to submodules in that directory.
    I think other benefits are going to occur to me, but that's a start :)

相关问题