是否有任何方法来读取.docx文件，包括使用python-docx自动编号

y3bcpkx1 于 2022-12-21 发布在 Python

关注(0)|答案(2)|浏览(307)

问题陈述：从. docx文件中提取章节，包括自动编号。
我尝试python-docx从. docx文件中提取文本，但它不包括自动编号。

from docx import Document

document = Document("wadali.docx")

def iter_items(paragraphs):
    for paragraph in document.paragraphs:
        if paragraph.style.name.startswith('Agt'):
            yield paragraph
        if paragraph.style.name.startswith('TOC'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Title'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Table Normal'):
            yield paragraph
        if paragraph.style.name.startswith('List'):
            yield paragraph

for item in iter_items(document.paragraphs):
    print item.text

python

来源：https://stackoverflow.com/questions/52094242/is-there-any-way-to-read-docx-file-include-auto-numbering-using-python-docx

2条答案

按热度按时间

jecbmhm31#

目前python-docx v0.8似乎还不完全支持编号。你需要做一些黑客。
首先，在演示中，要迭代文档段落，需要编写自己的迭代器。下面是一些函数：

import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph

def iter_paragraphs(parent, recursive=True):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, docx.document.Document):
        parent_elm = parent.element.body
    elif isinstance(parent, docx.table._Cell):
        parent_elm = parent._tc
    else:
        raise TypeError(repr(type(parent)))

    for child in parent_elm.iterchildren():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield docx.text.paragraph.Paragraph(child, parent)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            if recursive:
                table = docx.table.Table(child, parent)
                for row in table.rows:
                    for cell in row.cells:
                        for child_paragraph in iter_paragraphs(cell):
                            yield child_paragraph

您可以使用它来查找所有文档段落，包括表格单元格中的段落。
例如：

import docx

document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
    print(paragraph.text)

要访问numbering属性，您需要在"protected"成员paragraph._p.pPr.numPr（docx.oxml.numbering.CT_NumPr对象）中进行搜索：

for paragraph in iter_paragraphs(document):
    num_pr = paragraph._p.pPr.numPr
    if num_pr is not None:
        print(num_pr)  # type: docx.oxml.numbering.CT_NumPr

请注意，如果numbering.xml文件（在docx中）存在，则从该文件中提取此对象。
要访问它，您需要像读取包一样读取docx文件。例如：

import docx.package
import docx.parts.document
import docx.parts.numbering

package = docx.package.Package.open("sample.docx")

main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)

numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)

ct_numbering = numbering_part._element
print(ct_numbering)  # CT_Numbering
for num in ct_numbering.num_lst:
    print(num)  # CT_Num
    print(num.abstractNumId)  # CT_DecimalNumber

Office Open XMl文档中提供了更多信息。

赞(0）回复(0）举报 2022-12-21

6l7fqoea2#

There is a package, docx2python which does this in a lot simpler fashion: pypi.org/project/docx2python/
下面的代码：

from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)

产生一个列表，该列表包含内容，包括以良好的可解析方式的项目符号列表。

赞(0）回复(0）举报 2022-12-21

我来回答

是否有任何方法来读取.docx文件，包括使用python-docx自动编号

2条答案

相关问题

热门标签

最新问答