nltk CategorizedMarkdownCorpusReader.sections() 无法返回markdown的最后一部分,

vbopmzt1 于 8个月前发布在 Go

关注(0)|答案(1)|浏览(75)

如果你使用CategorizedMarkdownCorpusReader加载一个markdown语料库，该对象有一个sections()方法，它返回一个MarkdownSection对象的列表。这个函数没有返回我测试过的任何文档的最后一个部分。
下面是一个标题为test.md的示例markdown文件：

# Section One
This is a test section. The heading level (number of #s) does not impact this bug
# Section Two
This is a test section. The heading level (number of #s) does not impact this bug

下面是一个示例代码(需要一个包含上述markdown文件的实际目录),其中只返回第一个部分。

from nltk.corpus.reader.markdown import CategorizedMarkdownCorpusReader
directory = "/some/path/here"
reader = CategorizedMarkdownCorpusReader(directory, r"\w\.md")
print(*[s.heading for s in reader.sections("test.md")], sep="\n")
print(len(reader.sections("test.md")))