C++：PDF解析-->提取文本--> podofo-0.10.3

o8x7eapl 于 2024-01-09 发布在其他

关注(0)|答案(1)|浏览(165)

我已经在Visual Studio 2022中成功编译了PoDoFo 0.10.3。现在我想使用这个库从PDF文档中提取文本，但我正在努力使用API。甚至我也找不到任何示例如何做到这一点。

void parseOneFile(const string_view& filename)
{
    PdfMemDocument document;
    document.Load(filename);
    
    // iterate over all pages of the whole pdf document
    for (int pn = 0; pn < document.GetPageCount(); ++pn) 
    {
        PoDoFo::PdfPage* page = document.GetPage(pn);
        // todo: ectract the text from the page
    }

字符串
不幸的是，上面的代码示例不工作.（类PoDoFo：：PdfMemDocument没有成员GetPageCount）
有人知道怎么做吗？我只想提取文本并保存到一个像std::vector<std::string>这样的容器中，以便进一步处理。
谢谢你，谢谢

c++

来源：https://stackoverflow.com/questions/77764777/c-pdf-parsing-extract-text-podofo-0-10-3

1条答案

按热度按时间

z2acfund1#

在阅读了API之后，我能够编写以下代码行：

PdfMemDocument document;
document.Load(filename);
PoDoFo::PdfPageCollection& pagetree = document.GetPages();
for (int pn = 0; pn < pagetree.GetCount(); ++pn)
{
    PdfPage& curPdfPage = pagetree.GetPageAt(pn);
    
    PdfContents* pdfContent = curPdfPage.GetContents();
    PdfObject oneObject = pdfContent->GetObject();
    if (oneObject.IsArray())
    {
        PdfArray& array = oneObject.GetArray();
        for (auto& element : array)
        {
            std::cout << element.ToString() << std::endl;
        }
    }
    else if (oneObject.HasStream())
    {
        PdfObjectStream* stream = oneObject.GetStream();
    }
    else if (oneObject.IsDictionary())
    {
        PdfDictionary& dict = oneObject.GetDictionary();
 
    }

字符串
但我不确定我是否走对了路.我仍然没有数据/文本（类型为std：：string）。

展开查看全部

赞(0）回复(0）举报 2024-01-09

我来回答

C++：PDF解析-->提取文本--> podofo-0.10.3

1条答案

相关问题

热门标签

最新问答