java 使用HtmlUnit迭代下载

emeijp43  于 2023-01-11  发布在  Java
关注(0)|答案(1)|浏览(154)

我的问题:

我正在尝试使用Java/HtmlUnit下载位于此处(https://mft.rrc.texas.gov/link/1bf41875-3edd-4660-8ec5-b4cd15880563)的所有文件。(每个“下载”都以包含选定文件的.zip格式返回)
我已经能够下载多个文件,通过选中左侧的各个框,然后点击页面底部的“下载”,没有问题。
但是,我无法从第一页下载文件 *,然后 * 单击底部的“下一页”按钮-这样我就可以下载初始页面上没有显示的文件。

我尝试过的:

我已经尝试了几种不同的方法,都不能从第二页下载任何东西。任何时候我下载一个文件,下面的导航步骤都会把我带回第一页。
换句话说,我可以一个接一个地浏览任何数量的页面,但如果在任何时候我尝试下载,我点击的下一个页面与第一个页面相同。
下面是一些示例代码。

**Note(1):**我正在调用navigateTable(...),它正在调用downloadAll(...)。我也尝试过在单个代码块中执行此操作,但结果相同。
注(2):“页面”可能不是最合适的术语,因为“下一个”锚只是将新内容加载到同一URL的Web表格中。

代码:

public static synchronized void downloadAll(WebClient wc, HtmlPage page, String downloadpath, String downloadname) throws Exception {

        HtmlPage queue) = (HtmlPage)wc.getCurrentWindow().getEnclosedPage();

        wc.waitForBackgroundJavaScript(10000);
        
        //find and click checkbox indicating download all files on page
        HtmlCheckBoxInput queuebox = (HtmlCheckBoxInput)queue.getByXPath("//*[@id=\"fileTable:j_id_1s\"]/div/div[1]/input").get(0);
        queue = (HtmlPage)queuebox.setChecked(true);
        wc.waitForBackgroundJavaScript(10000);
        
        //print element showing how many rows are selected
        String numselected = queue.getElementById("totalRows").asNormalizedText();
        System.out.println(numselected);
        
        //find and click download button (should download all rows checked as a single zip file)
        System.out.println("Downloading file...");
        HtmlButton downloadbutton = (HtmlButton)queue.getHtmlElementById("j_id_3c:j_id_3c");
        System.out.println(downloadbutton.asNormalizedText());
        System.out.println("Clicking Download...");
        downloadbutton.click();
        System.out.println("Waiting for js...");
        wc.waitForBackgroundJavaScript(10000);
        Page downloadpage = wc.getCurrentWindow().getEnclosedPage(); 
        
        //save download to disk
        File destfile = new File(downloadpath, downloadname);                                                            
        try (InputStream contentAsStream = downloadpage.getWebResponse().getContentAsStream()) {                   
            try (OutputStream out = new FileOutputStream(destfile)) {                                              
                IOUtils.copy(contentAsStream, out);                                                                
            }                                                                                                      
        }
        System.out.println("File downloaded to:" + destfile.getAbsolutePath());
    }
    
    
    public static synchronized void navigateTable(String url, int npages, String downloadpath, String downloadname, String downloadext) throws Exception {
        //suppress logs
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
        
        //create webclient
        WebClient wc = new WebClient(); //BrowserVersion.CHROME
        wc.getOptions().setJavaScriptEnabled(true);
        wc.getOptions().setCssEnabled(true);
        wc.setAjaxController(new NicelyResynchronizingAjaxController());
        wc.getCookieManager().setCookiesEnabled(true);
        
        //get initial page
        HtmlPage queue = wc.getPage(url);
        
        //do n times
        for (int i = 1; i < npages+1; i++) {
            System.out.println(String.format("Current Iteration: %d", i));
            
            //print 'showing' status from page
            HtmlElement showing = (HtmlElement)queue.getByXPath("//*[@id=\"fileTable_paginator_bottom\"]/span[1]").get(0);
            System.out.println(showing.asNormalizedText());
            
            //call download
            downloadAll(wc, queue, downloadpath, String.format("%s_%d.%s", downloadname, i, downloadext));
            
            //get 'next page' button and click it, resetting current page to be next page
            System.out.println("Navigating to next queue page...");
            HtmlAnchor nextpageanchor = (HtmlAnchor)queue.getByXPath("//*[@id=\"fileTable_paginator_bottom\"]/a[3]").get(0);
            queue = (HtmlPage)nextpageanchor.click();
            
            //wait a sec for page to load
            wc.waitForBackgroundJavaScript(5000);
            System.out.println("Arrived at next queue page...");
        }
        
        //close webclient
        wc.close();
    }

控制台输出:

Current Iteration: 1
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_1.zip
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 2
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_2.zip
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 3
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_3.zip
Navigating to next queue page...
Arrived at next queue page...

注解掉downloadAll(...)行的控制台输出:

Current Iteration: 1
Showing 1 - 250 of 15544
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 2
Showing 251 - 500 of 15544
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 3
Showing 501 - 750 of 15544
Navigating to next queue page...
Arrived at next queue page...

最终想法:

再说一次,我的问题不是导航,也不是具体的文件下载。。而是,如何同时做到这两点。
请注意,如果没有下载部分,表将按预期进行更新(如控制台输出的 Showing ### - ### of 15544 行所示)
这是怎么回事?
非常感谢你的帮助。

xzlaal3s

xzlaal3s1#

我猜猜...
HtmlUnit在处理下载方面与您期望的浏览器有些不同。对于您的情况,实现自己的AttachmentHandler(https://htmlunit.sourceforge.io/filedownload-howto.html)可能会有所帮助。
我认为你必须实施

boolean handleAttachment(final WebResponse response)

在某种程度上,这个方法保存了附件并返回true。2这将避免当前窗口中的页面被下载的内容所替换。3最后,你的窗口中仍然有表格,进一步的导航应该可以工作。
如果没有,请在github上打开一个问题,我会尝试仔细看看。

相关问题