我的问题:
我正在尝试使用Java/HtmlUnit下载位于此处(https://mft.rrc.texas.gov/link/1bf41875-3edd-4660-8ec5-b4cd15880563)的所有文件。(每个“下载”都以包含选定文件的.zip格式返回)
我已经能够下载多个文件,通过选中左侧的各个框,然后点击页面底部的“下载”,没有问题。
但是,我无法从第一页下载文件 *,然后 * 单击底部的“下一页”按钮-这样我就可以下载初始页面上没有显示的文件。
我尝试过的:
我已经尝试了几种不同的方法,都不能从第二页下载任何东西。任何时候我下载一个文件,下面的导航步骤都会把我带回第一页。
换句话说,我可以一个接一个地浏览任何数量的页面,但如果在任何时候我尝试下载,我点击的下一个页面与第一个页面相同。
下面是一些示例代码。
**Note(1):**我正在调用navigateTable(...),它正在调用downloadAll(...)。我也尝试过在单个代码块中执行此操作,但结果相同。
注(2):“页面”可能不是最合适的术语,因为“下一个”锚只是将新内容加载到同一URL的Web表格中。
代码:
public static synchronized void downloadAll(WebClient wc, HtmlPage page, String downloadpath, String downloadname) throws Exception {
HtmlPage queue) = (HtmlPage)wc.getCurrentWindow().getEnclosedPage();
wc.waitForBackgroundJavaScript(10000);
//find and click checkbox indicating download all files on page
HtmlCheckBoxInput queuebox = (HtmlCheckBoxInput)queue.getByXPath("//*[@id=\"fileTable:j_id_1s\"]/div/div[1]/input").get(0);
queue = (HtmlPage)queuebox.setChecked(true);
wc.waitForBackgroundJavaScript(10000);
//print element showing how many rows are selected
String numselected = queue.getElementById("totalRows").asNormalizedText();
System.out.println(numselected);
//find and click download button (should download all rows checked as a single zip file)
System.out.println("Downloading file...");
HtmlButton downloadbutton = (HtmlButton)queue.getHtmlElementById("j_id_3c:j_id_3c");
System.out.println(downloadbutton.asNormalizedText());
System.out.println("Clicking Download...");
downloadbutton.click();
System.out.println("Waiting for js...");
wc.waitForBackgroundJavaScript(10000);
Page downloadpage = wc.getCurrentWindow().getEnclosedPage();
//save download to disk
File destfile = new File(downloadpath, downloadname);
try (InputStream contentAsStream = downloadpage.getWebResponse().getContentAsStream()) {
try (OutputStream out = new FileOutputStream(destfile)) {
IOUtils.copy(contentAsStream, out);
}
}
System.out.println("File downloaded to:" + destfile.getAbsolutePath());
}
public static synchronized void navigateTable(String url, int npages, String downloadpath, String downloadname, String downloadext) throws Exception {
//suppress logs
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
//create webclient
WebClient wc = new WebClient(); //BrowserVersion.CHROME
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(true);
wc.setAjaxController(new NicelyResynchronizingAjaxController());
wc.getCookieManager().setCookiesEnabled(true);
//get initial page
HtmlPage queue = wc.getPage(url);
//do n times
for (int i = 1; i < npages+1; i++) {
System.out.println(String.format("Current Iteration: %d", i));
//print 'showing' status from page
HtmlElement showing = (HtmlElement)queue.getByXPath("//*[@id=\"fileTable_paginator_bottom\"]/span[1]").get(0);
System.out.println(showing.asNormalizedText());
//call download
downloadAll(wc, queue, downloadpath, String.format("%s_%d.%s", downloadname, i, downloadext));
//get 'next page' button and click it, resetting current page to be next page
System.out.println("Navigating to next queue page...");
HtmlAnchor nextpageanchor = (HtmlAnchor)queue.getByXPath("//*[@id=\"fileTable_paginator_bottom\"]/a[3]").get(0);
queue = (HtmlPage)nextpageanchor.click();
//wait a sec for page to load
wc.waitForBackgroundJavaScript(5000);
System.out.println("Arrived at next queue page...");
}
//close webclient
wc.close();
}
控制台输出:
Current Iteration: 1
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_1.zip
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 2
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_2.zip
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 3
Showing 1 - 250 of 15544
250 Rows Selected
Downloading file...
Download
Clicking Download...
Waiting for js...
File downloaded to:C:\Drilling Permits Pending Approval_2023-01-04_204350_3.zip
Navigating to next queue page...
Arrived at next queue page...
注解掉downloadAll(...)行的控制台输出:
Current Iteration: 1
Showing 1 - 250 of 15544
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 2
Showing 251 - 500 of 15544
Navigating to next queue page...
Arrived at next queue page...
Current Iteration: 3
Showing 501 - 750 of 15544
Navigating to next queue page...
Arrived at next queue page...
最终想法:
再说一次,我的问题不是导航,也不是具体的文件下载。。而是,如何同时做到这两点。
请注意,如果没有下载部分,表将按预期进行更新(如控制台输出的 Showing ### - ### of 15544 行所示)
这是怎么回事?
非常感谢你的帮助。
1条答案
按热度按时间xzlaal3s1#
我猜猜...
HtmlUnit在处理下载方面与您期望的浏览器有些不同。对于您的情况,实现自己的AttachmentHandler(https://htmlunit.sourceforge.io/filedownload-howto.html)可能会有所帮助。
我认为你必须实施
在某种程度上,这个方法保存了附件并返回true。2这将避免当前窗口中的页面被下载的内容所替换。3最后,你的窗口中仍然有表格,进一步的导航应该可以工作。
如果没有,请在github上打开一个问题,我会尝试仔细看看。