HtmlUnit(Java) - 快速入门学习 - 无界面浏览器

x33g5p2x  于2022-02-12 转载在 HTML5  
字(18.9k)|赞(0)|评价(0)|浏览(428)

注意: 对于百度翻译、百度搜索、腾讯翻译等页面依然抓取不了结果,对于加密的JS文件解析基本不生效 — 推荐使用Selenium爬复杂JS、以及加密JS页面的内容

1. 概述

官方文档: https://htmlunit.sourceforge.io/

有具体Demo的讲解文档(搭配官方文档效果更佳):https://www.scrapingbee.com/java-webscraping-book/

作用: 一个"用于Java程序的无GUI浏览器"。它对HTML文档进行建模,并提供一个API,允许您调用页面,填写表单,单击链接等…就像您在"正常"浏览器中所做的那样

2. 注意

2.0 js解析问题

根据官方文档描述,仅能解析js库: htmx, jQuery, jQuery, MochiKit, GWT, Sarissa, MooTools, Prototype, Ext, Dojo, Dojo, YUI所以遇到经过加密的JS文件、以及其他库很可能会解析失败 === 所以模拟抓百度翻译、腾讯翻译、有道翻译这些加密的JS抓不了,建议使用Selenium(Java)进行抓,不过这工具比较重,好用是非常好用、直接爬就完事压根就不用分析浏览器的请求

2.1 关闭HtmlUnit日志

  1. java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

3. 使用

依赖: https://search.maven.org/artifact/net.sourceforge.htmlunit/htmlunit

  1. <dependency>
  2. <groupId>net.sourceforge.htmlunit</groupId>
  3. <artifactId>htmlunit</artifactId>
  4. <version>2.58.0</version>
  5. </dependency>

3.1 抓取IT之家周榜内容 - 单页面

抓取IT之家周榜的内容

  1. /**
  2. * IT之家
  3. */
  4. @Test
  5. @SneakyThrows
  6. public void test10() {
  7. //浏览器设置
  8. WebClient webClient = new WebClient();
  9. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  10. webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
  11. webClient.getOptions().setThrowExceptionOnScriptError(false);
  12. webClient.getOptions().setCssEnabled(true);
  13. webClient.getOptions().setJavaScriptEnabled(true);
  14. webClient.getOptions().setActiveXNative(false);
  15. //打开页面
  16. HtmlPage page = webClient.getPage("https://www.ithome.com/");
  17. //鼠标悬浮到周榜上
  18. DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
  19. page = (HtmlPage) inputEle.mouseOver();
  20. DomElement ulElement = page.getFirstByXPath("//div[@id='rank']//ul[@id='d-2']");
  21. //周榜信息
  22. System.out.println(ulElement.asNormalizedText());
  23. }

抓取成功

3.2 抓取IT之家周榜第九篇文章的内容 - 双页面

  1. /**
  2. * IT之家周榜第九篇内容
  3. */
  4. @Test
  5. @SneakyThrows
  6. public void test11() {
  7. WebClient webClient = new WebClient();
  8. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  9. webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
  10. webClient.getOptions().setThrowExceptionOnScriptError(false);
  11. webClient.getOptions().setCssEnabled(true);
  12. webClient.getOptions().setJavaScriptEnabled(true);
  13. webClient.getOptions().setActiveXNative(false);
  14. HtmlPage page = webClient.getPage("https://www.ithome.com/");
  15. //鼠标悬浮到周榜上
  16. DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
  17. page = (HtmlPage) inputEle.mouseOver();
  18. //获取文章链接
  19. List<DomElement> articleLinkElems = page.getByXPath("//div[@id='rank']//ul[@id='d-2']//a");
  20. if(CollUtil.isNotEmpty(articleLinkElems)) {
  21. //第九篇文章
  22. page = articleLinkElems.get(8).click();
  23. DomElement articleDivElem = page.getFirstByXPath("//div[@id='dt']//div[@class='fl content']");
  24. System.out.println(articleDivElem.asNormalizedText());
  25. }
  26. }

抓取成功

3.3 模拟用户操作 - (这个功能个人感觉非常非常的鸡肋,只能用于非常简单的JS,但是一般网站的动作触发都会进行一系列复杂的JS操作,所以想爬虫还是推荐Selenium)

示例页面

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <meta http-equiv="X-UA-Compatible" content="IE=edge">
  6. <meta name="viewport" content="width=device-width, initial-scale=1.0">
  7. <title>HtmlUnit测试</title>
  8. </head>
  9. <body>
  10. <form id="form" onclick="return false;">
  11. <div class="container">
  12. <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
  13. <label for="uname"><b>账号</b></label>
  14. <input type="text" placeholder="Enter Username" name="uname" id="uname" required>
  15. <label for="psw"><b>密码</b></label>
  16. <input type="password" placeholder="Enter Password" name="psw" id="psw" required>
  17. <button id="loginBtn" type="button">登陆</button>
  18. </div>
  19. </form>
  20. <form id="form2" method="post" action="http://127.0.0.1:8080/login">
  21. <div class="container">
  22. <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
  23. <label for="uname"><b>账号2</b></label>
  24. <input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
  25. <label for="psw"><b>密码2</b></label>
  26. <input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
  27. <button id="loginBtn2" type="submit">登陆2</button>
  28. </div>
  29. </form>
  30. </body>
  31. <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
  32. <script>
  33. $(function () {
  34. //登陆
  35. function loginOperation() {
  36. $.post("http://127.0.0.1:8080/login",$("#form").serialize(),responseData => {
  37. $("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
  38. $("form").hide();
  39. },"json")
  40. return false;
  41. }
  42. $("#loginBtn").click(loginOperation);
  43. })
  44. </script>
  45. </html>

登录接口代码 == springboot == 注意下面是两个文件的代码

  1. @Configuration
  2. public class SystemConfig {
  3. //允许跨域
  4. @Bean
  5. public CorsFilter corsFilter() {
  6. CorsConfiguration corsConfiguration = new CorsConfiguration();
  7. corsConfiguration.addAllowedOriginPattern("*");
  8. corsConfiguration.setAllowCredentials(true);
  9. corsConfiguration.addAllowedMethod("*");
  10. corsConfiguration.addAllowedHeader("*");
  11. UrlBasedCorsConfigurationSource configSource = new UrlBasedCorsConfigurationSource();
  12. configSource.registerCorsConfiguration("/**", corsConfiguration);
  13. return new CorsFilter(configSource);
  14. }
  15. }
  16. @Controller
  17. @RequestMapping
  18. @ResponseBody
  19. public class LoginController {
  20. @PostMapping("login")
  21. public Map login(HttpServletRequest request) {
  22. Map parameterMap = new HashMap(request.getParameterMap());
  23. parameterMap.put("name", "嗯嗯*");
  24. return parameterMap;
  25. }
  26. }

模拟用户表单操作

  1. /**
  2. * 模拟用户输入
  3. */
  4. @Test
  5. @SneakyThrows
  6. public void test12() {
  7. WebClient webClient = new WebClient();
  8. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  9. webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
  10. webClient.getOptions().setThrowExceptionOnScriptError(false);
  11. webClient.getOptions().setCssEnabled(true);
  12. webClient.getOptions().setJavaScriptEnabled(true);
  13. webClient.getOptions().setActiveXNative(false);
  14. //ajax手动提交的请求
  15. HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
  16. DomElement loginNameElem = page.getElementById("uname");
  17. loginNameElem.setAttribute("value", "root");
  18. DomElement passwordElem = page.getElementById("psw");
  19. passwordElem.setAttribute("value", "pswroot");
  20. //提交form1的表单
  21. DomElement startLoginBtnElem = page.getElementById("loginBtn");
  22. page = startLoginBtnElem.click();
  23. DomElement userInfoDivElem = page.getFirstByXPath("//h1");
  24. System.out.println(userInfoDivElem.asNormalizedText());
  25. //==================================================
  26. //表单提交 == 返回的是JSON结果的页面,不是htmlPage页面故需要将结果转成UnexpectedPage
  27. page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
  28. HtmlInput inputloginNameElem = (HtmlInput) page.getElementById("uname2");
  29. inputloginNameElem.setAttribute("value", "root2");
  30. HtmlInput inputpasswordElem = (HtmlInput) page.getElementById("psw2");
  31. inputpasswordElem.setAttribute("value", "pswroot2");
  32. //提交form2的表单
  33. HtmlForm enclosingForm = inputloginNameElem.getEnclosingForm();
  34. UnexpectedPage page2 = webClient.getPage(enclosingForm.getWebRequest(null));
  35. //获取响应结果
  36. System.out.println(page2.getWebResponse().getContentAsString(UTF_8));
  37. }

3.4 文件下载

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <meta http-equiv="X-UA-Compatible" content="IE=edge">
  6. <meta name="viewport" content="width=device-width, initial-scale=1.0">
  7. <title>HtmlUnit测试</title>
  8. </head>
  9. <body>
  10. <form id="form" onclick="return false;">
  11. <div class="container">
  12. <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
  13. <label for="uname"><b>账号</b></label>
  14. <input type="text" placeholder="Enter Username" name="uname" id="uname" required>
  15. <label for="psw"><b>密码</b></label>
  16. <input type="password" placeholder="Enter Password" name="psw" id="psw" required>
  17. <button id="loginBtn" type="button">登陆</button>
  18. </div>
  19. </form>
  20. <form id="form2" method="post" action="http://127.0.0.1:8080/login">
  21. <div class="container">
  22. <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
  23. <label for="uname"><b>账号2</b></label>
  24. <input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
  25. <label for="psw"><b>密码2</b></label>
  26. <input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
  27. <button id="loginBtn2" type="submit">登陆2</button>
  28. </div>
  29. </form>
  30. <a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
  31. <br/>
  32. <a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>
  33. </body>
  34. <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
  35. <script>
  36. $(function() {
  37. //登陆
  38. function loginOperation() {
  39. $.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
  40. $("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
  41. $("form").hide();
  42. }, "json")
  43. return false;
  44. }
  45. $("#loginBtn").click(loginOperation);
  46. })
  47. </script>
  48. </html>

文件下载接口

  1. package work.linruchang.qq.htmlunitweb.controller;
  2. import cn.hutool.core.util.StrUtil;
  3. import lombok.SneakyThrows;
  4. import org.springframework.core.io.FileSystemResource;
  5. import org.springframework.http.HttpHeaders;
  6. import org.springframework.http.MediaType;
  7. import org.springframework.http.ResponseEntity;
  8. import org.springframework.stereotype.Controller;
  9. import org.springframework.web.bind.annotation.GetMapping;
  10. import org.springframework.web.bind.annotation.PostMapping;
  11. import org.springframework.web.bind.annotation.RequestMapping;
  12. import org.springframework.web.bind.annotation.ResponseBody;
  13. import javax.servlet.http.HttpServletRequest;
  14. import javax.servlet.http.HttpServletResponse;
  15. import java.net.URLEncoder;
  16. import java.util.HashMap;
  17. import java.util.Map;
  18. /**
  19. * 作用:
  20. *
  21. * @author LinRuChang
  22. * @version 1.0
  23. * @date 2022/02/09
  24. * @since 1.8
  25. **/
  26. @Controller
  27. @RequestMapping
  28. @ResponseBody
  29. public class HtmlUnitController {
  30. /**
  31. * 下载文件测试
  32. * http://127.0.0.1:8080/download
  33. * @param request
  34. * @param httpServletResponse
  35. * @return
  36. */
  37. @GetMapping("download")
  38. @SneakyThrows
  39. public ResponseEntity login(HttpServletRequest request, HttpServletResponse httpServletResponse) {
  40. System.out.println(request.getSession().getId() + "开始下载");
  41. FileSystemResource fileSystemResource = new FileSystemResource("E:\\微信\\文件\\WeChat Files\\wxid_n7xzf77wr3wv22\\FileStorage\\File\\2022-02\\房东符金瑞名下楼栋需要批量处理.xlsx");
  42. HttpHeaders headers = new HttpHeaders();
  43. headers.add("Cache-Control", "no-cache, no-store, must-revalidate");
  44. headers.add("Content-Disposition", StrUtil.format("attachment; filename={}", URLEncoder.encode(fileSystemResource.getFilename())));
  45. headers.add("Pragma", "no-cache");
  46. headers.add("Expires", "0");
  47. return ResponseEntity.ok()
  48. .headers(headers)
  49. .contentLength(fileSystemResource.contentLength())
  50. .contentType(MediaType.parseMediaType("application/octet-stream"))
  51. .body(fileSystemResource);
  52. }
  53. }

开始测试HtmlUnit下载功能

  1. package work.linruchang.qq;
  2. import cn.hutool.core.collection.CollUtil;
  3. import cn.hutool.core.io.IoUtil;
  4. import cn.hutool.core.lang.Console;
  5. import com.gargoylesoftware.htmlunit.*;
  6. import com.gargoylesoftware.htmlunit.html.*;
  7. import com.gargoylesoftware.htmlunit.javascript.host.event.KeyboardEvent;
  8. import lombok.SneakyThrows;
  9. import org.junit.Test;
  10. import java.io.FileOutputStream;
  11. import java.io.InputStream;
  12. import java.net.URLDecoder;
  13. import java.util.List;
  14. import java.util.logging.Level;
  15. import static java.nio.charset.StandardCharsets.UTF_8;
  16. /**
  17. * 作用:
  18. *
  19. * @author LinRuChang
  20. * @version 1.0
  21. * @date 2022/02/08
  22. * @since 1.8
  23. **/
  24. public class HtmlUnitTest {
  25. @Test
  26. @SneakyThrows
  27. public void test13() {
  28. WebClient webClient = new WebClient();
  29. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  30. webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
  31. webClient.getOptions().setThrowExceptionOnScriptError(false);
  32. webClient.getOptions().setCssEnabled(true);
  33. webClient.getOptions().setJavaScriptEnabled(true);
  34. webClient.getOptions().setActiveXNative(false);
  35. HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
  36. //DomElement downloadBtn = page.getElementById("downloadBtn");
  37. DomElement downloadBtn = page.getElementById("downloadBtn2");
  38. //触发下载按钮
  39. Page clickPage = downloadBtn.click();
  40. //下面两句是等价
  41. //Page enclosedPage = webClient.getWebWindows().get(webClient.getWebWindows().size() - 1).getEnclosedPage();
  42. Page enclosedPage = clickPage.getEnclosingWindow().getEnclosedPage();
  43. InputStream contentAsStream = enclosedPage.getWebResponse().getContentAsStream();
  44. //获取文件名
  45. String responseHeaderValue = enclosedPage.getWebResponse().getResponseHeaderValue(HttpHeader.CONTENT_DISPOSITION);
  46. String documentName = responseHeaderValue.split(";")[1].split("=")[1].trim();
  47. documentName = URLDecoder.decode(documentName);
  48. Console.log("文件下载成功:{}",documentName);
  49. //存入数据库
  50. IoUtil.copy(contentAsStream, new FileOutputStream("C:\\Users\\Administrator\\Desktop\\图片\\"+ documentName));
  51. }
  52. }

3.5 弹框处理

示例页面

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <meta http-equiv="X-UA-Compatible" content="IE=edge">
  6. <meta name="viewport" content="width=device-width, initial-scale=1.0">
  7. <title>HtmlUnit测试</title>
  8. </head>
  9. <body>
  10. <form id="form" onclick="return false;">
  11. <div class="container">
  12. <input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
  13. <label for="uname"><b>账号</b></label>
  14. <input type="text" placeholder="Enter Username" name="uname" id="uname" required>
  15. <label for="psw"><b>密码</b></label>
  16. <input type="password" placeholder="Enter Password" name="psw" id="psw" required>
  17. <button id="loginBtn" type="button">登陆</button>
  18. </div>
  19. </form>
  20. <form id="form2" method="post" action="http://127.0.0.1:8080/login">
  21. <div class="container">
  22. <input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
  23. <label for="uname"><b>账号2</b></label>
  24. <input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
  25. <label for="psw"><b>密码2</b></label>
  26. <input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
  27. <button id="loginBtn2" type="submit">登陆2</button>
  28. </div>
  29. </form>
  30. <a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
  31. <br/>
  32. <a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>
  33. <br/>
  34. <button id="alertBtn">弹出信息</button>
  35. <br/>
  36. <button id="promptBtn">提示框信息</button>
  37. <br/>
  38. <button id="confirmBtn">确认框信息</button>
  39. </body>
  40. <script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
  41. <script>
  42. $(function() {
  43. var i = 0;
  44. $("#alertBtn").click(function() {
  45. alert("点击触发弹框信息: 第" + ++i + "次")
  46. })
  47. var j = 0;
  48. $("#promptBtn").click(function() {
  49. prompt("点击触发提示框信息: 第" + ++j + "次", "默认值1111")
  50. })
  51. var k = 0;
  52. $("#confirmBtn").click(function() {
  53. confirm("点击触发确认框信息: 第" + ++k + "次")
  54. })
  55. //登陆
  56. function loginOperation() {
  57. $.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
  58. $("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
  59. $("form").hide();
  60. }, "json")
  61. return false;
  62. }
  63. $("#loginBtn").click(loginOperation);
  64. })
  65. </script>
  66. </html>

HtmlUnit模拟用户触发弹框

  1. @Test
  2. @SneakyThrows
  3. public void test15() {
  4. WebClient webClient = new WebClient();
  5. webClient.setAjaxController(new NicelyResynchronizingAjaxController());
  6. webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
  7. webClient.getOptions().setThrowExceptionOnScriptError(false);
  8. webClient.getOptions().setCssEnabled(true);
  9. webClient.getOptions().setJavaScriptEnabled(true);
  10. webClient.getOptions().setActiveXNative(false);
  11. List<String> alertInfos = new ArrayList<>();
  12. webClient.setAlertHandler(new CollectingAlertHandler(alertInfos));
  13. //提示框处理
  14. final List<String> promptInfos = new ArrayList<>();
  15. webClient.setPromptHandler(new PromptHandler() {
  16. @Override
  17. public String handlePrompt(Page page, String message, String defaultValue) {
  18. Console.log("Prompt信息:{}、{}", message,defaultValue);
  19. promptInfos.add(message);
  20. return StrUtil.blankToDefault(message,defaultValue);
  21. }
  22. });
  23. //确认框消息处理
  24. final List<String> confirmInfos = new ArrayList<>();
  25. webClient.setConfirmHandler(new ConfirmHandler() {
  26. @Override
  27. public boolean handleConfirm(Page page, String message) {
  28. confirmInfos.add(message);
  29. //true确认 false取消弹框
  30. return true;
  31. }
  32. });
  33. HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
  34. DomElement alertBtn = page.getElementById("alertBtn");
  35. page = alertBtn.click();
  36. DomElement promptBtn = page.getElementById("promptBtn");
  37. page = promptBtn.click();
  38. page = promptBtn.click();
  39. DomElement confirmBtn = page.getElementById("confirmBtn");
  40. page = confirmBtn.click();
  41. page = confirmBtn.click();
  42. page = confirmBtn.click();
  43. Console.log("弹框信息:{}", alertInfos);
  44. Console.log("提示框信息:{}", promptInfos);
  45. Console.log("确认框信息:{}", confirmInfos);
  46. }

相关文章