注意: 对于百度翻译、百度搜索、腾讯翻译等页面依然抓取不了结果,对于加密的JS文件解析基本不生效 — 推荐使用Selenium爬复杂JS、以及加密JS页面的内容
官方文档: https://htmlunit.sourceforge.io/
有具体Demo的讲解文档(搭配官方文档效果更佳):https://www.scrapingbee.com/java-webscraping-book/
作用: 一个"用于Java程序的无GUI浏览器"。它对HTML文档进行建模,并提供一个API,允许您调用页面,填写表单,单击链接等…就像您在"正常"浏览器中所做的那样
根据官方文档描述,仅能解析js库: htmx, jQuery, jQuery, MochiKit, GWT, Sarissa, MooTools, Prototype, Ext, Dojo, Dojo, YUI所以遇到经过加密的JS文件、以及其他库很可能会解析失败 === 所以模拟抓百度翻译、腾讯翻译、有道翻译这些加密的JS抓不了,建议使用Selenium(Java)进行抓,不过这工具比较重,好用是非常好用、直接爬就完事压根就不用分析浏览器的请求
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
依赖: https://search.maven.org/artifact/net.sourceforge.htmlunit/htmlunit
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.58.0</version>
</dependency>
抓取IT之家周榜的内容
/**
* IT之家
*/
@Test
@SneakyThrows
public void test10() {
//浏览器设置
WebClient webClient = new WebClient();
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
//打开页面
HtmlPage page = webClient.getPage("https://www.ithome.com/");
//鼠标悬浮到周榜上
DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
page = (HtmlPage) inputEle.mouseOver();
DomElement ulElement = page.getFirstByXPath("//div[@id='rank']//ul[@id='d-2']");
//周榜信息
System.out.println(ulElement.asNormalizedText());
}
抓取成功
/**
* IT之家周榜第九篇内容
*/
@Test
@SneakyThrows
public void test11() {
WebClient webClient = new WebClient();
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
HtmlPage page = webClient.getPage("https://www.ithome.com/");
//鼠标悬浮到周榜上
DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");
page = (HtmlPage) inputEle.mouseOver();
//获取文章链接
List<DomElement> articleLinkElems = page.getByXPath("//div[@id='rank']//ul[@id='d-2']//a");
if(CollUtil.isNotEmpty(articleLinkElems)) {
//第九篇文章
page = articleLinkElems.get(8).click();
DomElement articleDivElem = page.getFirstByXPath("//div[@id='dt']//div[@class='fl content']");
System.out.println(articleDivElem.asNormalizedText());
}
}
抓取成功
示例页面
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>HtmlUnit测试</title>
</head>
<body>
<form id="form" onclick="return false;">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
<label for="uname"><b>账号</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname" required>
<label for="psw"><b>密码</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw" required>
<button id="loginBtn" type="button">登陆</button>
</div>
</form>
<form id="form2" method="post" action="http://127.0.0.1:8080/login">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
<label for="uname"><b>账号2</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
<label for="psw"><b>密码2</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
<button id="loginBtn2" type="submit">登陆2</button>
</div>
</form>
</body>
<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
<script>
$(function () {
//登陆
function loginOperation() {
$.post("http://127.0.0.1:8080/login",$("#form").serialize(),responseData => {
$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
$("form").hide();
},"json")
return false;
}
$("#loginBtn").click(loginOperation);
})
</script>
</html>
登录接口代码 == springboot == 注意下面是两个文件的代码
@Configuration
public class SystemConfig {
//允许跨域
@Bean
public CorsFilter corsFilter() {
CorsConfiguration corsConfiguration = new CorsConfiguration();
corsConfiguration.addAllowedOriginPattern("*");
corsConfiguration.setAllowCredentials(true);
corsConfiguration.addAllowedMethod("*");
corsConfiguration.addAllowedHeader("*");
UrlBasedCorsConfigurationSource configSource = new UrlBasedCorsConfigurationSource();
configSource.registerCorsConfiguration("/**", corsConfiguration);
return new CorsFilter(configSource);
}
}
@Controller
@RequestMapping
@ResponseBody
public class LoginController {
@PostMapping("login")
public Map login(HttpServletRequest request) {
Map parameterMap = new HashMap(request.getParameterMap());
parameterMap.put("name", "嗯嗯*");
return parameterMap;
}
}
模拟用户表单操作
/**
* 模拟用户输入
*/
@Test
@SneakyThrows
public void test12() {
WebClient webClient = new WebClient();
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
//ajax手动提交的请求
HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
DomElement loginNameElem = page.getElementById("uname");
loginNameElem.setAttribute("value", "root");
DomElement passwordElem = page.getElementById("psw");
passwordElem.setAttribute("value", "pswroot");
//提交form1的表单
DomElement startLoginBtnElem = page.getElementById("loginBtn");
page = startLoginBtnElem.click();
DomElement userInfoDivElem = page.getFirstByXPath("//h1");
System.out.println(userInfoDivElem.asNormalizedText());
//==================================================
//表单提交 == 返回的是JSON结果的页面,不是htmlPage页面故需要将结果转成UnexpectedPage
page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
HtmlInput inputloginNameElem = (HtmlInput) page.getElementById("uname2");
inputloginNameElem.setAttribute("value", "root2");
HtmlInput inputpasswordElem = (HtmlInput) page.getElementById("psw2");
inputpasswordElem.setAttribute("value", "pswroot2");
//提交form2的表单
HtmlForm enclosingForm = inputloginNameElem.getEnclosingForm();
UnexpectedPage page2 = webClient.getPage(enclosingForm.getWebRequest(null));
//获取响应结果
System.out.println(page2.getWebResponse().getContentAsString(UTF_8));
}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>HtmlUnit测试</title>
</head>
<body>
<form id="form" onclick="return false;">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
<label for="uname"><b>账号</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname" required>
<label for="psw"><b>密码</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw" required>
<button id="loginBtn" type="button">登陆</button>
</div>
</form>
<form id="form2" method="post" action="http://127.0.0.1:8080/login">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
<label for="uname"><b>账号2</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
<label for="psw"><b>密码2</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
<button id="loginBtn2" type="submit">登陆2</button>
</div>
</form>
<a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
<br/>
<a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>
</body>
<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
<script>
$(function() {
//登陆
function loginOperation() {
$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
$("form").hide();
}, "json")
return false;
}
$("#loginBtn").click(loginOperation);
})
</script>
</html>
文件下载接口
package work.linruchang.qq.htmlunitweb.controller;
import cn.hutool.core.util.StrUtil;
import lombok.SneakyThrows;
import org.springframework.core.io.FileSystemResource;
import org.springframework.http.HttpHeaders;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;
/**
* 作用:
*
* @author LinRuChang
* @version 1.0
* @date 2022/02/09
* @since 1.8
**/
@Controller
@RequestMapping
@ResponseBody
public class HtmlUnitController {
/**
* 下载文件测试
* http://127.0.0.1:8080/download
* @param request
* @param httpServletResponse
* @return
*/
@GetMapping("download")
@SneakyThrows
public ResponseEntity login(HttpServletRequest request, HttpServletResponse httpServletResponse) {
System.out.println(request.getSession().getId() + "开始下载");
FileSystemResource fileSystemResource = new FileSystemResource("E:\\微信\\文件\\WeChat Files\\wxid_n7xzf77wr3wv22\\FileStorage\\File\\2022-02\\房东符金瑞名下楼栋需要批量处理.xlsx");
HttpHeaders headers = new HttpHeaders();
headers.add("Cache-Control", "no-cache, no-store, must-revalidate");
headers.add("Content-Disposition", StrUtil.format("attachment; filename={}", URLEncoder.encode(fileSystemResource.getFilename())));
headers.add("Pragma", "no-cache");
headers.add("Expires", "0");
return ResponseEntity.ok()
.headers(headers)
.contentLength(fileSystemResource.contentLength())
.contentType(MediaType.parseMediaType("application/octet-stream"))
.body(fileSystemResource);
}
}
开始测试HtmlUnit下载功能
package work.linruchang.qq;
import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.IoUtil;
import cn.hutool.core.lang.Console;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.host.event.KeyboardEvent;
import lombok.SneakyThrows;
import org.junit.Test;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URLDecoder;
import java.util.List;
import java.util.logging.Level;
import static java.nio.charset.StandardCharsets.UTF_8;
/**
* 作用:
*
* @author LinRuChang
* @version 1.0
* @date 2022/02/08
* @since 1.8
**/
public class HtmlUnitTest {
@Test
@SneakyThrows
public void test13() {
WebClient webClient = new WebClient();
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
//DomElement downloadBtn = page.getElementById("downloadBtn");
DomElement downloadBtn = page.getElementById("downloadBtn2");
//触发下载按钮
Page clickPage = downloadBtn.click();
//下面两句是等价
//Page enclosedPage = webClient.getWebWindows().get(webClient.getWebWindows().size() - 1).getEnclosedPage();
Page enclosedPage = clickPage.getEnclosingWindow().getEnclosedPage();
InputStream contentAsStream = enclosedPage.getWebResponse().getContentAsStream();
//获取文件名
String responseHeaderValue = enclosedPage.getWebResponse().getResponseHeaderValue(HttpHeader.CONTENT_DISPOSITION);
String documentName = responseHeaderValue.split(";")[1].split("=")[1].trim();
documentName = URLDecoder.decode(documentName);
Console.log("文件下载成功:{}",documentName);
//存入数据库
IoUtil.copy(contentAsStream, new FileOutputStream("C:\\Users\\Administrator\\Desktop\\图片\\"+ documentName));
}
}
示例页面
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>HtmlUnit测试</title>
</head>
<body>
<form id="form" onclick="return false;">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交">
<label for="uname"><b>账号</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname" required>
<label for="psw"><b>密码</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw" required>
<button id="loginBtn" type="button">登陆</button>
</div>
</form>
<form id="form2" method="post" action="http://127.0.0.1:8080/login">
<div class="container">
<input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交">
<label for="uname"><b>账号2</b></label>
<input type="text" placeholder="Enter Username" name="uname" id="uname2" required>
<label for="psw"><b>密码2</b></label>
<input type="password" placeholder="Enter Password" name="psw" id="psw2" required>
<button id="loginBtn2" type="submit">登陆2</button>
</div>
</form>
<a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a>
<br/>
<a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a>
<br/>
<button id="alertBtn">弹出信息</button>
<br/>
<button id="promptBtn">提示框信息</button>
<br/>
<button id="confirmBtn">确认框信息</button>
</body>
<script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
<script>
$(function() {
var i = 0;
$("#alertBtn").click(function() {
alert("点击触发弹框信息: 第" + ++i + "次")
})
var j = 0;
$("#promptBtn").click(function() {
prompt("点击触发提示框信息: 第" + ++j + "次", "默认值1111")
})
var k = 0;
$("#confirmBtn").click(function() {
confirm("点击触发确认框信息: 第" + ++k + "次")
})
//登陆
function loginOperation() {
$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {
$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)
$("form").hide();
}, "json")
return false;
}
$("#loginBtn").click(loginOperation);
})
</script>
</html>
HtmlUnit模拟用户触发弹框
@Test
@SneakyThrows
public void test15() {
WebClient webClient = new WebClient();
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
List<String> alertInfos = new ArrayList<>();
webClient.setAlertHandler(new CollectingAlertHandler(alertInfos));
//提示框处理
final List<String> promptInfos = new ArrayList<>();
webClient.setPromptHandler(new PromptHandler() {
@Override
public String handlePrompt(Page page, String message, String defaultValue) {
Console.log("Prompt信息:{}、{}", message,defaultValue);
promptInfos.add(message);
return StrUtil.blankToDefault(message,defaultValue);
}
});
//确认框消息处理
final List<String> confirmInfos = new ArrayList<>();
webClient.setConfirmHandler(new ConfirmHandler() {
@Override
public boolean handleConfirm(Page page, String message) {
confirmInfos.add(message);
//true确认 false取消弹框
return true;
}
});
HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");
DomElement alertBtn = page.getElementById("alertBtn");
page = alertBtn.click();
DomElement promptBtn = page.getElementById("promptBtn");
page = promptBtn.click();
page = promptBtn.click();
DomElement confirmBtn = page.getElementById("confirmBtn");
page = confirmBtn.click();
page = confirmBtn.click();
page = confirmBtn.click();
Console.log("弹框信息:{}", alertInfos);
Console.log("提示框信息:{}", promptInfos);
Console.log("确认框信息:{}", confirmInfos);
}
版权说明 : 本文为转载文章, 版权归原作者所有 版权申明
原文链接 : https://blog.csdn.net/weixin_39651356/article/details/122871212
内容来源于网络,如有侵权,请联系作者删除!