org.jsoup.parser.Parser.htmlParser()方法的使用及代码示例

x33g5p2x  于2022-01-26 转载在 其他  
字(6.1k)|赞(0)|评价(0)|浏览(187)

本文整理了Java中org.jsoup.parser.Parser.htmlParser()方法的一些代码示例,展示了Parser.htmlParser()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.htmlParser()方法的具体详情如下:
包路径:org.jsoup.parser.Parser
类名称:Parser
方法名:htmlParser

Parser.htmlParser介绍

[英]Create a new HTML parser. This parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags.
[中]创建一个新的HTML解析器。该解析器将输入视为HTML5,并根据传入标记的语义知识强制创建规范化文档。

代码示例

代码示例来源:origin: org.jsoup/jsoup

  1. /**
  2. * Loads a file to a Document.
  3. * @param in file to load
  4. * @param charsetName character set of input
  5. * @param baseUri base URI of document, to resolve relative links against
  6. * @return Document
  7. * @throws IOException on IO error
  8. */
  9. public static Document load(File in, String charsetName, String baseUri) throws IOException {
  10. return parseInputStream(new FileInputStream(in), charsetName, baseUri, Parser.htmlParser());
  11. }

代码示例来源:origin: org.jsoup/jsoup

  1. /**
  2. * Parses a Document from an input steam.
  3. * @param in input stream to parse. You will need to close it.
  4. * @param charsetName character set of input
  5. * @param baseUri base URI of document, to resolve relative links against
  6. * @return Document
  7. * @throws IOException on IO error
  8. */
  9. public static Document load(InputStream in, String charsetName, String baseUri) throws IOException {
  10. return parseInputStream(in, charsetName, baseUri, Parser.htmlParser());
  11. }

代码示例来源:origin: org.jsoup/jsoup

  1. Request() {
  2. timeoutMilliseconds = 30000; // 30 seconds
  3. maxBodySizeBytes = 1024 * 1024; // 1MB
  4. followRedirects = true;
  5. data = new ArrayList<>();
  6. method = Method.GET;
  7. addHeader("Accept-Encoding", "gzip");
  8. addHeader(USER_AGENT, DEFAULT_UA);
  9. parser = Parser.htmlParser();
  10. }

代码示例来源:origin: com.vaadin/vaadin-server

  1. /**
  2. * Parses the given input stream into a jsoup document
  3. *
  4. * @param html
  5. * the stream containing the design
  6. * @return the parsed jsoup document
  7. * @throws IOException
  8. */
  9. private static Document parse(InputStream html) {
  10. try {
  11. Document doc = Jsoup.parse(html, UTF_8.name(), "",
  12. Parser.htmlParser());
  13. return doc;
  14. } catch (IOException e) {
  15. throw new DesignException("The html document cannot be parsed.");
  16. }
  17. }

代码示例来源:origin: rakam-io/rakam

  1. Document parse = Jsoup.parse(content, "", Parser.htmlParser());

代码示例来源:origin: fivesmallq/web-data-extractor

  1. /**
  2. * change parser to htmlParser.
  3. *
  4. * @return
  5. */
  6. public SelectorExtractor htmlParser() {
  7. this.parser = Parser.htmlParser();
  8. return this;
  9. }

代码示例来源:origin: com.norconex.collectors/norconex-importer

  1. /**
  2. * Gets the JSoup parser associated with the string representation.
  3. * The string "xml" (case insensitive) will return the XML parser.
  4. * Anything else will return the HTML parser.
  5. * @param parser "html" or "xml"
  6. * @return JSoup parser
  7. * @since 2.8.0
  8. */
  9. public static Parser toJSoupParser(String parser) {
  10. if ("xml".equalsIgnoreCase(parser)) {
  11. return Parser.xmlParser();
  12. }
  13. return Parser.htmlParser();
  14. }

代码示例来源:origin: abola/CrawlerPack

  1. /**
  2. * 將 HTML 轉化為 Jsoup Document 物件
  3. *
  4. * HTML的內容就使用Jsoup原生的 HTML Parser
  5. *
  6. * @param html Html document
  7. * @return org.jsoup.nodes.Document
  8. */
  9. public org.jsoup.nodes.Document htmlToJsoupDoc(String html){
  10. // 將 html(html/html5) 轉為 jsoup Document 物件
  11. Document jsoupDoc = Jsoup.parse(html, "UTF-8", Parser.htmlParser() );
  12. jsoupDoc.charset(StandardCharsets.UTF_8);
  13. return jsoupDoc;
  14. }

代码示例来源:origin: addthis/hydra

  1. Parser parser = Parser.htmlParser().setTrackErrors(0);
  2. @Nonnull Document doc = parser.parseInput(html, "");
  3. @Nonnull Elements tags = doc.select(tagName);

代码示例来源:origin: org.apache.any23/apache-any23-core

  1. return Jsoup.parse(input, encoding, documentIRI, Parser.htmlParser());

代码示例来源:origin: DigitalPebble/storm-crawler

  1. /**
  2. * Attempt to find a META tag in the HTML that hints at the character set
  3. * used to write the document.
  4. */
  5. private static String getCharsetFromMeta(byte buffer[], int maxlength) {
  6. // convert to UTF-8 String -- which hopefully will not mess up the
  7. // characters we're interested in...
  8. int len = buffer.length;
  9. if (maxlength > 0 && maxlength < len) {
  10. len = maxlength;
  11. }
  12. String html = new String(buffer, 0, len, DEFAULT_CHARSET);
  13. Document doc = Parser.htmlParser().parseInput(html, "dummy");
  14. // look for <meta http-equiv="Content-Type"
  15. // content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
  16. Elements metaElements = doc
  17. .select("meta[http-equiv=content-type], meta[charset]");
  18. String foundCharset = null;
  19. for (Element meta : metaElements) {
  20. if (meta.hasAttr("http-equiv"))
  21. foundCharset = getCharsetFromContentType(meta.attr("content"));
  22. if (foundCharset == null && meta.hasAttr("charset"))
  23. foundCharset = meta.attr("charset");
  24. if (foundCharset != null)
  25. return foundCharset;
  26. }
  27. return foundCharset;
  28. }

代码示例来源:origin: DigitalPebble/storm-crawler

  1. .decode(ByteBuffer.wrap(content)).toString();
  2. jsoupDoc = Parser.htmlParser().parseInput(html, url);

代码示例来源:origin: DigitalPebble/storm-crawler

  1. @Test
  2. public void testExclusionCase() throws IOException {
  3. Config conf = new Config();
  4. conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "style");
  5. TextExtractor extractor = new TextExtractor(conf);
  6. String content = "<html>the<STYLE>main</STYLE>content of the page</html>";
  7. Document jsoupDoc = Parser.htmlParser().parseInput(content,
  8. "http://stormcrawler.net");
  9. String text = extractor.text(jsoupDoc.body());
  10. assertEquals("the content of the page", text);
  11. }

代码示例来源:origin: DigitalPebble/storm-crawler

  1. @Test
  2. public void testMainContent() throws IOException {
  3. Config conf = new Config();
  4. conf.put(TextExtractor.INCLUDE_PARAM_NAME, "DIV[id=\"maincontent\"]");
  5. TextExtractor extractor = new TextExtractor(conf);
  6. String content = "<html>the<div id='maincontent'>main<div>content</div></div>of the page</html>";
  7. Document jsoupDoc = Parser.htmlParser().parseInput(content,
  8. "http://stormcrawler.net");
  9. String text = extractor.text(jsoupDoc.body());
  10. assertEquals("main content", text);
  11. }

代码示例来源:origin: DigitalPebble/storm-crawler

  1. @Test
  2. public void testExclusion() throws IOException {
  3. Config conf = new Config();
  4. conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "STYLE");
  5. TextExtractor extractor = new TextExtractor(conf);
  6. String content = "<html>the<style>main</style>content of the page</html>";
  7. Document jsoupDoc = Parser.htmlParser().parseInput(content,
  8. "http://stormcrawler.net");
  9. String text = extractor.text(jsoupDoc.body());
  10. assertEquals("the content of the page", text);
  11. }

相关文章