org.htmlparser.Parser.parse()方法的使用及代码示例

x33g5p2x  于2022-01-26 转载在 其他  
字(10.3k)|赞(0)|评价(0)|浏览(190)

本文整理了Java中org.htmlparser.Parser.parse()方法的一些代码示例,展示了Parser.parse()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.parse()方法的具体详情如下:
包路径:org.htmlparser.Parser
类名称:Parser
方法名:parse

Parser.parse介绍

[英]Parse the given resource, using the filter provided. This can be used to extract information from specific nodes. When used with a null filter it returns an entire page which can then be modified and converted back to HTML (Note: the synthesis use-case is not handled very well; the parser is more often used to extract information from a web page).

For example, to replace the entire contents of the HEAD with a single TITLE tag you could do this:

  1. NodeList nl = parser.parse (null); // here is your two node list
  2. NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD"))
  3. if (heads.size () > 0) // there may not be a HEAD tag
  4. {
  5. Head head = heads.elementAt (0); // there should be only one
  6. head.removeAll (); // clean out the contents
  7. Tag title = new TitleTag ();
  8. title.setTagName ("title");
  9. title.setChildren (new NodeList (new TextNode ("The New Title")));
  10. Tag title_end = new TitleTag ();
  11. title_end.setTagName ("/title");
  12. title.setEndTag (title_end);
  13. head.add (title);
  14. }
  15. System.out.println (nl.toHtml ()); // output the modified HTML

[中]使用提供的过滤器解析给定的资源。这可用于从特定节点提取信息。当与null过滤器一起使用时,它会返回整个页面,然后可以对其进行修改并转换回HTML(注意:合成用例处理得不是很好;解析器更常用于从网页中提取信息)。
例如,要用单个标题标记替换头部的全部内容,可以执行以下操作:

  1. NodeList nl = parser.parse (null); // here is your two node list
  2. NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD"))
  3. if (heads.size () > 0) // there may not be a HEAD tag
  4. {
  5. Head head = heads.elementAt (0); // there should be only one
  6. head.removeAll (); // clean out the contents
  7. Tag title = new TitleTag ();
  8. title.setTagName ("title");
  9. title.setChildren (new NodeList (new TextNode ("The New Title")));
  10. Tag title_end = new TitleTag ();
  11. title_end.setTagName ("/title");
  12. title.setEndTag (title_end);
  13. head.add (title);
  14. }
  15. System.out.println (nl.toHtml ()); // output the modified HTML

代码示例

代码示例来源:origin: com.rogiel.httpchannel/httpchannel-util

  1. private HTMLPage(Parser parser) throws ParserException {
  2. this.nodes = parser.parse(null);
  3. }

代码示例来源:origin: org.fitnesse/fitnesse

  1. private NodeList parseHtml(String possibleTable) {
  2. try {
  3. Parser parser = new Parser(possibleTable);
  4. return parser.parse(null);
  5. } catch (ParserException | StringIndexOutOfBoundsException e) {
  6. return null;
  7. }
  8. }

代码示例来源:origin: com.github.tcnh/fitnesse

  1. private NodeList parseHtml(String possibleTable) {
  2. try {
  3. Parser parser = new Parser(possibleTable);
  4. return parser.parse(null);
  5. } catch (ParserException e) {
  6. return null;
  7. }
  8. }

代码示例来源:origin: riotfamily/riot

  1. public void parse() throws ParserException {
  2. Parser parser = new Parser();
  3. parser.setInputHTML(html);
  4. nodes = parser.parse(null);
  5. }

代码示例来源:origin: org.apache.uima/ruta-ep-ide-ui

  1. private void fillMap(String documentationFile) throws IOException {
  2. InputStream resourceAsStream = getClass().getResourceAsStream(documentationFile);
  3. try {
  4. BufferedReader reader = new BufferedReader(new InputStreamReader(resourceAsStream));
  5. StringBuilder sb = new StringBuilder();
  6. while (true) {
  7. String line;
  8. line = reader.readLine();
  9. if (line == null) {
  10. break;
  11. }
  12. sb.append(line + "\n");
  13. }
  14. String document = sb.toString();
  15. Parser parser = new Parser(document);
  16. NodeList list = parser.parse(null);
  17. HtmlDocumentationVisitor visitor = new HtmlDocumentationVisitor(document);
  18. list.visitAllNodesWith(visitor);
  19. map.putAll(visitor.getMap());
  20. } catch (Exception e) {
  21. RutaIdeUIPlugin.error(e);
  22. }
  23. }
  24. }

代码示例来源:origin: xuyisheng/TextViewForFullHtml

  1. public static String parseFontHTML(String content) {
  2. hasData = false;
  3. Parser parser = Parser.createParser(content, "UTF-8");
  4. StringBuilder sb = null;
  5. try {
  6. NodeList list = (NodeList) parser.parse(null);
  7. if (hasFont(list)) {
  8. sb = getNewHtml(list);
  9. }
  10. } catch (ParserException e) {
  11. e.printStackTrace();
  12. }
  13. if (sb == null) {
  14. return content;
  15. }
  16. return sb.toString().replace("</FONT></FONT></FONT>", "</FONT>").replace("</FONT></FONT>", "</FONT>");
  17. }

代码示例来源:origin: org.htmlparser/htmlparser

  1. /**
  2. * Apply each of the filters.
  3. * The first filter is applied to the output of the parser.
  4. * Subsequent filters are applied to the output of the prior filter.
  5. * @return A list of nodes passed through all filters.
  6. * If there are no filters, returns the entire page.
  7. * @throws ParserException If an encoding change occurs
  8. * or there is some other problem.
  9. */
  10. protected NodeList applyFilters ()
  11. throws
  12. ParserException
  13. {
  14. NodeFilter[] filters;
  15. NodeList ret;
  16. ret = mParser.parse (null);
  17. filters = getFilters ();
  18. if (null != filters)
  19. for (int i = 0; i < filters.length; i++)
  20. ret = ret.extractAllNodesThatMatch (filters[i], mRecursive);
  21. return (ret);
  22. }

代码示例来源:origin: CloudSlang/cs-actions

  1. private void processHTMLBodyWithBASE64Images(MimeMultipart multipart) throws ParserException,
  2. MessagingException, NoSuchAlgorithmException, SMIMEException, java.security.NoSuchProviderException {
  3. if (null != body && body.contains("base64")) {
  4. Parser parser = new Parser(body);
  5. NodeList nodeList = parser.parse(null);
  6. HtmlImageNodeVisitor htmlImageNodeVisitor = new HtmlImageNodeVisitor();
  7. nodeList.visitAllNodesWith(htmlImageNodeVisitor);
  8. body = nodeList.toHtml();
  9. addAllBase64ImagesToMimeMultipart(multipart, htmlImageNodeVisitor.getBase64Images());
  10. }
  11. }

代码示例来源:origin: com.github.tcnh/fitnesse

  1. public HtmlTableScanner(String page) {
  2. if (page == null || page.equals(""))
  3. page = "<i>This page intentionally left blank.</i>";
  4. NodeList htmlTree;
  5. try {
  6. Parser parser = new Parser(new Lexer(new Page(page)));
  7. htmlTree = parser.parse(null);
  8. } catch (ParserException e) {
  9. throw new SlimError(e);
  10. }
  11. scanForTables(htmlTree);
  12. }

代码示例来源:origin: org.fitnesse/fitnesse

  1. public HtmlTableScanner(String page) {
  2. if (page == null || page.equals(""))
  3. page = "<i>This page intentionally left blank.</i>";
  4. NodeList htmlTree;
  5. try {
  6. Parser parser = new Parser(new Lexer(new Page(page)));
  7. htmlTree = parser.parse(null);
  8. } catch (ParserException e) {
  9. throw new SlimError(e);
  10. }
  11. scanForTables(htmlTree);
  12. }

代码示例来源:origin: ScienJus/pixiv-crawler

  1. /**
  2. * 提取多张图片
  3. * @param pageHtml
  4. * @return
  5. */
  6. public List<String> parseManga(String pageHtml) {
  7. try {
  8. List<String> result = new ArrayList<String>();
  9. Parser parser = new Parser(pageHtml);
  10. NodeFilter filter = new AndFilter(new TagNameFilter("div"),new HasAttributeFilter("class","item-container"));
  11. NodeList list = parser.parse(filter);
  12. for (int i = 0; i < list.size(); i++) {
  13. Node item = list.elementAt(i);
  14. result.add(((ImageTag) item.getChildren().elementAt(2)).getAttribute("data-src"));
  15. }
  16. return result;
  17. } catch (ParserException e) {
  18. logger.error(e.getMessage());
  19. }
  20. return null;
  21. }

代码示例来源:origin: org.apache.uima/ruta-core

  1. @Override
  2. public void process(JCas jcas) throws AnalysisEngineProcessException {
  3. String documentText = jcas.getDocumentText();
  4. List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
  5. List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
  6. try {
  7. Parser parser = new Parser(documentText);
  8. NodeList list = parser.parse(null);
  9. HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
  10. list.visitAllNodesWith(visitor);
  11. annotations = visitor.getAnnotations();
  12. annotationStack = visitor.getAnnotationStack();
  13. } catch (ParserException e) {
  14. throw new AnalysisEngineProcessException(e);
  15. }
  16. for (AnnotationFS each : annotations) {
  17. if (each.getBegin() < each.getEnd()) {
  18. jcas.addFsToIndexes(each);
  19. }
  20. }
  21. for (AnnotationFS each : annotationStack) {
  22. if (each.getBegin() < each.getEnd()) {
  23. jcas.addFsToIndexes(each);
  24. }
  25. }
  26. }

代码示例来源:origin: org.fitnesse/fitnesse

  1. private NodeList getMatchingTags(NodeFilter filter) throws Exception {
  2. String html = examiner.html();
  3. Parser parser = new Parser(new Lexer(new Page(html)));
  4. NodeList list = parser.parse(null);
  5. NodeList matches = list.extractAllNodesThatMatch(filter, true);
  6. return matches;
  7. }

代码示例来源:origin: org.fitnesse/fitnesse

  1. private NodeList makeNodeList(TestPage pageToTest) {
  2. String html = pageToTest.getHtml();
  3. Parser parser = new Parser(new Lexer(new Page(html)));
  4. try {
  5. return parser.parse(null);
  6. } catch (ParserException e) {
  7. throw new SlimError(e);
  8. }
  9. }

代码示例来源:origin: com.bbossgroups/bboss-htmlparser

  1. /**
  2. * Apply each of the filters.
  3. * The first filter is applied to the parser.
  4. * Subsequent filters are applied to the output of the prior filter.
  5. * @return A list of nodes passed through all filters.
  6. * @throws ParserException If an encoding change occurs
  7. * or there is some other problem.
  8. */
  9. protected NodeList applyFilters ()
  10. throws
  11. ParserException
  12. {
  13. NodeList ret;
  14. ret = new NodeList ();
  15. if (null != getFilters ())
  16. for (int i = 0; i < getFilters ().length; i++)
  17. if (0 == i)
  18. ret = mParser.parse (getFilters ()[0]);
  19. else
  20. ret = ret.extractAllNodesThatMatch (getFilters ()[i]);
  21. return (ret);
  22. }

代码示例来源:origin: org.apache.uima/textmarker-core

  1. @Override
  2. public void process(JCas jcas) throws AnalysisEngineProcessException {
  3. String documentText = jcas.getDocumentText();
  4. List<AnnotationFS> annotations = new ArrayList<AnnotationFS>();
  5. List<AnnotationFS> annotationStack = new ArrayList<AnnotationFS>();
  6. try {
  7. Parser parser = new Parser(documentText);
  8. NodeList list = parser.parse(null);
  9. HtmlVisitor visitor = new HtmlVisitor(jcas, onlyContent);
  10. list.visitAllNodesWith(visitor);
  11. annotations = visitor.getAnnotations();
  12. annotationStack = visitor.getAnnotationStack();
  13. } catch (ParserException e) {
  14. throw new AnalysisEngineProcessException(e);
  15. }
  16. for (AnnotationFS each : annotations) {
  17. if (each.getBegin() < each.getEnd()) {
  18. jcas.addFsToIndexes(each);
  19. }
  20. }
  21. for (AnnotationFS each : annotationStack) {
  22. if (each.getBegin() < each.getEnd()) {
  23. jcas.addFsToIndexes(each);
  24. }
  25. }
  26. }

代码示例来源:origin: com.github.tcnh/fitnesse

  1. private NodeList makeNodeList(TestPage pageToTest) {
  2. String html = pageToTest.getHtml();
  3. Parser parser = new Parser(new Lexer(new Page(html)));
  4. try {
  5. return parser.parse(null);
  6. } catch (ParserException e) {
  7. throw new SlimError(e);
  8. }
  9. }

代码示例来源:origin: com.github.tcnh/fitnesse

  1. private NodeList getMatchingTags(NodeFilter filter) throws Exception {
  2. String html = examiner.html();
  3. Parser parser = new Parser(new Lexer(new Page(html)));
  4. NodeList list = parser.parse(null);
  5. NodeList matches = list.extractAllNodesThatMatch(filter, true);
  6. return matches;
  7. }

代码示例来源:origin: ScienJus/pixiv-crawler

  1. /**
  2. * 提取单张图片
  3. * @param pageHtml
  4. * @return
  5. */
  6. public String parseMedium(String pageHtml) {
  7. try {
  8. Parser parser = new Parser(pageHtml);
  9. NodeFilter filter = new AndFilter(new TagNameFilter("img"),new HasAttributeFilter("class","original-image"));
  10. NodeList list = parser.parse(filter);
  11. if (list.size() > 0) {
  12. return ((ImageTag)list.elementAt(0)).getAttribute("data-src");
  13. }
  14. } catch (ParserException e) {
  15. logger.error(e.getMessage());
  16. }
  17. return null;
  18. }

代码示例来源:origin: ScienJus/pixiv-crawler

  1. /**
  2. * 在搜索列表中找到下一页的地址
  3. * @param pageHtml
  4. * @return
  5. */
  6. public String parseNextPage(String pageHtml) {
  7. try {
  8. Parser parser = new Parser(pageHtml);
  9. NodeFilter filter = new AndFilter(new TagNameFilter("a"),new HasAttributeFilter("rel","next"));
  10. NodeList list = parser.parse(filter);
  11. if(list.size() > 0) {
  12. return ((LinkTag)list.elementAt(0)).getLink();
  13. }
  14. } catch (ParserException e) {
  15. logger.error(e.getMessage());
  16. }
  17. return null;
  18. }

相关文章