org.jsoup.parser.Parser类的使用及代码示例

x33g5p2x  于2022-01-26 转载在 其他  
字(14.5k)|赞(0)|评价(0)|浏览(231)

本文整理了Java中org.jsoup.parser.Parser类的一些代码示例,展示了Parser类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser类的具体详情如下:
包路径:org.jsoup.parser.Parser
类名称:Parser

Parser介绍

[英]Parses HTML into a org.jsoup.nodes.Document. Generally best to use one of the more convenient parse methods in org.jsoup.Jsoup.
[中]将HTML解析为组织。jsoup。节点。文件通常,最好使用org中更方便的解析方法之一。jsoup。Jsoup。

代码示例

代码示例来源:origin: deeplearning4j/dl4j-examples

  1. Document document = Jsoup.parse(str, "", Parser.xmlParser());
  2. String descr;
  3. Elements patent = document.select("us-patent-grant");
  4. if (patent.size() > 0) {
  5. Elements mainClassification = e2.select("main-classification");
  6. if (mainClassification == null || mainClassification.size() == 0) {
  7. log.warn("Skipping patent {} in document - no main classification");
  8. return null;
  9. String main = e2.select("main-classification").outerHtml().replaceAll("\n", "")
  10. .replaceAll("<main-classification>", "").replaceAll("</main-classification>", "")
  11. .replaceFirst(" ", ""); //Replace first space - not significant, always present. But SECOND space is important
  12. descr = patent.select("description").text();
  13. } else {
  14. patent = document.select("PATDOC");
  15. if (patent.size() > 0) {
  16. title = patent.select("B540").first().text();
  17. abstr = patent.select("SDOAB").text();
  18. claims = patent.select("SDOCL").text();

代码示例来源:origin: org.jsoup/jsoup

  1. /**
  2. * Loads a file to a Document.
  3. * @param in file to load
  4. * @param charsetName character set of input
  5. * @param baseUri base URI of document, to resolve relative links against
  6. * @return Document
  7. * @throws IOException on IO error
  8. */
  9. public static Document load(File in, String charsetName, String baseUri) throws IOException {
  10. return parseInputStream(new FileInputStream(in), charsetName, baseUri, Parser.htmlParser());
  11. }

代码示例来源:origin: org.jsoup/jsoup

  1. static Document parseInputStream(InputStream input, String charsetName, String baseUri, Parser parser) throws IOException {
  2. if (input == null) // empty body
  3. return new Document(baseUri);
  4. input = ConstrainableInputStream.wrap(input, bufferSize, 0);
  5. doc = parser.parseInput(docData, baseUri);
  6. Elements metaElements = doc.select("meta[http-equiv=content-type], meta[charset]");
  7. if (meta.hasAttr("http-equiv"))
  8. foundCharset = getCharsetFromContentType(meta.attr("content"));
  9. if (foundCharset == null && meta.hasAttr("charset"))
  10. foundCharset = meta.attr("charset");
  11. if (foundCharset != null)
  12. if (foundCharset == null && doc.childNodeSize() > 0 && doc.childNode(0) instanceof XmlDeclaration) {
  13. XmlDeclaration prolog = (XmlDeclaration) doc.childNode(0);
  14. if (prolog.name().equals("xml"))
  15. reader.skip(1);
  16. try {
  17. doc = parser.parseInput(reader, baseUri);
  18. } catch (UncheckedIOException e) {

代码示例来源:origin: USPTO/PatentPublicData

  1. rawText = rawText.replaceAll("", "</q>");
  2. Document jsoupDoc = Jsoup.parse("<body>" + rawText + "</body>", "", Parser.xmlParser());
  3. jsoupDoc.outputSettings().prettyPrint(false).syntax(Syntax.xml).charset(StandardCharsets.UTF_16);
  4. jsoupDoc.select("bold").tagName("b");
  5. Elements figRefEls = jsoupDoc.select("FGREF");
  6. for (int i = 1; i <= figRefEls.size(); i++) {
  7. Element element = figRefEls.get(i - 1);
  8. element.attr("id", "FR-" + Strings.padStart(String.valueOf(i), 4, '0'));
  9. element.attr("idref", ReferenceTagger.createFigId(element.select("PDAT").text()));
  10. element.tagName("a");
  11. element.addClass("figref");
  12. jsoupDoc = Jsoup.parse("<body>" + fieldTextCleaned + "</body>", "", Parser.xmlParser());
  13. jsoupDoc.outputSettings().prettyPrint(false).syntax(OutputSettings.Syntax.xml).charset(StandardCharsets.UTF_16);

代码示例来源:origin: opacapp/opacclient

  1. static List<LentItem> parse_medialist(Document doc) {
  2. List<LentItem> media = new ArrayList<>();
  3. Elements copytrs = doc.select(".data tr");
  4. LentItem item = new LentItem();
  5. if (tr.text().contains("keine Daten")) {
  6. return null;
  7. item.setTitle(tr.select(".account-display-title").select("b, strong")
  8. .text().trim());
  9. try {
  10. item.setRenewable(false);
  11. if (tr.select("a").size() > 0) {
  12. for (Element link : tr.select("a")) {
  13. String href = link.attr("abs:href");
  14. if (lines.length == 4 || lines.length == 5) {
  15. item.setAuthor(Jsoup.parse(lines[1]).text().trim());
  16. item.setBarcode(Jsoup.parse(lines[2]).text().trim());
  17. if (lines.length == 5) {
  18. item.setBarcode(Parser.unescapeEntities(lines[1].trim(), false));
  19. item.setStatus(Parser.unescapeEntities(lines[2].trim(), false));
  20. } else if (lines.length == 2) {
  21. item.setAuthor(Parser.unescapeEntities(lines[1].trim(), false));

代码示例来源:origin: USPTO/PatentPublicData

  1. Document jsoupDoc = Jsoup.parse("<body>" + rawText + "</body>", "", Parser.xmlParser());
  2. jsoupDoc.outputSettings().prettyPrint(false).charset(StandardCharsets.UTF_16);
  3. Elements figEls = jsoupDoc.select("a.figref");
  4. for (int i = 1; i <= figEls.size(); i++) {
  5. Element element = figEls.get(i - 1);
  6. element.attr("id", "FR-" + Strings.padStart(String.valueOf(i), 4, '0'));
  7. Elements headerEls = jsoupDoc.select("PAC");
  8. for (int i = 1; i <= headerEls.size(); i++) {
  9. Element element = headerEls.get(i - 1);
  10. element.attr("id", "H-" + Strings.padStart(String.valueOf(i), 4, '0'));
  11. element.tagName("h2");

代码示例来源:origin: samczsun/Skype4J

  1. @Override
  2. public void handle(SkypeImpl skype, JsonObject resource) throws ConnectionException, ChatNotFoundException, IOException {
  3. String content = Utils.getString(resource, "content");
  4. String chatId = Utils.getString(resource, "conversationLink");
  5. String author = getAuthor(resource);
  6. Validate.notNull(content, "Null content");
  7. Validate.notNull(chatId, "Null chat");
  8. Validate.notNull(author, "Null author");
  9. String username = getUsername(author);
  10. Validate.notNull(username, "Null username");
  11. Chat chat = getChat(chatId, skype);
  12. Validate.notNull(chat, "Null chatobj");
  13. Participant initiator = chat.getParticipant(username);
  14. Validate.notNull(initiator, "Null initiator");
  15. Document doc = Parser.xmlParser().parseInput(content, "");
  16. List<ReceivedFile> receivedFiles = doc
  17. .getElementsByTag("file")
  18. .stream()
  19. .map(fe -> new ReceivedFileImpl(fe.text(), Long.parseLong(fe.attr("size")),
  20. Long.parseLong(fe.attr("tid"))))
  21. .collect(Collectors.toList());
  22. FileReceivedEvent event = new FileReceivedEvent(chat, initiator, receivedFiles);
  23. skype.getEventDispatcher().callEvent(event);
  24. }
  25. },

代码示例来源:origin: DigitalPebble/storm-crawler

  1. .decode(ByteBuffer.wrap(content)).toString();
  2. jsoupDoc = Parser.htmlParser().parseInput(html, url);
  3. .selectFirst("meta[name~=(?i)robots][content]");
  4. if (robotelement != null) {
  5. robotsTags.extractMetaTags(robotelement.attr("content"));
  6. slinks = new HashMap<>(0);
  7. } else {
  8. Elements links = jsoupDoc.select("a[href]");
  9. slinks = new HashMap<>(links.size());
  10. for (Element link : links) {
  11. String targetURL = link.attr("abs:href");
  12. .attr("rel"));
  13. Element body = jsoupDoc.body();
  14. if (body != null) {
  15. text = textExtractor.text(body);

代码示例来源:origin: de.unistuttgart.ims/de.unistuttgart.ims.drama.io.core

  1. public static void getNext(JCas jcas, InputStream file, Drama drama, boolean strict)
  2. throws IOException, CollectionException {
  3. Document doc = Jsoup.parse(file, "UTF-8", "", Parser.xmlParser());
  4. drama.setDocumentTitle(doc.select("titleStmt > title").first().text());
  5. if (!doc.select("idno[type=\"TextGridUri\"]").isEmpty())
  6. drama.setDocumentId(doc.select("idno[type=\"TextGridUri\"]").first().text().substring(9));
  7. Element authorElement = authorElements.get(i);
  8. Author author = new Author(jcas);
  9. author.setName(authorElement.text());
  10. if (authorElement.hasAttr("key")) {
  11. author.setPnd(authorElement.attr("key").replace("pnd:", "http://d-nb.info/gnd/"));

代码示例来源:origin: USPTO/PatentPublicData

  1. @Override
  2. public String getPlainText(String rawText, FreetextConfig textConfig) {
  3. Document jsoupDoc = Jsoup.parse(rawText, "", Parser.xmlParser());
  4. for (Element paragraph : jsoupDoc.select("PARA")) {
  5. int level = paragraph.attr("LVL") != null ? Integer.valueOf(paragraph.attr("LVL")) : 0;
  6. StringBuilder stb = new StringBuilder();
  7. for (int i = 0; i <= level; i++) {
  8. stb.append("&nbsp;");
  9. }
  10. paragraph.prepend(stb.toString());
  11. }
  12. String simpleHtml = getSimpleHtml(jsoupDoc.outerHtml());
  13. Document simpleDoc = Jsoup.parse(simpleHtml, "", Parser.xmlParser());
  14. HtmlToPlainText htmlConvert = new HtmlToPlainText(textConfig);
  15. return htmlConvert.getPlainText(simpleDoc);
  16. }

代码示例来源:origin: starlightknight/swagger-confluence

  1. private static String reformatXHtml(final String inputXhtml, final Map<String, ConfluenceLink> confluenceLinkMap) {
  2. final Document document = Jsoup.parse(inputXhtml, "utf-8", Parser.xmlParser());
  3. document.outputSettings().prettyPrint(false);
  4. document.outputSettings().escapeMode(xhtml);
  5. document.outputSettings().charset("UTF-8");
  6. final Elements linkElements = document.select("a");
  7. final String originalHref = linkElement.attr("href");
  8. final ConfluenceLink confluenceLink = confluenceLinkMap.get(originalHref);
  9. linkElement.before(confluenceLinkMarkup);
  10. linkElement.html("");
  11. linkElement.unwrap();

代码示例来源:origin: USPTO/PatentPublicData

  1. @Override
  2. public List<String> getParagraphText(String rawText) {
  3. String textWithPMarks = getSimpleHtml(rawText);
  4. Document jsoupDoc = Jsoup.parse(textWithPMarks, "", Parser.xmlParser());
  5. List<String> paragraphs = new ArrayList<String>();
  6. for (Element element : jsoupDoc.select("p")) {
  7. paragraphs.add(element.html());
  8. }
  9. return paragraphs;
  10. }
  11. }

代码示例来源:origin: DigitalPebble/storm-crawler

  1. /**
  2. * Attempt to find a META tag in the HTML that hints at the character set
  3. * used to write the document.
  4. */
  5. private static String getCharsetFromMeta(byte buffer[], int maxlength) {
  6. // convert to UTF-8 String -- which hopefully will not mess up the
  7. // characters we're interested in...
  8. int len = buffer.length;
  9. if (maxlength > 0 && maxlength < len) {
  10. len = maxlength;
  11. }
  12. String html = new String(buffer, 0, len, DEFAULT_CHARSET);
  13. Document doc = Parser.htmlParser().parseInput(html, "dummy");
  14. // look for <meta http-equiv="Content-Type"
  15. // content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
  16. Elements metaElements = doc
  17. .select("meta[http-equiv=content-type], meta[charset]");
  18. String foundCharset = null;
  19. for (Element meta : metaElements) {
  20. if (meta.hasAttr("http-equiv"))
  21. foundCharset = getCharsetFromContentType(meta.attr("content"));
  22. if (foundCharset == null && meta.hasAttr("charset"))
  23. foundCharset = meta.attr("charset");
  24. if (foundCharset != null)
  25. return foundCharset;
  26. }
  27. return foundCharset;
  28. }

代码示例来源:origin: crazyhitty/Munch

  1. @Override
  2. protected String doInBackground(String... strings) {
  3. Document opmlDocument = null;
  4. try {
  5. if (mUrl != null) {
  6. opmlDocument = Jsoup.connect(mUrl).parser(Parser.xmlParser()).get();
  7. } else {
  8. opmlDocument = Jsoup.parse(mFile, "UTF-8");
  9. }
  10. } catch (IOException e) {
  11. e.printStackTrace();
  12. return e.getMessage();
  13. }
  14. if (opmlDocument != null) {
  15. mOpmlItems = opmlDocument.select("outline");
  16. }
  17. return "success";
  18. }

代码示例来源:origin: TeamNewPipe/NewPipeExtractor

  1. private List<SubscriptionItem> getItemsFromOPML(InputStream contentInputStream) throws ExtractionException {
  2. final List<SubscriptionItem> result = new ArrayList<>();
  3. final String contentString = readFromInputStream(contentInputStream);
  4. Document document = Jsoup.parse(contentString, "", org.jsoup.parser.Parser.xmlParser());
  5. if (document.select("opml").isEmpty()) {
  6. throw new InvalidSourceException("document does not have OPML tag");
  7. }
  8. if (document.select("outline").isEmpty()) {
  9. throw new InvalidSourceException("document does not have at least one outline tag");
  10. }
  11. for (Element outline : document.select("outline[type=rss]")) {
  12. String title = outline.attr("title");
  13. String xmlUrl = outline.attr("abs:xmlUrl");
  14. if (title.isEmpty() || xmlUrl.isEmpty()) {
  15. throw new InvalidSourceException("document has invalid entries");
  16. }
  17. try {
  18. String id = Parser.matchGroup1(ID_PATTERN, xmlUrl);
  19. result.add(new SubscriptionItem(service.getServiceId(), BASE_CHANNEL_URL + id, title));
  20. } catch (Parser.RegexException e) {
  21. throw new InvalidSourceException("document has invalid entries", e);
  22. }
  23. }
  24. return result;
  25. }

代码示例来源:origin: addthis/hydra

  1. Parser parser = Parser.htmlParser().setTrackErrors(0);
  2. @Nonnull Document doc = parser.parseInput(html, "");
  3. @Nonnull Elements tags = doc.select(tagName);
  4. @Nonnull String attrValue = tag.attr(tagAttr).toLowerCase();
  5. for (String matchValue : values) {
  6. if (attrValue.contains(matchValue)) {

代码示例来源:origin: org.tinymediamanager.plugins/scraper-anidb

  1. trackConnections();
  2. doc = Jsoup.parse(cachedUrl.getInputStream(), "UTF-8", "", Parser.xmlParser());
  3. if (doc == null || doc.children().size() == 0) {
  4. return md;
  5. Element anime = doc.child(0);
  6. for (Element e : anime.children()) {
  7. if ("startdate".equalsIgnoreCase(e.tagName())) {
  8. try {
  9. Date date = StrgUtils.parseDate(e.text());
  10. md.setReleaseDate(date);

代码示例来源:origin: abc9070410/JComicDownloader

  1. org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect(urlString.replaceFirst("[.]com[/]manhua-", ".com/rss-")).cookie("Cookie", "isAdult=1").parser(org.jsoup.parser.Parser.xmlParser()).get();
  2. this.title = Common.getStringRemovedIllegalChar(NewEncoding.StoT(doc.getElementsByTag("title").get(0).text()));
  3. for (org.jsoup.nodes.Element e : doc.getElementsByTag("item")){
  4. volumeList.add( getVolumeWithFormatNumber( Common.getStringRemovedIllegalChar(
  5. NewEncoding.StoT(e.getElementsByTag("title").get(0).text().trim()))));
  6. urlList.add( e.getElementsByTag("link").get(0).text());

代码示例来源:origin: de.unistuttgart.ims/uimautil

  1. public JCas read(JCas jcas, InputStream xmlStream) throws IOException {
  2. doc = Jsoup.parse(xmlStream, "UTF-8", "", Parser.xmlParser());
  3. root = doc;
  4. else
  5. root = doc.select(textRootSelector).first();
  6. root.traverse(vis);
  7. parsingDescription.setEncoding(doc.charset().name());
  8. Node rootNode = doc.root();
  9. List<String> declarations = new LinkedList<String>();
  10. for (Node topNode : rootNode.childNodes()) {

代码示例来源:origin: org.apache.any23/apache-any23-core

  1. if (length >= 20 && bytes[length - 2] == '?') {
  2. String decl = "<" + new String(bytes, 2, length - 4) + ">";
  3. org.jsoup.nodes.Document doc = org.jsoup.Jsoup.parse(decl, documentIRI, Parser.xmlParser());
  4. for (org.jsoup.nodes.Element el : doc.children()) {
  5. if ("xml".equalsIgnoreCase(el.tagName())) {
  6. String enc = el.attr("encoding");
  7. if (enc != null && !enc.isEmpty()) {
  8. encoding = enc;
  9. return Jsoup.parse(input, encoding, documentIRI, Parser.htmlParser());

相关文章