edu.uci.ics.crawler4j.url.WebURL.<init>()方法的使用及代码示例

x33g5p2x  于2022-02-03 转载在 其他  
字(5.7k)|赞(0)|评价(0)|浏览(135)

本文整理了Java中edu.uci.ics.crawler4j.url.WebURL.<init>()方法的一些代码示例,展示了WebURL.<init>()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。WebURL.<init>()方法的具体详情如下:
包路径:edu.uci.ics.crawler4j.url.WebURL
类名称:WebURL
方法名:<init>

WebURL.<init>介绍

暂无

代码示例

代码示例来源:origin: yasserg/crawler4j

  1. private Set<WebURL> parseOutgoingUrls(WebURL referringPage) throws UnsupportedEncodingException {
  2. Set<String> extractedUrls = extractUrlInCssText(this.getTextContent());
  3. final String pagePath = referringPage.getPath();
  4. final String pageUrl = referringPage.getURL();
  5. Set<WebURL> outgoingUrls = new HashSet<>();
  6. for (String url : extractedUrls) {
  7. String relative = getLinkRelativeTo(pagePath, url);
  8. String absolute = getAbsoluteUrlFrom(URLCanonicalizer.getCanonicalURL(pageUrl), relative);
  9. WebURL webURL = new WebURL();
  10. webURL.setURL(absolute);
  11. outgoingUrls.add(webURL);
  12. }
  13. return outgoingUrls;
  14. }

代码示例来源:origin: yasserg/crawler4j

  1. String url = URLCanonicalizer.getCanonicalURL(href, contextURL, hrefCharset);
  2. if (url != null) {
  3. WebURL webURL = new WebURL();
  4. webURL.setTldList(tldList);
  5. webURL.setURL(url);

代码示例来源:origin: yasserg/crawler4j

  1. WebURL webUrl = new WebURL();
  2. webUrl.setTldList(tldList);
  3. webUrl.setURL(canonicalUrl);

代码示例来源:origin: yasserg/crawler4j

  1. @Override
  2. public WebURL entryToObject(TupleInput input) {
  3. WebURL webURL = new WebURL();
  4. webURL.setURL(input.readString());
  5. webURL.setDocid(input.readInt());
  6. webURL.setParentDocid(input.readInt());
  7. webURL.setParentUrl(input.readString());
  8. webURL.setDepth(input.readShort());
  9. webURL.setPriority(input.readByte());
  10. webURL.setAnchor(input.readString());
  11. return webURL;
  12. }

代码示例来源:origin: yasserg/crawler4j

  1. WebURL webURL = new WebURL();
  2. webURL.setTldList(myController.getTldList());
  3. webURL.setURL(movedToUrl);

代码示例来源:origin: edu.uci.ics/crawler4j

  1. public static Set<WebURL> extractUrls(String input) {
  2. Set<WebURL> extractedUrls = new HashSet<>();
  3. if (input != null) {
  4. Matcher matcher = pattern.matcher(input);
  5. while (matcher.find()) {
  6. WebURL webURL = new WebURL();
  7. String urlStr = matcher.group();
  8. if (!urlStr.startsWith("http")) {
  9. urlStr = "http://" + urlStr;
  10. }
  11. webURL.setURL(urlStr);
  12. extractedUrls.add(webURL);
  13. }
  14. }
  15. return extractedUrls;
  16. }

代码示例来源:origin: tim232385/WebVideoBot

  1. @Override
  2. protected WebURL handleUrlBeforeProcess(WebURL webURL) {
  3. return getViewkey(webURL)
  4. .map(key -> "https://www.pornhub.com/embed/" + key)
  5. .map(url -> {
  6. WebURL newUrl = new WebURL();
  7. newUrl.setURL(url);
  8. return newUrl;
  9. }).orElse(super.handleUrlBeforeProcess(webURL));
  10. }

代码示例来源:origin: tim232385/WebVideoBot

  1. public void download(CrawlConfig config, String url, File file) throws InterruptedException, IOException {
  2. PageFetcher pageFetcher = new PageFetcher(config);
  3. WebURL curURL = new WebURL();
  4. curURL.setURL(url);
  5. PageFetchResult fetchResult = null;
  6. try {
  7. fetchResult = pageFetcher.fetchPage(curURL);
  8. if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
  9. logger.info("Start download filePath:[{}]", file);
  10. FileUtils.copyInputStreamToFile(fetchResult.getEntity().getContent(), file);
  11. logger.info("Download Finish filePath:[{}].", file);
  12. } else {
  13. logger.info("Skip download url:[{}], HttpStatus:[{}]", url, fetchResult.getStatusCode());
  14. }
  15. } catch (PageBiggerThanMaxSizeException e) {
  16. logger.debug("PageBiggerThanMaxSizeException", e);
  17. logger.info("Skip download url:[{}], Out of MaxDownloadSize", url);
  18. } finally {
  19. if (fetchResult != null) {
  20. fetchResult.discardContentIfNotConsumed();
  21. }
  22. }
  23. }

代码示例来源:origin: stackoverflow.com

  1. public void addSeed(String pageUrl, int docId) {
  2. String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
  3. if (canonicalUrl == null) {
  4. logger.error("Invalid seed URL: " + pageUrl);
  5. return;
  6. }
  7. if (docId < 0) {
  8. docId = docIdServer.getDocId(canonicalUrl);
  9. if (docId > 0) {
  10. // This URL is already seen.
  11. return;
  12. }
  13. docId = docIdServer.getNewDocID(canonicalUrl);
  14. } else {
  15. try {
  16. docIdServer.addUrlAndDocId(canonicalUrl, docId);
  17. } catch (Exception e) {
  18. logger.error("Could not add seed: " + e.getMessage());
  19. }
  20. }
  21. WebURL webUrl = new WebURL();
  22. webUrl.setURL(canonicalUrl);
  23. webUrl.setDocid(docId);
  24. webUrl.setDepth((short) 0);
  25. if (!robotstxtServer.allows(webUrl)) {
  26. logger.info("Robots.txt does not allow this seed: " + pageUrl);
  27. } else {
  28. frontier.schedule(webUrl); //method that adds URL to the frontier at run time
  29. }
  30. }

代码示例来源:origin: edu.uci.ics/crawler4j

  1. String url = URLCanonicalizer.getCanonicalURL(href, contextURL, hrefCharset);
  2. if (url != null) {
  3. WebURL webURL = new WebURL();
  4. webURL.setURL(url);
  5. webURL.setTag(urlAnchorPair.getTag());

代码示例来源:origin: edu.uci.ics/crawler4j

  1. WebURL webUrl = new WebURL();
  2. webUrl.setURL(canonicalUrl);
  3. webUrl.setDocid(docId);

代码示例来源:origin: biezhi/java-library-examples

  1. private Page download(String url) {
  2. WebURL curURL = new WebURL();
  3. curURL.setURL(url);
  4. PageFetchResult fetchResult = null;
  5. try {
  6. fetchResult = pageFetcher.fetchPage(curURL);
  7. if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
  8. Page page = new Page(curURL);
  9. fetchResult.fetchContent(page, pageFetcher.getConfig().getMaxDownloadSize());
  10. parser.parse(page, curURL.getURL());
  11. return page;
  12. }
  13. } catch (Exception e) {
  14. logger.error("Error occurred while fetching url: " + curURL.getURL(), e);
  15. } finally {
  16. if (fetchResult != null) {
  17. fetchResult.discardContentIfNotConsumed();
  18. }
  19. }
  20. return null;
  21. }
  22. }

代码示例来源:origin: tjake/stormscraper

  1. WebURL curURL = new WebURL();
  2. curURL.setURL(URLCanonicalizer.getCanonicalURL(currentUrl));
  3. WebURL baseURL = new WebURL();
  4. baseURL.setURL(URLCanonicalizer.getCanonicalURL(startUrl));

代码示例来源:origin: edu.uci.ics/crawler4j

  1. @Override
  2. public WebURL entryToObject(TupleInput input) {
  3. WebURL webURL = new WebURL();
  4. webURL.setURL(input.readString());
  5. webURL.setDocid(input.readInt());
  6. webURL.setParentDocid(input.readInt());
  7. webURL.setParentUrl(input.readString());
  8. webURL.setDepth(input.readShort());
  9. webURL.setPriority(input.readByte());
  10. webURL.setAnchor(input.readString());
  11. return webURL;
  12. }

代码示例来源:origin: edu.uci.ics/crawler4j

  1. WebURL webURL = new WebURL();
  2. webURL.setURL(movedToUrl);
  3. webURL.setParentDocid(curURL.getParentDocid());

代码示例来源:origin: biezhi/java-library-examples

  1. WebURL url = new WebURL();

相关文章