edu.uci.ics.crawler4j.url.WebURL.setDocid()方法的使用及代码示例

x33g5p2x  于2022-02-03 转载在 其他  
字(3.6k)|赞(0)|评价(0)|浏览(139)

本文整理了Java中edu.uci.ics.crawler4j.url.WebURL.setDocid()方法的一些代码示例,展示了WebURL.setDocid()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。WebURL.setDocid()方法的具体详情如下:
包路径:edu.uci.ics.crawler4j.url.WebURL
类名称:WebURL
方法名:setDocid

WebURL.setDocid介绍

暂无

代码示例

代码示例来源:origin: yasserg/crawler4j

  1. webUrl.setTldList(tldList);
  2. webUrl.setURL(canonicalUrl);
  3. webUrl.setDocid(docId);
  4. webUrl.setDepth((short) 0);
  5. if (robotstxtServer.allows(webUrl)) {

代码示例来源:origin: yasserg/crawler4j

  1. @Override
  2. public WebURL entryToObject(TupleInput input) {
  3. WebURL webURL = new WebURL();
  4. webURL.setURL(input.readString());
  5. webURL.setDocid(input.readInt());
  6. webURL.setParentDocid(input.readInt());
  7. webURL.setParentUrl(input.readString());
  8. webURL.setDepth(input.readShort());
  9. webURL.setPriority(input.readByte());
  10. webURL.setAnchor(input.readString());
  11. return webURL;
  12. }

代码示例来源:origin: yasserg/crawler4j

  1. webURL.setParentUrl(curURL.getParentUrl());
  2. webURL.setDepth(curURL.getDepth());
  3. webURL.setDocid(-1);
  4. webURL.setAnchor(curURL.getAnchor());
  5. if (shouldVisit(page, webURL)) {
  6. if (!shouldFollowLinksIn(webURL) || robotstxtServer.allows(webURL)) {
  7. webURL.setDocid(docIdServer.getNewDocID(movedToUrl));
  8. frontier.schedule(webURL);
  9. } else {
  10. curURL.setDocid(docIdServer.getNewDocID(fetchResult.getFetchedUrl()));
  11. webURL.setDocid(newdocid);
  12. } else {
  13. webURL.setDocid(-1);
  14. webURL.setDepth((short) (curURL.getDepth() + 1));
  15. if ((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) {
  16. if (shouldVisit(page, webURL)) {
  17. if (robotstxtServer.allows(webURL)) {
  18. webURL.setDocid(docIdServer.getNewDocID(webURL.getURL()));
  19. toSchedule.add(webURL);
  20. } else {

代码示例来源:origin: stackoverflow.com

  1. public void addSeed(String pageUrl, int docId) {
  2. String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
  3. if (canonicalUrl == null) {
  4. logger.error("Invalid seed URL: " + pageUrl);
  5. return;
  6. }
  7. if (docId < 0) {
  8. docId = docIdServer.getDocId(canonicalUrl);
  9. if (docId > 0) {
  10. // This URL is already seen.
  11. return;
  12. }
  13. docId = docIdServer.getNewDocID(canonicalUrl);
  14. } else {
  15. try {
  16. docIdServer.addUrlAndDocId(canonicalUrl, docId);
  17. } catch (Exception e) {
  18. logger.error("Could not add seed: " + e.getMessage());
  19. }
  20. }
  21. WebURL webUrl = new WebURL();
  22. webUrl.setURL(canonicalUrl);
  23. webUrl.setDocid(docId);
  24. webUrl.setDepth((short) 0);
  25. if (!robotstxtServer.allows(webUrl)) {
  26. logger.info("Robots.txt does not allow this seed: " + pageUrl);
  27. } else {
  28. frontier.schedule(webUrl); //method that adds URL to the frontier at run time
  29. }
  30. }

代码示例来源:origin: edu.uci.ics/crawler4j

  1. webUrl.setDocid(docId);
  2. webUrl.setDepth((short) 0);
  3. if (robotstxtServer.allows(webUrl)) {

代码示例来源:origin: edu.uci.ics/crawler4j

  1. @Override
  2. public WebURL entryToObject(TupleInput input) {
  3. WebURL webURL = new WebURL();
  4. webURL.setURL(input.readString());
  5. webURL.setDocid(input.readInt());
  6. webURL.setParentDocid(input.readInt());
  7. webURL.setParentUrl(input.readString());
  8. webURL.setDepth(input.readShort());
  9. webURL.setPriority(input.readByte());
  10. webURL.setAnchor(input.readString());
  11. return webURL;
  12. }

代码示例来源:origin: edu.uci.ics/crawler4j

  1. webURL.setParentUrl(curURL.getParentUrl());
  2. webURL.setDepth(curURL.getDepth());
  3. webURL.setDocid(-1);
  4. webURL.setAnchor(curURL.getAnchor());
  5. if (shouldVisit(page, webURL)) {
  6. if (!shouldFollowLinksIn(webURL) || robotstxtServer.allows(webURL)) {
  7. webURL.setDocid(docIdServer.getNewDocID(movedToUrl));
  8. frontier.schedule(webURL);
  9. } else {
  10. curURL.setDocid(docIdServer.getNewDocID(fetchResult.getFetchedUrl()));
  11. webURL.setDocid(newdocid);
  12. } else {
  13. webURL.setDocid(-1);
  14. webURL.setDepth((short) (curURL.getDepth() + 1));
  15. if ((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) {
  16. if (shouldVisit(page, webURL)) {
  17. if (robotstxtServer.allows(webURL)) {
  18. webURL.setDocid(docIdServer.getNewDocID(webURL.getURL()));
  19. toSchedule.add(webURL);
  20. } else {

相关文章