爬网时字符串索引超出范围时出错

cunj1qz1  于 2021-07-09  发布在  Java
关注(0)|答案(0)|浏览(278)

我的程序在craws了前2个url的“exception in thread”awt-eventqueue-0“java.lang.stringindexoutofboundsexception:字符串索引超出范围:0”之后,一直出现错误。前两个url的craw是我想要的,我使用另一个类中的方法从它们那里获取文本。另一节课可能是我不知道的问题。请看一下我的代码,看看发生了什么。

package WebCrawler;

import java.util.Scanner;
import java.util.ArrayList;

import static TextAnalyser.Textanalyser.analyse;

public class Crawler {

    public static void main(String[] args) {
        //   java.util.Scanner input = new java.util.Scanner(System.in);
        //  System.out.print("Enter a URL: ");
        //  String url = input.nextLine();
        crawler("http://www.port.ac.uk/"); // Traverse the Web from the a starting url 
    }

    public static void crawler(String startingURL) {
        ArrayList<String> listOfPendingURLs = new ArrayList<String>();
        ArrayList<String> listOfTraversedURLs = new ArrayList<String>();

        listOfPendingURLs.add(startingURL);
        while (!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
            String urlString = listOfPendingURLs.remove(0);

            if (!listOfTraversedURLs.contains(urlString)) {
                listOfTraversedURLs.add(urlString);
                String text = urlString;
                text = ReadTextfromURL.gettext(text);
                text = analyse(text);
                System.out.println("text : " + text);
                System.out.println("Craw " + urlString);

                for (String s: getSubURLs(urlString)) {
                    if (!listOfTraversedURLs.contains(s)) {
                        listOfPendingURLs.add(s);
                    }
                }
            }
        }
    }

    public static ArrayList<String> getSubURLs(String urlString) {
        ArrayList <String> list = new ArrayList<String>();

        try {
            java.net.URL url = new java.net.URL(urlString);
            Scanner input = new Scanner(url.openStream());
            int current = 0;
            while (input.hasNext()) {
                String line = input.nextLine();
                current = line.indexOf("http:", current);
                while (current > 0) {
                    int endIndex = line.indexOf("\"", current);
                    if (endIndex > 0) { // Ensure that a correct URL is found 
                        list.add(line.substring(current, endIndex));
                        current = line.indexOf("http:", endIndex);
                    } else {
                        current = -1;
                    }
                }
            }
        } catch (Exception ex) {
            System.out.println("Error: " + ex.getMessage());
        }

        return list;
    }
}

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题