JSoup解析原网页获取Form Data和Query String Parameters

x33g5p2x  于2021-12-28 转载在 其他  
字(10.6k)|赞(0)|评价(0)|浏览(390)

本次项目需要用到jsoup和fastjson,所以先在pom.xml中加入:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.11.3</version>
    </dependency>
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.49</version>
    </dependency>
</dependencies>

项目需求:将个人的教学执行计划爬取出来,放在APP上进行展示,如下所示:

获取内网和教务系统的cookie

先登录内网:

  登录成功后:

  最后获得了内网的cookie:cookies_innet。
  现在我们要模拟登录到新教务系统这个网页,进入到它的登录页面:

  登录成功后获得cookie:cookies。详细过程见:JSoup模拟登录新版正方教务系统(内网-教务系统)爬取信息过程详解

爬取教学执行计划

下面是进入到查询界面的情形:

  我们按下F12,选中计算机科学与技术之后,点击修读要求:

  我们点击打开Network的第一个链接:

  我们发现,需要四个参数,最下面的su是学号,gnmkdm是固定不变的,_参数是当前时间,而最上面的jxzxjhxx_id我一开始以为是固定的,但后来发现其实不是,让别人用自己的学号密码登录之后,查询教学执行计划还是我这个专业的计划,因此必须先确定jxzxjhxx_id,咋找呢?
  根据这篇文章:Exception in thread “main“ org.jsoup.HttpStatusException: HTTP error fetching URL. Status=422, URL=猜测这个id可能就在原网页中,于是打开Elements搜索jxzxjhxx_id:

  果然有,真是天助我也,于是乎先解析原网页:

String suburl = url + "/jwglxt/jxzxjhgl/jxzxjhck_cxJxzxjhckIndex.html?gnmkdm=N153540&layout=default&su=" + stuNum;
connection = Jsoup.connect(suburl);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36");
connection.header("Connection", "keep-alive");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.GET).execute();

但是打印response.body()发现,没有我想要的id值,于是ctrl+u打开原网页:

  我想要的id数据是在table标签内的,现在这个标签是空的,但看到上面的查询二字便恍然大悟,可能需要先点击查询按钮吧:

  点击打开Network的第一个标签,看看需要提交哪些表单数据:

  下面五个是设置查询后显示的,比如一页最多几个,当前第几页,是否排序等等,这个简单。第一个jg_id也很明显是学院编号,第二个njdm_id是年级,考虑到这个APP会被不同年级不同学院的同学使用,所以我一开始是不知道年级和学院编号的,也只能在原网页中找:

  可以看到网页中存在这些值,value就是学院编号。接着查看被选中的年份:

  于是找到被选中学院的编号和被选中的年份:

String suburl = url + "/jwglxt/jxzxjhgl/jxzxjhck_cxJxzxjhckIndex.html?gnmkdm=N153540&layout=default&su=" + stuNum;
connection = Jsoup.connect(suburl);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36");
connection.header("Connection", "keep-alive");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.GET).execute();
String jg_id = "";
String njdm_id = "";
Document doc11 = Jsoup.parse(response.body());
//System.out.println(doc11);
//找学院和年份
Elements lis = doc11.getElementsByAttributeValue("id", "jg_id").select("option");
for(Element element : lis) {
    if(element.attr("selected").equals("selected")) {
        jg_id = element.attr("value");
        System.out.println(jg_id);
    }
}

Elements lis1 = doc11.getElementsByAttributeValue("id", "nj_cx").select("option");
for(Element element : lis1) {
    if(element.attr("selected").equals("selected")) {
        njdm_id = element.attr("value");
        System.out.println(njdm_id);
    }
}

我们先找到select标签下的option集合:

Elements lis = doc11.getElementsByAttributeValue("id", "jg_id").select("option");

接着依次遍历看哪一个option被选中了,这样最后就得到了想要的jg_id和njdm_id参数,于是开始模拟登录:

suburl = url + "/jwglxt/jxzxjhgl/jxzxjhck_cxJxzxjhckIndex.html?doType=query&gnmkdm=N153540&su=" + stuNum;
connection = Jsoup.connect(suburl);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36");
connection.header("Content-Type","application/x-www-form-urlencoded;charset=utf-8");
connection.header("Connection", "keep-alive");
connection.data("jg_id", jg_id);
connection.data("njdm_id", njdm_id);
connection.data("dlbs", "");
connection.data("zyh_id", "");
connection.data("_search", "false");
connection.data("nd", String.valueOf(new Date().getTime()));
connection.data("queryModel.showCount", "15");
connection.data("queryModel.currentPage", "1");
connection.data("queryModel.sortName", "");
connection.data("queryModel.sortOrder", "asc");
connection.data("time", "1");
response = connection.cookies(cookies_innet).cookies(cookies).ignoreContentType(true).method(Connection.Method.GET).execute();
System.out.println(response.body());

打印出来再转json格式:

System.out.println(response.body());
JSONObject jsonObject = JSON.parseObject(response.body());
JSONArray table = JSON.parseArray(jsonObject.getString("items"));

打印table:

  我们想要的jxzxjhxx_id确实在里面,接着根据专业提取相应的id:

for (Iterator iterator = table.iterator(); iterator.hasNext();) {
    JSONObject lesson = (JSONObject) iterator.next();
    if(lesson.getString("zymc").equals(major)) {
        final_id = lesson.getString("jxzxjhxx_id");
    }
    System.out.println(lesson.getString("zymc") + " " +
            lesson.getString("jxzxjhxx_id"));
}

final_id就是最后要找的jxzxjhxx_id。

找到id后回到一开始,我们要进入到这个界面:

String time = String.valueOf(new Date().getTime());
connection = Jsoup.connect(url + "/jwglxt/jxzxjhgl/jxzxjhck_cxJxzxjhxdyqIndex.html?jxzxjhxx_id=" + final_id + "&_=" + time + "&gnmkdm=N153540&su=" + stuNum);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.GET).ignoreContentType(true).execute();
String doc = response.body();

打印doc之后发现网页中只是存在最低要求学分和已修学分等信息:

  那就先找到最低要求学分和课程总学分这两个值,也就是解析doc,这里不再叙述。
  点击课程详情新出来一个链接:

  点击打开:

Query String Parameters是固定的,就在url里面,不再细说。主要是Form Data里面的xfyqjd_id值,因为必修专选实践有三个不同的id,跟上面一样,也需要在网页中找到:

System.out.println(doc);
int index11 = doc.indexOf("必修课&nbsp;最低要求学分");
String sub1 = doc.substring(index11 - 200, index11 - 100);
int index12 = sub1.indexOf("xfyqjd_id");
String zhu = sub1.substring(index12 + 11, index12 + 43);
System.out.println(sub1.substring(index12 + 11, index12 + 43));

int index21 = doc.indexOf("专选课&nbsp;最低要求学分");
String sub2 = doc.substring(index21 - 200, index21 - 100);
int index22 = sub2.indexOf("xfyqjd_id");
String zhuan = sub2.substring(index22 + 11, index22 + 43);
System.out.println(sub2.substring(index22 + 11, index22 + 43));

int index31 = doc.indexOf("实践课&nbsp;最低要求学分");
String sub3 = doc.substring(index31 - 200, index31 - 100);
int index32 = sub3.indexOf("xfyqjd_id");
String shi = sub3.substring(index32 + 11, index32 + 43);
System.out.println(sub3.substring(index32 + 11, index32 + 43));

这里就不再解析网页了,因为这玩意是动态加载的。。。doc中没有id值,但是response.body()中有,于是就直接搜索了。

接下来就是查找所有课程了:

List<Plan> data = new ArrayList<>();
connection = Jsoup.connect(url + "/jwglxt/jxzxjhgl/jxzxjhxfyq_cxJxzxjhxfyqKcxx.html?gnmkdm=N153540&su=" +  stuNum);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
connection.data("xfyqjd_id", zhu);
connection.data("jdkcsx", "1");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.POST).ignoreContentType(true).execute();
JSONArray major1 = JSON.parseArray(response.body());

Plan plan1 = new Plan();
plan1.setTag("必修课");
plan1.setMinCredit(credits.get(0));
plan1.setCurrentCredit(credits.get(3));
List<SubPlan> subPlans1 = new ArrayList<>();
for (Iterator iterator = major1.iterator(); iterator.hasNext();) {
    JSONObject lesson = (JSONObject) iterator.next();
    SubPlan subPlan = new SubPlan();
    subPlan.setCourse_num(lesson.getString("KCH"));
    subPlan.setCourse_name(lesson.getString("KCMC"));
    subPlan.setCourse_nature(lesson.getString("KCXZMC"));
    subPlan.setCredit(lesson.getString("XF"));
    subPlan.setYear(lesson.getString("JYXDXNM"));
    subPlan.setSemester(lesson.getString("JYXDXQM"));
    subPlans1.add(subPlan);
    System.out.println(lesson.getString("KCH") + " " +
                    lesson.getString("KCMC") + " " +
                    lesson.getString("KCXZMC") + " " +
                    lesson.getString("XF") + " " +
                    lesson.getString("JYXDXNM") + " " +
                    lesson.getString("JYXDXQM"));
}
plan1.setPlans(subPlans1);
data.add(plan1);

connection = Jsoup.connect(url + "/jwglxt/jxzxjhgl/jxzxjhxfyq_cxJxzxjhxfyqKcxx.html?gnmkdm=N153540&su=" +  stuNum);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
connection.data("xfyqjd_id", zhuan);
connection.data("jdkcsx", "1");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.POST).ignoreContentType(true).execute();
        //Document document = response.parse();
JSONArray major2 = JSON.parseArray(response.body());

Plan plan2 = new Plan();
plan2.setTag("专选课");
plan2.setMinCredit(credits.get(1));
plan2.setCurrentCredit(credits.get(4));
List<SubPlan> subPlans2 = new ArrayList<>();
for (Iterator iterator = major2.iterator(); iterator.hasNext();) {
    JSONObject lesson = (JSONObject) iterator.next();
    SubPlan subPlan = new SubPlan();
    subPlan.setCourse_num(lesson.getString("KCH"));
    subPlan.setCourse_name(lesson.getString("KCMC"));
    subPlan.setCourse_nature(lesson.getString("KCXZMC"));
    subPlan.setCredit(lesson.getString("XF"));
    subPlan.setYear(lesson.getString("JYXDXNM"));
    subPlan.setSemester(lesson.getString("JYXDXQM"));
    subPlans2.add(subPlan);
}
plan2.setPlans(subPlans2);
data.add(plan2);

connection = Jsoup.connect(url + "/jwglxt/jxzxjhgl/jxzxjhxfyq_cxJxzxjhxfyqKcxx.html?gnmkdm=N153540&su=" +  stuNum);
connection.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
connection.data("xfyqjd_id", shi);
connection.data("jdkcsx", "1");
response = connection.cookies(cookies_innet).cookies(cookies).method(Connection.Method.POST).ignoreContentType(true).execute();
        //Document document = response.parse();
JSONArray major3 = JSON.parseArray(response.body());

Plan plan3 = new Plan();
plan3.setTag("实践课");
plan3.setMinCredit(credits.get(2));
plan3.setCurrentCredit(credits.get(5));
List<SubPlan> subPlans3 = new ArrayList<>();
for (Iterator iterator = major3.iterator(); iterator.hasNext();) {
    JSONObject lesson = (JSONObject) iterator.next();
    SubPlan subPlan = new SubPlan();
    subPlan.setCourse_num(lesson.getString("KCH"));
    subPlan.setCourse_name(lesson.getString("KCMC"));
    subPlan.setCourse_nature(lesson.getString("KCXZMC"));
    subPlan.setCredit(lesson.getString("XF"));
    subPlan.setYear(lesson.getString("JYXDXNM"));
    subPlan.setSemester(lesson.getString("JYXDXQM"));
    subPlans3.add(subPlan);
}
plan3.setPlans(subPlans3);

最终结果:

相关文章