Python3爬取某教育平台题库保存为Word文档

最近在玩树莓派，所以趁有空复习下Python，碰巧一个朋友让我帮他打印下某教育平台的考题（已报名有账号密码的），记得上次让我帮忙打印前，花了一些钱让图文店手打整理才开始打印，现在想起真是千万只草尼玛在心中蹦踏，当时的自己蠢得可以..这次，花了大半天写了这个脚本，一来是帮朋友，二来也是在给自己找个机会练手。

^_^亲测可行！代码中使用的Cookie已去除，只记录过程

在敲代码前需要用到一个软件Fiddler，负责抓包工作，或者安装Chrome浏览器扩展程序：https://github.com/welefen/Fiddler，但这个Github项目已经停了，扩展程序可以在这个网站下载：http://www.cnplugins.com/devtool/fiddler/

首先，我们打开网站登录页面（这里我用的是Fiddler拓展程序），输入账号和密码，进入我的题库，在Fiddler中可以看到网站请求数据：有很多模拟登录是从登录页面开始，账号密码再到获取加载Cookie，而我这个算是一次性脚本程序就简简单单忽略了，直接在请求头中传入Cookie，模拟做题操作（已加入模拟l登录操作，见完整代码）。

在上面的网页中可以看到09235《设计原理》这门课程下有五套题目，我们右键点击“显示网页源代码”可以看到以下信息：

这里边包裹着课程名称......题目名称......2/题目总数......试题名称......2/题目总数继续做题

从上面的伪网页源代码可知，我们只需要获得课程名称、试题名称、题目总数、试题ID，创建一个课程名称文件夹，然后通过试题ID去解析不同的试题网址：

'''根据范围截取字符串'''def analyse(html, start_s, end_s):start = html.find(start_s) + len(start_s)tmp_html = html[start:]end = tmp_html.find(end_s)return tmp_html[:end].strip()'''解析课程列表'''def analyse_lesson(opener, headers, html):#获取课程名称tmp_folder = analyse(html, "", "")folder = analyse(tmp_folder, "", "")#创建文件夹，改变当前工作目录print("正在创建文件夹（%s）..." % folder)if not os.path.exists(folder):os.mkdir(folder)os.chdir(folder)#循环获取每一个课程的试题lesson_html = analyse(html, "", "")while True:tmp_html = analyse(lesson_html, "", "")lesson_html = analyse(lesson_html, tmp_html, "")sectionid = analyse(tmp_html, "index.php/Lessontiku/questionsmore_manage/sectionid/", "/subjectid")#解析每一套试题analyse_exam(opener, headers, tmp_html, sectionid)if not tmp_html or not lesson_html:break; 在解析试题之前先看下做题网页时如何，其中一份试题网址是 http://i.sxmaps.com/index.php/Lessontiku/questionsmore_manage/subjectid/111/sectionid/5014/p/3/majorid_sx/38/classid_sx/24

当我们点击下一题时，网址会变成：

http://i.sxmaps.com/index.php/Lessontiku/questionsmore_manage/subjectid/111/sectionid/5014/p/4/majorid_sx/38/classid_sx/24

这时p/后的数字由3变成了4，说明这个数字是页数。再来，我们换一份试题：

http://i.sxmaps.com/index.php/Lessontiku/questionsmore_manage/sectionid/5015/subjectid/111/p/2/classid_sx/24/majorid_sx/38

这时第一个sectionid/后的数字由5014变成了5015，说明这个数字是试题ID。这样一来，可以在脑海中想到如何把这些题目都下载下来了，使用两个循环语句，第一层负责获取试题ID，第二层负责获取题目页数，其中的请求地址可以这样写：

result_url = 'http://i.sxmaps.com/index.php/Lessontiku/questionsmore_manage/subjectid/111/sectionid/%s/p/%d/majorid_sx/38/classid_sx/24' % (sectionid, index)

继续右键点击“显示网页源代码”，可以看到源码中“题目类型、题目内容、选项（只有单选题和多选题才会有）、正确答案”分布位置：伪网页源代码如下：

题目类型题目内容......#只有单选题和多选题才会出现选项选项1选项2选项3选项4正确答案

到这里我们就可以写出正确解析所有试题的代码了：

'''解析试题标题、题目总数'''def analyse_exam(opener, headers, html, sectionid):#获取标题title = analyse(html, "", "")#获取题目总数total_size = analyse(html, "", "")start = total_size.find("/") + 1total_size = total_size[start:]print("正在下载（%s）题目总数：%s" % (title, total_size))#循环解析题目for index in range(1, int(total_size) + 1):result_url = 'http://i.sxmaps.com/index.php/Lessontiku/questionsmore_manage/subjectid/111/sectionid/%s/p/%d/majorid_sx/38/classid_sx/24' % (sectionid, index)item_request = request.Request(result_url, headers = headers)try:response = opener.open(item_request)html = response.read().decode('utf-8')exam_doc = analyse_item(index, html, exam_doc)answers_doc = analyse_answers(index, html, answers_doc)except error.URLError as e:if hasattr(e, 'code'):print("HTTPError:%d" % e.code)elif hasattr(e, 'reason'):print("URLError:%s" % e.reason)'''解析每道试题详细信息'''def analyse_item(index, html):#题目类型type_s = ""start = html.find(type_s) + len(type_s)tmp_html = html[start:]end = tmp_html.find("")start = end - 5exam_type = tmp_html[start:end].strip()#标题title = analyse(tmp_html, "", "")paragraph = "%d.%s %s" % (index, exam_type, title)print("标题：%s" % paragraph)if(exam_type == '[单选题]' or exam_type == '[多选题]'):#选项options = []while True:option_s = ""end_s = "确定"end_div_s = ""if tmp_html.find(option_s)

云奕文章网

Python3爬取某教育平台题库保存为Word文档

相关推荐：