导航菜单
首页 >  考研帮app怎么用  > 【Python爬虫项目】考研帮app文章抓取(详细适合新手练习附源码)

【Python爬虫项目】考研帮app文章抓取(详细适合新手练习附源码)

考研帮app文章抓取 1.准备工作

文章中使用的库有:requests,json,threading,lxml,pandas 文章中使用工具:Pycharm,Fiddler

2.抓包 搜索想要爬取的内容用fiddler进行抓包

3.分析 查看包内容我们可以发现json中有我们想要的数据

复制json对其进行格式化显示,可以发现data下共15个对象,我们打开看下 在这里插入图片描述

从data中获取我们想要的信息 abstract:摘要title:标题uuid:文章idname:作者名称 分析请求头url地址:/api-cpp/news/search?search_content=%E7%AC%94%E8%AE%B0&page=1看到page=1我们不妨大胆猜测一下是第一页内容,那么第二页就是将1改成2 #构键前10页urldef url_craet():url = 'http://api.qz100.com/api-cpp/news/search?search_content=%E7%AC%94%E8%AE%B0&page={}'url_list = [url.format(i) for i in range(1,11)]return url_list

对第一篇文章进行抓包,发现url:/a/?id=dfb029db9b304ea5ac8e32a298a16de9 文章id=dfb029db9b304ea5ac8e32a298a16de9 ,而第一篇文章的uuid就是dfb029db9b304ea5ac8e32a298a16de9 在这里插入图片描述 对文章发送请求,打开开发者进行元素定位再用xpath提取文章内容 在这里插入图片描述

4.代码 import requestsimport jsonimport threadingfrom lxml import etreeimport pandas as pd#总列表count=list()#生成前10页urldef url_craet():#基础urlurl = 'http://api.qz100.com/api-cpp/news/search?search_content=%E7%AC%94%E8%AE%B0&page={}'url_list = [url.format(i) for i in range(1,11)]return url_list#url解析def url_parse(url):headers = {'Accept': 'application/json','Accept-Encoding': 'gzip','User-Agent': 'okhttp/3.12.6',}response = requests.get(url=url, headers=headers).textresponse = json.loads(response)data = response['result']['data']#创建线程锁对象rlock=threading.RLock()#上锁rlock.acquire()for i in data:#标题title = i['title']#作者author = i['tenant']['name']#摘要abstract = i['abstract']#文章idid = i['uuid']dic={'标题':title,'作者':author,'摘要':abstract,'文章id':'http://harbor.qz100.com/a/?id='+id}print(dic)#添加到总列表count.append(dic)#解锁rlock.release()#文章爬取def content_parse(i,url):headers = {'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Linux; Android 7.1.2; VOG-AL10 Build/HUAWEIVOG-AL10; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 KaoYanClub-Android/4.0.4','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7','Cookie': 'sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217ad1d711d44e-032036cf000b1a-6373260-2073600-17ad1d711d51e2%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217ad1d711d44e-032036cf000b1a-6373260-2073600-17ad1d711d51e2%22%7D','X-Requested-With': 'com.tal.kaoyan',}rlock = threading.RLock()rlock.acquire()contents=requests.get(url=url,headers=headers).texttree=etree.HTML(contents)a=tree.xpath("//*[@class='content']//text()")a=''.join(a)print(a)i['内容']=arlock.release()#主函数def run():links=url_craet()for url in links:x=threading.Thread(target=url_parse,args=(url,))x.start()x.join()for i in count:y=threading.Thread(target=content_parse,args=(i,i['文章id'],))y.start()y.join()data=pd.DataFrame(count)data.to_excel("考研帮.xlsx",index=False)if __name__ == '__main__':run() 5.运行结果 不到5秒钟就爬完了前10页文章

相关推荐: