requests的使用方法

x33g5p2x  于2021-09-19 转载在 其他  
字(9.5k)|赞(0)|评价(0)|浏览(375)

直接获取

  1. import requests

python基于HTTP协议进行网络请求的第三方库

发送请求
1、requests.get(url, /*, headers, params, proxies) — 发送get请求

​ 2、requests.post(url, /*, headers, params, proxies) — 发送post请求

参数:

​ url — 请求地址(一个网站的网址、接口的地址、图片地址等)

​ headers — 设置请求头(设置cookie和UserAgent)

​ params — 设置参数

​ proxies — 设置代理

发送get请求参数直接拼接到URL中

  1. response = requests.get('http://api.tianapi.com/auto/index?key=c9d408fefd8ed4081a9079d0d6165d43&num=10')

发送post请求,参数设置在params

  1. params = {
  2. 'key':'c9d408fefd8ed4081a9079d0d6165d43',
  3. 'num':10
  4. }
  5. response = requests.post('http://api.tianapi.com/auto/index',params = params)
  1. response = requests.get('http://www.yingjiesheng.com/')

设置编码方式(乱码的时候才需要设置在控制台head标签中查看编码方式)

  1. response.encoding = 'GBK'

获取响应头信息

  1. print(response.headers)

获取响应体

  1. # a.获取text值(用于请求网页,直接拿到网页源代码)
  2. print(response.text)
  3. # b.获取json解析结果(用于返回json数据的数据接口)
  4. print(response.json())
  5. # c.获取content值(获取二进制类型的原数据,用于图片、视频、音频的下载)
  6. print(response.content)

添加请求头

添加User-Agent
  1. headers = {
  2. # 伪装为浏览器访问
  3. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4542.2 Safari/537.36'
  4. }

获取网页数据

  1. response = requests.get('https://www.51job.com/',headers = headers)
添加cookie

需要登录才能进入的网页添加

  1. headers = {
  2. # 伪装为浏览器访问
  3. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4542.2 Safari/537.36',
  4. # 跳过登录
  5. 'cookie':'_zap=1667ed8e-6095-4318-89ec-330c6ef0b8ee; d_c0="AMCve_TnZRCPTh3LhiwHGSuGFq7l4XAr6UE=|1574496707"; _ga=GA1.2.1669928704.1583753947; _9755xjdesxxd_=32; YD00517437729195%3AWM_TID=4dmOpmKKlNhAEUAEBUYqyFtCZ219gaGq; __snaker__id=iRbYMmY5zvFTkQUm; _xsrf=W3e22M3aL6nZ9ZTPAfuxTjelvXUcstnM; YD00517437729195%3AWM_NI=OdDI%2FzdmzhFqywo4cAVWWYPWnNiJrU6p%2FZ6OQ%2BjdzwAOU6Hhd3ew%2Fym8NqBmEq2q%2BwGmAfs605pBNi%2FWHBKGmL9J9OsMdf%2FEaRsTp9tJRBthnh%2Fi3b5l6HTOzQca8GmWSng%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eea6d880e98ba7afcd68f79a8bb3c44b979f9a85f573aceebe97ca6598b9ba91fb2af0fea7c3b92a8dbc8caaae72a8b3b689c57bbba9a1d5e24b968999bad35dae92a7abf15ff6b0e1bbb54593f5a685f97d83919bd7cb64a8edab98ae73afe88292c16787affe92aa7d85b38f87f23ef4ebc0a3cd63ed8988a3ca4daf94f7abbc45b193ac87ae4fa38ebdb5f333b8969ea3f764aeb8e18ab26af7b4faa9d745ba8fad82bb4daf899d8fe637e2a3; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1628255651,1628596803,1628599663,1628664693; SESSIONID=TTagVSrOcand6lnQK3Tx9P3QfIUJQzCSuwiM8mBdAez; JOID=UFgVB07VQ9GbCjEQQ9PWTZvaJpBQgxe0_VtgYgixIL_7PVp8EqM0x_wPMxRG_xEMMhp1TrjTjeBgzXIl0sn6nZM=; osd=U1sVBEjWQNGYDDITQ9DQTpjaJZZTgBe3-1hjYgu3I7z7Plx_EaM3wf8MMxdA_BIMMRx2TbjQi-NjzXEj0cr6npU=; gdxidpyhxdE=Os7BNOHW1sAPAdCRxVjImRb%5CwvHiXuqyeKWE%2BDdcs5%2Fx4LeRYQ56kPLuEZUvnGyGU0vPHAv35sZ3GWqnI2bOR8Udxg5iLXq6fNMuqkowIZ2Z2%5CZOCRh3PvZu3NluLjDKO9H0HSCg0iYDEXUrbPrKDZ4iZsPvjwbC5pHrj4Niel4djY9B%3A1628665594139; r_cap_id="2|1:0|10:1628664703|8:r_cap_id|44:YWI1NDJjZWMyN2NkNDI3MGI2YWQ3MWFkOTNlYzUwN2Q=|0ac1b35368b1b87333d0d8910f9b90e172f777bd7aba8b213955aeb522ebbf3a"; l_n_c=1; cap_id="ZDM0MjhjMmEyN2Y3NDM4ZGIxMzM2YTY4NThmMThjOTE=|1628664713|5484319ad8db598aff3bdfce4d09effda9863624"; l_cap_id="ZjAwNTAzNjA3YzlmNDUzNWIyN2Y1OWE3M2VmZjFhOWI=|1628664713|eb132e13718f49ec0e33cb23468ea33d39b1027e"; n_c=1; atoken=D6D67B4069B60CE4499DFCBCDA83299B; atoken_expired_in=7776000; client_id="OUQxOTkwRTI4MTQUM3RDQzRTM1M0U=|1628664722|80c987621bc12b757cf3578041cfe972142d55ee"; capsion_ticket="2|1:0|10:1628664723|14:capsion_ticket|44:NWJjNDFjMmIwMzBiNGYyMTg4NTA1MWYxNWFkNjFmYmU=|4393e57fcf2fc0719cfac5005f054d90f3a79973be8d3633fa2dffc23e7231a1"; captcha_session_v2="2|1:0|10:1628664888|18:captcha_session_v2|88:QWUweXlmcXNHeGVBcTJMRXlDcVZmSWdEQW1rdEhrNTdlUXhGYThobm8xalZ3SUNwUmxiM2FoM2diTTBIWnJjSw==|d3176dd52a4897549525550efa2ff752507bb7b98c8db2563812f2b91f51ca6d"; captcha_ticket_v2="2|1:0|10:1628665001|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfSkpWQ243QVFzVlJ0QnFoaElrMUdJSFouY1M0VTFndVZIS0RpTXFXa0ZZUUdLSWNzQjFxT3dPdUtUSTFtZEkxTHFneDZaNFZrUjlsTEZxU1J4aXFHZ1VrMEJPWG91YmZwbVl6MHFSQUtBMm5MbEpyOWZ5NGFtMjR3RGNodC5GYkoycnVlaV9aTEtqc09CMjVDa01xdG12blJ0aE1oZ0NjLVRMV0h1OHJqUFc3MkVtMlJKUXN2aWlRekI5MXRrQkN0OEpLajZlN2phbllBYm1PWjFONXZ5alpZVFBPQzhGZkdaX1FHR18tMjdXcW42bEtqVjRRazdRSk9mVzIxVTBtb1lVcjhSS3p4UWVlODFVSkwyYnE4QmtSRFJLb0RkQTdmeE4uWGd6U0Y2S0dtQlJkTnRnQlc5Y1RMN3NrdERBQUdfU0dUMG51RUNYamI4Tkc0ODFfQllsaHFYYk84ZlRRVy4yejBiYVJDeGpLbHc3eHlLOFNWUXFTWlhod1VpcUx4SERWQWZNN0hpcnM3dFJqYUhHSWtqLXVGVklMNzBJbDhZVU9NLkdYMGpQZDkxN1ktUldlR1hmYmdDTlk4dDhuU25NUmVoNmhsVmxnU0w1WW5hd1BvMW44RUtfbl9kWVRfT2hDVWx6cks3Und1U2hlV3dpeUFhTUxOTVhaMyJ9|8944f13d09a8e61fd716fe8586d7543635409716f0fec0375a18299b844bfc76"; z_c0="2|1:0|10:1628665057|4:z_c0|92:Mi4xRmI0UU1RQUFBQUFBd0s5NzlPZGxFQ1lBQUFCZ0FsVk40Y0lBWWdDSDNQZFhUSW5NT2RtS1JMbWJPMUdTaUxBRmxR|ffeef5b0187eb3b1d598cb47fedb36fe8dfe811d39dc8c6d330dbe9c5bc97ec7"; tst=r; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1628665060; KLBRSID=dc02df4a8178e8c4dfd0a3c8cbd8c726|1628665063|1628664691'
  6. }

User-Agent以及cookie的获取:

cookie在相同位置

json数据解析

除了通过网页源代码获取数据外,可以查看该网页是否由json接口提供数据;

  1. 通过json数据接口获取json数据(今日头条)
  2. response = requests.get('https://www.toutiao.com/hot-event/hot-board/?origin=toutiao_pc&_signature=_02B4Z6wo00f01MS82QAAAIDARL4jQ3jJuyjEmN2AAFBBfDYzO5Ue5Tr6dyHZEvYM5aPjg9xOHE3LKbbaAQksLHvKEx9q4O10B2Py6VQFVZhhEklIQRg.uiWqNZdo-z5rHcpbpKbT0oYa8I0j61')
  3. all_title = response.json()
  4. # print(all_title)
  5. for title in all_title['data']:
  6. print(title['Title'])

查看json接口:

同User-Agent方法操作后点击Preview查看

图片爬取

  1. import requests
  2. def download_image(url:str):
  3. # 请求网络图片数据
  4. response = requests.get(url)
  5. data = response.content
  6. # print(data)
  7. file = open(f'test/images/{url.split("/")[-1]}','wb')
  8. file.write(data)
  9. if __name__ == '__main__':
  10. download_image('https://p5.toutiaoimg.com/img/pgc-image/9f5d102756354b6db8fa9408c57d01c8~cs_noop.png')

通过选择器提取数据

  1. from bs4 import BeautifulSoup
  1. 准备需要解析的网页数据(实际就是用requests或selenium获取)
  1. data = open('test/practice.html',encoding = 'utf-8').read()
  1. 创建BeautifulSoup对象(可自动纠正数据中错误HTML结构)
  1. # BeautifulSoup(数据,解析器)
  2. soup = BeautifulSoup(data,'lxml')

lxml需要下载

  1. 通过BeautifulSoup对象获取标签和标签内容
  1. # 1)、获取标签
  2. # BeautifulSoup对象.select(css选择器) --- 获取css选择器中所有当前选择器所有标签(返回列表,列表中的元素是标签对象)
  3. # BeautifulSoup对象.select_one(css选择器) --- 获取css选择器中所有当前选择器第一个标签(返回标签对象)
  4. result = soup.select('p')
  5. print(result) # [<p>我是段落1</p>, <p>我是段落2</p>, <p>我是超链接3</p>]
  6. result = soup.select_one('p')
  7. print(result) # <p>我是段落1</p>
  8. result = soup.select('#p1')
  9. print(result) # [<p id="p1">我是超链接3</p>]
  10. result = soup.select_one('#p1')
  11. print(result) # <p id="p1">我是超链接3</p
  12. result = soup.select('div p')
  13. print(result) # [<p>我是段落2</p>, <p id="p1">我是超链接3</p>]
  14. result = soup.select('div>p')
  15. print(result) # [<p>我是段落2</p>]
  16. # 2)、获取标签内容
  17. # a.标签对象.string --- 获取标签中的文字内容(只有在标签内容是纯文字的时候有效,否则结果为None)
  18. p2 = soup.select_one('div>p')
  19. print(p2) # <p>我是段落2</p>
  20. print(p2.string) # '我是段落2'
  21. s1 = soup.select_one('#s1')
  22. print(s1) # <span id="s1">我是<b>span1</b></span>
  23. print(s1.string) # None
  24. # b.标签对象.get_text() --- 获取标签内容中所有的文字信息
  25. print(p2.get_text()) # '我是段落2'
  26. print(s1.get_text()) # '我是span1'
  27. # c.标签对象.contents
  28. print(p2.contents) # ['我是段落2']
  29. result = s1.contents
  30. print(result) # ['我是', <b>span1</b>]
  31. print(result[-1].get_text()) # 'span1'
  32. # 3)、获取标签属性
  33. # 标签对象.attrs['属性名']
  34. a1 = soup.select_one('div>a')
  35. print(a1) # <a href="https://www.baidu.com">我是超链接2</a>
  36. print(a1.attrs['href']) # 'https://www.baidu.com'
  37. img1 = soup.select_one('img')
  38. print(img1) # <img alt="" src="http://www.gaoimg.com/uploads/allimg/210801/1-210P1151401S1.jpg"/>
  39. print(img1.attrs['src']) # 'http://www.gaoimg.com/uploads/allimg/210801/1-210P1151401S1.jpg'

上述HTML:

  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <title>Title</title>
  6. </head>
  7. <body>
  8. <p>我是段落1</p>
  9. <a href="">我是超链接1</a>
  10. <div>
  11. <a href="https://www.baidu.com">我是超链接2</a>
  12. <p>我是段落2</p>
  13. <span>
  14. <p id="p1">我是超链接3</p>
  15. </span>
  16. </div>
  17. <img src="http://www.gaoimg.com/uploads/allimg/210801/1-210P1151401S1.jpg" alt="">
  18. <span id="s1">我是<b>span1</b></span>
  19. </body>
  20. </html>
豆瓣top250爬取练习
  1. 导包
  1. from bs4 import BeautifulSoup
  2. import requests
  3. import csv
  1. 添加请求头
  1. headers = {
  2. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4542.2 Safari/537.36'
  3. }
  1. 数据爬取函数
  1. # 获取每页数据
  2. def get_info(url = 'https://movie.douban.com/top250'):
  3. #分别用来存放电影名、电影简评、电影详情路径
  4. movies_name = []
  5. movies_view = []
  6. movies_url = []
  7. # 访问
  8. response = requests.get(url, headers=headers)
  9. # print(response.text)
  10. soup = BeautifulSoup(response.text, 'lxml')
  11. # 获取每个电影的li标签
  12. movies = soup.select('#content div.article li')
  13. # print(movies)
  14. for item in movies:
  15. movies_name.append(item.select_one('.title').get_text())
  16. movies_url.append(item.select_one('a').attrs['href'])
  17. if(item.select_one('.inq') != None): #get_text()方法对None值无效所以需要判断
  18. movies_view.append((item.select_one('.inq').get_text()))
  19. else:
  20. movies_view.append('暂无')#没有简评的电影用暂无替换
  21. movies1 = zip(movies_name, movies_view, movies_url)#三个列表中相同下标的数据来自同一电影,将他们整理在一起
  22. return movies1
  1. 保存数据函数
  1. # 将读到的数据写入CSV文件
  2. def write_file(movies1):
  3. file = open('files/douban/detail.csv', 'a', encoding='utf-8')
  4. writer = csv.writer(file)
  5. writer.writerow(['电影名称', '电影描述', '电影详情地址']) #每页都写一个头
  6. for it in movies1:
  7. writer.writerow(it)
  1. 主函数
  1. url_list = [25,50,75,100,125,150,175,200,225]
  2. index = -1
  3. url = 'https://movie.douban.com/top250'
  4. while(index < 9):
  5. if(index == -1 ):
  6. movies = get_info()
  7. write_file(movies)
  8. else:
  9. movies = get_info(f'https://movie.douban.com/top250?start={url_list[index]}&filter=')
  10. write_file(movies)
  11. index += 1
  12. else:
  13. print('数据写入完成!')

在使用get_text()方法时一定要注意是否存在空值。否则数据不全或无法对号入座。当然程序也会报错!

相关文章