python爬虫基础：requests库详解与案例

最新推荐文章于 2025-05-12 10:17:20 发布

原创

最新推荐文章于 2025-05-12 10:17:20 发布 · 2.3k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#tcp/ip #网络 #服务器 #python #爬虫

1.`Requests`模块的使用

`requests`模块的介绍与安装

作用：发送网络请求，返回响应数据。

中文文档：https://requests.readthedocs.io/projects/cn/zh_CN/latest/

对于爬虫任务，使用 requests模块基本能够解决绝大部分的数据抓取的任务。所以用好 requests至关重要

模块安装

安装命令：

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple

`requests`功能特性

Keep-Alive & 连接池
国际化域名和 URL
带持久 Cookie的会话
浏览器式的 SSL认证
自动内容解码
基本/摘要式的身份认证
优雅的 key/value Cookie
自动解压
Unicode响应体
HTTP(S)代理支持
文件分块上传
流下载
连接超时
分块请求=-
支持 .netrc

`requests`发送网络请求以及常用属性

需求：通过 requests向百度首页发送请求，获取百度首页数据

import requests

url = "https://www.baidu.com"

response = requests.get(url=url)

print("---状态码如下---")
print(response.status_code)

print("---bytes类型数据：---")
print(response.content)

print("---str类型数据---")
print(response.text)

print("---str类型数据(utf-8)---")
print(response.content.decode("utf-8"))

常用属性如下：

response.text 响应体 str类型
respones.content 响应体 bytes类型
response.status_code 响应状态码
response.request.headers 响应对应的请求头
response.headers 响应头
response.request.headers.get('cookies') 响应对应请求的 cookie
response.cookies 响应的 cookie（经过了 set-cookie动作）
response.url请求的 URL

`text`与 `content`方法区别

response.text
- 类型：str
- 解码类型：requests模块根据 HTTP头部对响应的编码推测文本编码类型
- 修改编码方式：response.encoding = 'gbk'
response.content
- 类型：bytes
- 解码类型：没有指定
- 修改编码方式：response.content.decode('utf-8')

获取网页源码的通用方式：

response.encoding = 'utf-8'
response.content.decode('utf-8')
response.text

以上三种方式从前往后依次尝试，百分百可以解决网页编码问题。

import requests

r = requests.get("https://www.baidu.com")

print("-----requests一般能够根据响应自动解码-----")
print(r.text)

print("-----如果不能够解析出想要的真实数据，可以通过设置解码方式-----")
r.encoding = "utf-8"
print(r.text)

下载网络图片

需求：将百度 logo下载到本地

思路分析：

logo的 url地址：https://www.baidu.com/img/bd_logo1.png
利用 requests模块发送请求并获取响应
使用二进制写入的方式打开文件并将 response响应内容写入文件内

import requests

# 图片的url
url = 'https://ywww.baidu.com/img/bd_logo1.png'

# 响应本身就是一个图片,并且是二进制类型
r = requests.get(url)

# print(r.content)

# 以二进制+写入的方式打开文件
with open('baidu.png', 'wb') as f:
    # r.content bytes二进制类型
    f.write(r.content)

`iter_content`方法

如果下载一个较大的资源，例如一个视频，可能需要的下载时间较长，在这个较长的下载过程中程序是不能做别的事情的（当然可以使用多任务来解决），如果在不是多任务的情况下，想要知道下载的进度，此时就可以通过类似迭代的方式下载部分资源。

使用 iter_content

在获取数据时，设置属性 stream=True

r = requests.get('https://www.baidu.com', stream=True)

with open('test.html', 'wb') as f:
    for chunk in r.iter_content(chunk_size=100):
        f.write(chunk)

stream=True说明

如果设置了 stream=True，那么在调用 iter_content方法时才会真正下载内容
如果没设置 stream属性则调用 requests.get就会耗费时间下载

显示视频下载进度

import requests


def download_video(url, save_path):
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    downloaded_size = 0

    with open(save_path, 'wb') as file:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                file.write(chunk)
                downloaded_size += len(chunk)
                percent = (downloaded_size / total_size) * 100
                print(f"下载进度: {
     
     percent:.2f}%")

    print("下载完成...")


# 调用下载函数
video_url = "ht