Python爬虫Requests、Beautiful Soup、Scrapy（附视频教程）

hweiyu00

于 2025-08-12 10:23:58 发布

阅读量259

点赞数 10

CC 4.0 BY-SA版权

分类专栏：技术栈杂谈文章标签： python 爬虫

本文链接：https://blog.csdn.net/vbhfdghff/article/details/150264702

技术栈杂谈专栏收录该内容

94 篇文章

订阅专栏

概述

在Python爬虫领域，Requests、Beautiful Soup和Scrapy是三个非常常用的工具
视频教程：https://pan.quark.cn/s/c4da28467c4f

1. Requests

Requests是一个简洁易用的HTTP库，用于发送HTTP请求获取网页内容。

主要特点：

语法简洁，比Python内置的urllib更易用
支持多种HTTP方法（GET、POST等）
自动处理Cookie、会话保持
支持文件上传、代理设置等

简单示例：

import requests

# 发送GET请求
response = requests.get('https://www.example.com')

# 查看响应内容
print(response.text)

# 查看响应状态码
print(response.status_code)

# 发送带参数的请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://www.example.com', params=params)

2. Beautiful Soup

Beautiful Soup是一个HTML和XML解析库，用于从网页中提取数据。

主要特点：

能解析不规范的HTML（标签缺失、格式错误等）
提供简单直观的API用于遍历和搜索解析树
支持多种解析器（html.parser、lxml等）

简单示例：

from bs4 import BeautifulSoup
import requests

# 获取网页内容
response = requests.get('https://www.example.com')
html_content = response.text

# 创建Beautiful Soup对象
soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有的<a>标签
links = soup.find_all('a')
for link in links:
    # 获取链接文本
    print(link.text)
    # 获取链接地址
    print(link.get('href'))

# 查找特定class的div
specific_divs = soup.find_all('div', class_='specific-class')

3. Scrapy

Scrapy是一个功能强大的爬虫框架，适用于大规模、复杂的爬虫项目。

主要特点：

基于Twisted的异步处理框架，爬取效率高
内置数据提取机制（XPath和CSS选择器）
提供完整的爬虫流程管理（请求调度、去重、数据存储等）
支持中间件扩展，可处理代理、登录等复杂场景
内置数据导出功能（JSON、CSV、数据库等）

简单示例（创建一个基本的Scrapy爬虫）：

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://www.example.com']

    def parse(self, response):
        # 使用XPath提取数据
        titles = response.xpath('//h1/text()').getall()
        for title in titles:
            yield {'title': title}
        
        # 提取链接并继续爬取
        next_links = response.xpath('//a/@href').getall()
        for link in next_links:
            yield response.follow(link, self.parse)