爬虫与数据分析结合案例：中国大学排名爬取与分析全流程-CSDN博客

本文链接：https://blog.csdn.net/m0_74408245/article/details/150260600

爬虫与数据分析结合案例：中国大学排名爬取与分析全流程

一、案例背景与目标

本案例以高三网中国大学排名（网址：2021中国的大学排名一览表_高三网）为数据源，完成从数据爬取到分析可视化的全流程实践。主要目标包括：

爬取学校名称、总分、全国排名、星级排名、办学层次等信息
对爬取的数据进行预处理（处理缺失值）
通过可视化图表分析学校星级分布情况

二、数据爬取实现

1. 核心步骤

爬取过程分为三个关键环节：获取网页内容、解析数据、保存为 CSV 文件。

（1）获取网页内容

使用requests库发送 HTTP 请求，处理编码和异常：

import requests

def get_html(url, time=3):
    try:
        r = requests.get(url, timeout=time)
        r.encoding = r.apparent_encoding  # 自动识别编码
        r.raise_for_status()  # 状态码非200时抛出异常
        return r.text
    except Exception as error:
        print(error)

（2）解析网页数据

用BeautifulSoup解析 HTML，通过 CSS 选择器定位表格数据：

from bs4 import BeautifulSoup

def parser(html):
    soup = BeautifulSoup(html, "lxml")
    out_list = []
    for row in soup.select("table>tbody>tr"):  # 遍历表格行
        td_html = row.select("td")  # 获取单元格
        row_data = [
            td_html[1].text.strip(),  # 学校名称
            td_html[2].text.strip(),  # 总分
            td_html[3].text.strip(),  # 全国排名
            td_html[4].text.strip(),  # 星级排名
            td_html[5].text.strip()   # 办学层次
        ]
        out_list.append(row_data)
    return out_list

（3）保存为 CSV 文件

使用csv模块将数据写入文件：

import csv

def save_csv(item, path):
    with open(path, "wt", newline="", encoding="utf-8") as f:
        csv_write = csv.writer(f)
        csv_write.writerows(item)  # 批量写入数据

（4）主程序调用

if __name__ == "__main__":
    url = "http://www.bspider.top/gaosan/"
    html = get_html(url)
    out_list = parser(html)
    save_csv(out_list, "school.csv")

三、数据预处理：处理缺失值

爬取的school.csv中 “总分” 列存在空值，需用 Pandas 处理，提供四种方案：

1. 删除含空值的行

import pandas as pd
df = pd.read_csv("school.csv")
new_df = df.dropna()  # 直接删除含空值的行
print(new_df.to_string())

2. 用指定内容替换空值

df.fillna("暂无分数信息", inplace=True)  # 统一替换为文本

3. 用均值替换空值

x = df["总分"].mean()  # 计算总分均值
df["总分"].fillna(x, inplace=True)  # 填充空值

4. 用中位数替换空

x = df["总分"].median()  # 计算总分中位数
df["总分"].fillna(x, inplace=True)

四、数据分析与可视化

1. 数据概况

共爬取 820 所学校，星级分布如下：

8 星：8 所
7 星：16 所
6 星：36 所
5 星：59 所
4 星：103 所
3 星：190 所
2 星：148 所
1 星：260 所

2. 可视化图表

（1）柱形图：展示不同星级学校数量

import matplotlib.pyplot as plt
import numpy as np

x = np.array(["8星", "7星", "6星", "5星", "4星", "3星", "2星", "1星"])
y = np.array([8, 16, 36, 59, 103, 190, 148, 260])

plt.title("不同星级的学校个数")
plt.rcParams["font.sans-serif"] = ["SimHei"]  # 解决中文显示问题
plt.bar(x, y)  # 垂直柱形图
# 或使用水平柱形图：plt.barh(x, y)
plt.show()

（2）饼图：展示星级占比

y = np.array([1, 2, 4.5, 7.2, 12.5, 23.1, 18, 31.7])  # 各星级占比（%）
plt.pie(
    y,
    labels=["8星", "7星", "6星", "5星", "4星", "3星", "2星", "1星"]
)
plt.title("不同星级的学校个数占比")
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.show()