19年深圳杯D题之爬取电视收视率排行榜

最新推荐文章于 2023-08-07 13:45:31 发布

喜欢coding的谢同学

最新推荐文章于 2023-08-07 13:45:31 发布

阅读量1.5k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：数学建模爬虫 # 实战

本文链接：https://blog.csdn.net/weixin_44112790/article/details/89929592

爬虫同时被 3 个专栏收录

26 篇文章

订阅专栏

数学建模

20 篇文章

订阅专栏

实战

14 篇文章

订阅专栏

本文介绍了一种从特定网站抓取每日电视收视率排行榜的方法，通过解析网页源代码，获取并保存了排行榜链接及收视率数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

D题与地方电视台有关，多年的收视率数据可能有一些用

站点分析

http://www.tvtv.hk/archives/category/tv
在这里插入图片描述
可以发现每天的排行榜以一个静态页面发布，获得了每天的排行榜链接才能进一步获取每天的数据

每天前10的信息以文字发布在p标签内，抓取段落的时候，最后按空格拆分一下

具体代码

抓取每天排行的链接

def get_href_list():
    hrefs = {}
    for i in range(1, 100):
        print(i)
        url = 'http://www.tvtv.hk/archives/category/tv/page/'+str(i)
        response = requests.get(url)
        html = response.text
        doc = pq(html)
        articles = doc.find('.status-publish')
        for article in articles.items():
            alink = article.find('h2 a')
            hrefs[alink.attr('title')] = alink.attr('href')
    with open('TV/排行榜链接列表.csv', 'a') as f:
        for key in hrefs.keys():
            if key.find('榜（') > 0:
                f.write(key+','+hrefs[key]+ '\n')

结果如下
在这里插入图片描述
打开每天的链接

def get_audience_proportion():
    out = open('TV/收视率排行榜列表.csv', 'w', encoding='utf-8')
    with open('TV/排行榜链接列表.csv, 'r') as f:
        for line in f:
            print(line)
            strs = line.split(',')
            out.write(strs[0])
            response = requests.get(strs[1], timeout=10000)
            html = response.text
            doc = pq(html)
            paragraph = doc.find('.entry-content p').text()
            items = paragraph.split(' ')
            count = 0
            for item in items:
                if count > 2:	
                    out.write(','+item)
                count = count+1
            out.write('\n')

结果如下
在这里插入图片描述