前言
D题与地方电视台有关,多年的收视率数据可能有一些用
站点分析
http://www.tvtv.hk/archives/category/tv
可以发现每天的排行榜以一个静态页面发布,获得了每天的排行榜链接才能进一步获取每天的数据
每天前10的信息以文字发布在p标签内,抓取段落的时候,最后按空格拆分一下
具体代码
抓取每天排行的链接
def get_href_list():
hrefs = {}
for i in range(1, 100):
print(i)
url = 'http://www.tvtv.hk/archives/category/tv/page/'+str(i)
response = requests.get(url)
html = response.text
doc = pq(html)
articles = doc.find('.status-publish')
for article in articles.items():
alink = article.find('h2 a')
hrefs[alink.attr('title')] = alink.attr('href')
with open('TV/排行榜链接列表.csv', 'a') as f:
for key in hrefs.keys():
if key.find('榜(') > 0:
f.write(key+','+hrefs[key]+ '\n')
结果如下
打开每天的链接
def get_audience_proportion():
out = open('TV/收视率排行榜列表.csv', 'w', encoding='utf-8')
with open('TV/排行榜链接列表.csv, 'r') as f:
for line in f:
print(line)
strs = line.split(',')
out.write(strs[0])
response = requests.get(strs[1], timeout=10000)
html = response.text
doc = pq(html)
paragraph = doc.find('.entry-content p').text()
items = paragraph.split(' ')
count = 0
for item in items:
if count > 2:
out.write(','+item)
count = count+1
out.write('\n')
结果如下