0% found this document useful (1 vote)
746 views5 pages

Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code

This document describes a Python program that parses an XML RSS feed from a news website, generates a graph of linked web pages, and computes topic-specific page ranks. It loads the RSS feed from a URL, parses the XML to extract news items, and saves the items to a CSV file. It also shows the commands used to install and upgrade the Requests library and Pip package manager.

Uploaded by

SumitMaurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
746 views5 pages

Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code

This document describes a Python program that parses an XML RSS feed from a news website, generates a graph of linked web pages, and computes topic-specific page ranks. It loads the RSS feed from a URL, parses the XML to extract news items, and saves the items to a CSV file. It also shows the commands used to install and upgrade the Requests library and Pip package manager.

Uploaded by

SumitMaurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Date: Practical No:10 Roll No:

Aim: Write a program to parse XML text, generate Web graph and compute topic specific page rank.

Source Code:
#Python code to illustrate parsing of XML files
# importing the required modules
import csv
import requests
import xml.etree.ElementTree as ET
def loadRSS():
# url of rss feed
url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# creating HTTP response object from given url
resp = requests.get(url)
# saving the xml file
with open('topnewsfeed.xml', 'wb') as f:
f.write(resp.content)
def parseXML(xmlfile):
# create element tree object
tree = ET.parse(xmlfile)
# get root element
root = tree.getroot()
# create empty list for news items
newsitems = []
# iterate news items
for item in root.findall('./channel/item'):
# empty news dictionary
news = {}
# iterate child elements of item
for child in item:
# special checking for namespace object content:media
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
else:
news[child.tag] = child.text.encode('utf8')

# append news dictionary to news items list


newsitems.append(news)
# return news items list
return newsitems
def savetoCSV(newsitems, filename):
# specifying the fields for csv file
fields = ['guid', 'title', 'pubDate', 'description', 'link', 'media']
# writing to csv file
with open(filename, 'w') as csvfile:
# creating a csv dict writer object
writer = csv.DictWriter(csvfile, fieldnames = fields)
# writing headers (field names)

writer.writeheader()
# writing data rows
writer.writerows(newsitems)

def main():
# load rss from web to update existing xml file
loadRSS()
# parse xml file
newsitems = parseXML('topnewsfeed.xml')
# store news items in a csv file
savetoCSV(newsitems, 'topnews.csv')
if __name__ == "__main__":
# calling main function
main()
In cmd:
C:\Users\Sumit>pip install requests
Collecting requests
Downloading
https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10
c0287b84b/requests-2.21.0-py2.py3-none-any.whl (57kB)
100% |████████████████████████████████| 61kB 84kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
Downloading
https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14
098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
100% |████████████████████████████████| 143kB 122kB/s
Collecting idna<2.9,>=2.5 (from requests)
Downloading
https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545
bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
100% |████████████████████████████████| 61kB 136kB/s
Collecting certifi>=2017.4.17 (from requests)
Downloading
https://files.pythonhosted.org/packages/9f/e0/accfc1b56b57e9750eba272e24c4dddeac86852c2bebd1236674
d7887e8a/certifi-2018.11.29-py2.py3-none-any.whl (154kB)
100% |████████████████████████████████| 163kB 178kB/s
Collecting urllib3<1.25,>=1.21.1 (from requests)
Downloading
https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6a
d3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl (118kB)
100% |████████████████████████████████| 122kB 204kB/s
Installing collected packages: chardet, idna, certifi, urllib3, requests
Successfully installed certifi-2018.11.29 chardet-3.0.4 idna-2.8 requests-2.21.0 urllib3-1.24.1
You are using pip version 18.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
C:\Users\Sumit>python -m pip install --upgrade pip

Collecting pip
Downloading
https://files.pythonhosted.org/packages/46/dc/7fd5df840efb3e56c8b4f768793a237ec4ee59891959d6a215d6
3f727023/pip-19.0.1-py2.py3-none-any.whl (1.4MB)
100% |████████████████████████████████| 1.4MB 579kB/s
Installing collected packages: pip
Found existing installation: pip 18.1
Uninstalling pip-18.1:
Successfully uninstalled pip-18.1
Successfully installed pip-19.0.1
C:\Users\Sumit>
Output:
= RESTART: D:\Ratnam\tycs\2018-19\Information retrival\practicals\pracs10.py =
>>>
Topnews.csv

Topnewsfeed.xml

You might also like