Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
Aim: Write a program to parse XML text, generate Web graph and compute topic specific page rank.
Source Code:
#Python code to illustrate parsing of XML files
# importing the required modules
import csv
import requests
import xml.etree.ElementTree as ET
def loadRSS():
# url of rss feed
url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# creating HTTP response object from given url
resp = requests.get(url)
# saving the xml file
with open('topnewsfeed.xml', 'wb') as f:
f.write(resp.content)
def parseXML(xmlfile):
# create element tree object
tree = ET.parse(xmlfile)
# get root element
root = tree.getroot()
# create empty list for news items
newsitems = []
# iterate news items
for item in root.findall('./channel/item'):
# empty news dictionary
news = {}
# iterate child elements of item
for child in item:
# special checking for namespace object content:media
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
else:
news[child.tag] = child.text.encode('utf8')
writer.writeheader()
# writing data rows
writer.writerows(newsitems)
def main():
# load rss from web to update existing xml file
loadRSS()
# parse xml file
newsitems = parseXML('topnewsfeed.xml')
# store news items in a csv file
savetoCSV(newsitems, 'topnews.csv')
if __name__ == "__main__":
# calling main function
main()
In cmd:
C:\Users\Sumit>pip install requests
Collecting requests
Downloading
https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10
c0287b84b/requests-2.21.0-py2.py3-none-any.whl (57kB)
100% |████████████████████████████████| 61kB 84kB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests)
Downloading
https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14
098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
100% |████████████████████████████████| 143kB 122kB/s
Collecting idna<2.9,>=2.5 (from requests)
Downloading
https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545
bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
100% |████████████████████████████████| 61kB 136kB/s
Collecting certifi>=2017.4.17 (from requests)
Downloading
https://files.pythonhosted.org/packages/9f/e0/accfc1b56b57e9750eba272e24c4dddeac86852c2bebd1236674
d7887e8a/certifi-2018.11.29-py2.py3-none-any.whl (154kB)
100% |████████████████████████████████| 163kB 178kB/s
Collecting urllib3<1.25,>=1.21.1 (from requests)
Downloading
https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6a
d3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl (118kB)
100% |████████████████████████████████| 122kB 204kB/s
Installing collected packages: chardet, idna, certifi, urllib3, requests
Successfully installed certifi-2018.11.29 chardet-3.0.4 idna-2.8 requests-2.21.0 urllib3-1.24.1
You are using pip version 18.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
C:\Users\Sumit>python -m pip install --upgrade pip
Collecting pip
Downloading
https://files.pythonhosted.org/packages/46/dc/7fd5df840efb3e56c8b4f768793a237ec4ee59891959d6a215d6
3f727023/pip-19.0.1-py2.py3-none-any.whl (1.4MB)
100% |████████████████████████████████| 1.4MB 579kB/s
Installing collected packages: pip
Found existing installation: pip 18.1
Uninstalling pip-18.1:
Successfully uninstalled pip-18.1
Successfully installed pip-19.0.1
C:\Users\Sumit>
Output:
= RESTART: D:\Ratnam\tycs\2018-19\Information retrival\practicals\pracs10.py =
>>>
Topnews.csv
Topnewsfeed.xml