E-Commerce Review Scrapper: Python Mini Project On
E-Commerce Review Scrapper: Python Mini Project On
By
Ankita Bhosle
Dikshant Solanki
Anjana Pattan
University of Mumbai
2021-2022
1
Vidyavardhini’s College of Engineering & Technology
CERTIFICATE
Internal Examiner : ( )
External Examiner : ( )
2
Acknowledgement
We would like to express our deepest appreciation to all those who provided us the
possibility to complete this report. A special gratitude we give to our project guide,
Mr. Shanmugasundaram Konar, whose contribution in stimulating suggestions and
encouragement, helped us to coordinate our project especially in writing this report.
Furthermore, we would also like to acknowledge with much appreciation the crucial
role of the staff who gave the permission to use all required equipment and the
necessary materials to complete the task. Last but not least, many thanks goes to the
head of the project, Mr. Shanmugasundar Konar who has invested her full effort in
guiding the team in achieving the goal. I have to appreciate the guidance given by
other supervisors as well as the panels especially in our project presentation that has
improved our presentation skills thanks to their comment and advice.
Ankita Bhosle
Dikshant Solanki
Anjana pattan
3
Abstract:
Web scraping is basically an interactive method for website and some other online
sources to browse for and access data. To delete a replica of the information and
save it in an external archive for review, it uses software engineering technology
and custom software programming to extract data or any other content of on-line
sources. Web scraping is often called automatic data gathering, database
discovery, database crawling, or content management mining. Web scraping have
possibly existed since before the start of the World Wide Web, but it has been
used mainly in the context of data analytics, and is generally associated to e-
commerce. Web scraping technique provides a broad collection of options and can
serve various purposes: A web crawler's least necessity is to automate the
normally physical work of gathering cost quotation marks and website article
details. A web crawler's main requirement will be to discover formerly
inaccessible sources of price data, and include a survey of all accessible price
information. This scraping process is performed using different technologies
which can be automatic application tools or manual methods. This paper provides
the overall review of web scraping technology, how it is carried out and the
effects of this technology.
4
Content
1. Introduction 6
2. Literature Survey 7
4.2 Code
4.3 Output
5. Conclusion
6. Reference
5
Introduction
Web scraping is not initially developed for research of social science, as a effect,
analysts using this method may incorporate unknown suppositions into their own,
because web scraping will not usually require direct contact among the analyst and
those who were formerly collecting the information and inserting it online, data
analysis issues may simply arise. Research teams using web scraping techniques as
an information gathering method still have to be acquainted with the accuracy and
correct analysis of the details retrieved from the website. One final problem analysis
must address is the potential effect of web scraping on a publication's functionality,
as certain web scraping actions unintentionally overpowered and close down a
webpage. A web scraper which is appropriately intended and executed, could assist
analysts prevail over obstacle to data access, gather online information more
resourcefully, and eventually respond investigation queries that cannot be answered
by conventional means of assortment and examination. The below figure 1 shows
the overview of how web scraping is done.
6
Literature Survey
Renita Crystal Pereira et. al., [1] provided web scraping summary and techniques and
tools that face several complexities as data extraction isn't that simple. These strategies
guarantee that the data collected is correct, consistent and has better integrity, because
there is a large amount of data present which is hard to handle and retain. Although there
are a few problems faced by functional techniques that can be such as the elevated
amount of web scraping be able to cause rigid harm to the websites. The measurement
level of the web scraper will vary with the measurement units of the original source file,
making it very difficult to interpret the data. Using social networking sites and internet
is amplifying day by day like facebook, twitter, linked-in and some other, user
knowledge is also high in the internet available from everywhere. This as well offers
hackers an advantage in stealing information. Where the concept of rising income comes
into being, social networking is important from a view of business point. Like with
online shopping, it will also assist consumers in getting fast shopping and also save time.
On the other hand, there is advantage in supporting the company and profiting from it.
Kaushal Parikh et. al., [2] proposed a web scraping detection with the help of machine
learning It is valuable for research dependent companies. Web scraping has forever been
a difficult preventive attack. Every time a company places its data on internet, it is
probable that it could be copied and pasted and then utilized in the other point of view
without the corporation knowing itself about it. A lot of protection mechanisms have
already been in place but some of them continue to be ignored. The significance of
machine learning therefore steps in. Machine learning is quite effective on pattern
detection. Therefore if we succeed in making the machine understand a cadence of
intruder then it will avoid these types of threats from occurring. Web scraping solutions
are aimed primarily at translating complex data obtained through networks into
structured data that could be stored and examined in a central database. Web scraping
solutions thus have a significant impact on the result of the cause. Sameer Padghan et.
al., [3] projected an approach where data extraction is done from web pages in assistance
with web scraping easily. This method would enable the data to be scrapped from
numerous websites that will minimize human intervention, save time and also enhance
the quality of data relevance. It will also support the user in gathering data from the site
and to save the data to their intent and use it as the individual wishes. The scraped
information may be used for database development or for research purposes and also for
different similar activities. The scraping used would increase significantly and will often
encroach on the framework to obtain the details. However the scraping can be stopped
by using effective and safe-web scraping methods. This method should be treated as a
blessing that must be used carefully for the advancement of human races.
7
3. Detail of the Mini Project
Flask is a web framework, it’s a Python module that lets you develop web
applications easily. It’s has a small and easy-to-extend core: it’s a microframework
that doesn’t include an ORM (Object Relational Manager) or such features.
It does have many cool features like url routing, template engine. It is a WSGI web
app framework.
Beautiful Soup is a Python library designed for quick turnaround projects like
screen-scraping. Three features make it powerful:
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating,
searching, and modifying a parse tree: a toolkit for dissecting a document and
extracting what you need. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and
outgoing documents to UTF-8. You don't have to think about encodings, unless the
document doesn't specify an encoding and Beautiful Soup can't detect one. Then
you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib,
allowing you to try out different parsing strategies or trade speed for flexibility.
Gunicorn ‘Green Unicorn’ is a Python WSGI HTTP Server for UNIX. It’s a pre-
fork worker model ported from Ruby’s Unicorn project. The Gunicorn server is
broadly compatible with various web frameworks, simply implemented, light on
server resource usage, and fairly speedy
8
3.2 Code
4 from flask import Flask, render_template, request,jsonify
5 from flask_cors import CORS,cross_origin
6 import requests
7 from bs4 import BeautifulSoup as bs
8 from urllib.request import urlopen as uReq
9
10 app = Flask(__name__)
11
12 @app.route('/',methods=['GET']) # route to display the home page
13 @cross_origin()
14 def homePage():
15 return render_template("index.html")
16
17 @app.route('/review',methods=['POST','GET']) # route to show the
review comments in a web UI
18 @cross_origin()
19 def index():
20 if request.method == 'POST':
21 try:
22 searchString = request.form['content'].replace(" ","")
23 flipkart_url = "https://www.flipkart.com/search?q=" +
searchString
24 uClient = uReq(flipkart_url)
25 flipkartPage = uClient.read()
26 uClient.close()
27 flipkart_html = bs(flipkartPage, "html.parser")
28 bigboxes = flipkart_html.findAll("div", {"class":
"_1AtVbE col-12-12"})
29 del bigboxes[0:3]
30 box = bigboxes[0]
31 productLink = "https://www.flipkart.com" +
box.div.div.div.a['href']
32 prodRes = requests.get(productLink)
33 prodRes.encoding='utf-8'
34 prod_html = bs(prodRes.text, "html.parser")
35 print(prod_html)
36 commentboxes = prod_html.find_all('div', {'class':
"_16PBlm"})
9
37
38 filename = searchString + ".csv"
39 fw = open(filename, "w")
40 headers = "Product, Customer Name, Rating, Heading,
Comment \n"
41 fw.write(headers)
42 reviews = []
43 for commentbox in commentboxes:
44 try:
45 #name.encode(encoding='utf-8')
46 name = commentbox.div.div.find_all('p', {'class':
'_2sc7ZR _2V5EHH'})[0].text
47
48 except:
49 name = 'No Name'
50
51 try:
52 #rating.encode(encoding='utf-8')
53 rating = commentbox.div.div.div.div.text
54
55 except:
56 rating = 'No Rating'
57
58 try:
59 #commentHead.encode(encoding='utf-8')
60 commentHead = commentbox.div.div.div.p.text
61
62 except:
63 commentHead = 'No Comment Heading'
64 try:
65 comtag = commentbox.div.div.find_all('div',
{'class': ''})
66 #custComment.encode(encoding='utf-8')
67 custComment = comtag[0].div.text
68 except Exception as e:
69 print("Exception while creating dictionary: ",e)
70
71 mydict = {"Product": searchString, "Name": name,
"Rating": rating, "CommentHead": commentHead,
10
72 "Comment": custComment}
73 reviews.append(mydict)
74 return render_template('results.html',
reviews=reviews[0:(len(reviews)-1)])
75 except Exception as e:
76 print('The Exception message is: ',e)
77 return 'something is wrong'
78 # return render_template('results.html')
79
80 else:
81 return render_template('index.html')
82
83 if __name__ == "__main__":
84 #app.run(host='127.0.0.1', port=8001, debug=True)
85 app.run(debug=True)
11
3.3 Output
12
13
Conclusion
Extracting data through scraping technology is a new evolving activity in the
technology harvesting arena. Though many companies are still using manual process
of extracting data but Web Scraping solutions will transform the traditional method
of extracting data. The day is not that far with exponential growth throughout this
field when it can become a phenomenon and most companies will understand the
value of scraping innovation and how it enables them remain ahead in the race
dramatically. This paper presents the survey of Web scraping technology
incorporating what it is, how it works, the popular tools and technologies of web
scraping, the websites used for this technology and the top most fields which are
making use of this technology.
14
Reference
[1] Renita Crystal Pereira and Vanitha T, “Web Scraping of Social Networks,”
International Journal of Innovative Research in Computer and Communication
Engineering, pp. 237-240, Vol. 3, 2015. [2] Kaushal Parikh, Dilip Singh, Dinesh
Yadav and Mansingh Rathod, “Detection of web scraping using machine learning,”
Open access international journal of Science and Engineering, pp.114-118, Vol. 3,
2018. [3] Sameer Padghan, Satish Chigle and Rahul Handoo, “Web Scraping-Data
Extraction Using Java Application and Visual Basics Macros,” Journal of Advances
and Scholarly Researches in Allied Education, pp. 691-695, Vol.15, 2018. [4]
Anand V. Saurkar, Kedar G. Pathare and Shweta A. Gode, “An Overview On Web
Scraping Techniques And Tools,” International Journal on Future Revolution in
Computer Science & Communication Engineering, pp. 363-367, Vol. 4, 2018. [5]
Federico Polidoro, Riccardo Giannini, Rosanna Lo Conte, Stefano Mosca and
Francesca Rossetti, “Web scraping techniques to collect data on consumer
electronics and airfares for Italian HICP compilation,” Statistical Journal of the
IAOS, pp. 165-176, 2015. [6] Jan Kinne and Janna Axenbeck, “Web Mining of
Firm Websites: A Framework for Web Scraping and a Pilot Study for Germany,”
2019. [7] Ingolf Boettcher, “Automatic data collection on the Internet,” pp. 1-9,
2015. [8] Erin J. Farley and Lisa Pierotte, “An Emerging Data Collection Method
for Criminal Justice Researchers,” Justice Research and statistics association, pp.
1-9, 2017.
15