0% found this document useful (0 votes)
152 views

Analysis of Web Server Log Files

This document discusses analyzing web server log files to increase the effectiveness of a website using a web mining tool. It summarizes the purpose of web usage mining as capturing, analyzing, and modeling web server logs to automatically discover user behavior on a website. The document then describes the types of information contained in web server log files and provides an example. It also briefly summarizes some related work on web usage mining. Finally, it outlines the key information that can be collected from web server logs using the WebLog Expert web mining tool, such as the number of hits, visitors, referring websites, and time spent on the website.

Uploaded by

Temp Wala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Analysis of Web Server Log Files

This document discusses analyzing web server log files to increase the effectiveness of a website using a web mining tool. It summarizes the purpose of web usage mining as capturing, analyzing, and modeling web server logs to automatically discover user behavior on a website. The document then describes the types of information contained in web server log files and provides an example. It also briefly summarizes some related work on web usage mining. Finally, it outlines the key information that can be collected from web server logs using the WebLog Expert web mining tool, such as the number of hits, visitors, referring websites, and time spent on the website.

Uploaded by

Temp Wala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Advanced Computer and Mathematical Sciences

ISSN 2230-9624. Vol 4, Issue 1, 2013, pp1-8


http://bipublication.com

ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE


EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL
Arvind K. Sharma1 and P.C. Gupta2
1
School of Computer and System Sciences
Jaipur National University, Jaipur, Rajasthan-India
2
Department of Computer Science
University of Kota, Kota, Rajasthan-India
*
Corresponding author: Email: [email protected]

[Received-24/10/2012, Accepted-30/11/2012]

ABSTRACT:
The fundamental role of Web usage mining is to capture, analyze, and model the Web server logs. Usually it
automatically discovers the usage behaviour of the Website users. In this paper, we have been implemented a Web
mining tool to analyze the Web server log files of the Website. It evaluates the important information about visitors,
top errors, web browsers and different platforms used by the Website users mostly. The obtained information shall
definitely increase the effectiveness of the Website.

Keywords: Web Usage Mining, Web Server Log Files, WebLog Expert

[I] NTRODUCTION is to find out the useful information from web data
Due to the volatile growth of information available or web log files. To do this task, web usage mining
over the Internet during the past few years, the focuses on investigating the potential knowledge
World Wide Web has become most popular and from browsing patterns of the users and to find the
powerful platform to store, propagate and retrieve correlation between the pages on analysis.
information as well as mine useful knowledge. It The rest of paper is organized as follows: Section-
is a way of communication and information II explains web usage mining. Section-III
dissemination and it serves as a platform for discusses about web server log files. Section-IV
exchanging various kinds of information. The summaries related works. Section-V demonstrates
volume of information available on the internet Web Log Expert tool. Section-VI contains
has been increasing rapidly with the explosive experimental results. Conclusion is shown in
growth of the World Wide Web. While users are section-VII while references are mentioned in the
provided with more information and service last section.
options, it has become more difficult for them to
find the ‘right’ or ‘interesting’ information, the [II] WEB USAGE MINING
problem commonly known as information Web usage mining is the application of data
overload. The primary goal of web usage mining mining techniques to discover the knowledge
ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

hidden in the web server log file, such as user Web Server
access patterns from web data and for analyzing Logs
users’ behaviour patterns. Web mining deals with
the data related to the web, which may be the data
actually present in web pages or the data
concerning the web activities. Web mining refers
to overall process of discovering potentially useful Transfer Agent Error Referrer
Log Log Log Log
and previously unknown information from web
documents and services[1]. This is to ensure an
Fig.1: Taxonomy of Web Server Logs
improved service of web-based applications. The
user access log files present very significant The first two web logs such as Transfer Log and
information about a web server. It is applied to fix Agent Log are standard. The Referrer and Agent
several world problems by discovering the Logs may or may not be Turned On at the Server
interesting user navigational patterns. The web or may be added to the Transfer Log file to create
information is categorized into two categories: an Extended log file format.
deep web and shallow web. The deep web
includes information stored in searchable 3.2 Sample of Web Server Log Files
databases often inaccessible to search engines and We have been shown a sample of web log data in
it is accessed only by Website’s interface. In other table-1 below. A user Id is the unique name that
hand, the shallow web information can be uses to identify. User Id is displayed when the user
accessed by search engines without accessing the would like to make any transactions on the website
web databases[2]. or any other means.

[III] WEB SERVER LOG FILES Table-1: Sample of Web Server Log Files
Host User URL
The server log files are simple text files which Id
records activity of the users on the server. These 117.197.6.155 1 /images/pic010.jpg
files reside on the server. If user visits many times 131.253.41.47 2 images/chemlab_d.jpg
on the Website then it creates entry many times on 95.108.158.238 3 /images/pic8.jpg
the Server. The main source of raw data is the web 117.201.98.145 4 /images/Result_Scan.jpg

access logs which are known as web server log However, other users could not see the real name
files. The log files can be analyzed over a time and other personal information. Each row of web
period. The time period can be specified on log file represents the URLs that user visits.
hourly, daily, weekly and monthly basis. The Attributes of the web log file include Visit Time,
typical web server log files contain such type of Host, URL and other miscellaneous information
information: IP address, request time, method (e.g. about user’s activity.
GET), URL of the requested files, HTTP version,
[IV] RELATED WORKS
return codes, the number of bytes transferred, the
In recent years, web usage mining is one of the
referrer’s URL and user agents.
favourite area of many researchers. In one of the
3.1 Taxonomy of Web Server Logs work a novel approach was introduced for
Web server logs are plain texts i.e. ASCII files classifying user navigation patterns and predicting
which are independent from the server[11]. There user’s future request[3]. In another work a
are some distinctions between server softwares but methodology was proposed and web log data was
traditionally there are four categories of web used to improve marketing activities[4]. Valter
server logs, which are shown in fig.1. Cumbi et al. have done a case study of e-

Arvind K. Sharma and P.C. Gupta 2


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

government portal initiative in Mozambique for 5.1 Information Collected by WebLog Expert
visitor analysis[5]. A work is done on mining • Number of Hits– This number usually
interesting knowledge from web logs which signifies the number of times any resource is
presented in[6]. Ramya et al. have proposed a accessed in a Website. A hit is a request to a
methodology for discovering patterns in usage web server for a file i.e. web page, image,
mining to improve the quality of data by reducing JavaScript, Cascading Style Sheet, etc.
the quantity of data[7]. Maheswara Rao et al. have • Number of Visitors– A visitor is exactly what
introduced a research frame work capable of it sounds like. It is a human who navigates to
preprocessing web log data completely and the website and browses one or more pages on
efficiently. This framework helps to mine usage the website.
behavior of the users[8]. One work specifies a • Visitor Referring Website– The referring
recommender system that was able to online website gives the information or URL of the
personalization for user patterns[9]. In one more website which referred the particular website
work, a methodology was proposed for interesting in consideration.
knowledge mining through web access logs[10]. • Visitor Referral Website– The referral
[V] WEB LOG EXPERT TOOL website gives the information or URL of the
WebLog Expert is a fast and powerful Apache log website which is being referred to the
file analyzer and IIS log analyzer tool. WebLog particular website in consideration.
Expert is a freeware web mining tool[12]. It helps • Time and Duration– This information in the
reveal important statistics about the Website usage web server logs give the time and duration for
like: activity of visitors, access statistics, paths how long the website was accessed by the
through the site, visitors' browsers, and much particular user.
more. This software tool can read log files of the • Path Analysis– Path analysis gives the
most popular Web servers such as: Apache analysis of the path to a particular user has
(Combined and common log formats) and IIS followed in accessing contents of a website.
4/5/6/7. It can also read ZIP, GZ and BZ2 • Visitor IP Address– This information gives
compressed log files so won't need to unpack the the IP address of the visitors who visited the
logs manually before analyzing. The GUI website.
Interface of WebLog Expert tool contains menu, • Browser Type– This information gives the
toolbar and the list of profiles which is shown in information of the type of web browser that
fig.2 below. was used for accessing the website.
• Platform– This information provides the type
of operating systems or platforms etc. which
has been used to access the website.
• Cookies– A message given to a web browser
by a web server. The browser stores the
message in a text file called cookie. The
message is then sent back to the server each
time the browser requests a page from the
server. The main purpose of cookies is to
identify users and possibly prepare customized
web pages for them.

Fig.2: GUI Interface of WebLog Expert

Arvind K. Sharma and P.C. Gupta 3


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

5.2 Hits, Visits & Page Views 400 Bad Request


• Hit– Each file sent to a web browser by a 401 Unauthorized
402 Payment Required
server is known as an individual hit.
403 Forbidden
• Visit– A visit happens when someone visits 404 Not Found
the website. It contains one or more page 405 Method Not Allowed
views/hits. One visitor can have many visits to 406 Not Acceptable
407 Proxy Authentication Required
the website. A unique visitor is determined by
408 Request Timeout
the IP address or cookie. By default, a visit 409 Conflict
session is terminated when a user falls on 410 Gone
inactive state for more than 30 minutes. If the 411 Length Required
visitor left the website and came back 30 412 Precondition Failed
413 Request Entity Too Large
minutes later, then WebLog Expert will report
414 Request-URI Too Long
2 visits. If the visitor came back within 30 415 Unsupported Media Type
minutes, then WebLog Expert will still report 416 Requested Range Not Satisfiable
1 visit. 417 Expectation Failed
• Page View– A page view is each time a visitor 500 Internal Server Error
501 Not Implemented
views a web page on the website, irrespective 502 Bad Gateway
of how many hits are generated. Pages are 503 Service Unavailable
comprised of files. Every image in a page is a 504 Gateway Timeout
separate file. When a visitor looks at a page 505 HTTP Version Not Supported
101 Switching Protocols
i.e. a page view, they may see numerous
images, graphics, pictures etc. and generate [VI] EXPERIMENTAL RESULTS
multiple hits. For example– if a web page In this work we have been analyzed the web server
contains 5 images, a ‘hit’ on that page will log files of an Educational Institution’s Website
generate 6 ‘hits’ on the web server, one page i.e. www.davkota.org[13] with the help of
view for the web page, 5 hits for the images. WebLog Expert tool. Statistical/text log file data
have been used for experimentation provided by
5.3 HTTP Status Codes WebLog Expert. Various analysis have been
The several status codes of hypertext transfer carried out to identify the behavior of the Website
protocol are shown in table-1 below. users.
Table-2: HTTP Status Codes 6.1 Experiment-1
Status Code Description
101 Switching Protocols The log files contain the data from October 8,
200 OK 2012 to October 14, 2012 (Time range: 10/8/2012
201 Created 18:03:08 - 10/14/2012 17:29:07). In this duration
202 Accepted log files have been stored 25 MB data and we have
203 Non-Authoritative Information
got 3.4 MB data after the preprocessing task for
204 No Content
205 Reset Content this work. The web log data received after
206 Partial Content preprocessing has been implemented through
300 Multiple Choices WebLog Expert and the complete analysis has
302 Found been done. Fig.3 shows that Website is accessed
303 See Other
every day, it receives far more visits from Monday
304 Not Modified
305 Use Proxy to Saturday. The increased visits received by the
306 (Unused) Website on Monday to Saturday, reinforces the
307 Temporary Redirect

Arvind K. Sharma and P.C. Gupta 4


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

earlier finding that the Website is mainly used by


working people, employees or students.

Fig.4: Daily Errors Types

6.2 Experiment-2
In this work, the collected web server log files
from October 8, 2012 to October 14, 2012 are
Fig.3: Total Number of Hits Ratio
experimented through WebLog Expert tool. It has
The daily web data for the entire week from been found that most of the Web browsers are
October 8, 2012 to October 14, 2012, tells about used by the most of the users to visit the Website.
the number of Hits occurs, which Files, Pages, Google Chrome is one of mostly used web
Visits, and Kbytes have been visited. For this browser. Which is shown in fig.5.(see Appendix-
week, the maximum Hits per Day were 637, the B)
maximum Files per Day were 144, the maximum
Pages per Day were 85, the maximum visits per
Day were 13 and the maximum Kbytes per Day
were 903. (see Appendix-A)
We have found different types of errors occurred
during the web surfing. The different types of
errors are shown in table-3 below.
Table-3: 404 Errors (Page Not Found) Fig.5: Mostly Used Web Browser

S. Error Hits Different versions of the Web browser Internet


No. Explorer are shown in table-4. Most of the users
accessed the website through the Internet Explorer
1. 404 Not Found 368
version 7.x.
2. 503 Service Unavailable 31 Table-4: Different Versions of Internet Explorer
3. 403 Forbidden 1 S. Web Browser Hits Visi- % of Total
No. tors Visitors
Total 400 1. Internet 119 11 35.48%
Explorer 7.x
It is very clear from the table-3 that 404 is most 2. Internet 140 9 29.03%
frequently occurred error. Some other types of client Explorer 8.x
and server errors are shown. The graphical presentation 3. Internet 30 7 22.58%
of daily error types is shown in fig.4. Explorer 6.x
4. Internet 54 4 12.90%
Explorer 9.x
Total 343 31 100.00%

Arvind K. Sharma and P.C. Gupta 5


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

The graphical presentation of this information is shown links. This work will also increase the
in fig.6. effectiveness of the Website.
REFERENCES
[1] Margaret H. Dunham, S. Sridhar, “Data Mining:
Introductory and Advanced Topics”, Pearson Education.
[2] Arvind Kumar Sharma, P.C. Gupta, “Exploration of
efficient methodologies for the Improvement in web
mining techniques: A survey”, International Journal of
Research in IT & Management(IJRIM) Vol.1, Issue-3,
July 2011.
[3] Liu, H., et al., “Combined mining of Web server logs
and web contents for classifying user navigation patterns
Fig.6: Different versions of Internet Explorer and predicting user’s future requests”, Data and Knowledge
Engineering, 2007,Vol 61, Issue 2, pp.304-330.
Several Operating Systems have been used by the [4] Arya, S., et al., “A methodology for web usage mining
website users to access the Website. Windows and its applications to target group identification”, Fuzzy
Operating Systems have been frequently used by the sets and systems, 2004, pp.139-152.
website users. Windows XP operating system has been [5] Valter Cumbi et al. “Mozambican Government Portal
used by the most of users to access the Website. The Case Study: Visitor Analysis”, IST-Africa 2007
mostly used Operating System is shown in fig.7 below. Conference Proceedings Paul Cunningham and Miriam
(see Appendix-C) Cunningham (Eds) IIMC International Information
Management Corporation, 2007.
[6] F.M. Facca, and P.L. Lanzi, “Mining interesting
Knowledge from Web logs: a survey”, Elsevier Science,
Data and Knowledge Engineering, 2005, 53, pp.225-241.
[7] G. R.C. et al., "An Efficient Preprocessing
Methodology for Discovering Patterns and Clustering of
Web Users using a Dynamic ART1 Neural Network,"Fifth
International Conference on Information Processing, 2011.
Springer-Verlag.
[8] Maheswara Rao.V.V.R and Valli Kumari.V, "An
Enhanced Pre-Processing Research Framework for web
Log Data Using a Learning Algorithm," Computer Science
and Information Technology, DOI, pp. 1-15, 2011.
Fig.7: Mostly Used Operating System 10.5121/csit.2011.1101.
[9] Mehrdad Jalali et al., “A Recommender System for
[VII] CONCLUSION Online Personalization in the WUM Applications”,
Web is a huge storehouse of web pages and links. Proceedings of the World Congress on Engineering and
It offers large quantity of data for Internet users. Computer Science 2009 Vol. II, WCECS 2009, October
20-22, 2009, San Francisco, USA.
When users visit the Web they leave copious
[10] K Sudheer Reddy et al, “An Effective
information in terms of web access logs which is Methodology for Pattern Discovery in Web Usage
heterogeneous, complex, high dimensional and Mining”, International Journal of Computer Science
incremental in nature. Analyzing such type of data and Information Technologies, Vol. 3 (2) , 2012, 3664-
3667.
will help to determine the browsing interest of the [11] L.K. Joshila Grace et al., “Web Log Data Analysis and
website users. In this paper, the complete analysis Mining” in Proc CCSIT-2011, Springer CCIS, Vol. 133,
of web server log files has been done by using Jan 2011.
WebLog Expert tool. The obtained results shall [12] [Online] http://www.weblogexpert.com
[13] DAV Kota website’s server is available at: [Online]
definitely help to the Website Maintainers, http://www.davkota.org
Website Analysts, Website Designers and
Developers to manage their System by
determining occurred errors, corrupted and broken

Arvind K. Sharma and P.C. Gupta 6


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

AUTHOR’S PROFILE

Arvind K. Sharma had received P. C. Gupta had received his Ph.D


his Master’s Degree in Computer degree in Computer Science from
Science from Maharshi Dayanand Bundelkhand University, Jhansi.
University, Rohtak and M.Phil Presently he is working as Associate
Degree in Computer Science from Professor in the Department of
Alagappa University, Karaikudi. Computer Science & Informatics,
He is pursuing Ph.D in Computer University of Kota, Rajasthan, India.
Science from School of Computer He has published various technical
and Systems Sciences, Jaipur and research papers in National and International Conferences
National University, Jaipur, Rajasthan, India. His areas of and Journals. His research interest lies in Artificial
interest include Data Mining, Web Usage Mining and Web Intelligence and Neural Networks.
Applications.
Appendix–A
General Statistics

Arvind K. Sharma and P.C. Gupta 7


ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL

Appendix –B

Appendix –C

Arvind K. Sharma and P.C. Gupta 8

You might also like