Analysis of Web Server Log Files
Analysis of Web Server Log Files
[Received-24/10/2012, Accepted-30/11/2012]
ABSTRACT:
The fundamental role of Web usage mining is to capture, analyze, and model the Web server logs. Usually it
automatically discovers the usage behaviour of the Website users. In this paper, we have been implemented a Web
mining tool to analyze the Web server log files of the Website. It evaluates the important information about visitors,
top errors, web browsers and different platforms used by the Website users mostly. The obtained information shall
definitely increase the effectiveness of the Website.
Keywords: Web Usage Mining, Web Server Log Files, WebLog Expert
[I] NTRODUCTION is to find out the useful information from web data
Due to the volatile growth of information available or web log files. To do this task, web usage mining
over the Internet during the past few years, the focuses on investigating the potential knowledge
World Wide Web has become most popular and from browsing patterns of the users and to find the
powerful platform to store, propagate and retrieve correlation between the pages on analysis.
information as well as mine useful knowledge. It The rest of paper is organized as follows: Section-
is a way of communication and information II explains web usage mining. Section-III
dissemination and it serves as a platform for discusses about web server log files. Section-IV
exchanging various kinds of information. The summaries related works. Section-V demonstrates
volume of information available on the internet Web Log Expert tool. Section-VI contains
has been increasing rapidly with the explosive experimental results. Conclusion is shown in
growth of the World Wide Web. While users are section-VII while references are mentioned in the
provided with more information and service last section.
options, it has become more difficult for them to
find the ‘right’ or ‘interesting’ information, the [II] WEB USAGE MINING
problem commonly known as information Web usage mining is the application of data
overload. The primary goal of web usage mining mining techniques to discover the knowledge
ANALYSIS OF WEB SERVER LOG FILES TO INCREASE THE EFFECTIVENESS OF THE WEBSITE USING WEB MINING TOOL
hidden in the web server log file, such as user Web Server
access patterns from web data and for analyzing Logs
users’ behaviour patterns. Web mining deals with
the data related to the web, which may be the data
actually present in web pages or the data
concerning the web activities. Web mining refers
to overall process of discovering potentially useful Transfer Agent Error Referrer
Log Log Log Log
and previously unknown information from web
documents and services[1]. This is to ensure an
Fig.1: Taxonomy of Web Server Logs
improved service of web-based applications. The
user access log files present very significant The first two web logs such as Transfer Log and
information about a web server. It is applied to fix Agent Log are standard. The Referrer and Agent
several world problems by discovering the Logs may or may not be Turned On at the Server
interesting user navigational patterns. The web or may be added to the Transfer Log file to create
information is categorized into two categories: an Extended log file format.
deep web and shallow web. The deep web
includes information stored in searchable 3.2 Sample of Web Server Log Files
databases often inaccessible to search engines and We have been shown a sample of web log data in
it is accessed only by Website’s interface. In other table-1 below. A user Id is the unique name that
hand, the shallow web information can be uses to identify. User Id is displayed when the user
accessed by search engines without accessing the would like to make any transactions on the website
web databases[2]. or any other means.
[III] WEB SERVER LOG FILES Table-1: Sample of Web Server Log Files
Host User URL
The server log files are simple text files which Id
records activity of the users on the server. These 117.197.6.155 1 /images/pic010.jpg
files reside on the server. If user visits many times 131.253.41.47 2 images/chemlab_d.jpg
on the Website then it creates entry many times on 95.108.158.238 3 /images/pic8.jpg
the Server. The main source of raw data is the web 117.201.98.145 4 /images/Result_Scan.jpg
access logs which are known as web server log However, other users could not see the real name
files. The log files can be analyzed over a time and other personal information. Each row of web
period. The time period can be specified on log file represents the URLs that user visits.
hourly, daily, weekly and monthly basis. The Attributes of the web log file include Visit Time,
typical web server log files contain such type of Host, URL and other miscellaneous information
information: IP address, request time, method (e.g. about user’s activity.
GET), URL of the requested files, HTTP version,
[IV] RELATED WORKS
return codes, the number of bytes transferred, the
In recent years, web usage mining is one of the
referrer’s URL and user agents.
favourite area of many researchers. In one of the
3.1 Taxonomy of Web Server Logs work a novel approach was introduced for
Web server logs are plain texts i.e. ASCII files classifying user navigation patterns and predicting
which are independent from the server[11]. There user’s future request[3]. In another work a
are some distinctions between server softwares but methodology was proposed and web log data was
traditionally there are four categories of web used to improve marketing activities[4]. Valter
server logs, which are shown in fig.1. Cumbi et al. have done a case study of e-
government portal initiative in Mozambique for 5.1 Information Collected by WebLog Expert
visitor analysis[5]. A work is done on mining • Number of Hits– This number usually
interesting knowledge from web logs which signifies the number of times any resource is
presented in[6]. Ramya et al. have proposed a accessed in a Website. A hit is a request to a
methodology for discovering patterns in usage web server for a file i.e. web page, image,
mining to improve the quality of data by reducing JavaScript, Cascading Style Sheet, etc.
the quantity of data[7]. Maheswara Rao et al. have • Number of Visitors– A visitor is exactly what
introduced a research frame work capable of it sounds like. It is a human who navigates to
preprocessing web log data completely and the website and browses one or more pages on
efficiently. This framework helps to mine usage the website.
behavior of the users[8]. One work specifies a • Visitor Referring Website– The referring
recommender system that was able to online website gives the information or URL of the
personalization for user patterns[9]. In one more website which referred the particular website
work, a methodology was proposed for interesting in consideration.
knowledge mining through web access logs[10]. • Visitor Referral Website– The referral
[V] WEB LOG EXPERT TOOL website gives the information or URL of the
WebLog Expert is a fast and powerful Apache log website which is being referred to the
file analyzer and IIS log analyzer tool. WebLog particular website in consideration.
Expert is a freeware web mining tool[12]. It helps • Time and Duration– This information in the
reveal important statistics about the Website usage web server logs give the time and duration for
like: activity of visitors, access statistics, paths how long the website was accessed by the
through the site, visitors' browsers, and much particular user.
more. This software tool can read log files of the • Path Analysis– Path analysis gives the
most popular Web servers such as: Apache analysis of the path to a particular user has
(Combined and common log formats) and IIS followed in accessing contents of a website.
4/5/6/7. It can also read ZIP, GZ and BZ2 • Visitor IP Address– This information gives
compressed log files so won't need to unpack the the IP address of the visitors who visited the
logs manually before analyzing. The GUI website.
Interface of WebLog Expert tool contains menu, • Browser Type– This information gives the
toolbar and the list of profiles which is shown in information of the type of web browser that
fig.2 below. was used for accessing the website.
• Platform– This information provides the type
of operating systems or platforms etc. which
has been used to access the website.
• Cookies– A message given to a web browser
by a web server. The browser stores the
message in a text file called cookie. The
message is then sent back to the server each
time the browser requests a page from the
server. The main purpose of cookies is to
identify users and possibly prepare customized
web pages for them.
6.2 Experiment-2
In this work, the collected web server log files
from October 8, 2012 to October 14, 2012 are
Fig.3: Total Number of Hits Ratio
experimented through WebLog Expert tool. It has
The daily web data for the entire week from been found that most of the Web browsers are
October 8, 2012 to October 14, 2012, tells about used by the most of the users to visit the Website.
the number of Hits occurs, which Files, Pages, Google Chrome is one of mostly used web
Visits, and Kbytes have been visited. For this browser. Which is shown in fig.5.(see Appendix-
week, the maximum Hits per Day were 637, the B)
maximum Files per Day were 144, the maximum
Pages per Day were 85, the maximum visits per
Day were 13 and the maximum Kbytes per Day
were 903. (see Appendix-A)
We have found different types of errors occurred
during the web surfing. The different types of
errors are shown in table-3 below.
Table-3: 404 Errors (Page Not Found) Fig.5: Mostly Used Web Browser
The graphical presentation of this information is shown links. This work will also increase the
in fig.6. effectiveness of the Website.
REFERENCES
[1] Margaret H. Dunham, S. Sridhar, “Data Mining:
Introductory and Advanced Topics”, Pearson Education.
[2] Arvind Kumar Sharma, P.C. Gupta, “Exploration of
efficient methodologies for the Improvement in web
mining techniques: A survey”, International Journal of
Research in IT & Management(IJRIM) Vol.1, Issue-3,
July 2011.
[3] Liu, H., et al., “Combined mining of Web server logs
and web contents for classifying user navigation patterns
Fig.6: Different versions of Internet Explorer and predicting user’s future requests”, Data and Knowledge
Engineering, 2007,Vol 61, Issue 2, pp.304-330.
Several Operating Systems have been used by the [4] Arya, S., et al., “A methodology for web usage mining
website users to access the Website. Windows and its applications to target group identification”, Fuzzy
Operating Systems have been frequently used by the sets and systems, 2004, pp.139-152.
website users. Windows XP operating system has been [5] Valter Cumbi et al. “Mozambican Government Portal
used by the most of users to access the Website. The Case Study: Visitor Analysis”, IST-Africa 2007
mostly used Operating System is shown in fig.7 below. Conference Proceedings Paul Cunningham and Miriam
(see Appendix-C) Cunningham (Eds) IIMC International Information
Management Corporation, 2007.
[6] F.M. Facca, and P.L. Lanzi, “Mining interesting
Knowledge from Web logs: a survey”, Elsevier Science,
Data and Knowledge Engineering, 2005, 53, pp.225-241.
[7] G. R.C. et al., "An Efficient Preprocessing
Methodology for Discovering Patterns and Clustering of
Web Users using a Dynamic ART1 Neural Network,"Fifth
International Conference on Information Processing, 2011.
Springer-Verlag.
[8] Maheswara Rao.V.V.R and Valli Kumari.V, "An
Enhanced Pre-Processing Research Framework for web
Log Data Using a Learning Algorithm," Computer Science
and Information Technology, DOI, pp. 1-15, 2011.
Fig.7: Mostly Used Operating System 10.5121/csit.2011.1101.
[9] Mehrdad Jalali et al., “A Recommender System for
[VII] CONCLUSION Online Personalization in the WUM Applications”,
Web is a huge storehouse of web pages and links. Proceedings of the World Congress on Engineering and
It offers large quantity of data for Internet users. Computer Science 2009 Vol. II, WCECS 2009, October
20-22, 2009, San Francisco, USA.
When users visit the Web they leave copious
[10] K Sudheer Reddy et al, “An Effective
information in terms of web access logs which is Methodology for Pattern Discovery in Web Usage
heterogeneous, complex, high dimensional and Mining”, International Journal of Computer Science
incremental in nature. Analyzing such type of data and Information Technologies, Vol. 3 (2) , 2012, 3664-
3667.
will help to determine the browsing interest of the [11] L.K. Joshila Grace et al., “Web Log Data Analysis and
website users. In this paper, the complete analysis Mining” in Proc CCSIT-2011, Springer CCIS, Vol. 133,
of web server log files has been done by using Jan 2011.
WebLog Expert tool. The obtained results shall [12] [Online] http://www.weblogexpert.com
[13] DAV Kota website’s server is available at: [Online]
definitely help to the Website Maintainers, http://www.davkota.org
Website Analysts, Website Designers and
Developers to manage their System by
determining occurred errors, corrupted and broken
AUTHOR’S PROFILE
Appendix –B
Appendix –C