0% found this document useful (0 votes)
829 views

The Use of Data Mining To Predict Web Performance

The document discusses using data mining techniques to predict web performance as seen by end users. It introduces a novel area of web mining called web performance mining. A two-phase mining procedure involving clustering and decision trees is used to construct a predictive model. The model was tested in a real-world experiment and showed an average 80% correct prediction rate.

Uploaded by

rafael100@gmail
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
829 views

The Use of Data Mining To Predict Web Performance

The document discusses using data mining techniques to predict web performance as seen by end users. It introduces a novel area of web mining called web performance mining. A two-phase mining procedure involving clustering and decision trees is used to construct a predictive model. The model was tested in a real-world experiment and showed an average 80% correct prediction rate.

Uploaded by

rafael100@gmail
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Cybernetics and Systems: An International Journal, 37: 587–608

Copyright Q 2006 Taylor & Francis Group, LLC


ISSN: 0196-9722 print=1087-6553 online
DOI: 10.1080/01969720600734586

THE USE OF DATA MINING TO PREDICT


WEB PERFORMANCE

LESZEK BORZEMSKI
Institute of Information Science and Engineering,
Wroclaw University of Technology, Wroclaw, Poland

Web mining is the area of data mining that deals with the extraction of
interesting knowledge from World Wide Web data. The purpose of
this article is to show how data mining may offer a promising strategy
for discovering and building knowledge usable in the prediction of
Web performance. We introduce a novel Web mining dimension—a
Web performance mining that discovers the knowledge about Web
performance issues using data mining. The analysis is aimed at the
characterization of Web performance as seen by the end users. Our
strategy involves discovering knowledge that characterizes Web per-
formance perceived by end users and then making use of this knowl-
edge to guide users in future Web surfing. For that, the predictive
model using a two-phase mining procedure is constructed on the basis
of the clustering and decision tree techniques. The usefulness of the
method for the prediction the future Web performance has been con-
firmed in a real-world experiment, which showed the average correct
prediction ratio of about 80%. The WING (Web pING) measurement
infrastructure was used for active measurements and data gathering.

INTRODUCTION
Data Mining (DM), also known as Knowledge Discovery in Databases
(KDD), means a process of nontrivial extraction of implicit, previously
unknown, and potentially useful information (such as knowledge rules,

Address correspondence to Leszek Borzemski, Institute of Information Science and


Engineering, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw,
50-370, Poland. E-mail: [email protected]
588 L. BORZEMSKI

constraints, regularities) from data in databases (Chen et al. 1996). The


term ‘‘data mining’’ sometimes refers to the overall process of the KDD
and sometimes only to the methods and techniques that are used in data
pattern extraction and knowledge discovery. The terms DM and KDD
are sometimes used interchangeably as well. Here we follow a broader
definition that is oriented toward the whole knowledge discovery pro-
cess, which typically involves the following iterative steps: data selection,
data preparation and transformation, data analysis to identify patterns,
and the evaluation of mining results (Bose and Mahapatra 2001; Zhang
et al. 2003). Mining knowledge from databases has been identified by
many researchers as a key research topic in database systems, machine
learning, and knowledge management. DM is used for a variety of pur-
poses, ranging from improving service or performance to analyzing
and detecting interesting patterns and characteristics in different
application domains. Generally, DM is an approach to knowledge gene-
ration, which is the first and basic process in knowledge management
(Pechenizkiy et al. 2005; Spiegler 2003).
Web mining is the application of DM methods and techniques to dis-
cover useful knowledge from World Wide Web data. Web mining focuses
now on four main research directions related to the categories of Web
data. They deal with namely Web content mining, Web usage mining,
Web structure mining, and Web user profile mining (Chakrabarti 2003;
Fürnkranz 2005; Wang et al. 2005). Typically in Web mining, we analyze
such data sources as the content of the Web documents (usually text and
graphics), Web data logs (e.g., IP addresses, date and time Web access),
data describing the Web structure (i.e., HTML and XML tags), and Web
user profile data. Web content mining discovers what Web pages are
about and reveals new knowledge from them. Web usage mining con-
cerns the identification of patterns in user navigation through Web pages
and is performed for reasons of service personalization, system improve-
ment, and usage characterization (Facca and Lanzi 2005). Web usage
mining is known also as a clickstream analysis. Web structure mining
investigates how Web documents are structured and discovers the model
underlying the link structures of the World Wide Web. Web user profile
mining discovers user’s profiles based on users’ behavior on the Web, for
example, for the needs of e-commerce recommendation systems (Cho et al.
2002; Kim et al. 2002).
The purpose of this article is to spread some of the basic ideas
underlying the application of data mining in Web performance
DATA MINING IN WEB PERFORMANCE PREDICTION 589

evaluation. We propose a method of Web network behavior evaluation


and prediction using DM techniques; thereby, we introduce a novel
Web mining dimension—a Web performance mining that discovers the
knowledge about Web performance behavior from Web data. Our aim
is to characterize Web performance from the perspective of end users
(Web clients, i.e., Web browsers) that is in the sense of the Web server-
to-browser throughput. Especially, due to the clients’ needs, we focus on
predicting the performance behavior of the Web through the knowledge
gained in DM. Thanks to the approach presented in this article, we can
explore unknown knowledge from data to predict upcoming state of good
or poor performance in access to the Web site. Our method involves dis-
covering knowledge that characterizes variability in Web performance
and using it in order to guide users in their further Internet exploration.
Several applications can benefit from knowing the future possible
performance characteristics of the Web. In fact, network performance
prediction allows applications to optimize their operational efficiency,
which is directly impacted by network interactions.
This article presents a methodology, tools, and empirical study of
Web performance mining. We want to answer how to develop a simple
but yet fully usable DM-based predictive model describing Web perform-
ance from the perspective of end users. To attain our goal, there are a few
separate issues to be dealt with. First, a Web performance problem should
be defined so that it is adequate for the end-users’ performance view. The
second question is what Web performance data we should use. How and
where do we get, measure, or collect needed data? The next question is
how do we predict Web performance parameters using DM?
From the end-user’s perspective, Web performance is measured by
the time between clicking the Web page link and the completion of total
page downloading. This is the period of time from the point at which the
user requests access to the Web page to the point at which the data is
presented on the user’s computer. In order to effectively do his or her
Web surfing, the entire data transfer path including the path through
the Web server, local computer, Web intermediary systems, and the net-
work must present as little delay and big throughput as possible. Down-
loading time depends mostly on the total page size (including embedded
objects), Web page and site design, Web server response, network
latency, and available data transfer rate. However, it has never been easy
to determine whether slow responses perceived by the end-users are due
to either network or end system on both sides, i.e., user and Web server
590 L. BORZEMSKI

sides. Generally, we cannot exactly diagnose and isolate performance


problem’s key sources. Since almost 60–80% of downloading latency
as perceived by users refers to the network bottlenecks issues on the
transmission path between the user and Web host (Cardellini et al.
2002), we focus on Web data transmission performance.
Application-level data measured near a client (or in a similar
location) is needed for the evaluation, estimation, and prediction of
Web page downloading performance. Generally, the user is mainly inter-
ested in what throughput (transfer speed) is achieved while downloading
a page. Therefore, in Web page access, there is a major interest in measur-
ing of server-to-client HTTP traffic to determine available bandwidth.
Client-side measurements and processing can be made by the client itself
but only by means of the special clients, not the commonly used Web
browsers. In practice this is a nontrivial task, and special measurement
and processing infrastructure must be provided. To be an effective sol-
ution, such infrastructure might provide its service for a community of cli-
ents located nearby. Therefore, we have developed the measurement
infrastructure called WING for the active probing, measuring, collecting,
and visualization of Web transactions (Borzemski and Nowak 2004b).
WING can instantly or periodically probe selected Web servers; collect
and store data about Web page downloading; as well as preprocesses that
data for further analysis, including statistical analysis and data mining.
Web performance can be measured and evaluated by means of pass-
ive observations or active probing (benchmarking). The dataset used in
our DM analysis was collected actively by the WING system. We mea-
sured periodical downloading of specific Web pages from several Web
sites all around the world. In Borzemski and Nowak (2004a), the
obtained dataset was used to derive a descriptive overall performance
model of the Web as seen by end users in the Wroclaw University of
Technology campus network. This model was developed using tra-
ditional data analysis approach and showed the correlation between
median values of TCP connections’ Round-Trip Times (RTTs) and
HTTP throughputs over all servers under consideration. Here, we use
the same raw dataset in the deployment of a new predictive performance
model by means of DM.
The prediction of Internet performance has been always a challenge
and a topical issue (Abusina et al. 2005; Arlitt et al. 2005; He et al. 2005).
Such prediction might be useful when a Web client schedules its activity
in time and space, and it chooses when to access the Web server and
DATA MINING IN WEB PERFORMANCE PREDICTION 591

what Web server is to be selected. The decision might be based on the pre-
diction of the network performance prior to actually starting the access
and data transfer. Examples include peer-to-peer and overlay networks,
Web-based distributed computing infrastructures and grids (Baker
et al. 2002), as well as distributed corporate portals (Daniel and Ward
2005) used to assess the contributions of intranets, grids, and portals to
knowledge management initiatives (Yousaf and Welzl 2005).
Many techniques are used to generate knowledge by means of DM.
The core mining techniques are clustering, classification, association,
and time series. In mining we used clustering and classification mining
functions as standardized in the DB2 Intelligent Miner for Data software
by IBM (IBM 2005).
The rest of this article is organized as follows. First, we give the
background of the research and review the related work. Next, we
present the proposed DM-based method to predict Web performance.
After that, we overview the measurement methodology and experiment
setup and show the application example of our method using real-world
Web measurements. Finally, we present the conclusion and describe
future work.

BACKGROUND AND RELATED WORK


Performance has always been a key issue in the design and operation of
computer systems. This is especially critical with regard to the Internet.
The performance of the Web is crucial to user satisfaction, they expect
high quality of a Web application. As e-business is often overwhelmed
by performance problems, Web service providers would also act in a
similar manner as NSPs (Network Service Providers) in case of their
services and set up the concept of the Quality of Web Service (Cardellini
et al. 2002; Casalicchio and Colajanni 2001), which refers to the user’s
view of Web performance. NSPs often specify service levels, called ser-
vices level agreements, committed to be provided to the users. Then
the network performance parameters, such as packet loss, delay, jitter,
and throughput, are observed as well as predicted in the framework of
the network traffic management (Abusina et al. 2005). Therefore, it is
important to understand Web performance issues. Web performance
mining might help in this by showing the type of problems and when they
can occur. The application of DM methods and algorithms would predict
Web performance behavior while the user interacts within the Web. The
592 L. BORZEMSKI

predictive models are perhaps the most popular results of DM and have
proven their usefulness in several applications.
Measurements in World Wide Web resulted in many datasets with
huge amounts of performance data collected for the administration or
operational reasons by means of passive and=or active measurements.
They are mostly spatio-temporal datasets organized in the form of a time
series of categorical or numerical type data. Examples of such datasets
include the logs and traffic traces from the Web servers, e-business sites,
and Internet links.
Basically, DM deals with datasets obtained from some observa-
tional studies, which are connected with passive measurements. Unlike
in our work, we deal with datasets collected in active measurements. In
passive measurements, we monitor a network whereas in active mea-
surements, we generate our own traffic and observe the response.
One crucial issue with passive measurements is that they rely on traffic
flowing across the link being measured. If we want to gather data over
long period of time, then there is the problem of the size and complexity
of data repositories. Then appropriate sampling is needed. Even when
collecting all traffic (which is practically impossible), then we can
achieve mislabeled and unavailable information datasets. However, very
often we cannot have any datasets without excitation and experimental
design. Therefore, we need to construct an experimental design in
such a way as to be able to estimate the effects of network probing.
Proper data usable for Web performance evaluation as it is considered
in this article can be effectively gathered only in appropriate active
measurements.
The performance prediction may be done in a short-term or long-term
way. We can perform predictions using formula-based or history-based
algorithms. Short-term forecasting requires the instantaneous measuring
of network performance and real-time calculations using forecasting for-
mula. However, very often we are not able to measure and calculate the
performance indexes instantaneously. Therefore, then we consider long-
term forecasting. The purpose of long-term forecasting is to be able
to predict, with a high degree of certainty, how the Web (specific Web
server) will perform in the future based on its past performance and dis-
covered knowledge. Accurate long-term forecasting is generally thought
of as a time-consuming and tedious process but essential for almost any
Web user. It gives the opportunity to schedule user activities related to
the Web, e.g., in Grid systems when a group of users works together
DATA MINING IN WEB PERFORMANCE PREDICTION 593

and shares common Web and network resources. In this article, we just
deal with a long-term prediction based on historical information.
The data analysis in Web and Internet measurements projects is made
using traditional statistical analysis. For the first time, we introduced DM
in the performance prediction analysis of Internet paths in our TRACE
project (Borzemski 2004), where we evaluated the IP-related performance
of the communication path between the user and Internet host locations
in a long-term prediction scale period. Using DM, we discovered how
the round-trip times of the packets and the number of hops they pass
on the routing path may vary with the day of the week and the time of
the measurement. After that, we used this knowledge to build the decision
tree guiding to future characteristics of a relevant properties of a given
Internet path in a long-term scale.
Users perceive a good Internet performance as characterized by low
latency and high throughput. The network latency is usually estimated by
the RTT, which is the delay between sending the request for data and
receiving (the first bit of) the reply. The lower the latency, the faster
we can do low-data activities. The other key element of network perform-
ance, throughput, also affects Web applications. Throughput is the
‘‘network bandwidth’’ metric, which tells about the actual number of
bytes transferred over a network path during a fixed amount of time.
Throughput determines the ‘‘speed’’ of a network as perceived by the
end user. The higher the throughput of the Internet connection, the
faster the user can surf the Internet.
When browsing the Web, users are concerned with the performance
of downloading entire pages, which are constructed from the base page
and embedded objects. Various factors and solutions impact Web
performance. Among them there are Web site architectures, available
network throughput, as well as browsers themselves. For instance, to
speed up Web site service, we can organize a number of servers in a clus-
ter with front-end components called Web switches that distribute
incoming requests among servers (Borzemski and Zatwarnicki 2005).
However, it has never been easy to determine whether slow responses
are due to either network problems or end-system problems on both
sides, i.e., user and server sides, or both. All these factors may affect ulti-
mate user-to-server (and vice versa) performance. User-perceived Web
quality is extremely difficult to study in an integrated way because we
cannot exactly diagnose and isolate Web performance key sources,
which, moreover, are transient and very complex in the relationships
594 L. BORZEMSKI

between different factors that may influence each other (Cardellini et al.
2002; Casalicchio and Colajanni 2001).
Although Web applications are usually stateless, there are new appli-
cations that require the predictable Web performance. They are becoming
a considerable portion of Internet traffic. They can be implemented, for
example, in Web-based grid infrastructures built within the Internet to
aggregate a wide variety of resources including supercomputers, storage
systems, and data sources distributed all over the world, and used as a sin-
gle unified resource for virtual communities or a service for knowledge
management in scientific laboratories (Baker et al. 2002; Lin and Hsueh
2003; Tian and Nakamori 2005). Then node-to-node well-predicted
TCP=IP throughput could be a key issue in such applications.
Several active and passive measurement projects have been built on
the Internet, e.g., those in Brownlee et al. (2001); CAIDA (2005), Claffy
and McCreary (1999), Luckie et al. (2001), MyKenote (2005), SLAC
(2005), and Zhang and Duffield (2001). Mostly they are aimed to deal
with the performance problem related to whole or a significant part of
the Internet where large amounts of measured data regarding, for
instance, round-trip delay among several node pairs over a few hours,
days, or months, and use specific measurements and data analysis infra-
structure. These projects can build so-called Internet weather at the IP
level. Most of them only measure the traffic and present the results
as some aggregated and temporary observations but do not provide
any network performance forecasting.
As the new grid-based solutions are developing over the Internet,
then such performance prediction services are needed. The Network
Weather Service (NWS), which has the functionality of being analogous
to weather forecasting (Wolski 1998), is used in grids for making predic-
tions of the performance of various resource components, including the
network itself, by sending out and monitoring lightweight probes through
the network to the sink destinations at regular intervals. It is intended to
be a lightweight, noninvasive monitoring system. NWS operates over a
distributed set of performance-sensors network monitors from which it
gathers readings of the instantaneous network conditions. It can also
monitor and forecast performance computational resources. NWS sen-
sors also exist for such components as CPU and disk. NWS runs only
in UNIX operating system environments and requires much installation
and administration work. It uses numerical models to generate short-
term forecasts of what the conditions will be for a given time frame.
DATA MINING IN WEB PERFORMANCE PREDICTION 595

However, NWS basic prediction techniques are not representative of the


transfer speed obtainable for large files (10 MB to 1 GB) and do not sup-
port long-term forecasts. New NWS developments address these pro-
blems, e.g., Swany and Wolski (2002) show the technique developed
for forecasting long HTTP transfers using a combination of short
NWS TCP=IP bandwidth probes and previously observed HTTP trans-
fers, particularly for longer-ranged predictions.
Besides grids, there is the world of peer-to-peer applications, such as
Gnutella and resilient overlay networks. They are becoming a great
portion of Internet traffic. Such peer–to–peer (P2P) application net-
works are also built among scientific communities. Such initiatives also
require well-predictable Internet performance.
In the area of Web performance, probably the most known is a
commercial service, MyKeynote (MyKeynote 2005). Our WING system
can also be used for similar measurements providing some featured
evaluations not available in competitive developments (Borzemski 2006).

PROPOSED DATA-MINING–BASED PREDICTION METHOD


In this section, we show how we construct a DM-based Web-
performance prediction model. We do not want to forecast the particular
value of the RTT and throughput at specific time as we want to have a
prediction of Web performance in the sense of general characteristics
in a long-teme scale. We classify Web performance to one of classes.
Classes are derived based on past data and define distinguishable
conditions of Web performance behavior described by the RTT and
throughput categories. We assume that the time of day and day of week
mainly explain the variability in the RTT and throughput.
We propose to use a two-phase DM-based method in which the
clustering mining function is followed by the tree classification mining
function in such a way that the result of clustering is the input to
classification.
Clustering-segments performance data records into groups (clusters)
having similar properties. To do this type of discovery, we use in this arti-
cle the neural clustering algorithm, which employed a Kohonen Feature
Map neural network (IBM 2005). The result of the clustering function
shows the number of detected clusters and the characteristics of data
records that make up each cluster. To partition a dataset so that
measurement records that have similar characteristics are grouped
596 L. BORZEMSKI

together, as active attributes participating in creation of clusters, we use


the day of the week and time of day, the average round-trip time, and
throughput.
One of the disadvantages of cluster models is that there are no
explicit rules to define each cluster. The model obtained by clustering
is thus difficult to implement, and there is no clear understanding of
how the model assigns clusters IDs. Therefore, we propose to employ
the classification that may give a simpler model of classes. The induced
model consists of patterns, essentially generalizations over the data
records that are useful in distinguishing the classes. Once a model is
induced, it can be used to automatically predict the class of other unclas-
sified (future) data records.
The decision tree is one of the most popular classification algorithms
in current use in DM. A decision tree is tree-shaped structure that repre-
sents sets of decisions. These decisions generate rules for the classi-
fication of a dataset. Trees develop arbitrary accuracy and use
validation data sets to avoid spurious detail. They are easy to understand
and modify. Moreover, in our situation the use of the tree representation
is preferable because it provides explicit, easy-to-understand rules for
each cluster for Web users, who are usually non-experts in data mining.
Hence, in the second step of the method, we use the tree classi-
fication mining function. The classification builds a decision-making
structure (a decision tree). Here, we explore modified (modified for cat-
egorical attributes) Classification and Regression Tree (CART) techni-
ques used for the classification of a dataset (IBM 2005). CART
segments a dataset by creating two-way splits on the basis of two time-
of-day and day-of-week attributes. The classes in the decision tree are
cluster IDs obtained in the first step of the method. The decision tree
represents the knowledge in the form of IF-THEN rules. Each rule
can be created for each path from the root to a leaf. The leaf node holds
the class prediction.

MEASUREMENT FRAMEWORK
To measure Web performance, we used the WING system developed at
our laboratory (Borzemski and Nowak 2004b). Several tools exist to
measure different parameters of network and Web performance, but
due to our specific needs, we have developed the measurement system
from scratch. Figure 1 shows the WING architecture. WING is a
DATA MINING IN WEB PERFORMANCE PREDICTION 597

Figure 1. WING architecture.

network measurement system that measures an end-to-end Web per-


formance path between the Web site and the end user. It is implemented
at our university side only.
WING can send the Web-page requests to the targeted Web site and
monitor the response. It can collect live HTTP trace data near a user
workstation, distill all key aspects of each Web transaction during brows-
ing (Figure 2), and store all time-stamped measurements in the database
for further analysis. WING uses a real browser running under a user
operating system; hence, it perceives a Web page downloading in the
same manner like a real browser. The system may be freely programmed
for periodic measurements using scripts or may be used in ad hoc
mode—then it returns a visualization of page downloading by showing
the HTTP timeline chart and a number of detailed and aggregated data
about the downloading process. For the needs of Web performance pre-
diction, it determines the average transfer rate of the HTTP objects
downloaded using the TCP connection. WING measures the time inter-
val between the first byte packet and the last byte packet of the object
received by the client using that connection. The transfer rate (through-
put) is then calculated by dividing the number of bytes transferred by the
amount of time taken to transfer them. The FIRST BYTE (Figure 2) is
the time between the sending of the GET request and the reception of the
first packet including a requested component. The LEFT BYTES is the
598 L. BORZEMSKI

Figure 2. A typical Web transaction diagram.

time spent for downloading the rest of the requested object. WING also
estimates the RTT from CONNECT time as the time taken to form a
connection by the browser with the server. It is shown in Figure 2 as
the time spacing between the SYN packet sent and the ACK-SYN packet
received by the client. Today’s implementation of WING is done for MS
IE, which is the most popular Web browser in the Internet; however, the
service can monitor the activity of any browser. We should notice that
the browsers download Web pages in different ways so the Web page
downloading time chart can be different. And as a result, the actual
user-perceived performance can be different, and the result may be inad-
equate. Measurements can be made instantly or periodically. The appli-
cation of instant probing is shown in Borzemski (2006). For the needs of
this article, WING was used in periodic Web measurements.
The measurements analyzed in this article were performed from
September 4, 2002, to April 9, 2003. In the measurements, we used
the rfc1945.txt file as the probe, which was downloaded from several
Web pages. The file rfc1945.txt is large enough (its original size is
137582 bytes) to estimate the average transfer rate and yet not too large
to overload Internet links and target Web servers. The target servers
were chosen randomly by the Google search mechanism. Among a few
hundred links found by Google, we have chosen 209 direct links to that
DATA MINING IN WEB PERFORMANCE PREDICTION 599

Figure 3. Testbed configuration.

file. After preliminary tests in which some servers died, we started to


measure 83 servers. These servers were probed at regular intervals of
2 h 40 min, i.e., 10 times a day over 24-hour period. Figure 3 shows
the testbed configuration and Figure 4 shows the partial list of target ser-
vers. More information about the experiment testbed and statistical data
analysis of the measurements is given in Borzemski and Nowak (2004a).

APPLYING THE METHOD TO REAL-LIFE SITUATIONS


In this section, we exemplify how we can build a DM-based prediction
model for Web performance forecasting in real-life situations. The fol-
lowing data preparation road map for DM is proposed. It involves the
server selection, data selection, data preparation and transformation,
clustering, cluster result analysis, cluster characterization using a
decision tree, and evaluation of the mining results. As our model can
be valid for a given Web server connection, therefore first of all we must
600 L. BORZEMSKI

Figure 4. A partial list of target Web servers.

choose a server for further analysis. Here we show the performance pre-
diction model for the server that demonstrates network traffic to have
the greatest degree of self-similarity. Thus the selection of the server
for further DM analysis can be done in the following way. First, we elim-
inate servers with more than 10% failed Web transactions. The pool of
servers is reduced to 63 servers. Next, we purge the dataset by filtering
out the data records for those servers that had more than 5 failed mea-
surements per day. We obtain the set of 33 servers. The final selection is
made on the basis of network traffic self-similarity characteristic (Leland
et al. 1994). We chose the server with the traffic exhibiting high self-
similarity evaluated both for RTT and throughput time series. The Hurst
parameter H is calculated for the traffic data from each server. Four
candidate servers are considered in the final selection: #77, #161,
#167, and #181. The server #161 is finally selected (www.ii.uib.no,
Bergen, Norway) where parameter H is around 0.63 for both RTT and
throughput traffic series. Next, in the data selection step, we select
records and clean data. Only non-error Web transactions are considered.
Missing RTT and throughput values are estimated as averages. A data
record with RTT > 200 is classified as being an outlier. Four attributes
(fields) are selected for DM: RTT, THROUGHPUT, DAY-OF-WEEK,
and TIME-OF-DAY. Thirty-four records are completed, and 75 records
are dropped in this step. In the data preparation and transformation
phase, we digitize TIME-OF-DAY into 9 equal-width intervals: (00:00–
02:40, . . . , 21:20–00:00), DAY-OF-WEEK into equal-width 7 intervals,
and RTT into non-equal-width 7 bins with breakpoints (0, 46, 56, 70,
90, 130, 165, 200 ms), and categorize THROUGHPUT by text labels:
DATA MINING IN WEB PERFORMANCE PREDICTION 601

Figure 5. (a) Server characteristics; (b) sample database.

low, medium, high, and where medium is for 180–260 KB=s. Figure 5a
shows characteristics for four final candidate servers, and Figure 5b gives
a sample database used in DM (before THROUGHPUT categorization).
Now we show the clustering and decision tree mining analysis results
for that chosen Web server. We use the IBM Intelligent Miner for Data
8.1, and the measurements are stored in the relational table in a DB2
database, which rows contained records of measurements collected for
server #161 measured at the sampling times. In clustering, we use the
following active attributes: RTT, TROUGHPUT, DAY-OF-WEEK, and
TIME-OF-DAY. Figure 6 presents 9 clusters (the biggest ones among
16 clusters derived), which were identified when all records from the
dataset are mined, i.e., from the whole time horizon under consideration.
The clusters are ordered according to their size, the smallest one
includes about 7% of records; the biggest one is about 18%. The
description of clusters shows how the clusters differ from each other.
For instance, cluster #4 (9.41% of population) defines the set of records
where DAY-OF-DAY is predominantly 4 (Wednesday), TIME-OF-DAY
is predominantly 5 (time interval 10:20–13:00), RTT is predominantly
2 (47–55 ms), and THROUGHPUT is medium.
After clustering, we explore classification using the results of cluster-
ing as the inputs in decision tree deployment. The general objective of
creating the decision-making model is to use it to predict RTT and
throughput behavior most probable to achieve in the future. We assume
602 L. BORZEMSKI

Figure 6. Characteristics of clusters.

that the only a priori information that is to be given in the future is the
DAY-OF-WEEK and the TIME-OF-DAY. The fragment of the resulted
decision tree for the total dataset is shown in Figure 7. This decision tree
shows the accuracy of 75%. To extract the classification rules from the
decision tree, we need to represent knowledge in the form of IF-THEN
rules. One rule is created for each path from the root to a leaf. The leaf
node holds the class prediction. The purity in a leaf node indicates the
percentage of correctly predicted records in that node. As an example,
for the decision tree which part is shown in Figure 7, we can extract
the following classification rule: IF (TIME < 4.5) AND (DAY  3.5)
AND (TIME  3.5) AND (DAY < 4.5) THEN CLUSTER ID ¼ 4. It
says that if we want to download Web resources from the server #161

Figure 7. A fragment of the decision tree for server #161.


DATA MINING IN WEB PERFORMANCE PREDICTION 603

between 00:00 a.m. and 10:40 a.m. (TIME < 4.5) on Wednesday, Thurs-
day, Friday, or Saturday (DAY  3.5), and when this is after 8:00 a.m.
(TIME  3.5) and on Sunday, Monday, Tuesday, or Wednesday
(DAY < 4.5), then we can expect that the network behavior is as
described by the cluster #4.
Further analysis includes moving window (horizon) and incremental
DM. In both analyses to obtain a single result, we make multiple predic-
tions employing each time the same two-phase mining procedure:
clustering and classification, but each time using a specific dataset
defined by the ‘‘mining window.’’ In incremental DM we use one week
increment. The DM set starts small (one week data) at 9=4=2002 and
increases incrementally (week by week) up to the whole dataset. The
result is shown in Figure 8. The accuracy varies, and we discover ‘‘abnor-
mal’’ network conditions (10=9=02–11=19=2 and after 3=5=03), which
are connected with the network reconfiguration, which is confirmed by
a separate analysis of traceroute data.
Moving window DM is the approach that address timeliness matters
in our history-based predictions. Probably mining over the entire history
may be impractical in this application (the Web is dynamically chan-
ging). The mining windows are defined on the basis of the timestamps
and include all samples from the dataset stream in the last n units of time
(n ¼ 1 week, 2 week, . . . , 20 weeks). Such windows are moved over the
dataset, starting from the beginning date. The accuracy of the prediction

Figure 8. Incremental DM.


604 L. BORZEMSKI

Figure 9. Moving window mining (window size ¼ 1).

of such moving DM for one, six, and twenty weeks are shown in
Figures 9, 10, and 11, respectively. As we can see, the accuracy of the
prediction varies very much from one window to another when the
windows size is equal to one week. Recent data (from one week) is not
enough for a prediction. Six–week windows give pretty high accuracies
around 70–80%. Twenty-week windows give better results, but too long
windows could be rather impractical due to the dynamic characteristics
of the Web.

Figure 10. Moving window mining (window size ¼ 6).


DATA MINING IN WEB PERFORMANCE PREDICTION 605

Figure 11. Moving window mining (window size ¼ 20).

CONCLUSION AND FUTURE WORK


In this article, we introduced a new application of DM connected with
the prediction of Web performance called Web performance mining.
We proposed a two-phase mining procedure based on clustering and
classification. We demonstrated in a real-word experiment that our
approach can play a usable role in Web performance prediction. The
sample model gave pretty high accuracie, about 80%.
There are still challenges in Web performance mining. We would like
to suggest some future research directions: (1) developing new measure-
ment scenarios to obtain the datasets characterizing by different
time-scale samplings, different target servers (server types, localizations,
and Internet connectivity), other WING locations, and other probe
definitions (type, size, and sampling); (2) a consideration of new issues,
performance factors, and DM techniques used in the analysis of data;
and (3) attempting to deal with integration issues in building automated
DM solutions for Web-based Grid applications.

REFERENCES
Abusina, Z. U. M., S. M. S. Zabir, A. Asir, D. Chakraborty, T. Suganuma, and
N. Shiratori. 2005. An engineering approach to dynamic prediction of
network performance from application logs. Int. Network Mgmt. 15:151–162.
Arlitt, M., B. Krishnamurthy, and J. C. Mogul. 2005. Predicting short-transfer
latency from TCP arcane: A trace-based validation. In Proc. of International
606 L. BORZEMSKI

Measurement Conference IMC’05, Berkeley: USENIX association, pp.


119–124.
Baker, M., R. Buyya, and D. Laforenza. 2002. Grids and grid technologies for
wide-area distributed computing. Softw. Pract. Exper. 32:1437–1466.
Borzemski, L. 2004. Data mining in evaluation of Internet path performance. In
Innovations in Applied Artificial Intelligence. Proc. 17th International
Conference on Industrial and Engineering Applications of Artificial Intelli-
gence and Expert Systems IEA=AIE 2004. Lecture Notes in Artificial Intelli-
gence, Vol. 3029, edited by B. Orchard, Ch. Yang, and M. Ali. Berlin:
Springer-Verlag, pp. 643–652.
Borzemski, L. 2006. Testing, measuring, and diagnosing Web sites from the
user’s perspective. International Journal of Enterprise Information Systems,
2:54–66.
Borzemski, L. and Z. Nowak. 2004a. An empirical study of Web quality: Measur-
ing the Web from the Wroclaw University of Technology campus. In Engin-
eering advanced Web applications, edited by M. Matera and S. Comai.
Princeton, NJ: Rinton Publishers, pp. 307–320.
Borzemski, L. and K. Nowak. 2004b. WING: A Web probing, visualization, and
performance analysis service. In Web Engineering, Proc. 4th International
Conference on Web Engineering ICWE 2004. Lecture notes in computer
science, Vol. 3140, edited by N. Koch, P. Fraternali, and M. Wirsing. Berlin:
Springer-Verlag, pp. 601–602.
Borzemski, L. and K. Zatwarnicki. 2005. Using adaptive fuzzy-neural control to
minimize response time in cluster-based Web systems. In Advances in Web
Intelligence. Proc. 3rd Atlantic Web Intelligence Conference AWIC’05. Lec-
ture Notes in Artificial Intelligence, Vol. 3528, edited by P. S. Szczepaniak,
Kacprzyk, and A. Jiewiadomski. Berlin: Springer-Verlag, pp. 63–68.
Bose, I. and R. K. Mahapatra. 2001. Business data mining—A machine learning
perspective. Information & Management, 39:211–225.
Brownlee, N., Kc. Claffy, M. Murray, and E. Nemeth. 2001. Methodology for
passive analysis of a university Internet link. In Proc. Passive and Active
Measurement Workshop, Amsterdam.
CAIDA. 2005. The Cooperative Association for Internet Data Analysis.
http:==www.caida.org=home (17 July 2005).
Cardellini, V., E. Casalicchio, C. Colajanni, and P. S. Yu. 2002. The state of the
art in locally distributed Web-server systems. ACM Computing Surveys,
34:263–311.
Casalicchio, E. and M. Colajanni. 2001. A client-aware dispatching algorithm
for Web clusters providing multiple services. In Proc. World Wide Web 10,
Hong Kong, pp. 535–544.
Chakrabarti, S. 2003. Mining the Web: Analysis of Hypertext and Semistructured
Data. San Francisco: Morgan Kaufmann.
DATA MINING IN WEB PERFORMANCE PREDICTION 607

Chen, M.-S., J. Han, and P. S. Yu. 1996. Data mining: An overview from a data-
base perspective. IEEE Trans. Knowledge and Data Engineering, 8:866–883.
Cho, Y. H., J. K. Kim, and S. H. Kim. 2002. A personalized recommender system
based on Web usage mining and decision tree induction. Expert Systems with
Applications, 23:329–342.
Claffy, K. C. and S. McCreary. 1999. Internet measurement and data analysis:
Passive and active measurement. University of California, San Diego: CAIDA.
Available: http:==www.caida.org=outreach=papers=1999=Nae4hansen=Nae4
hansen.html
Daniel, E. and J. Ward. 2005. Enterprise portals: Addressing the organizational
and individual perspectives of information systems. In Proc. Thirteenth Eur-
opean Conference on Information Systems ECIS 2005, edited by D. Bartmann,
F. Rajola, J. Kallinikos, D. Avison, R. Winter, P. Ein–Dor, J. Becker,
F. Bondendorf, and C. Weinhardt. Regensburg, Germany.
Facca, F. and P. Lanzi. 2005. Mining interesting knowledge from weblogs:
A survey. Data & Knowledge Engineering, 53:225–241.
Fürnkranz, J. 2005. Web mining. In Data mining and knowledge discovery handbook,
edited by M. Oded and L. Rokach. Berlin: Springer-Verlag, pp. 899–920.
He, Q., C. Dovrolis, and M. Ammar. 2005. On the predictability of large
transfer TCP throughput. In Proc. SIGCOMM’05, New York: ACM Press,
pp. 145–156.
IBM. 2005. DB2 intelligent miner. Available: http:==www-306.ibm.com=software=
data=iminer=tools.html (17 July 2005).
Kim, J. K., Y. H. Cho, W. J. Kim, J. R. Kim, and J. H. Suh. 2002. A personalized
recommendation procedure for Internet shopping suport. Electronic Com-
merce Research and Applications, 1:301–313.
Leland, W., M. Taqqu, W. Willinger, and D. Wilson. 1994. On the self-similar
nature of Ethernet traffic. IEEE=ACM Trans. Networking, 2:1–15.
Lin, F.-R. and C.-M. Hsueh. 2003. Knowledge map creation and maintenance for
virtual communities of practice. In Proc. 36th International Conference on
Systems Sciences HICSS’03, edited by R. Sprague. Los Alamitos: IEEE
Press, P. 69.1.
Luckie, M. J., A. J. McGregor, and H.-W. Braun. 2001. Towards improving
packet probing techniques. In Proc. 1st ACM SIGCOMM Workshop on Inter-
net Measurement, edited by V. Paxson. New York: ACM Press, pp. 145–150.
MyKeynote. 2005. MyKeynote diagnosis page. Available: http:==www.mykeynote.
com (17 July 2005).
Pechenizkiy, M., A. Tsymbal, and S. Puuronen. 2005. Knowledge management
challenges in knowledge discovery systems. In Proc. 16th International Work-
shop on Database and Expert Systems Applications DEXA’05. Los Alamitos:
IEEE Press, pp. 433–437.
608 L. BORZEMSKI

SLAC. 2005. Internet monitoring at Stanford Linear Accelerator Center. Available:


http:==www.slac.stanford.edu=comp=net=wan-mon.html (17 July 2005).
Spiegler, I. 2003. Technology and knowledge: Bridging a ‘‘generating’’ gap. Infor-
mation & Management, 40:533–539.
Swany, M. and R. Wolski. 2002. Multivariate resource performance forecasting
in the network weather service. In Proc. of the IEEE=ACM SC2002 Confer-
ence. Los Alamitos: IEEE Press, pp. 1–10.
Tian, J. and Y. Nakamori. 2005. Consideration on a service for knowledge man-
agement in scientific laboratories. In Proc. IEEE International Conference on
Services Systems and Services Management. Los Alamitos: IEEE Press, pp.
886–891.
Wang, X., A. Abraham, and K. A. Smith. 2005. Intelligent web traffic mining and
analysis. Journal of Network and Computer Applications, 28:147–165.
Wolski, R. 1998. Dynamically forecasting network performance using the
network weather service. Cluster Computing, 1:119–132.
Yousaf, M. M. and M. Welzl. 2005. A reliable network measurement and predic-
tion architecture for grid scheduling. 1st IEEE=IFIP International Workshop
on Autonomic Grid Networking and Management AGNM’05, edited by M. Z.
Hasan and V. Sander. Barcelona.
Zhang, S., C. Zhang, and Q. Yang. 2003. Data preparation for data mining.
Applied Artificial Intelligence, 17:375–381.
Zhang, Y. and N. Duffield. 2001. On the constancy of Internet path properties. In
Proc. 1st ACM SIGCOMM Workshop on Internet Measurement, edited by
V. Paxson. New York: ACM Press, pp. 197–211.

You might also like