Integratingecom PDF
Integratingecom PDF
Abstract
We show that the e-commerce domain can provide all
the right ingredients for successful data mining and
claim that it is a killer domain for data mining. We
describe an integrated architecture, based on our experience at Blue Martini Software, for supporting this
integration. The architecture can dramatically reduce
the pre-processing, cleaning, and data understanding
effort often documented to take 80% of the time in
knowledge discovery projects. We emphasize the need
for data collection at the application server layer (not
the web server) in order to support logging of data and
metadata that is essential to the discovery process. We
describe the data transformation bridges required from
the transaction processing systems and customer event
streams (e.g., clickstreams) to the data warehouse. We
detail the mining workbench, which needs to provide
multiple views of the data through reporting, data
mining algorithms, visualization, and OLAP. We conclude with a set of challenges.
Introduction
Integrated Architecture
Figure 1.
Data Collection
This section describes the data collection component of the proposed architecture. This component logs
customers transactions (e.g., purchases and returns)
and event streams (e.g., clickstreams). While the data
collection component is a part of every customer touch
point (e.g., web site, customer service applications, and
wireless applications), in this section we will describe
in detail the data collection at the web site. Most of the
concepts and techniques mentioned in this section
could be easily extended to other customer touch
points.
3.1
Clickstream Logging
3.2
3.3
Analysis
modeling is usually a prerequisite for analysis, our experience shows that many analyses require additional
data transformations that convert the data into forms
more amenable to data mining.
Q
ua
nt
i
Pr ty
ice
Cl
ot
hi
n
Cl g
ot
hi
n
Bo g/M
en
ok
s
s
Bo
ok
s/
Tr
av
e
SK
U
As we mentioned earlier, the business user can define product, promotion, and assortment hierarchies in
the Business Data Definition component. Figure 2
gives a simple example of a product hierarchy. This
hierarchical information is very valuable for analysis,
but few existing data mining algorithms can utilize it
directly. Therefore, we need data transformations to
convert this information to a format that can be used by
data mining algorithms. One possible solution is to
add a column indicating whether the item falls under a
$12
THEN
Heavy spender
IF
Heavy spender
Challenges
but generalizations are likely to be found at higher levels (e.g., families and categories). Some algorithms
have been designed to support tree-structured attributes
[15], but they do not scale to the large product hierarchies. The challenge is to support such hierarchies
within the data mining algorithms.
Scale Better: Handle Large Amounts of Data
Yahoo! had 465 million page views per day in December of 1999 [16]. The challenge is to find useful
techniques (other than sampling) that will scale to this
volume of data. Are there aggregations that should be
performed on the fly as data is collected?
Support and Model External Events
External events, such as marketing campaigns (e.g.,
promotions and media ads), and site redesigns change
patterns in the data. The challenge is to be able to
model such events, which create new patterns that
spike and decay over time.
Support Slowly Changing Dimensions
Visitors demographics change: people get married,
their children grow, their salaries change, etc. With
these changes, their needs, which are being modeled,
change. Product attributes change: new choices (e.g.,
colors) may be available, packaging material or design
change, and even quality may improve or degrade.
These attributes that change over time are often referred to as "slowly changing dimensions" [4]. The
challenge is to keep track of these changes and provide
support for such changes in the analyses.
Identify Bots and Crawlers
Bots and crawlers can dramatically change clickstream patterns at a web site. For example, Keynote
(www.keynote.com) provides site performance measurements. The Keynote bot can generate a request
multiple times a minute, 24 hours a day, 7 days a week,
skewing the statistics about the number of sessions,
page hits, and exit pages (last page at each session).
Search engines conduct breadth first scans of the site,
References
[1]
[2]
[3]
Summary
We proposed an architecture that successfully integrates data mining with an e-commerce system. The
proposed architecture consists of three main components: Business Data Definition, Customer Interaction,
and Analysis, which are connected using data transfer
bridges. This integration effectively solves several
major problems associated with horizontal data mining
tools including the enormous effort required in preprocessing of the data before it can be used for mining,
and making the results of mining actionable. The tight
integration between the three components of the architecture allows for automated construction of a data
warehouse within the Analysis component. The shared
metadata across the three components further simplifies this construction, and, coupled with the rich set of
mining algorithms and analysis tools (like visualization, reporting and OLAP) also increases the efficiency
of the knowledge discovery process. The tight integration and shared metadata also make it easy to deploy
results, effectively closing the loop. Finally we presented several challenging problems that need to be
addressed for further enhancement of this architecture.
Acknowledgments
We would like to thank other members of the data
mining and visualization teams at Blue Martini Software and our documentation writer, Cindy Hall. We
wish to thank our clients for sharing their data with us
and helping us refine our architecture and improve
Blue Martinis products.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]