DWM-theory
DWM-theory
Data Warehouse architecture and design strategy top down and bottom up
a. A Data-Warehouse is a heterogeneous collection of data sources organized
under a unified schema.
b. A Data Warehouse therefore can be described as a system that consolidates
and manages data from different sources to assist an organization in making
proper decisions.
A. T\op-down design
a. The Top-Down Approach to data warehouse design, introduced by Bill Inmon, starts
by building a comprehensive, centralized data warehouse that integrates data from
all organizational sources.
b. External data is extracted, transformed, and loaded (ETL) into the data warehouse,
which acts as the primary repository for the organization.
c. Once the data warehouse is fully established, specialized data marts are created for
individual departments such as finance, sales, or marketing, ensuring consistent and
unified reporting.
d. Data is stored in its purest form in the warehouse, while the data marts contain
subsets relevant to specific business areas.
e. This method is systematic and ensures robust data governance, making it suitable
for large organizations with complex data needs.
● Advantages:
1. The top-down approach provides better consistency across data marts
since all derive data from the same central data warehouse.
2. It ensures improved scalability, allowing organizations to add new data
marts without disrupting the existing system.
● Disadvantages:
1. It is time-consuming and expensive, making it unsuitable for smaller
organizations with limited resources.
2. The complexity of implementation can lead to challenges in managing and
maintaining the system.
B. Bottom up approach
a. The Bottom-Up Approach, introduced by Ralph Kimball, focuses on first creating
data marts tailored to specific business needs, such as sales or marketing, and then
integrating them into a centralized data warehouse.
b. Data is extracted, transformed, and loaded into individual data marts, which offer
immediate reporting capabilities.
c. Over time, these data marts are unified to form a cohesive data warehouse, ensuring
that insights from various departments are integrated.
d. This approach allows for quicker implementation and cost efficiency, making it
suitable for smaller organizations or those with specific, immediate reporting needs.
e. It emphasizes flexibility and incremental development.
Advantages:
Disadvantages:
1. It can lead to data silos, with inconsistent or redundant data across different
departments.
2. Ensuring enterprise-wide data integration can be challenging due to varied data
mart structures and granularities.
Briefly explain Star Schema, Snowflake Schema, Fact Constellation Schema.
In data warehousing, schemas define the structure of how data is stored and organized.
Three common types of schemas are Star Schema, Snowflake Schema, and Fact
Constellation Schema, each serving specific analytical and reporting needs.
1. Fact table: A fact table stores quantitative data (facts) like sales, revenue, or counts.
It also contains foreign keys to dimension tables for contextual referencing.
2. Dimension: Dimension tables provide descriptive information (dimensions) such as
product details, time, or location, offering context for the facts in the fact table.
Star Schema
a. The Star Schema is the most straightforward and commonly used structure in data
warehousing.
b. It consists of a central fact table, which contains quantitative data, surrounded by
dimension tables.
c. These dimension tables provide descriptive information (like time, product, or
customer) about the facts, forming a star-like shape.
d. The dimension tables are typically denormalized, meaning they contain redundant
data to allow faster query performance and simpler structure.
e. This setup makes it easy to retrieve information and perform aggregations without
complex joins.
Usage: Star schemas are ideal for business reporting and analysis, as they allow for
easy data access and efficient querying, making them perfect for users who need to
perform straightforward and simple analytical tasks.
Snowflake Schema
a. The Snowflake Schema is a more complex version of the star schema, where the
dimension tables are normalized into multiple related sub-tables.
b. Instead of keeping redundant data in the dimension tables, the snowflake schema
breaks down these tables further to remove redundancy and improve data integrity.
c. This normalization ensures less storage space is used, but the structure becomes
more intricate, requiring additional joins when querying.
d. Usage: Snowflake schemas are suited for scenarios where data integrity and storage
efficiency are more important than query simplicity. They are commonly used when
there is a need to handle large, complex datasets while maintaining a normalized
structure.
a. ETL (Extract, Transform, Load) is a crucial process in data warehousing that helps in
gathering data from various source systems, transforming it into a suitable format,
and then loading it into a data warehouse for analysis and reporting.
b. ETL ensures that data is clean, standardized, and accurate before it is stored,
making it an essential part of maintaining a reliable and efficient data warehouse.
Transformation:
a. The next step is to transform the extracted data. In this stage, the data is cleaned,
validated, and transformed according to business rules or standard formats.
b. This could involve tasks such as:
● Filtering: Removing unnecessary data.
● Cleaning: Filling in missing values or correcting errors.
● Joining: Merging data from multiple sources.
● Splitting: Dividing data into different fields.
● Sorting: Arranging data in a specific order based on key attributes.
Loading:
Finally, the transformed data is loaded into the data warehouse. The data can be loaded at
different frequencies—either in batches or in real-time—depending on the needs of the
organization. This step ensures that the data warehouse remains updated and ready for
analysis and reporting.
1. Extraction:
○ The first step in the ETL process is extraction, where data is pulled from
various source systems.
○ These sources can include databases, APIs, flat files, or even external
systems like web applications.
○ The data in its raw form might be structured (like in relational databases),
semi-structured (like XML or JSON), or unstructured (like text files).
○ The goal of this step is to gather all relevant data and place it in a temporary
storage area called a staging area.
○ This staging area helps to prevent the direct loading of potentially corrupt or
incomplete data into the data warehouse, allowing time to process and clean
the data properly.
2. Transformation:
○ After the data is extracted, the next step is transformation.
○ This is where the data is processed and refined to ensure that it is clean,
standardized, and compatible with the destination database's schema.
○ The transformation process can involve several tasks:
i. Filtering: This involves removing any unnecessary or irrelevant data
that is not needed for the analysis.
ii. Cleaning: This step ensures the data is accurate by handling issues
like missing values, incorrect entries, or duplicates. For example, it
might involve filling missing values with defaults or standardizing
terms (e.g., "USA" vs. "United States").
iii. Joining: In this step, data from different sources may need to be
merged or joined together to create a unified dataset, combining
multiple attributes into one.
iv. Splitting: Sometimes, a single field may need to be split into multiple
fields to better suit the analysis needs.
v. Sorting: This involves arranging the data in a specific order, typically
based on certain attributes like a time or numeric field, to optimize
querying performance.
○ The transformation process ensures that the data is in a clean, consistent,
and useful format for loading into the data warehouse.
3. Loading:
○ The final step in the ETL process is loading the transformed data into the
data warehouse.
○ Once the data has been cleaned and transformed, it is inserted into the
destination system in the appropriate structure.
○ The loading process can occur at different frequencies:
i. Batch Loading: This involves loading large sets of data at specific
intervals (e.g., daily, weekly).
ii. Real-Time Loading: This method loads data continuously or at
shorter intervals to keep the data warehouse up to date with the latest
data.
○ The loading process is crucial for keeping the data warehouse current and
ready for reporting, analysis, or other business intelligence tasks.
○ It may involve inserting, updating, or even deleting data based on the
changes in the source systems.
a. Operational Metadata:
● Describes the data source systems from which data is extracted, including
details like field lengths, data types, and structures.
● Helps track the status of data processing activities, such as when data was
last updated or loaded.
Example: Operational metadata might include information about the source system,
such as the type of data (e.g., sales transactions) and the date it was last updated.
Example: This metadata could include rules like converting dates to a standard
format or aggregating sales data by region during the transformation stage.
c. End-User Metadata:
● Acts as a navigational map for users, making it easier for them to find and
understand the data using business-friendly terminology.
● Allows users to explore the data in ways that align with how they think about
the business processes.
Example: End-user metadata might map complex technical terms like "cust_id" to
more understandable terms like "customer number" for ease of use by business
analysts
OLAP vs OLTP
a. Data Mining (DM) is the process of discovering patterns, trends, and useful
information from large datasets.
b. The architecture of data mining is designed to handle the stages of data retrieval,
processing, pattern discovery, and presenting results in a structured and
comprehensible form.
c. The architecture typically includes several key components that work together to
support efficient and effective data mining.
1. Data Sources:
○ Data sources are where the raw data resides before mining begins. These
sources can be databases, data warehouses, or even the World Wide Web
(WWW).
○ Details: The data can come in many forms, such as relational databases, text
files, spreadsheets, multimedia (photos, videos), or logs. The data is often
unstructured or semi-structured and needs to be processed before mining.
○ Example: A retail company might use transaction data from a database,
customer information from a data warehouse, and social media data from the
WWW to perform data mining.
2. Database Server:
○ Definition: This is where the actual data is stored. The database server holds
the data that needs to be processed for mining.
○ Details: The database server is responsible for managing and retrieving data
in response to requests from the data mining engine. It organizes data into
structured formats and ensures that the data is ready for mining.
○ Example: A database server might store customer demographics, sales
transactions, and product details for a company to mine patterns like
customer purchasing behavior.
3. Data Mining Engine:
○ Definition: This is the core component that performs the actual mining
techniques to extract meaningful patterns and knowledge from the data.
○ Details: The mining engine applies various algorithms and techniques, such
as classification, clustering, regression, association rule mining, and anomaly
detection. It processes the data retrieved from the database and searches for
patterns or predictions.
○ Example: If a company wants to identify which products are often bought
together, the mining engine could apply association rule mining to find
patterns in the sales data.
4. Pattern Evaluation Modules:
○ Definition: These modules are responsible for evaluating the discovered
patterns and determining which ones are useful and interesting.
○ Details: They analyze the results generated by the data mining engine and
rank them based on criteria like usefulness, novelty, and significance. The
patterns might then be passed back to the mining engine or displayed to the
user.
○ Example: After the mining engine identifies product associations, the pattern
evaluation module might filter out irrelevant associations and highlight
significant ones, such as "Customers who buy A also buy B."
5. Graphical User Interface (GUI):
○ Definition: The GUI provides an interface for users to interact with the data
mining system.
○ Details: Since data mining can involve complex algorithms and large
datasets, the GUI helps users access and understand the results through
visual representations like charts, graphs, and dashboards. It allows for easier
manipulation and exploration of the data.
○ Example: A marketing analyst might use a GUI to adjust parameters, run a
classification algorithm, and then visualize the resulting clusters or decision
trees.
6. Knowledge Base:
○ Definition: A knowledge base stores domain-specific knowledge and user
experiences that guide the data mining process.
○ Details: It helps the mining engine by providing prior knowledge that might
improve the accuracy and relevance of the mining process. It can include
information such as business rules, historical data, or expert advice.
○ Example: A knowledge base might include customer profiles, expert
recommendations, or rules on customer behavior, which can guide the mining
engine in generating more relevant patterns or predictions.
1. Data Quality: Poor data quality (e.g., missing, noisy, or inconsistent data) can affect
the accuracy of mining results.
2. Privacy Concerns: Extracting sensitive information from personal or organizational
data can raise ethical and legal issues.
3. Algorithm Scalability: Many algorithms struggle to handle the vast size of modern
datasets, leading to inefficiency.
4. Interpretability: Complex mining algorithms may produce results that are difficult for
non-experts to understand or act upon.
5. Integration Challenges: Combining data from various heterogeneous sources for
mining can be complex and time-consuming.
1. Data Cleaning:
a. In this step, the data is cleaned by removing noise and inconsistencies.
b. The goal is to correct any errors, handle missing values, and ensure the data
is accurate and reliable.
c. For example, missing values in customer records may be filled in with
appropriate default values or removed entirely if necessary.
2. Data Integration:
a. This step involves combining data from different sources into a cohesive
dataset.
b. The data might come from various databases, flat files, or external sources.
The objective is to merge all relevant data so that it can be analyzed
collectively.
c. For instance, combining sales data, customer demographic information, and
website activity into a single dataset for further analysis.
3. Data Selection:
a. In the data selection step, only the relevant data for the analysis task is
retrieved from the database.
b. This is a crucial step as it ensures that unnecessary or unrelated data is
excluded, focusing only on the data that will help answer specific questions.
c. For example, if the task is to predict customer churn, only customer behavior
and transactional data might be selected.
4. Data Transformation:
a. Data transformation involves converting the selected data into formats
suitable for mining.
b. This can include aggregating the data, normalizing it, or applying summary
operations to reduce complexity.
c. For example, aggregating monthly sales data into quarterly data to identify
broader trends, or normalizing the data to a standard scale for better analysis.
5. Data Mining:
a. This is the core step where data mining techniques are applied to extract
meaningful patterns from the transformed data.
b. Techniques like classification, clustering, regression, and association rule
mining are used to find patterns.
c. For example, a classification algorithm might predict whether a customer is
likely to churn based on historical data.
6. Pattern Evaluation:
a. After mining the data, the patterns discovered are evaluated to determine
their usefulness and significance.
b. This step is important to ensure that the patterns are not only interesting but
also actionable.
c. For example, a pattern showing that high-spending customers often shop at
specific times could be evaluated to guide marketing strategies.
7. Knowledge Presentation:
a. Finally, the discovered knowledge is presented to the users in an
understandable way.
b. Visualization techniques like graphs, charts, or dashboards are used to
represent the patterns and insights clearly.
c. For instance, a marketing team might see a dashboard showing the most
frequent purchasing patterns, helping them design targeted campaigns.
These steps together form the KDD process, enabling organizations to extract valuable
insights from large and complex datasets.
Data preprocessing:
Data preprocessing is an essential step in the data mining process that transforms raw data
into a suitable format for analysis.
It involves cleaning, integrating, and transforming data to improve its quality and make it
more appropriate for mining tasks.
The goal is to ensure that the data is ready for analysis, which in turn leads to more accurate
and efficient results.
1. Data Cleaning:
Data cleaning aims to identify and rectify errors or inconsistencies in the data. This
includes dealing with missing values, noisy data, and duplicate records.
○ Handling Missing Data: If data is missing, it can be filled with the mean,
most frequent value, or predicted values, or the tuple with missing values can
be removed if applicable.
○ Handling Noisy Data: Noise refers to irrelevant or inaccurate data.
Techniques like binning, regression, and clustering can be used to smooth out
noisy data.
2. Data Integration:
This step combines data from multiple sources into a unified dataset. Sources may
include multiple databases, data cubes or data files. It addresses issues that arise
due to differences in data formats, structures, or semantics, and ensures consistency
in the combined data.
3. Data Transformation:
In this step, the data is converted into a format that is suitable for analysis. This could
include normalization, standardization, and discretization.
○ Normalization: Rescaling the data into a standard range, for example, from 0
to 1, ensures that all features are on a similar scale, which is important for
certain algorithms.
○ Discretization: This process converts continuous attributes into discrete
intervals, which makes it easier for some models to handle.
4. Data Reduction:
Data reduction reduces the size of the dataset while preserving essential information,
which makes the analysis faster and more efficient.
○ Feature Selection: This involves choosing a subset of the most relevant
features, removing irrelevant or redundant ones.
○ Feature Extraction: Creating new features by transforming existing data into
a lower-dimensional space.
○ Sampling: Selecting a subset of data from a larger dataset can also reduce
the size of the data without losing important patterns.
5. Data Discretization:
This step involves converting continuous data into discrete categories or intervals,
which can be especially useful for algorithms that require categorical data. Various
binning methods, such as equal width or equal frequency, can be applied.
The steps of data preprocessing, such as cleaning, transforming, and reducing data, are
fundamental in ensuring that the data mining process is effective and the results are
accurate. By preparing data correctly, businesses and researchers can derive valuable
insights and build more efficient, reliable models.
Data Exploration
Data exploration is the initial step in the data analysis process where the goal is to
understand the structure, quality, and patterns within the dataset. It involves summarizing the
main characteristics of the data, identifying missing or incorrect values, and visualizing the
data to discover relationships and trends. This step helps in preparing the data for further
analysis, such as feature selection, model building, or statistical testing.
1. Data Summarization: Calculate basic statistics like mean, median, mode, range,
and standard deviation for numerical attributes.
2. Handling Missing Values: Identify and decide how to deal with missing or
inconsistent data (e.g., imputation or removal).
3. Visualization: Use plots like histograms, scatter plots, and box plots to understand
data distribution and relationships.
4. Identifying Patterns: Look for trends, correlations, or anomalies within the data.
5. Feature Analysis: Analyze and understand the significance of individual features
Data Visualization
1. Histogram
2. Line Chart
3. Bar Chart
4. Scatter Plot
Decision tree
a. A decision tree is a flowchart-like structure used to make decisions or predictions.
b. It consists of nodes representing decisions or tests on attributes, branches
representing the outcome of these decisions, and leaf nodes representing final
outcomes or predictions.
c. Each internal node corresponds to a test on an attribute, each branch corresponds to
the result of the test, and each leaf node corresponds to a class label or a continuous
value.
Naive bayes
Why Is It Called "Naive"?
The term "naive" comes from the algorithm's key assumption: feature independence.
1. Independence Assumption:
It assumes that all features (attributes) are independent of each other, meaning the
presence or absence of one feature does not affect the presence or absence of
another. For example, in email classification, Naive Bayes assumes that the
occurrence of the word "win" is unrelated to the occurrence of the word "free."
2. Unrealistic in Real-World Data:
This assumption is rarely true in real-world datasets, as features often exhibit
dependencies (e.g., "win" and "free" are often related in spam emails).
Despite this naive assumption, the algorithm performs surprisingly well in practice,
especially for problems like text classification and natural language processing (NLP).
methods for estimating classifiers accuracy(holdout, subsampling cross
validation and bootstrapping)
Hold-out method
a. The hold-out method is a simple technique for evaluating machine learning models.
b. It involves splitting the dataset into two disjoint sets: a training set used to train the
model and a test set used to evaluate its performance.
c. Typically, the dataset is divided in a ratio like 70:30 or 80:20.
d. The method ensures the model's performance is tested on unseen data to estimate
its generalization ability.
Advantages
1. Simple and Fast: Easy to implement and requires minimal computational effort.
2. Prevents Overfitting: Testing on unseen data provides a realistic evaluation of the
model's performance.
Disadvantages
1. Bias-Variance Tradeoff: The results depend on how the data is split, potentially
causing biased evaluations.
2. Data Wastage: A significant portion of the dataset is not used for training, which can
be a drawback for small datasets.
Subsampling
The subsampling method is a model evaluation technique where multiple random splits of
the dataset into training and testing subsets are performed. In each iteration, the model is
trained on the training subset and evaluated on the test subset, and the final performance is
averaged over all iterations. This method ensures more reliable evaluation by reducing
dependency on a single split.
Advantages
Disadvantages
Cross validation
Advantages
Disadvantages
Bootstrap
The Bootstrap method is a resampling technique used to assess the performance and
stability of a machine learning model. In this method, multiple subsets are generated from
the original dataset by randomly sampling with replacement. Each subset can contain
duplicate records from the original data, and some original records may be left out in the
subset. These subsets are then used to train and test the model multiple times, allowing for
a better estimate of the model’s performance by evaluating it on different variations of the
data.
Advantages:
Disadvantages:
K-Means is a simple and widely used unsupervised machine learning algorithm that aims to
partition 'n' observations into 'k' clusters, where each observation belongs to the cluster with
the nearest mean.
The algorithm works iteratively to minimize the variance within each cluster. The main steps
of the K-Means algorithm are:
Advantages of K-Means:
Limitations of K-Means:
1. Fixed Number of Clusters: The number of clusters, 'k', must be predefined, which
can be a challenge if the optimal number of clusters is not known.
2. Sensitive to Initial Centroids: The algorithm's performance is sensitive to the initial
placement of centroids and may lead to suboptimal results or local minima.
3. Assumes Spherical Clusters: K-Means assumes that clusters are spherical and of
roughly equal size, which may not be true for all datasets.
4. Outlier Sensitivity: Outliers can heavily influence the placement of centroids,
leading to inaccurate clustering.
K-Medoids Clustering
Advantages of K-Medoids:
● Robust to Outliers: Since the medoid is a data point from the dataset, it is less
sensitive to outliers than the centroid-based K-Means algorithm.
● Flexibility in Distance Metrics: K-Medoids can be used with any distance metric
(e.g., Manhattan, Euclidean, etc.), making it more versatile for different types of data.
Limitations of K-Medoids:
Web mining
a. Web Mining is the process of gaining useful information and knowledge from
the internet.
b. It uses data mining tools and algorithms to analyze web data, including web
pages, links, server logs, and other online resources.
c. Web Mining can be used for various purposes, such as market research, user
behavior analysis, personalized content recommendations, and many more.
d. Web mining is further divided into three types:
● Web mining also identifies and filters web spam for safe web search
results.
● Web Mining also categorizes web pages into relevant topics or themes.
Web Content Mining focuses on extracting valuable information from the content of web
pages. The data on web pages is often unstructured, making it difficult to analyze directly.
Web Content Mining uses natural language processing, text mining, and other techniques to
convert this unstructured web content into structured data that can be more easily analyzed.
● Process:
○ The content of a web page (e.g., text, images, videos) is extracted and
categorized into structured formats like databases or knowledge graphs.
○ Techniques such as text mining, sentiment analysis, and information retrieval
are employed to discover patterns in the content.
○ Information such as product reviews, news articles, and blog posts can be
mined for insights about public opinion, market trends, and consumer
behavior.
● Applications:
○ Market Research: Analyzing reviews or forums for customer opinions and
product feedback.
○ Competitive Analysis: Extracting data from competitor websites to
understand their offerings, pricing strategies, and user sentiment.
○ Personalized Recommendations: Understanding user preferences based
on the content they engage with to provide tailored suggestions (e.g., articles,
products).
Web Structure Mining deals with analyzing the hyperlink structure of the web. This type of
mining focuses on the relationships between different web pages and websites. By
examining the hyperlink structure, web structure mining helps understand how websites are
interlinked and the connections between various web entities.
● Process:
○ The underlying web structure is modeled as a graph, where web pages are
nodes and hyperlinks are edges.
○ Graph theory and network analysis are often used to identify patterns in the
structure of the web, such as clusters of related pages, frequently visited
paths, or ranking of websites.
○ Techniques like PageRank (used by Google Search) analyze the importance
of web pages based on their connections and inbound links.
● Applications:
○ Search Engine Optimization (SEO): Understanding the link structure can
help improve search engine ranking by identifying authoritative pages.
○ Website Navigation Analysis: Helps webmasters and designers optimize
site navigation and internal linking for better user experience.
○ Link Prediction: Predicting potential future links based on existing website
structures to improve web crawling and data collection.
Web Usage Mining focuses on analyzing user interaction with websites. This type of mining
involves examining web activity logs such as page views, clicks, session durations, and
other user actions to understand how visitors engage with a site. It is particularly useful for
profiling user behavior, predicting preferences, and improving user experience.
● Process:
○ Data is collected from server logs or cookies to track user interactions with
web pages.
○ Various metrics such as the number of visits, page click patterns, time spent
on each page, and the sequence of pages visited are analyzed.
○ Techniques such as clustering, classification, and association rule mining are
used to identify patterns in user behavior.
● Applications:
○ Personalized Content Recommendations: Based on past behavior,
websites can suggest products, services, or content tailored to the individual
user (e.g., Amazon’s recommendation engine).
○ Website Optimization: Understanding where users drop off or how they
navigate through a website helps in making design improvements for better
engagement.
○ User Profiling: Creating user profiles based on their actions to segment
users and target them with specific marketing campaigns or offers.
Advantages
Limitations
Hierarchy:
Rules:
1. Level 1 Rule: "If someone buys electronics, they are likely to buy mobile phones."
○ This rule shows a general relationship between broad categories.
2. Level 2 Rule: "If someone buys mobile phones, they are likely to buy iPhones."
○ This rule dives deeper into a specific subcategory, showing relationships at a
narrower level.
These rules demonstrate multilevel association mining because patterns are identified at
different levels of the hierarchy. It begins with broad categories (electronics) and
progressively explores deeper, more specific relationships (mobile phones → iPhones). This
approach provides insights into both general and detailed purchasing behavior.
Advantages
Limitations
Example
Scenario:
Consider a retail store dataset with the following attributes:
Rules:
1. Rule 1 (Multidimensional): "If a customer buys electronics and is an adult, they are
likely to shop in the evening."
○ This rule spans across Product Category, Customer Age Group, and
Purchase Time dimensions.
2. Rule 2 (Multidimensional): "If a senior customer buys groceries, they are likely to
shop in the morning."
○ This connects Customer Age Group with Product Category and Purchase
Time.
Extras
What is dendrogram
a. A dendrogram is a tree-like diagram used to represent the results of hierarchical
clustering.
b. It shows how individual data points or clusters are grouped together step by step.
c. The diagram visually illustrates the hierarchy of clusters, with each data point starting
as its own cluster.
d. As the algorithm progresses, clusters are merged based on their similarity, and this is
reflected in the branches of the dendrogram.
e. The height of each branch represents the distance or dissimilarity between merged
clusters.
f. Dendrograms are useful for determining the optimal number of clusters by cutting the
tree at a specific level.
g. They are commonly used in unsupervised learning tasks, particularly in biology,
where they help visualize the relationships between species or genes.
h. Overall, dendrograms provide a clear and intuitive way to analyze the results of
hierarchical clustering.