0% found this document useful (0 votes)
23 views42 pages

ALL YOU NEED Data_Mining_and_Warehousing

ALL YOU NEED Data_Mining_and_Warehousing

Uploaded by

sonal raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views42 pages

ALL YOU NEED Data_Mining_and_Warehousing

ALL YOU NEED Data_Mining_and_Warehousing

Uploaded by

sonal raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data mining and

Warehousing

MISSION NO BACKLOGs

🏡 CONSIDER A DONATION PLEASE, PAY MY COLLEGE FEES

Data mining and Warehousing 1


UNIT - 1

1) What is a Data Warehouse?

"A data warehouse is a subject-oriented, integrated, time-variant and non-volatile


collection of data in support of management's decision making process".

1. Subject Oriented : Data that gives information about a particular subject


instead of about a company's ongoing operations.

2. Integrated : Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.

3. Time-variant : All data in the data warehouse is identified with a particular time
period.

4. Non-volatile : Data is Stable in a data warehouse. More data is added but data is
never removed.

Data Warehouse is the Central Data Store within your company.

There is a need for Data Warehouse for all the enterprises that want to make
data-driven decisions because a Data Warehouse is the “Single Source of Truth”

Data mining and Warehousing 2


for all the data in the company.

Make better business decisions

Data Warehouse Enhances Business Intelligence.

Data Quality and Consistency

It supports analysis and performance reporting.

Multiple years of history.

It supports analysis and performance reporting.

Multiple years of history.

2) DATABASE VS DATAWAREHOUSE?
Database System Data Warehouse

It supports operational processes. It supports analysis and performance reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within the scope of this Data must be integrated and balanced from
one system. multiple system.
Data is updated when transaction
Data is updated on scheduled processes.
occurs.
Data verification occurs when entry is Data verification occurs after the fact.

Data mining and Warehousing 3


done.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consolidated.

Flat relational. Multidimensional.

3) What are the various way of data Processing?

1. Data Cleaning :

a. Data cleaning routines work to "clean the data by filling in missing values,
noisy data, identifying and resolving inconsistencies.

2. Data Integration :

a. Data integration involves integrating data from multiple databases, data


cubes, or files.

b. Some attributes representing a given concept may have different names in


different databases, causing inconsistencies and redundancies.

For example, the attribute for customer identification may be referred to as


customer_id in one data store and cust_id in another. Naming
inconsistencies may
also occur for attribute values. Also, some attributes may be inferred from
others
(e.g., annual revenue). Having a large amount of redundant data may slow
down
or confuse the knowledge discovery process. Additional data cleaning can
be
performed to detect and remove redundancies that may have resulted from

Data mining and Warehousing 4


data
integration.

3. Data Transformation :

a. Data transformation operations, such as normalization and aggregation; are


additional data preprocessing procedures that would contribute toward the
success Of the mining

b. Normalization is scaling the data to be analyzed to a specific


range such as [0.0, 1.0] for providing better results.

4. Data Reduction :

a. Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same) analytical
results.

4) Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:

Bottom Tier (Data Warehouse Server)

Middle Tier (OLAP Server)

Top Tier (Front end Tools).

Bottom Tier (Data Warehouse Server)


A bottom-tier that consists of the Data Warehouse server, which is almost
always an RDBMS.

It may include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile
data provided by external consultants) are extracted using application
program interfaces called a gateway.

A gateway is provided by the underlying DBMS and allows customer


programs to generate SQL code to be executed at a server.

Data mining and Warehousing 5


Examples of gateways contain ODBC (Open Database Connection) and
OLE-DB (Open-Linking and Embedding for Databases), by Microsoft, and
JDBC (Java Database Connection).

Middle Tier (OLAP Server)

A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.

The OLAP server is implemented using either

Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that


maps functions on multidimensional data to standard relational operations.

Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server


that directly implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by


OLAP, as well as additional tools for data mining of the OLAP-generated
data.

Top Tier (Front end Tools)

A top-tier.

Contain front-end tools for displaying results provided by OLAP, as well as


additional tools for data mining of the OLAP-generated data.

5) Trends in data warehousing


There are several trends in data warehousing that are shaping the way
organizations store, manage, and analyze data:

1. Cloud-based data warehousing:

Data mining and Warehousing 6


a. Many organizations are moving their data warehousing systems to the
cloud in order to take advantage of the scalability, flexibility, and cost-
efficiency of cloud computing.

b. Cloud-based data warehousing allows organizations to store and


process large amounts of data without the need to invest in expensive
hardware and infrastructure.

2. Big data integration:

a. The proliferation of big data sources such as social media, IoT devices,
and machine learning algorithms has led to the need for data
warehousing systems that can handle large volumes of structured and
unstructured data.

b. Data warehousing solutions are evolving to enable the integration and


analysis of big data in real-time.

3. Self-service analytics:

a. Data warehousing systems are increasingly focusing on enabling users


to access and analyze data on their own, without the need for
specialized IT skills. Self-service analytics tools allow business users to
explore and visualize data, create reports and dashboards, and make
data-driven decisions without the need for IT support.

4. Data lake integration:

a. A data lake is a centralized repository that allows organizations to store


all their structured and unstructured data at any scale.

b. Data warehousing solutions are increasingly integrating with data lakes


to enable the analysis of a wider range of data types and sources.

5. Machine learning and AI:

a. Data warehousing solutions are incorporating machine learning and AI


capabilities to enable the automation of data preparation, analysis, and
reporting tasks.

b. This allows organizations to gain insights from their data faster and more
accurately.

Data mining and Warehousing 7


Project planning and management in data
warehouse
Project planning and management are critical for the success of a data
warehousing project. Some best practices for project planning and management
in data warehousing include:

1. Define the project scope:

a. Clearly define the objectives and deliverables of the data warehousing


project, as well as the resources and constraints involved. This will help
ensure that the project stays on track and meets the needs of the
business.

2. Develop a project plan:

a. Create a detailed project plan that outlines the tasks, dependencies,


timelines, and resources required to complete the project.

b. The project plan should also include contingency plans for addressing
potential risks and issues.

3. Assemble a project team:

a. Assemble a team of skilled professionals with expertise in data


warehousing, business analysis, and project management.

b. The team should have a clear understanding of the project objectives


and roles and responsibilities.

4. Identify and prioritize data sources:

a. Identify and prioritize the data sources that will be used in the data
warehousing project, and plan for the acquisition, cleansing, and
transformation of the data.

5. Choose the right tools and technologies:

a. Select the tools and technologies that will be used to build and maintain
the data warehouse, taking into account the needs and constraints of the
project.

6. Communicate and coordinate with stakeholders:

Data mining and Warehousing 8


a. Regularly communicate and coordinate with stakeholders to ensure that
the project is aligned with business objectives and to address any
concerns or issues that may arise.

7. Monitor and control the project:

a. Use project management techniques such as monitoring and controlling


to track progress, identify and address issues, and make course
corrections as needed to ensure that the project stays on track.

UNIT - 2

What is Dimensional Modeling in Data Warehouse?


Dimensional Modeling (DM) is a data structure technique optimized for data
storage in a Data warehouse.

The purpose of dimensional modeling is to optimize the database for faster


retrieval of data.

These dimensional and relational models have their unique way of data storage
that has specific advantages.

For instance, in the relational mode, normalization and ER models reduce


redundancy in data. On the contrary, dimensional model in data warehouse
arranges data in such a way that it is easier to retrieve information and generate
reports.

principle of data Modeling

1) Identify the Business Process


A Business Process is a very important aspect when dealing with
Dimensional Data Modelling.

Data mining and Warehousing 9


The business process helps to identify what sort of Dimension and Facts are
needed and maintain the quality of data.

To describe business processes, you can use Business Process Modelling


Notation (BPMN) or Unified Modelling Language (UML).

2) Identify Grain
Identification of Grain is the process of identifying how much normalization
(lowest level of information) can be achieved within the data.

It is the stage to decide the incoming frequency of data (i.e.daily, weekly,


monthly, yearly), how much data we want to store in the database (one day,
one month, one year, ten years), and how much the storage will cost.

3) Identify the Dimensions


Dimensions are the key components in the Dimensional Data Modelling
process.

It contains detailed information about the objects like date, store, name,
address, contacts, etc. For example, in an E-Commerce use case, a
Dimension can be:

Product

Order Details

Order Items

Departments

Customers (etc).

4) Identify the Facts


Once the Dimensions are created, the measures/transactions are supposed
to be linked with the associated Dimensions. The Fact Tables hold
measures and are linked to Dimensions via foreign keys. Usually, Facts
contain fewer columns and huge rows.

For example, in an E-Commerce use case, one of the Fact Tables can be of
orders, which holds the products’ daily ordered quantity. Facts may contain

Data mining and Warehousing 10


more than one foreign key to build relationships with different Dimensions.

5) Build the Schema


The next step is to tie Dimensions and Facts into the Schema. Schemas are
the table structure, and they align the tables within the database. There are
2 types of Schemas:

Star Schema: The Star Schema is the Schema with the simplest
structure. In a Star Schema, the Fact Table surrounds a series of
Dimensions Tables. Each Dimension represents one Dimension Table.
These Dimension Tables are not fully normalized. In this Schema, the
Dimension Tables will contain a set of attributes that describes the
Dimension. They also contain foreign keys that are joined with the Fact
Table to obtain results.

Snowflake Schema: A Snowflake Schema is the extension of a Star


Schema, and includes more Dimensions. Unlike a Star Schema, the
Dimensions are fully normalized and are split down into further tables.
This Schema uses less disk space because they are already
normalized. It is easy to add Dimensions to this Schema and the data
redundancy is also less because of the intricate Schema design.

What is data extraction? Describe the different


technique of data extraction.
Data extraction is the process of retrieving data from a specific source, such as a
database, a website, or a document, and transferring it to a different location or
format.
This can be done manually or through the use of specialized software or tools.
There are several techniques that can be used for data extraction, including:

1. Structured Query Language (SQL):

SQL is a programming language used to interact with databases.

Data mining and Warehousing 11


It can be used to extract data from a database by writing a specific query
that specifies the data to be extracted.

2. Web scraping:

Web scraping involves using a program or tool to automatically extract


data from a website.

This can be done by writing code that interacts with the website's HTML
or XML code to extract the desired data.

3. Application Programming Interface (API):

An API is a set of rules and protocols that allows different software


programs to communicate with each other.

APIs can be used to extract data from a specific source by sending a


request and receiving a response with the desired data.

4. Data mining:

Data mining involves using automated techniques to discover patterns


and relationships in large datasets.

It can be used to extract data from a variety of sources, including


databases, websites, and documents.

5. Optical Character Recognition (OCR):

OCR is a technique that uses specialized software to recognize and


extract text from scanned documents or images.

It can be used to extract data from paper documents or PDF files.

6. Data integration:

Data integration involves combining data from multiple sources into a


single, unified dataset.

This can be done through the use of specialized tools or by writing code
to extract and merge the data from different sources.

7. Data transformation:

Data transformation involves converting data from one format to another,


such as from a database to a spreadsheet or from a PDF to a Word

Data mining and Warehousing 12


document.

This can be done manually or through the use of specialized tools or


software.

Explain transformation with suitable example.


In the context of data mining, data transformation refers to the process of
converting raw data into a form that is more suitable for analysis.
This can involve a variety of techniques, including:

1. Normalization:

a. Normalization involves scaling the data so that it falls within a specific


range, such as between 0 and 1.

b. This is often done to ensure that different variables are on the same
scale and can be compared more easily.

2. Aggregation:

a. Aggregation involves combining data from multiple sources or records


into a single value, such as taking the average of a set of numbers.

b. This can be useful for reducing the size of a dataset or for summarizing
data in a way that is more meaningful.

3. Filtering:

a. Filtering involves selecting only a subset of the data for further analysis,
based on certain criteria.

b. This can be useful for removing irrelevant or noisy data and focusing on
the most important information.

4. Encoding:

a. Encoding involves converting categorical data, such as text labels, into


numerical values that can be processed by machine learning algorithms.

Data mining and Warehousing 13


b. This can be done through techniques like one-hot encoding, where each
category is represented by a separate binary column.

5. Feature selection:

a. Feature selection involves selecting a subset of the variables or features


in a dataset for further analysis.

b. This can be done to improve the performance of a machine learning


model or to remove features that are redundant or irrelevant.

Data transformation is an important step in the data mining process, as it allows


you to prepare the data in a way that is more suitable for analysis and modeling.
By applying the appropriate transformation techniques, you can better uncover
patterns and relationships in the data and gain valuable insights.

What are the different ways of representation of


logical database?
In data mining, a logical database refers to a representation of a database
that is used for analysis and modeling purposes.

It is typically created by extracting and transforming data from a physical


database, which is the actual data storage system that contains the raw data.
A logical database is often used to prepare data for analysis or modeling in a
way that is more suitable for these purposes.
This can involve filtering or cleaning the data to remove irrelevant or noisy
information, encoding categorical data, or aggregating data from multiple
sources.

There are several ways of representing a logical database, including:

1. Entity-Relationship Diagram (ERD):

a. An ERD is a graphical representation of a database that shows the


relationships between different entities or data elements.

Data mining and Warehousing 14


b. It is often used to design and model a database before it is
implemented.

2. Relational Model:

a. The relational model is a way of representing a database as a


collection of tables, with each table representing a different entity or
data element.

b. Each table consists of rows (called records or tuples) and columns


(called attributes or fields).

3. Object-Oriented Model:

a. In the object-oriented model, a database is represented as a


collection of objects, with each object representing a different entity or
data element.

b. Objects can have attributes and methods, and can be related to other
objects through relationships.

4. Hierarchical Model:

a. In the hierarchical model, a database is represented as a tree-like


structure, with each data element represented as a node in the tree.

b. There is a one-to-many relationship between parent and child nodes,


with the root node representing the top level of the hierarchy.

5. Network Model:

a. In the network model, a database is represented as a collection of


interconnected nodes, with each node representing a different data
element.

b. Nodes can be connected through relationships, and data can be


accessed through a set of pointers that link the nodes together.

6. XML:

a. XML (Extensible Markup Language) is a way of representing data in


a structured format that can be read by both humans and computers.

b. XML documents are made up of elements, attributes, and values, and


can be used to represent a wide range of data types and structures.

Data mining and Warehousing 15


UNIT - 3

How are data warehouse and web related?


A data warehouse is a database specifically designed for storing and analyzing
large amounts of data from multiple sources.

It is typically used for business intelligence and decision-making purposes, and


can be accessed and queried by users to extract insights and trends from the
data.
Web data refers to the data that is generated by user interactions on the internet,
such as website visits, online purchases, and social media activity.

This data can be collected and stored in a data warehouse, along with other
types of data, to be analyzed and used for business purposes.
There are several ways in which data warehouses and the web are related:

1. Web data can be used to populate a data warehouse:

a. Web data can be collected from various sources, such as website logs,
online surveys, and social media platforms, and used to populate a data
warehouse.

b. This data can then be analyzed to gain insights into customer behavior,
preferences, and trends.

2. Data warehouses can be accessed through the web:

a. Data warehouses can be accessed and queried through a web interface,


allowing users to extract and analyze data from anywhere with an

Data mining and Warehousing 16


internet connection.

b. This can be done through the use of specialized tools or applications,


such as business intelligence software or web-based dashboards.

3. Data warehouses can be used to power web applications:

a. Data stored in a data warehouse can be used to power web applications,


such as online stores or personalized websites

Overall, data warehouses and the web are related in that data warehouses can
be used to store, analyze, and power data from the web, and can be accessed
and queried through the web.

How is data partitioning helpful in reducing the query


access time from data warehouse?

Data partitioning definition:

Data partitioning is the process of dividing a large dataset into smaller, more
manageable pieces, called partitions. It is often used in data warehouses to
improve query performance and reduce access times.
It is often used to improve the performance and scalability of database systems
and data analysis applications.

There are several ways in which data partitioning can help to reduce the query
access time from a data warehouse:

1. Improved indexing:

a. When data is partitioned, it can be indexed separately for each partition.

b. This can improve the performance of queries that use the index, as the
index will be smaller and more efficient.

2. Parallel processing:

Data mining and Warehousing 17


a. Data partitioning can allow for parallel processing of queries, meaning
that multiple processors can work on different partitions of the data at the
same time.

b. This can greatly reduce the time it takes to execute a query, especially
for large datasets.

3. Data locality:

a. Data partitioning can improve data locality, which refers to the idea that
data that is accessed together should be stored close to each other.

b. By partitioning data based on how it is typically accessed, you can


ensure that the data needed for a specific query is stored together,
reducing the time it takes to retrieve it.

4. Reduced I/O: Data partitioning can also reduce the amount of input/output
(I/O) required to retrieve data from a data warehouse, as it allows you to
retrieve only the data that is needed for a specific query rather than reading
through the entire dataset. This can further improve query performance and
reduce access times.

Overall, data partitioning is a useful technique for reducing the query access time
from a data warehouse by improving indexing, enabling parallel processing,
improving data locality, and reducing I/O.

List activities performed during data warehouse


deployment

There are several activities that are typically performed during the deployment of
a data warehouse, including:

1. Design and planning:

a. Before deployment, the data warehouse design and architecture should


be carefully planned and developed.

Data mining and Warehousing 18


b. This can involve deciding on the data sources to be used, the data model
and schema, the ETL (extract, transform, load) process, and the tools
and technologies to be used.

2. Data cleansing and transformation:

a. Before data can be loaded into the data warehouse, it may need to be
cleaned and transformed to ensure it is in a consistent and usable
format.

b. This can involve tasks such as removing duplicates, correcting errors,


and standardizing data formats.

3. Data loading:

a. Once the data has been cleaned and transformed, it can be loaded into
the data warehouse.

b. This can involve using ETL tools or scripts to extract the data from the
source systems, transform it into the appropriate format, and load it into
the data warehouse.

4. Data quality checks:

a. After the data has been loaded into the data warehouse, it is important to
perform quality checks to ensure that the data is accurate and complete.

b. This can involve verifying that all the data has been loaded correctly,
checking for inconsistencies or errors, and comparing the data to the
source systems to ensure it is accurate.

5. Data indexing:

a. Indexing is a process that is used to improve the performance of queries


on the data warehouse.

b. Once the data has been loaded and quality checked, it is typically
indexed to make it easier to search and retrieve.

6. Security and access control:

a. It is important to ensure that the data warehouse is secure and that


access to the data is controlled appropriately.

Data mining and Warehousing 19


b. This can involve setting up authentication and authorization protocols, as
well as implementing security measures such as encryption and firewalls.

7. Testing and validation:

a. Before the data warehouse is put into production, it is important to


thoroughly test and validate the system to ensure it is functioning
correctly.

b. This can involve testing the ETL process, running sample queries, and
verifying the data quality.

8. Deployment and maintenance:

a. Once the data warehouse has been tested and validated, it can be
deployed into production and made available for users to access and
query.

b. Ongoing maintenance and support will also be required to ensure the


system remains stable and efficient.

Explain project planning & management in Data


warehouse.

Project planning and management are important considerations in the


development of a data warehouse.

A well-planned and managed data warehouse project can help to ensure that
the project is completed on time, within budget, and with the desired level of
quality.

Data mining and Warehousing 20


Some key elements of project planning and management in a data warehouse
project include:

1. Defining the project scope and objectives:

a. It is important to clearly define the goals and objectives of the data


warehouse project, as well as the specific tasks and deliverables that will
be required to achieve these goals.

b. This can help to ensure that the project stays focused and on track.

2. Developing a project plan:

a. A project plan is a detailed document that outlines the steps and tasks
required to complete the data warehouse project.

b. It should include information such as the timeline, budget, resources, and


dependencies for each task.

3. Identifying and securing resources:

a. In order to complete the data warehouse project, you will need to identify
and secure the necessary resources, such as personnel, software,
hardware, and data sources.

b. It is important to carefully plan and coordinate these resources to ensure


that the project stays on track.

4. Managing risks:

a. All projects carry some level of risk, and it is important to identify and
manage these risks throughout the data warehouse project.

b. This can involve creating contingency plans, conducting risk


assessments, and monitoring and addressing potential issues as they
arise.

5. Tracking progress:

a. It is important to track the progress of the data warehouse project


regularly and adjust the project plan as needed.

b. This can involve using project management software, holding regular


meetings with the project team, and regularly reviewing the project plan
to ensure it is still on track.

Data mining and Warehousing 21


Overall, project planning and management are critical components of a
successful data warehouse project. By carefully planning and managing the
project, you can ensure that it is completed on time, within budget, and with the
desired level of quality.

Explain OLAP operation with example

OLAP (Online Analytical Processing) is a type of database technology that is


designed for efficient querying and analysis of large amounts of data.
It allows users to quickly perform complex calculations and analysis on data
stored in a data warehouse.
There are several types of OLAP operations, including:

1. Roll-up:

a. A roll-up operation involves aggregating data from multiple dimensions or


levels into a higher level of granularity.

b. For example, you might roll up the data by region to see the total sales
for each region. This might give you a result like this:

Region Sales

North $100,000

South $200,000

East $150,000

West $250,000

2. Drill-down:

a. A drill-down operation involves breaking down data to a lower level of


granularity.

Data mining and Warehousing 22


b. For example, you might drill down from the quarter-level to the month-
level to see the sales data for each month within a quarter.

3. Slice and dice:

a. A slice and dice operation involves selecting and organizing data from
multiple dimensions in order to view it from different perspectives.

b. For example, a retail company might use OLAP to analyze sales data by
store location, time period, and product category to identify which
products are selling well in which locations and at which times of year.

4. Pivot:

a. A pivot operation involves rotating the data in a table or matrix so that


different dimensions are displayed as rows or columns.

b. For example, a company might use OLAP to pivot sales data by product
and time period to see how different products are performing over time.

5. Hierarchical analysis:

a. This involves analyzing data at different levels of a hierarchy, such as


analyzing sales data at the country level and then drilling down to the
regional and store levels.

b. For example, a company might use hierarchical analysis to analyze sales


data at the country level and then drill down to the regional and store
levels.

Overall, OLAP operations are useful for quickly and easily analyzing and
manipulating large amounts of data in a data warehouse, allowing users to gain
insights and make informed decisions.

Explain the physical design process of data


warehouse.
The physical design of a data warehouse involves designing the underlying
database and the hardware infrastructure that will be used to support the data

Data mining and Warehousing 23


warehouse.
It is an important step in the data warehouse development process because it
determines the performance, scalability, and reliability of the data warehouse.
Here are the main steps in the physical design process of a data warehouse:

1. Identify the hardware and software requirements:

a. This involves determining the hardware and software infrastructure


that will be needed to support the data warehouse, including the
number of servers, the amount of storage, and the type of database
software that will be used.

2. Design the database schema:

a. This involves designing the structure of the database, including the


tables, columns, and relationships between the data.

b. The database schema should be designed to support the data


warehouse queries and to ensure efficient data retrieval.

3. Load the data:

a. This involves transferring the data from the various source systems
into the data warehouse.

b. This can be done using ETL (extract, transform, and load) tools,
which are used to extract the data from the source systems,
transform it into the appropriate format, and load it into the data
warehouse.

4. Optimize the database and hardware:

a. This involves fine-tuning the database and hardware to ensure that


the data warehouse performs optimally.

b. This may include indexing tables, partitioning data, and using


hardware acceleration techniques to improve query performance.

Overall, the physical design process of a data warehouse is an important step


in the data warehouse development process because it determines the
performance, scalability, and reliability of the data warehouse.

Data mining and Warehousing 24


Define types OLAP With suitable diagram.
Online Analytical Processing (OLAP) is a type of database technology that is
designed to support the efficient querying and analysis of data stored in a
multidimensional database. There are three main types of OLAP systems:

1. Relational OLAP (ROLAP):

a. ROLAP systems use a traditional relational database management


system (RDBMS) to store data, but they use specialized indexing and
querying techniques to enable fast querying and analysis of data.

b. ROLAP Architecture includes the following components

Database server.

ROLAP server.

Front-end tool.

2. Multidimensional OLAP (MOLAP):

a. MOLAP systems store data in a multi-dimensional array, which allows


for fast querying and analysis of data using pre-computed
aggregations.

b. MOLAP Architecture includes the following components

Database server.

Data mining and Warehousing 25


MOLAP server.

Front-end tool.

3. Hybrid OLAP (HOLAP):

a. HOLAP systems combine the best aspects of ROLAP and MOLAP


systems, allowing for both fast querying and analysis of data using
pre-computed aggregations and the ability to handle large amounts of
data.

Describe data mining task primitives.

Data mining task primitives are the basic operations that are used to
perform data mining tasks.

These task primitives can be used to perform a wide range of data mining
tasks, including classification, regression, clustering, association rule
mining, and outlier detection.

The main data mining task primitives are:

Data mining and Warehousing 26


1. Selection:

a. The selection primitive allows the user to select a subset of the data
to be used for a data mining task.

b. This can be useful if the user only wants to analyze a specific subset
of the data, or if the user wants to exclude certain data from the
analysis.

2. Projection:

a. The projection primitive allows the user to select specific attributes or


features from the data for use in the data mining task.

b. This can be useful if the user only wants to analyze a specific set of
attributes or if the user wants to exclude certain attributes from the
analysis.

3. Sampling:

a. The sampling primitive allows the user to select a sample of the data
to be used for a data mining task.

b. This can be useful if the user only wants to analyze a small subset of
the data, or if the user wants to reduce the computational burden of
the data mining task.

4. Aggregation:

a. The aggregation primitive allows the user to group data together and
perform calculations on the aggregated data.

b. This can be useful for summarizing data or for identifying patterns in


the data.

5. Sorting:

a. The sorting primitive allows the user to order the data according to
specific attributes or features.

b. This can be useful for organizing data or for identifying patterns in the
data.

6. Ranking:

Data mining and Warehousing 27


a. The ranking primitive allows the user to rank the data according to
specific attributes or features.

b. This can be useful for identifying the most important or influential data
points.

7. Transformation:

a. The transformation primitive allows the user to transform the data in


some way, such as by normalizing the data or by applying a specific
function to the data.

b. This can be useful for preparing the data for analysis or for adjusting
the data to meet the requirements of a specific data mining algorithm.

8. Model building:

a. The model building primitive allows the user to build a model based
on the data, which can then be used for prediction or other data
mining tasks.

b. This can be useful for making predictions about future data or for
identifying patterns in the data that are not immediately apparent.

These task primitives can be combined and applied in different ways to


perform a wide variety of data mining tasks. For example, a data mining task
may involve selecting a subset of the data, projecting specific attributes,
aggregating the data, and building a model based on the aggregated data.

UNIT - 4

Data mining and Warehousing 28


What data mining
Data mining is the process of discovering patterns and trends in large
datasets.

It involves using various techniques and algorithms to extract useful


information and insights from data.

Data mining is often used in a wide range of fields, including business,


finance, healthcare, and science.

Some common data mining techniques include clustering, classification,


regression, association rule learning, anomaly detection, dimensionality
reduction, neural networks, and genetic algorithms.

Data mining can be used to make predictions, identify trends and patterns,
and generate insights that can inform decision-making.

It can also be used to detect fraudulent activity, identify potential problems or


issues, and optimize processes and systems.

Data mining requires a large amount of data, which may be collected from
various sources such as databases, websites, and sensors.

It also requires specialized software and tools to analyze and process the
data.

Data mining often involves working with large and complex datasets, and
may require the use of specialized hardware and infrastructure such as
clusters or cloud computing platforms.

Data mining is a rapidly evolving field, with new techniques and tools being
developed constantly.

Explain different data mining techniques.

Data mining and Warehousing 29


There are many different data mining techniques that can be used to extract
useful information and insights from large datasets. Here are some common
techniques:

1. Classification:

a. This involves training a model to predict which class or category a data


point belongs to.

b. Classification can be used to predict outcomes, such as whether a


customer will churn or whether a patient will develop a particular disease.

c. Classification of Data mining frameworks as per the:

i. type of data sources mined:

ii. the database involved:

iii. the kind of knowledge discovered:

iv. according to data mining techniques used:

2. Clustering:

a. This involves grouping data points into clusters based on their


similarities.

b. Clustering can be used to identify groups of similar items, customers, or


other entities, and can be useful for segmentation and classification
tasks.

3. Regression:

Data mining and Warehousing 30


a. This involves predicting a continuous numerical value, such as a price or
temperature. Regression can be used to forecast trends and make
predictions about future values.

4. Association rule learning:

a. This involves identifying rules that describe relationships between


variables in the data.

b. Association rule learning can be used to discover patterns in


transactional data, such as which products are frequently purchased
together.

c. This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.

5. Sequential Patterns

a. patterns analysis is one of data mining technique that seeks to


discover or identify similar patterns, regular events or trends in
transaction
data over a business period.

b. In sales, with historical transaction data, can identify a set of items


that customers buy together a different in a year.

c. Then businesses can use this information to recommend customers buy


it with
better deals based on their purchasing frequency in the past.

6. Prediction :

a. The prediction, as it name implied, is one of a data mining techniques


that
discovers relationship between independent variables and relationship
between dependent and independent variables.

b. For instance, the prediction analysis technique can be used in sale to


predict
profit for the future if we consider sale is an independent variable, profit
could be a dependent variable.

Data mining and Warehousing 31


c. Then based on the historical sale and profit data, we can draw a fitted
regression curve that is used for profit prediction.

7. Decision trees :

a. Decision tree is one of the most used data mining techniques because its
model is easy to understand for users.

b. In decision tree technique, the of the decision tree is a simple question or


condition that has multiple answers.

c. Each answer then leads to a set of questions or conditions that help us


determine die data so that we can make the final decision based on it.

d. For example, we use the following decision tree to determine whether or


not
to play tennis

What do you mean by Clustering? Briefly explain the


association rules.

Clustering

Clustering is alternatively referred to as unsupervised learning or


segmentation.

It can be thought of as partitioning or segmentation the data into groups that


might or might not be disjointed.

The clustering is usually accomplished by determining the similarity among


the
data on predefined attributes.

The most similar data are grouped into clusters.

A special type of clustering is called segmentation.

With segmentation a database is partitioned into disjointed grouping of


similar

Data mining and Warehousing 32


tuples called segments.

Segmentation is often viewed as being identical to clustering.

In other circles segmentation is viewed as a specific type of clustering


applied to a
database itself.

Association rules are a type of rule-based machine learning method for


discovering relationships between variables in a large database.
They are often used in the field of data mining to discover patterns of co-
occurrence in transactional data.
In the context of clustering, association rules can be used to identify relationships
between variables within a cluster. For example, if a clustering algorithm has
been used to group a set of customers into different clusters based on their
purchasing habits, association rules could be used to identify relationships
between the items that customers in a particular cluster tend to purchase
together.
Here are a few points about association rules in the context of clustering:

1. Association rules are typically expressed as an "if-then" statement, where the


"if" part represents the antecedent (the predictor variable) and the "then" part
represents the consequent (the predicted variable).

2. Association rules can be used to identify relationships between variables


within a cluster.

3. Association rules can be used to make recommendations to customers or to


optimize product placement in a store.

4. Association rules can be calculated using a variety of measures, such as


support, confidence, and lift.

5. The quality of the association rules discovered will depend on the quality of
the clusters produced by the clustering algorithm.

Data mining and Warehousing 33


What is knowledge discovery
Knowledge discovery is the process of discovering useful, novel, and
potentially valuable information from data.

It involves identifying patterns and trends in data, and using that information
to understand the underlying relationships and structures in the data.

The goal of knowledge discovery is to extract knowledge from data and use it
to make informed decisions and predictions.

This process can be applied to a wide range of fields, including business,


science, and engineering, and it often involves the use of data mining and
machine learning techniques to analyze large datasets.

Describe KDD process in detail.

The knowledge discovery in data mining (KDD) process is a systematic approach


to discovering useful, novel, and potentially valuable information from data. It
typically involves the following steps:

Data mining and Warehousing 34


1. Data selection:

a. The first step in KDD is to select the data that will be analyzed.

b. This may involve collecting data from various sources, such as


databases, sensor networks, or the web.

2. Data preprocessing:

a. Once the data has been selected, it is often necessary to clean and
preprocess the data to make it suitable for analysis.

b. This may involve tasks such as correcting errors, handling missing


values, and removing unnecessary data.

3. Data transformation:

a. The next step is to transform the data into a format that is suitable for
analysis.

b. This may involve techniques such as normalization, aggregation, or


feature selection.

4. Data mining:

a. The data is then analyzed using various data mining techniques, such as
machine learning algorithms or statistical models, to identify patterns and
trends in the data.

5. Pattern evaluation:

a. The discovered patterns and trends are then evaluated to determine their
significance and usefulness.

6. Knowledge presentation:

a. The final step is to present the knowledge that has been discovered in a
clear and understandable way, typically through the use of visualizations
or reports.

Throughout the KDD process, it is important to carefully consider the goals of the
analysis and the characteristics of the data, as well as to validate the results
obtained.

Data mining and Warehousing 35


UNIT - 5

What is Web mining?

Web mining is the process of using data mining techniques to discover patterns
and trends in data that is collected from the World Wide Web.

It involves collecting, storing, and analyzing data from websites and social media
platforms to gain insights into customer behavior, preferences, and trends.

Web mining can be used in a variety of applications, such as improving search


engine results, personalizing web content, detecting fraud, and analyzing social
media data.

Define spatial mining.


Spatial mining is the process of analyzing and extracting knowledge from spatial
data, which is data that has a geographic or spatial component.

Spatial data can include information about the location, shape, and attributes of
geographical features, such as points, lines, and polygons.

Spatial mining involves the use of specialized techniques and algorithms to


analyze and interpret spatial data, with the goal of discovering patterns and
trends that would not be apparent from the raw data alone.

It is commonly used in a variety of fields, including geography, urban planning,


environmental science, and marketing.

Spatial mining can be performed using a variety of tools and techniques, including
GIS software, spatial database systems, and machine learning algorithms.

Data mining and Warehousing 36


Differentiate between the features provided by web
content mining and web usage mining.

Web content mining and web usage mining are two different types of data mining
techniques that are used to extract information from the World Wide Web.
Web content mining

is the process of automatically extracting structured data from web pages.

It involves analyzing the content of web pages and extracting information that is
relevant to a specific topic or task.

Web content mining can be used to extract a wide range of information,


including text, images, and other media.

Web usage mining

is the process of analyzing and interpreting data about the behavior of users on
the web. It involves tracking the actions of users as they interact with web
pages, such as the pages they visit, the links they click on, and the time they
spend on each page.

Web usage mining is often used to understand user behavior and identify
patterns in how users interact with web pages.

Web content Web usage mining

it extracts data from the content of web it extracts data from the actions of users on
pages the web.

it extracts structured data from web pages, it extracts data about user behavior, such
such as text and images as page visits and clickstream data.

it used to extract specific types of it is used to understand user behavior and


information from web pages, such as identify patterns in how users interact with
product information or news articles web pages.

it involves the use of natural language it involves the use of log file analysis and
processing and machine learning data visualization tools to analyze and

Data mining and Warehousing 37


algorithms to extract and analyze data from interpret user behavior data.
web pages

Discuss the application of data mining in the field


of e-commerce.
Data mining is widely used in the field of e-commerce to extract valuable
insights from large datasets of customer and transactional data.

Some of the ways in which data mining is used in e-commerce include:

1. Customer segmentation:

a. Data mining can be used to identify patterns in customer behavior


and group customers into segments based on common
characteristics.

b. This can be useful for targeted marketing and personalized


recommendations.

2. Fraud detection:

a. Data mining can be used to identify patterns in transactional data


that may indicate fraudulent activity.

b. This can help e-commerce companies prevent losses due to


fraudulent transactions.

3. Personalized recommendations:

a. Data mining can be used to analyze customer behavior and


preferences to make personalized product recommendations.

b. This can increase customer satisfaction and sales.

4. Price optimization:

a. Data mining can be used to analyze customer behavior and market


trends to optimize prices for products and services.

Data mining and Warehousing 38


b. This can help e-commerce companies maximize profits and stay
competitive.

5. Marketing campaign optimization:

a. Data mining can be used to analyze the effectiveness of marketing


campaigns and identify which strategies and tactics are most
successful.

b. This can help e-commerce companies optimize their marketing


efforts and improve ROI.

6. Inventory management:

a. Data mining can be used to analyze customer demand and


purchasing patterns to optimize inventory levels and reduce the risk
of stock outs.

Overall, data mining is an essential tool for e-commerce companies to gain a


deeper understanding of their customers and optimize their business
operations.

Explain Applications & trends in Datamining.


Data mining is the process of extracting and analyzing large amounts of
data to identify patterns, trends, and relationships that may not be
immediately apparent.

Some common applications of data mining include:

1. Marketing:

a. Data mining can be used to analyze customer behavior and


preferences to identify patterns and trends that can inform marketing
strategies.

Data mining and Warehousing 39


b. This can include targeted marketing campaigns, personalized
product recommendations, and customer segmentation.

2. Healthcare:

a. Data mining can be used to identify patterns in patient data and


predict the likelihood of certain outcomes, such as the likelihood of a
patient developing a particular condition or the effectiveness of a
particular treatment.

3. Finance:

a. Data mining can be used to identify patterns in financial data, such


as stock prices and market trends, to inform investment decisions
and risk management strategies.

4. Retail:

a. Data mining can be used to analyze customer purchasing patterns


and optimize inventory levels, pricing, and marketing campaigns.

5. Fraud detection:

a. Data mining can be used to identify patterns in data that may


indicate fraudulent activity, such as fraudulent credit card
transactions or insurance claims.

6. Manufacturing:

a. Data mining can be used to optimize production processes, identify


patterns in quality control data, and reduce the risk of defects.

7. Telecommunications:

a. Data mining can be used to analyze customer usage patterns and


optimize network resources, as well as identify patterns in network
usage data that may indicate security threats.

Overall, data mining has the potential to unlock valuable insights from large
datasets and inform a wide range of business and research applications.

Data mining and Warehousing 40


Explain the Taxonomy of web mining with suitable
diagram.
Web mining is the process of using data mining techniques to extract and
analyze information from the World Wide Web.

It involves the use of specialized tools and algorithms to analyze and


interpret data from web pages, web logs, and other online sources.

Web mining can be divided into three main categories:

1. Web content mining:

a. Web content mining involves the extraction of structured data from web
pages, such as text, images, and other media.

b. It is typically used to extract specific types of information from web


pages, such as product information or news articles.

2. Web structure mining:

a. Web structure mining involves the analysis of the relationships between


different web pages, such as the links between pages and the structure
of the web as a whole.

b. It is often used to identify patterns in the organization and linking of web


pages.

3. Web usage mining:

a. Web usage mining involves the analysis of data about the behavior of
users on the web, such as page visits and clickstream data.

b. It is typically used to understand user behavior and identify patterns in


how users interact with web pages.

Data mining and Warehousing 41


Data mining and Warehousing 42

You might also like