ALL YOU NEED Data_Mining_and_Warehousing
ALL YOU NEED Data_Mining_and_Warehousing
Warehousing
MISSION NO BACKLOGs
2. Integrated : Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
3. Time-variant : All data in the data warehouse is identified with a particular time
period.
4. Non-volatile : Data is Stable in a data warehouse. More data is added but data is
never removed.
There is a need for Data Warehouse for all the enterprises that want to make
data-driven decisions because a Data Warehouse is the “Single Source of Truth”
2) DATABASE VS DATAWAREHOUSE?
Database System Data Warehouse
Data is balanced within the scope of this Data must be integrated and balanced from
one system. multiple system.
Data is updated when transaction
Data is updated on scheduled processes.
occurs.
Data verification occurs when entry is Data verification occurs after the fact.
ER based. Star/Snowflake.
1. Data Cleaning :
a. Data cleaning routines work to "clean the data by filling in missing values,
noisy data, identifying and resolving inconsistencies.
2. Data Integration :
3. Data Transformation :
4. Data Reduction :
a. Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same) analytical
results.
Data from operational databases and external sources (such as user profile
data provided by external consultants) are extracted using application
program interfaces called a gateway.
A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.
A top-tier.
a. The proliferation of big data sources such as social media, IoT devices,
and machine learning algorithms has led to the need for data
warehousing systems that can handle large volumes of structured and
unstructured data.
3. Self-service analytics:
b. This allows organizations to gain insights from their data faster and more
accurately.
b. The project plan should also include contingency plans for addressing
potential risks and issues.
a. Identify and prioritize the data sources that will be used in the data
warehousing project, and plan for the acquisition, cleansing, and
transformation of the data.
a. Select the tools and technologies that will be used to build and maintain
the data warehouse, taking into account the needs and constraints of the
project.
UNIT - 2
These dimensional and relational models have their unique way of data storage
that has specific advantages.
2) Identify Grain
Identification of Grain is the process of identifying how much normalization
(lowest level of information) can be achieved within the data.
It contains detailed information about the objects like date, store, name,
address, contacts, etc. For example, in an E-Commerce use case, a
Dimension can be:
Product
Order Details
Order Items
Departments
Customers (etc).
For example, in an E-Commerce use case, one of the Fact Tables can be of
orders, which holds the products’ daily ordered quantity. Facts may contain
Star Schema: The Star Schema is the Schema with the simplest
structure. In a Star Schema, the Fact Table surrounds a series of
Dimensions Tables. Each Dimension represents one Dimension Table.
These Dimension Tables are not fully normalized. In this Schema, the
Dimension Tables will contain a set of attributes that describes the
Dimension. They also contain foreign keys that are joined with the Fact
Table to obtain results.
2. Web scraping:
This can be done by writing code that interacts with the website's HTML
or XML code to extract the desired data.
4. Data mining:
6. Data integration:
This can be done through the use of specialized tools or by writing code
to extract and merge the data from different sources.
7. Data transformation:
1. Normalization:
b. This is often done to ensure that different variables are on the same
scale and can be compared more easily.
2. Aggregation:
b. This can be useful for reducing the size of a dataset or for summarizing
data in a way that is more meaningful.
3. Filtering:
a. Filtering involves selecting only a subset of the data for further analysis,
based on certain criteria.
b. This can be useful for removing irrelevant or noisy data and focusing on
the most important information.
4. Encoding:
5. Feature selection:
2. Relational Model:
3. Object-Oriented Model:
b. Objects can have attributes and methods, and can be related to other
objects through relationships.
4. Hierarchical Model:
5. Network Model:
6. XML:
This data can be collected and stored in a data warehouse, along with other
types of data, to be analyzed and used for business purposes.
There are several ways in which data warehouses and the web are related:
a. Web data can be collected from various sources, such as website logs,
online surveys, and social media platforms, and used to populate a data
warehouse.
b. This data can then be analyzed to gain insights into customer behavior,
preferences, and trends.
Overall, data warehouses and the web are related in that data warehouses can
be used to store, analyze, and power data from the web, and can be accessed
and queried through the web.
Data partitioning is the process of dividing a large dataset into smaller, more
manageable pieces, called partitions. It is often used in data warehouses to
improve query performance and reduce access times.
It is often used to improve the performance and scalability of database systems
and data analysis applications.
There are several ways in which data partitioning can help to reduce the query
access time from a data warehouse:
1. Improved indexing:
b. This can improve the performance of queries that use the index, as the
index will be smaller and more efficient.
2. Parallel processing:
b. This can greatly reduce the time it takes to execute a query, especially
for large datasets.
3. Data locality:
a. Data partitioning can improve data locality, which refers to the idea that
data that is accessed together should be stored close to each other.
4. Reduced I/O: Data partitioning can also reduce the amount of input/output
(I/O) required to retrieve data from a data warehouse, as it allows you to
retrieve only the data that is needed for a specific query rather than reading
through the entire dataset. This can further improve query performance and
reduce access times.
Overall, data partitioning is a useful technique for reducing the query access time
from a data warehouse by improving indexing, enabling parallel processing,
improving data locality, and reducing I/O.
There are several activities that are typically performed during the deployment of
a data warehouse, including:
a. Before data can be loaded into the data warehouse, it may need to be
cleaned and transformed to ensure it is in a consistent and usable
format.
3. Data loading:
a. Once the data has been cleaned and transformed, it can be loaded into
the data warehouse.
b. This can involve using ETL tools or scripts to extract the data from the
source systems, transform it into the appropriate format, and load it into
the data warehouse.
a. After the data has been loaded into the data warehouse, it is important to
perform quality checks to ensure that the data is accurate and complete.
b. This can involve verifying that all the data has been loaded correctly,
checking for inconsistencies or errors, and comparing the data to the
source systems to ensure it is accurate.
5. Data indexing:
b. Once the data has been loaded and quality checked, it is typically
indexed to make it easier to search and retrieve.
b. This can involve testing the ETL process, running sample queries, and
verifying the data quality.
a. Once the data warehouse has been tested and validated, it can be
deployed into production and made available for users to access and
query.
A well-planned and managed data warehouse project can help to ensure that
the project is completed on time, within budget, and with the desired level of
quality.
b. This can help to ensure that the project stays focused and on track.
a. A project plan is a detailed document that outlines the steps and tasks
required to complete the data warehouse project.
a. In order to complete the data warehouse project, you will need to identify
and secure the necessary resources, such as personnel, software,
hardware, and data sources.
4. Managing risks:
a. All projects carry some level of risk, and it is important to identify and
manage these risks throughout the data warehouse project.
5. Tracking progress:
1. Roll-up:
b. For example, you might roll up the data by region to see the total sales
for each region. This might give you a result like this:
Region Sales
North $100,000
South $200,000
East $150,000
West $250,000
2. Drill-down:
a. A slice and dice operation involves selecting and organizing data from
multiple dimensions in order to view it from different perspectives.
b. For example, a retail company might use OLAP to analyze sales data by
store location, time period, and product category to identify which
products are selling well in which locations and at which times of year.
4. Pivot:
b. For example, a company might use OLAP to pivot sales data by product
and time period to see how different products are performing over time.
5. Hierarchical analysis:
Overall, OLAP operations are useful for quickly and easily analyzing and
manipulating large amounts of data in a data warehouse, allowing users to gain
insights and make informed decisions.
a. This involves transferring the data from the various source systems
into the data warehouse.
b. This can be done using ETL (extract, transform, and load) tools,
which are used to extract the data from the source systems,
transform it into the appropriate format, and load it into the data
warehouse.
Database server.
ROLAP server.
Front-end tool.
Database server.
Front-end tool.
Data mining task primitives are the basic operations that are used to
perform data mining tasks.
These task primitives can be used to perform a wide range of data mining
tasks, including classification, regression, clustering, association rule
mining, and outlier detection.
a. The selection primitive allows the user to select a subset of the data
to be used for a data mining task.
b. This can be useful if the user only wants to analyze a specific subset
of the data, or if the user wants to exclude certain data from the
analysis.
2. Projection:
b. This can be useful if the user only wants to analyze a specific set of
attributes or if the user wants to exclude certain attributes from the
analysis.
3. Sampling:
a. The sampling primitive allows the user to select a sample of the data
to be used for a data mining task.
b. This can be useful if the user only wants to analyze a small subset of
the data, or if the user wants to reduce the computational burden of
the data mining task.
4. Aggregation:
a. The aggregation primitive allows the user to group data together and
perform calculations on the aggregated data.
5. Sorting:
a. The sorting primitive allows the user to order the data according to
specific attributes or features.
b. This can be useful for organizing data or for identifying patterns in the
data.
6. Ranking:
b. This can be useful for identifying the most important or influential data
points.
7. Transformation:
b. This can be useful for preparing the data for analysis or for adjusting
the data to meet the requirements of a specific data mining algorithm.
8. Model building:
a. The model building primitive allows the user to build a model based
on the data, which can then be used for prediction or other data
mining tasks.
b. This can be useful for making predictions about future data or for
identifying patterns in the data that are not immediately apparent.
UNIT - 4
Data mining can be used to make predictions, identify trends and patterns,
and generate insights that can inform decision-making.
Data mining requires a large amount of data, which may be collected from
various sources such as databases, websites, and sensors.
It also requires specialized software and tools to analyze and process the
data.
Data mining often involves working with large and complex datasets, and
may require the use of specialized hardware and infrastructure such as
clusters or cloud computing platforms.
Data mining is a rapidly evolving field, with new techniques and tools being
developed constantly.
1. Classification:
2. Clustering:
3. Regression:
c. This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.
5. Sequential Patterns
6. Prediction :
7. Decision trees :
a. Decision tree is one of the most used data mining techniques because its
model is easy to understand for users.
Clustering
5. The quality of the association rules discovered will depend on the quality of
the clusters produced by the clustering algorithm.
It involves identifying patterns and trends in data, and using that information
to understand the underlying relationships and structures in the data.
The goal of knowledge discovery is to extract knowledge from data and use it
to make informed decisions and predictions.
a. The first step in KDD is to select the data that will be analyzed.
2. Data preprocessing:
a. Once the data has been selected, it is often necessary to clean and
preprocess the data to make it suitable for analysis.
3. Data transformation:
a. The next step is to transform the data into a format that is suitable for
analysis.
4. Data mining:
a. The data is then analyzed using various data mining techniques, such as
machine learning algorithms or statistical models, to identify patterns and
trends in the data.
5. Pattern evaluation:
a. The discovered patterns and trends are then evaluated to determine their
significance and usefulness.
6. Knowledge presentation:
a. The final step is to present the knowledge that has been discovered in a
clear and understandable way, typically through the use of visualizations
or reports.
Throughout the KDD process, it is important to carefully consider the goals of the
analysis and the characteristics of the data, as well as to validate the results
obtained.
Web mining is the process of using data mining techniques to discover patterns
and trends in data that is collected from the World Wide Web.
It involves collecting, storing, and analyzing data from websites and social media
platforms to gain insights into customer behavior, preferences, and trends.
Spatial data can include information about the location, shape, and attributes of
geographical features, such as points, lines, and polygons.
Spatial mining can be performed using a variety of tools and techniques, including
GIS software, spatial database systems, and machine learning algorithms.
Web content mining and web usage mining are two different types of data mining
techniques that are used to extract information from the World Wide Web.
Web content mining
It involves analyzing the content of web pages and extracting information that is
relevant to a specific topic or task.
is the process of analyzing and interpreting data about the behavior of users on
the web. It involves tracking the actions of users as they interact with web
pages, such as the pages they visit, the links they click on, and the time they
spend on each page.
Web usage mining is often used to understand user behavior and identify
patterns in how users interact with web pages.
it extracts data from the content of web it extracts data from the actions of users on
pages the web.
it extracts structured data from web pages, it extracts data about user behavior, such
such as text and images as page visits and clickstream data.
it involves the use of natural language it involves the use of log file analysis and
processing and machine learning data visualization tools to analyze and
1. Customer segmentation:
2. Fraud detection:
3. Personalized recommendations:
4. Price optimization:
6. Inventory management:
1. Marketing:
2. Healthcare:
3. Finance:
4. Retail:
5. Fraud detection:
6. Manufacturing:
7. Telecommunications:
Overall, data mining has the potential to unlock valuable insights from large
datasets and inform a wide range of business and research applications.
a. Web content mining involves the extraction of structured data from web
pages, such as text, images, and other media.
a. Web usage mining involves the analysis of data about the behavior of
users on the web, such as page visits and clickstream data.