0% found this document useful (0 votes)

7 views

Data Mining UNIT I

The document provides an overview of data warehouses and data mining, highlighting the characteristics, benefits, and differences between operational databases and data warehouses. It explains the process of data mining, including the types of patterns that can be mined, such as descriptive and predictive patterns, and the techniques used for extracting useful information from large datasets. Additionally, it discusses the importance of having separate systems for operational tasks and analytical processing to maintain performance and data integrity.

Uploaded by

swapnilmahajan231

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Mining UNIT I

Uploaded by

swapnilmahajan231

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data Mining - UNIT-I

A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries
and analysis and often contain large amounts of historical data. The data within a data warehouse is usually
derived from a wide range of sources such as application log files and transaction applications.

Data Warehouse

A data warehouse exhibits the following characteristics to support the management's decision-making
process −

 Subject Oriented − Data warehouse is subject oriented because it provides us the information
around a subject rather than the organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue, etc. The data warehouse does not focus on the ongoing
operations, rather it focuses on modelling and analysis of data for decision-making.
 Integrated − Data warehouse is constructed by integration of data from heterogeneous sources
such as relational databases, flat files etc. This integration enhances the effective analysis of data.
 Time Variant − The data collected in a data warehouse is identified with a particular time period.
The data in a data warehouse provides information from a historical point of view.
 Non-volatile − Nonvolatile means the previous data is not removed when new data is added to it.
The data warehouse is kept separate from the operational database therefore frequent a change in
operational database is not reflected in the data warehouse.

A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business
analysts. Because of these capabilities, a data warehouse can be considered an organization’s “single
source of truth.”

A typical data warehouse often includes the following elements:

 A relational database to store and manage data

 An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
 Statistical analysis, reporting, and data mining capabilities
 Client analysis tools for visualizing and presenting data to business users
 Other, more sophisticated analytical applications that generate actionable information by applying
data science and artificial intelligence (AI) algorithms, or graph and spatial features that enable
more kinds of analysis of data at scale

Organizations can also select a solution combining transaction processing, real-time analytics across data
warehouses and data lakes, and machine learning in one MySQL Database service—without the
complexity, latency, cost, and risk of extract, transform, and load (ETL) duplication.

Benefits of a Data Warehouse

Data warehouses offer the overarching and unique benefit of allowing organizations to analyze large
amounts of variant data and extract significant value from it, as well as to keep a historical record. Four
unique characteristics (described by computer scientist William Inmon, who is considered the father of the
data warehouse) allow data warehouses to deliver this overarching benefit. According to this definition,
data warehouses are

 Subject-oriented. They can analyze data about a particular subject or functional area (such as
sales).
 Integrated. Data warehouses create consistency among different data types from disparate sources.
 Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
 Time-variant. Data warehouse analysis looks at change over time.

A well-designed data warehouse will perform queries very quickly, deliver high data throughput, and
provide enough flexibility for end users to “slice and dice” or reduce the volume of data for closer
examination to meet a variety of demands—whether at a high level or at a very fine, detailed level. The
data warehouse serves as the functional foundation for middleware BI environments that provide end users
with reports, dashboards, and other interfaces.

Difference between Operational database system and Datawarehouse

The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as updates
are made and reflect the current value of the last transactions. Operational Database Management Systems
also called as OLTP (Online Transactions Processing Databases), are used to manage dynamic data in real-
time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and decision-
making. Such systems can organize and present information in specific formats to accommodate the
diverse needs of various users. These systems are called as Online-Analytical Processing (OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases. However, the goals of both these
databases are different.

Operational Database Data Warehouse

Data warehousing systems are typically designed to
Operational systems are designed to support high-
support high-volume analytical processing (i.e.,
volume transaction processing.
OLAP).
Operational systems are usually concerned with Data warehousing systems are usually concerned
current data. with historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex,
generally adding or retrieving a single row at a time unpredictable queries that access many rows per
per table. table.
It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.
It supports a few concurrent clients relative to
It supports thousands of concurrent clients.
OLTP.
Data warehousing systems are widely subject-
Operational systems are widely process-oriented.
oriented
Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Data Warehouse designed for on-line Analytical
Relational databases are created for on-line Processing (OLAP)
transactional Processing (OLTP)
Feature OLTP OLAP
It is a system which is used to It is a system which is used to manage informational
Characteristic
manage operational Data. Data.
Clerks, clients, and information Knowledge workers, including managers,
Users
technology professionals. executives, and analysts.

OLTP system is a customer-

OLAP system is market-oriented, knowledge
System oriented, transaction, and query
workers including managers, do data analysts
orientation processing are done by clerks,
executive and analysts.
clients, and information technology
professionals.

OLAP system manages a large amount of historical

OLTP system manages current data data, provides facilitates for summarization and
Data contents that too detailed and are used for aggregation, and stores and manages data at
decision making. different levels of granularity. This information
makes the data more comfortable to use in informed
decision making.
Database Size 100 MB-GB 100 GB-TB

OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
Database design
application-oriented database design.
design.

OLAP system often spans multiple versions of a

OLTP system focuses primarily on
database schema, due to the evolutionary process of
the current data within an enterprise
an organization. OLAP systems also deal with data
View or department, without referring to
that originates from various organizations,
historical information or data in
integrating information from many data stores.
different organizations.
Because of their large volume, OLAP data are
Volume of data Not very large stored on multiple storage media.

The access patterns of an OLTP

system subsist mainly of short, Accesses to OLAP systems are mostly read-only
Access patterns atomic transactions. Such a system methods because of these data warehouses stores
requires concurrency control and historical data.
recovery techniques.
Mostly write
Access mode Read/write
Short and fast inserts and updates
Insert and Periodic long-running batch jobs refresh the data.
proposed by end-users.
Updates
Number of
records Tens Millions
accessed

Normalization Fully Normalized Partially Normalized

It depends on the amount of files contained, batch

Processing
Very Fast data refresh, and complex query may take many
Speed
hours, and query speed can be upgraded by creating
indexes.
Why have a separate datawarehouse?

• A major reason is to help promote high performance of operational databases and datawarehouse.

• An operational database is designed and tuned for tasks such as indexing and hashing using primary
keys, searching for particular records and optimizing queries.

• Datawarehouse queries are often complex. These queries involve computation of large groups of
data at summarized levels, they may require the use of special data organization, access and
implementation methods based on multidimensional values.

• So processing OLAP queries in operational databases would degrade the performance of

operational tasks.

• An operational database supports the concurrent processing of multiple transactions, concurrency

control and recovery mechanism like locking and logging are required to ensure the consistency and
robustness of transactions.

• An OLAP query needs read-only access of data records for summarization and aggregation. If
concurrency control and recovery mechanisms are applied for such OLAP operations it would harm
the execution of concurrent transactions and reduce the throughput of an OLTP system

• A data warehouse requires consolidation of data from heterogeneous sources, which results in high
quality, clean and integrated data.

• As both the systems provide different functionalities and require different kinds of data .It is
necessary to maintain them separately.

What is Data mining?

The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of information
to various perspectives for categorization into useful data, which is collected and assembled in particular
areas such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other
data requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms
for data segments and evaluates the probability of future events. Data Mining is also called Knowledge
Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.

Data mining involves an integration of techniques from multiple disciplines such as database and
datawarehouse technology, statistics, machine learning, high performance computing, pattern recognition,
neural networks, data visualization, information retrieval, image and signal processing and spatial or
temporal data analysis.

Data mining is the process of discovering interesting knowledge from large amounts of data stored in
databases, data warehouses or other information repositories such as relational database systems,
transaction processing systems and file systems.
What kind of patterns can be mined?

Based on the data functionalities, patterns can be further classified into two categories.

1) Descriptive patterns.

2) Predictive patterns.

Descriptive patterns

It deals with the general characteristics and converts them into relevant and helpful information.

Descriptive patterns can be divided into the following patterns:

1) Clusters
2) Associations
3) Frequent patterns
4) Correlations
5) Class/concept

 Class/concept description: Data entries are associated with labels or classes. For instance, in a
library, the classes of items for borrowed items include books and research journals, and customers'
concepts include registered members and not registered members. These types of descriptions are
class or concept descriptions.

 Frequent patterns: These are data points that occur more often in the dataset. There are many
kinds of recurring patterns, such as frequent items, frequent subsequence, and frequent sub-
structure.

 Associations: It shows the relationships between data and pre-defined association rules. For
instance, a shopkeeper makes an association rule that 70% of the time, when a football is sold, a kit
is bought alongside. These two items can be combined together to make an association.
 Correlations: This is performed to find the statistical correlations between two data points to find if
they have positive, negative, or no effect.

 Clusters: This is the formation of a group of similar data points. Each point in the collection is
somewhat similar but very different from other members of different groups.

2) Predictive patterns: - They perform inference on the current data in order to make predictions.It
predicts future values by analyzing the data patterns and their outcomes based on the previous data. It also
helps us find missing values in the data.

Predictive patterns can be categorized into the following patterns.

1) Outlier analysis

2) Evolution analysis

3) Regression

4) Classification

• Classification: It helps predict the label of unknown data points with the help of known data points.
For instance, if we have a dataset of X-rays of cancer patients, then the possible labels would be
cancer patient and not cancer patient. These classes can be obtained by data characterizations or
by data discrimination.
• Regression: Unlike classification, regression is used to find the missing numeric values from the
dataset. It is also used to predict future numeric values as well. For instance, we can find the
behavior of the next year's sales based on the past twenty years' sales by finding the relation
between the data.
• Outlier analysis: Not all data points in the dataset need to follow the same behavior. Data points
that don't follow the usual behavior are called outliers. Analysis of these outliers is called outlier
analysis. These outliers are not considered while working on the data.
• Evolution analysis: As the name suggests, those data points change their behavior and trends with
time.
• In some cases users may have no idea regarding what kind of patterns in their data may be
interesting, so they may like to search for several different kinds of patterns in parallel.so it is
important to have a data mining system that can mine multiple kinds of patterns to accommodate
different user expectations or applications.
• The different kinds of patterns that are extracted from various types of data are- Itemset, sequence,
rule, graph and event.

Class/concept description :-

Class/Concept refers to the data to be associated with the classes or concepts.

For example, in a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the following two ways −

 Data Characterization − This refers to summarizing data of class under study. This class under
study is called as Target Class. The output of data characterization can be presented in various
forms. Example:-Pie charts, bar charts, curves, multidimensional data cubes and multidimensional
tables including crosstabs.
 Data Discrimination − It refers to the mapping or classification of a class with some predefined
group or class. It is the comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. The target and contrasting
classes can be specified by the user and the corresponding data objects retrieved through database
queries.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of kind of
frequent patterns −

 Frequent Item Set − It refers to a set of items that frequently appear together, for example, milk
and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a camera
is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees,
or lattices, which may be combined with item-sets or subsequences.

Association and correlation

• In data mining association and correlation are the key techniques for extracting patterns and
relationships from large datasets. Association exposes the relationships between items. Association
refers to any relationship between two variables.
• Association is a technique used in data mining to identify the relationships or co-occurrences
between items in a dataset. It involves analyzing large datasets to discover patterns or associations
between items, such as products purchased together in a supermarket or web pages frequently
visited together on a website. Association analysis is based on the idea of finding the most frequent
patterns or itemsets in a dataset, where an itemset is a collection of one or more items.
• Correlation measures the strength of the link between two variables.
• Correlation is a statistical measure that expresses the extent to which two variables are linearly
related (meaning they change together at a constant rate). It's a common tool for describing simple
relationships without making a statement about cause and effect.
• Correlation Analysis is a data mining technique used to identify the degree to which two or more
variables are related or associated with each other. Correlation refers to the statistical relationship
between two or more variables, where the variation in one variable is associated with the variation
in another variable. In other words, it measures how changes in one variable are related to changes
in another variable. Correlation can be positive, negative, or zero, depending on the direction and
strength of the relationship between the variables.
• For example, we are studying the relationship between the hours of study and the grades obtained
by students. If we find that as the number of hours of study increases, the grades obtained also
increase, then there is a positive correlation between the two variables.
• On the other hand, if we find that as the number of hours of study increases, the grades obtained
decrease, then there is a negative correlation between the two variables.
• If there is no relationship between the two variables, we would say that there is zero correlation.

Classification and regression for predictive analysis

• Classification:-A classification is an ordered set of related categories used to group data

according to its similarities. It consists of codes and descriptors and allows survey responses to be
put into meaningful categories in order to produce useful data. A classification is a useful tool for
anyone developing statistical surveys.
• It is the process of finding a model (or function) that describes and distinguishes data classes or
concepts, for the purpose of being able to use the model to predict the class of objects whose class
label is unknown.
• Classification predicts categorical labels, prediction models, and continuous valued functions. It is
used to predict missing or unavailable numerical data values rather than class labels.
• The classification process deals with problems where data can be divided into binary or multiple
discrete labels.
• Regression is a statistical technique that relates a dependent variable to one or more
independent variables. A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the independent variables.
• The process of finding a model or function for distinguishing the data into continuous real values
instead of using classes or discrete values. It can also identify the distribution movement depending
on the historical data.
• In regression we have to predict a continuous target variable using independent features.
• Here we have two types of problems linear regression and nonlinear regression.

What is Outlier in data mining

Whenever we talk about data analysis, the term outliers often come to our mind. As the name
suggests, "outliers" refer to the data points that exist outside of what is to be expected. The major thing
about the outliers is what you do with them. If you are going to analyze any task to analyze data sets, you
will always have some assumptions based on how this data is generated. If you find some data points that
are likely to contain some form of error, then these are definitely outliers, and depending on the context,
you want to overcome those errors. The data mining process involves the analysis and prediction of data
that the data holds. In 1969, Grubbs introduced the first definition of outliers.

Outliers are discarded at many places when data mining is applied. But it is still used in many
applications like fraud detection, medical, etc. It is usually because the events that occur rarely can store
much more significant information than the events that occur more regularly.
Any unusual response that occurs due to medical treatment can be analyzed through outlier analysis in data
mining.

The data objects that do not comply with the general behavior or model of data are called outliers. An
outlier is a data point that stands out a lot from other data points in a set.

Most data mining methods discard outliers as noise or exceptions.

The analysis of outlier data is called as outlier mining. Outliers may be detected using statistical tests or
using distance measures in which objects that are substantial distance from any other cluster are considered
outliers.

Major issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues regarding –

 Mining Methodology and User Interaction

 Performance Issues
 Diverse Data Types Issues

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be interested in different
kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used
to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it needs
to be expressed in high level languages, and visual representations. These representations should be
easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities. If the data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the result from the partitions is
merged. The incremental algorithms, update databases without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to
mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The data
is available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

Data Mining and Society

Data Mining is the process of collecting data and then processing them to find useful patterns with the
help of statistics and machine learning processes. By finding the relationship between the database, the
peculiarities can be easily identified. Aggregations of useful datasets from a heap of data in the database
help in the growth of many industries we depend in our daily life and enhance customer service. We can’t
deny the fact that we live in a world of data. From the local grocery store to detecting network frauds, data
mining plays a significant role. Beyond the benefits, data mining has negative impacts on society like
privacy breaches and security problems. This article shows both the positive and negative effects of data
mining on society.

Positive effects of data mining on society

Data mining has influenced our lives whether we feel it or not. Its applications are widely used in many
fields to reduce strain and time. It has also supplemented the life of humans. Let’s see some of the
examples.

 Customer relationship management: By using the techniques of data mining the company
provides customized and preferred services for customers which provides a pleasant experience
while using the services. By aggregating and grouping the data, the company can create
advertisements only when needed and it can reach the right people who require the service. By
targeting the customer, unwanted promotional activities can be avoided which saves a lot of money
for the company. The customer also doesn’t get annoyed when heaps of junk mails and messages
are not sent. Data mining can also help in saving time and provide satisfaction to the customers.
 Personalized search engines: In the world of data and networks, our lives become intertwined
with web browsers. They had obtained an inevitable place in our lifestyle, knowledge and so on.
With the help of data mining algorithms, the suggestions and the order of websites are tailored
according to the gathered information by summarizing it. Ranking the page according to the
content, the no of visits also help the web browser to provide necessary results for the query given
by the user. By giving a personalized environment, spam and misleading advertisements can be
avoided. By data mining, frequent spam accounts can be identified and they are automatically
moved into the spam folder. For e.g. Gmail has a spam folder where unwanted and frequent junk
messages are placed instead of heaping the inbox. Web-wide tracking is a process in which the
system keeps track of every website a user visits. By incorporating the DoubleClick mechanism in
these websites, they can note the websites that have been visited. And personalized lifestyle,
educational ads are made visible in that sites relevant to that user.
 Mining in the health sector: Data mining helps in maintaining the health and welfare of a human.
The layers of data mining embedded in pharmaceutical industries help to analyze data, to establish
relationships while creating and improving drugs. It also helps in analyzing the effects of drugs on
patients, the side effects and outcomes. They also help in tracking the number of chronically
diseased patients, ICU patients which help in reducing the overflow of admissions in hospitals.
Some medicines can also cause side effects or other benefits regardless of what disease it treats. In
such cases, data mining can largely influence the growth of the health sector.
 E-shopping: E- retail platforms are one of the fastest-growing major industries in the world. From
books, movies, groceries, lifestyles everything is listed on online e-retail platforms. This cannot run
successfully without the help of data mining and predictive analysis. By these techniques, cross-
selling and holding onto regular customers have become possible. Data mining helps in announcing
offers and discounts to keep the customers intact and to increase sales. By using the algorithms of
data science, the e-commerce website can largely influence the customers using targeted ad
campaigns which will surge the number of users as well as it provide satisfactory results to
customers.
 Crime prevention: Data mining plays a huge role in the prevention of crimes and reducing fraud
rates. In telecommunication industries, it helps in identifying subscription theft and super-imposed
frauds. It also helps in the identification of fraudulent calls. By doing this, user security can be
ensured and prevent the company from facing a huge loss. It also plays an important role in police
departments for identifying key patterns in crime and predicting them. It also helps in identifying
the unsolved crimes committed by the same criminal by establishing a relationship between
previous and present datasets in the crime database. By extracting and aggregating data, the police
department can identify future crimes and prevent them. It also helps in identifying the cause of
crime and the criminal behind that. This application largely supports the safety of people.

Negative effects of data mining on society

 The exploitation of data and discrimination: By agreeing on the terms and conditions provided
by a company, the company gets access to collect data of the customers. From age groups to
economical status, the company profiles the customers. By customer profiling, they get to know the
datasets of rich, poor, elder, or younger. Some unethical or devious companies offer low credits or
inferior deals to the customer in an area where fewer sales rate is noted. For. e.g. An unethical
company decreasing the credit scores in the loyalty card of people connected to a branch whose
transactions are less. Sometimes while profiling customers, wrongly accusing a customer happens.
Though he is faultless, his needs and comfort are denied. Even though the company declares the
customer faultless after investigations, still the wrongly accused customer struggles mentally and
this incident will negatively impact his life. Certain companies don’t take the responsibility of
securing the data of customers which makes the data vulnerable and causes privacy breaches.
 Health-related ethical problem: Using data mining techniques, the companies can extract data
about the health problems of the employees. They can also relate the summarized dataset with the
datasets from the past history of previous employees. By discovering the pattern of diseases and
frequencies, the company chooses the specific insurance plans accordingly. But, there is a chance
that the company uses this data while hiring new employees. Hence, they avoid hiring people with a
higher frequency of sickness. Insurance companies collect this data so they can avoid policies with
companies with a high risk of health issues.
 Privacy breach: Every single piece of data we enter into the database of the internet is indirectly
under the control of data miners. When used for unethical purposes, the privacy of an individual is
invaded. Certain companies use this data to filter the latent people but with the potential to become
customers of that company. In this way, the company sends targeted advertisements and increase
customer traffic. For e.g. In telecommunication industries, the call details of the customer are
created to enhance business growth and to maintain low customer churn. But, the company uses the
data selfishly for its growth and it leads to the exploitation of privacy. Thus every single piece of
data given to the network stands for greater risk under the influence of data mining.
 Manipulation of data and unethical problems: There are circumstances when normal data
provided by a customer or user becomes manipulative data. That is when a customer makes a
promotion on social media, which means he has a good financial status in his growing business.
Using such information, miners can obtain unethical data to gain profits or access. Spreading of
false information through social media and erroneous opinions can mislead people because when
data miners collect this information, they become facts and which leads to a scam. Also, by using
predictive analytics and machine learning algorithms, the outcome of an event is predicted by the
government which may fail sometimes and that will create a disastrous effect on the public. When
the prediction is based on unprompted unsafe sources, those predictions lead to severe losses and
the company may fail.
 Invasive marketing: The junk advertisements that heap your mobile while using social media or
other social platforms are the result of data mining. Targeted ads benefit both seller and customer
and save time but when it gets intense and unethical, wrong products are forced through
advertisements that may negatively influence the life of the user. From the browser histories to
previously purchased items, the data is extracted and used to influence the user to buy other
products sometimes which may be harmful. This aggressive technique will cause undesirable
effects on the user. Every discovery or field has its own merits and demerits. A part of that
application may help the human and a part may degrade the values and ethics of society. As a part
of the society we live in, it is our duty to use the applications of technology following the rules and
maintaining ethics. Industries, companies and marketing agents should respect the privacy of
individual humans and should provide the space they need. When every single person out there
takes responsibility for the proper handling of data, data mining would be a gift of technology that
could build and ease our life in so many ways.

*END OF UNIT I*******

Sitxhrm 007 Student Assessment Tasks
No ratings yet
Sitxhrm 007 Student Assessment Tasks
36 pages
Math8 Q4 Mod5
No ratings yet
Math8 Q4 Mod5
27 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Understanding Yourself As A Writer
No ratings yet
Understanding Yourself As A Writer
2 pages
Unit 2
No ratings yet
Unit 2
31 pages
Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
Introduction To Data Warehouse Edited
No ratings yet
Introduction To Data Warehouse Edited
34 pages
NOTES
No ratings yet
NOTES
17 pages
Data Warehouse
No ratings yet
Data Warehouse
16 pages
DMDW NOTES UNIT 1
No ratings yet
DMDW NOTES UNIT 1
29 pages
DWDM Set-2
No ratings yet
DWDM Set-2
55 pages
DWDM Book
No ratings yet
DWDM Book
58 pages
DWM 1
No ratings yet
DWM 1
15 pages
Data warehouse unit-3 complete
No ratings yet
Data warehouse unit-3 complete
31 pages
Olap Vs Oltp
No ratings yet
Olap Vs Oltp
9 pages
DAT MINING MODULE1
No ratings yet
DAT MINING MODULE1
15 pages
AISPrE7- Lesson1..
No ratings yet
AISPrE7- Lesson1..
19 pages
DWDM UNIT 1 (1)
No ratings yet
DWDM UNIT 1 (1)
24 pages
DWDM-UNIT-1
No ratings yet
DWDM-UNIT-1
27 pages
DWM Unit1 Solved QB
No ratings yet
DWM Unit1 Solved QB
14 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
OLTP (On-Line Transaction Processing) Is Characterized by A Large Number of Short On-Line Transactions
No ratings yet
OLTP (On-Line Transaction Processing) Is Characterized by A Large Number of Short On-Line Transactions
10 pages
DWH & Data Modeling
No ratings yet
DWH & Data Modeling
50 pages
OLTP (On-Line Transaction Processing) Is Characterized by A Large Number of Short On-Line Transactions
No ratings yet
OLTP (On-Line Transaction Processing) Is Characterized by A Large Number of Short On-Line Transactions
12 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Data Warehousing & Dimensional Modeling Concepts !!
No ratings yet
Data Warehousing & Dimensional Modeling Concepts !!
33 pages
DW Unit-2
No ratings yet
DW Unit-2
5 pages
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
No ratings yet
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
37 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
21 pages
Dmw Qb Solved Sem6 (1) 2
No ratings yet
Dmw Qb Solved Sem6 (1) 2
131 pages
Term 1
No ratings yet
Term 1
12 pages
Module1-Question Bank With Answers (1) - 2
No ratings yet
Module1-Question Bank With Answers (1) - 2
23 pages
Olap & Dss Support in Data Warehouse: By-Pooja Sinha Kaushalya Bakde
No ratings yet
Olap & Dss Support in Data Warehouse: By-Pooja Sinha Kaushalya Bakde
18 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
A Comparsion of Databases and DataWarehouses - 2
No ratings yet
A Comparsion of Databases and DataWarehouses - 2
29 pages
Module1 Part3
No ratings yet
Module1 Part3
46 pages
DW Unit I Notes
No ratings yet
DW Unit I Notes
28 pages
Data Warehouse and Mining-1
No ratings yet
Data Warehouse and Mining-1
40 pages
DWM Unit I
No ratings yet
DWM Unit I
114 pages
Dwdm Unit-2 Final
No ratings yet
Dwdm Unit-2 Final
21 pages
DWMM Notes
No ratings yet
DWMM Notes
23 pages
Data Mining Complete
No ratings yet
Data Mining Complete
95 pages
1.1 Basic Concepts & Architecture
No ratings yet
1.1 Basic Concepts & Architecture
27 pages
1684245766488
No ratings yet
1684245766488
33 pages
Bi Lectures Chatgpt
No ratings yet
Bi Lectures Chatgpt
48 pages
A Comparsion of Databases and DataWarehouses
No ratings yet
A Comparsion of Databases and DataWarehouses
29 pages
Data Warehouse
No ratings yet
Data Warehouse
39 pages
Intro to DW
No ratings yet
Intro to DW
5 pages
Data Warehouse - Concept and Fundamentals: Sridevi
No ratings yet
Data Warehouse - Concept and Fundamentals: Sridevi
25 pages
DATA WAREHOUSE (OLTP vs OLAP)
No ratings yet
DATA WAREHOUSE (OLTP vs OLAP)
8 pages
DM Chapter 4
No ratings yet
DM Chapter 4
8 pages
ETL Testing
No ratings yet
ETL Testing
32 pages
9 DMW Olap PPT 11.2
No ratings yet
9 DMW Olap PPT 11.2
12 pages
Business Intelligence - Data Warehouse Implementation
100% (1)
Business Intelligence - Data Warehouse Implementation
157 pages
BigQuery
No ratings yet
BigQuery
8 pages
FRM Important Questions
No ratings yet
FRM Important Questions
17 pages
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
No ratings yet
Chap3_PIEAS_DCIS_BSCIS_DM_23_Topic_03_DWH_OLAP
46 pages
Data Warehousing
No ratings yet
Data Warehousing
7 pages
Notes DWDM
No ratings yet
Notes DWDM
12 pages
Readingpaper
No ratings yet
Readingpaper
10 pages
Traditional Enterprise BI
No ratings yet
Traditional Enterprise BI
47 pages
Data Warehousing: Understanding A Data Warehouse
No ratings yet
Data Warehousing: Understanding A Data Warehouse
4 pages
About Rotaract Club of Cit
No ratings yet
About Rotaract Club of Cit
4 pages
Activity 2 Venn Diagram
No ratings yet
Activity 2 Venn Diagram
2 pages
Fill in The Blanks
No ratings yet
Fill in The Blanks
2 pages
ISFP Meaning: I - Introverted
No ratings yet
ISFP Meaning: I - Introverted
12 pages
Hotstar: Decoding The Enigma: Key Words: Hotstar, 3-Tranche Model, OTT Industry
No ratings yet
Hotstar: Decoding The Enigma: Key Words: Hotstar, 3-Tranche Model, OTT Industry
18 pages
Assignment Submission and Assessment
No ratings yet
Assignment Submission and Assessment
5 pages
Stress Hits Women's Brains Harder
No ratings yet
Stress Hits Women's Brains Harder
9 pages
Indian Education System
No ratings yet
Indian Education System
5 pages
Bullying in The Philippines
No ratings yet
Bullying in The Philippines
5 pages
HP0 M46
No ratings yet
HP0 M46
19 pages
A Study On The Effectiveness of Review Classes For Entrance Examination To The Collegiate Level
100% (1)
A Study On The Effectiveness of Review Classes For Entrance Examination To The Collegiate Level
17 pages
Math
No ratings yet
Math
26 pages
Pentecostalismo en Camerun
No ratings yet
Pentecostalismo en Camerun
239 pages
SDO Navotas SHS Philisophy FirstSem FV
No ratings yet
SDO Navotas SHS Philisophy FirstSem FV
101 pages
Virginia Beach 2010 'A' Schools Bus Schedule
No ratings yet
Virginia Beach 2010 'A' Schools Bus Schedule
31 pages
Where Did Humans Originate
No ratings yet
Where Did Humans Originate
5 pages
Special Science Elementary School (SSES) : Activity Sheets in Filipino 2
100% (1)
Special Science Elementary School (SSES) : Activity Sheets in Filipino 2
20 pages
Medical Entrance Exam Dates 2010
No ratings yet
Medical Entrance Exam Dates 2010
2 pages
Practical Research FINAL Part 2
No ratings yet
Practical Research FINAL Part 2
63 pages
EME2026 Intro
0% (1)
EME2026 Intro
18 pages
Other Wes Moore Essay - 1st Draft
No ratings yet
Other Wes Moore Essay - 1st Draft
6 pages
FINAL-EXAM-ASSESSMENT-OF-LEARNING-2ND-SEM-SY-2023-2024
No ratings yet
FINAL-EXAM-ASSESSMENT-OF-LEARNING-2ND-SEM-SY-2023-2024
2 pages
The Clinical Use of 'Mindfulness' Meditation Techniques in Short-Term Psychotherapy
No ratings yet
The Clinical Use of 'Mindfulness' Meditation Techniques in Short-Term Psychotherapy
11 pages
Fundamentals of Martial Arts Student Copy 1
No ratings yet
Fundamentals of Martial Arts Student Copy 1
3 pages
Jokermohamed 51@
100% (1)
Jokermohamed 51@
5 pages
Paired - Sample T Test (Part 3)
No ratings yet
Paired - Sample T Test (Part 3)
4 pages
PGP Handbook 2011 13
No ratings yet
PGP Handbook 2011 13
59 pages

Data Mining UNIT I

Uploaded by

Data Mining UNIT I

Uploaded by

Data Mining - UNIT-I

A typical data warehouse often includes the following elements:

 A relational database to store and manage data

Benefits of a Data Warehouse

Difference between Operational database system and Datawarehouse

Operational Database Data Warehouse

OLTP system is a customer-

OLAP system manages a large amount of historical

OLAP system often spans multiple versions of a

The access patterns of an OLTP

Normalization Fully Normalized Partially Normalized

It depends on the amount of files contained, batch

• So processing OLAP queries in operational databases would degrade the performance of

• An operational database supports the concurrent processing of multiple transactions, concurrency

What is Data mining?

Descriptive patterns can be divided into the following patterns:

Predictive patterns can be categorized into the following patterns.

Class/Concept refers to the data to be associated with the classes or concepts.

Mining of Frequent Patterns

Association and correlation

Classification and regression for predictive analysis

• Classification:-A classification is an ordered set of related categories used to group data

What is Outlier in data mining

Most data mining methods discard outliers as noise or exceptions.

Major issues in Data Mining

 Mining Methodology and User Interaction

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

There can be performance-related issues such as follows −

Diverse Data Types Issues

Data Mining and Society

Positive effects of data mining on society

Negative effects of data mining on society

*******************************END OF UNIT I*************************************

You might also like

*END OF UNIT I*******