DM - Unit-1 - Fundamentals of Data Mining
DM - Unit-1 - Fundamentals of Data Mining
UNIT-1
FUNDAMENTALS OF DATA MINING
★HISTORY:
● The origins of data mining can be traced back to the 1950s when the first
computers were developed and used for scientific and mathematical
research. As the capabilities of computers and data storage systems
improved, researchers began to explore the use of computers to analyze and
extract insights from large data sets.
● One of the earliest and most influential pioneers of data mining was Dr.
Herbert Simon, a Nobel laureate in economics who is widely considered to
be the father of artificial intelligence. In the 1950s and 1960s, Simon and his
colleagues developed a number of algorithms and techniques for extracting
useful information and insights from data, including clustering,
classification, and decision trees.
● In the 1980s and 1990s, the field of data mining continued to evolve, and
new algorithms and techniques were developed to address the challenges of
working with large and complex data sets. The development of data mining
software and platforms, such as SAS, SPSS, and RapidMiner, made it easier
for organizations to apply data mining techniques to their data.
● In recent years, the availability of large data sets and the growth of cloud
computing and big data technologies have made data mining even more
powerful and widely used. Today, data mining is a crucial tool for many
organizations and industries and is used to extract valuable insights and
information from data sets in a wide range of domains.
1. Association
2. Classification
● Decision Tree
● SVM(Support Vector Machine)
● Generalized Linear Models
● Bayesian classification:
● Classification by Backpropagation
● K-NN Classifier
● Rule-Based Classification
● Frequent-Pattern Based Classification
● Rough set theory
● Fuzzy Logic
➢ Decision Trees: A decision tree is a flow-chart-like tree structure, where
each node represents a test on an attribute value, each branch denotes an
outcome of a test, and tree leaves represent classes or class distributions.
Decision trees can be easily transformed into classification rules. Decision
tree enlistment is a nonparametric methodology for building classification
models. In other words, it does not require any prior assumptions regarding
the type of probability distribution satisfied by the class and other attributes.
Decision trees, especially smaller size trees, are relatively easy to interpret.
The accuracies of the trees are also comparable to two other classification
techniques for a much simpler data set. These provide an expressive
representation for learning discrete-valued functions. However, they do not
simplify well to certain types of Boolean problems.
This figure was generated on the IRIS data set of the UCI machine
repository. Basically, three different class labels available in the data set:
Setosa, Versicolor, and Virginia.
if a large enough proportion of them are allotted to a precise class, the new
document is also appointed to the present class, otherwise not. Additionally,
finding the closest neighbors is quickened using traditional classification
strategies.
3. Prediction
4. Clustering
5. Regression
the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining.
An artificial neural network is an adjective system that changes its
structure-supported information that flows through the artificial network
during a learning section. The ANN relies on the principle of learning by
example. There are two classical types of neural networks, perceptron and
also multilayer perceptron.
7. Outlier Detection
A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The
investigation of OUTLIER data is known as OUTLIER MINING. An outlier
may be detected using statistical tests which assume a distribution or
probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based
techniques distinguish exceptions/outliers by inspecting differences in the
principle attributes of items in a group.
8. Genetic Algorithm
into meaningful information. Data Mining can be applied to any type of data
e.g. Data Warehouses, Transactional Databases, Relational Databases,
Multimedia Databases, Spatial Databases, Time-series Databases, World
Wide Web.
● Data mining provides competitive advantages in the knowledge economy. It
does this by providing the maximum knowledge needed to rapidly make
valuable business decisions despite the enormous amounts of available data.
● There are many measurable benefits that have been achieved in different
application areas from data mining. So, let’s discuss different applications of
Data Mining:
1] Data Quality
The quality of data used in data mining is one of the most significant
challenges. The accuracy, completeness, and consistency of the data affect
the accuracy of the results obtained. The data may contain errors, omissions,
duplications, or inconsistencies, which may lead to inaccurate results.
2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT). The
complexity of the data may make it challenging to process, analyze, and
understand. In addition, the data may be in different formats, making it
challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques
such as clustering, classification, and association rule mining. These
techniques help to identify patterns and relationships in the data, which can
then be used to gain insights and make predictions.
4] Scalability
Data mining algorithms must be scalable to handle large datasets efficiently.
As the size of the dataset increases, the time and computational resources
required to perform data mining operations also increase. Moreover, the
algorithms must be able to handle streaming data, which is generated
continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it possible
to process large datasets quickly and efficiently.
5] Interpretability
Data mining algorithms can produce complex models that are difficult to
interpret. This is because the algorithms use a combination of statistical and
mathematical techniques to identify patterns and relationships in the data.
Moreover, the models may not be intuitive, making it challenging to
understand how the model arrived at a particular conclusion.
To address this challenge, data mining practitioners use visualization
techniques to represent the data and the models visually. Visualization makes
it easier to understand the patterns and relationships in the data and to
identify the most important variables.
6] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data. The data may be used to discriminate against certain
groups, violate privacy rights, or perpetuate existing biases. Moreover, data
mining algorithms may not be transparent, making it challenging to detect
biases or discrimination.
In this blog, we'll explore some of the key trends shaping the future of data
mining and their potential implications for businesses and society.
2. Federated Learning
3. Augmented Analytics
4. Explainable AI
5. Edge Computing
Edge computing involves processing data closer to its source, such as IoT
devices or sensors, rather than relying solely on centralized cloud
infrastructure. In the future, edge computing will enable organizations to
perform data mining tasks directly on the edge devices, enabling faster
insights and decision-making in scenarios where real-time processing is
critical.
➢ Performance Issues
interesting research area because most of the social media platforms like
Twitter, Facebook data can be analyzed through this and derive
interesting trends and patterns.
● Mining Spatiotemporal Data: The data that is related to both space and
time is Spatiotemporal data. Spatiotemporal data mining retrieves
interesting patterns and knowledge from spatiotemporal data.
Spatiotemporal Data mining helps us to find the value of the lands, the
age of the rocks and precious stones, predict the weather patterns.
Spatiotemporal data mining has many practical applications like GPS in
mobile phones, timers, Internet-based map services, weather services,
satellite, RFID, sensor.
● Mining Data Streams: Stream data is the data that can change
dynamically and it is noisy, inconsistent which contain multidimensional
features of different data types. So this data is stored in NoSql database
systems. The volume of the stream data is very high and this is the
challenge for the effective mining of stream data. While mining the Data
Streams we need to perform the tasks such as clustering, outlier analysis,
and the online detection of rare events in data streams.
2. Transactional Databases
● Transactional databases is a collection of data organized by time
stamps, date, etc to represent transaction in databases.
● This type of database has the capability to roll back or undo its
operation when a transaction is not completed or committed.
● Highly flexible system where users can modify information without
changing any sensitive information.
● Follows ACID property of DBMS.
● Application: Banking, Distributed systems, Object databases, etc.
3. Multimedia Databases
● Multimedia databases consists audio, video, images and text media.
● They can be stored on Object-Oriented Databases.
● They are used to store complex information in a pre-specified formats.
● Application: Digital libraries, video-on demand, news-on demand,
musical database, etc.
4. Spatial Database
● Store geographical information.
● Stores data in the form of coordinates, topology, lines, polygons, etc.
● Application: Maps, Global positioning, etc.
5. Time-series Databases
● Time series databases contains stock exchange data and user logged
activities.
● Handles array of numbers indexed by time, date, etc.
● It requires real-time analysis.
● Application: eXtremeDB, Graphite, InfluxDB, etc.
6. WWW
● WWW refers to the World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by Uniform
Resource Locators (URLs) through web browsers, linked by HTML
pages, and accessible via the Internet network.
● It is the most heterogeneous repository as it collects data from
multiple resources.
● It is dynamic in nature as Volume of data is continuously increasing
and changing.
● Application: Online shopping, Job search, Research, studying, etc.
★DATA WAREHOUSES:
Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.
● The top-down approach starts with the overall design and planning. It
is useful in cases where the technology is mature and well known, and
where the business problems that must be solved are clear and well
understood.
● The bottom-up approach starts with experiments and prototypes. This
is useful in the early stage of business modeling and technology
development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
● In the combined approach, an organization can exploit the planned
and strategic nature of the top-down approach while retaining the
rapid implementation and opportunistic application of the bottom-up
approach. The warehouse design process consists of the following
steps:
● Choose a business process to model, for example, orders, invoices,
shipments, inventory, account administration, sales, or the general
ledger. If the business process is organizational and involves multiple
complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be
chosen.
Tier-1:
This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either
a relational OLAP (ROLAP) model or a multidimensional OLAP.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).
1. Enterprise warehouse:
2. Data mart:
3. Virtual warehouse:
➢Metadata Repository:
Metadata are data about data.When used in a data warehouse, metadata are
the data that define warehouse objects. Metadata are created for the data
names and definitions of the given warehouse. Additional metadata are
created and captured for time stamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or
integration processes.