unit 3 BI & Data science (1)
unit 3 BI & Data science (1)
Data Mining: The origins of Data Mining- Data Mining Tasks- OLAP and
Multidimensional-Data Analysis-Basic Concept of Association Analysis and Cluster
Analysis.
Machine Learning: History and Evolution – AI Evolution- Statistics vs. Data Mining vs.
Data Analytics vs. Data Science – Supervised Learning- Unsupervised Learning-
Reinforcement Learning- Frameworks for Building Machine Learning Systems.
Data Mining:
The origins of Data Mining
Data mining is a discipline with a long history. It starts with the early Data Mining
methods Bayes’ Theorem (1700`s) and Regression analysis (1800`s) which were
mostly identifying patterns in data.
Data mining is the process of analyzing large data sets (Big Data) from different
perspectives and uncovering correlations and patterns to summarize them into useful
information.
Nowadays it is blended with many techniques such as artificial intelligence, statistics,
data science, database theory and machine learning.
Increasing power of technology and complexity of data sets has lead Data Mining to
evolve from static data delivery to more dynamic and proactive information deliveries;
from tapes and disks to advanced algorithms and massive databases.
In the late 80`s Data Mining term began to be known and used within the research
community by statisticians, data analysts, and the management information systems
(MIS) communities.
By the early 1990`s, data mining was recognized as a sub-process or a step within a
larger process called Knowledge Discovery in Databases (KDD) .The most commonly
used definition of KDD is “The nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data” (Fayyad, 1996).
Data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets. Given the evolution
of data warehousing technology and the growth of big data, adoption of data mining
techniques has rapidly accelerated over the last couple of decades, assisting companies by
transforming their raw data into useful knowledge.
Data mining functionalities are to perceive the various forms of patterns to be identified in
data mining activities. To define the type of patterns to be discovered in data mining
activities, data mining features are used. Data mining has a wide application for forecasting
and characterizing data in big data.
Data mining tasks are majorly categorized into two categories: descriptive and predictive.
2) Prediction
To detect the inaccessible data, it uses regression analysis and detects the missing numeric
values in the data. If the class mark is absent, so classification is used to render the prediction.
Due to its relevance in business intelligence, the prediction is common. If the class mark is
absent, so the prediction is performed using classification. There are two methods of
predicting data. Due to its relevance in business intelligence, a prediction is common. The
prediction of the class mark using the previously developed class model and the prediction of
incomplete or incomplete data using prediction analysis are two ways of predicting data.
3) Classification
Classification is used to create data structures of predefined classes, as the model is used to
classify new instances whose classification is not understood. The instances used to produce
the model are known as data from preparation. A decision tree or set of classification rules is
based on such a form of classification process that can be collected to identify future details,
for example by classifying the possible compensation of the employee based on the
classification of salaries of related employees in the company.
4) Association Analysis
The link between the data and the rules that bind them is discovered. And two or more data
attributes are associated. It associates qualities that are transacted together regularly. They
work out what are called the rules of partnerships that are commonly used in the study of
stock baskets. To link the attributes, there are two elements. One is the trust that suggests the
possibility of both associated together, and another helps, which informs of associations' past
occurrence.
5) Outlier Analysis
Data components that cannot be clustered into a given class or cluster are outliers. They are
often referred to as anomalies or surprises and are also very important to remember.
Although in some contexts, outliers can be called noise and discarded, they can disclose
useful information in other areas, and hence can be very important and beneficial for their
study.
6) Cluster Analysis
Clustering is the arrangement of data in groups. Unlike classification, however, class labels
are undefined in clustering and it is up to the clustering algorithm to find suitable classes.
Clustering is often called unsupervised classification since provided class labels do not
execute the classification. Many clustering methods are all based on the concept of
maximizing the similarity (intra-class similarity) between objects of the same class and
decreasing the similarity between objects in different classes (inter-class similarity).
We may uncover patterns and shifts in actions over time, with such distinct analysis, we can
find features such as time-series results, periodicity, and similarities in patterns. Many
technologies from space science to retail marketing can be found holistically in data
processing and features.
To facilitate this kind of analysis, data is collected from multiple data sources and
stored in data warehouses then cleansed and organized into data cubes.
Dimensions are then populated by members (such as customer names, countries and
months) that are organized hierarchically. OLAP cubes are often pre-summarized
across dimensions to drastically improve query time over relational databases.
1. Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
2. Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data. With multidimensional data stores, the storage utilization may be low if the data set is
sparse. Therefore, many MOLAP server use two levels of data storage representation to
handle dense and sparse data sets.
3. Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
4. Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, OLAP operations in
multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Slice. This enables an analyst to take one level of information for display, such as
"sales in 2017."
Dice. This allows an analyst to select data from multiple dimensions to analyze,
such as "sales of blue beach balls in Iowa in 2017."
Pivot. Analysts can gain a new view of data by rotating the data axes of the cube.
OLAP software then locates the intersection of dimensions, such as all products sold in the
Eastern region above a certain price during a certain time period, and displays them. The
result is the "measure"; each OLAP cube has at least one to perhaps hundreds of measures,
which are derived from information stored in fact tables in the data warehouse.
OLAP begins with data accumulated from multiple sources and stored in a data warehouse.
The data is then cleansed and stored in OLAP cubes, which users run queries against.
Association Rule Mining
Association analysis is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of association rules or sets
of frequent items.
Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Implication means co-occurrence, not causality!
Example of Association Rules
{Beer}{Diaper}
{Milk, Bread}{Eggs, Coke}
{Beer, Bread}{Milk}
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It
is an unsupervised machine learning-based algorithm that acts on unlabelled data. A group
of data points would comprise together to form a cluster in which all the objects would
belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group.
This group is nothing but a cluster. A cluster is nothing but a collection of similar data
which is grouped together.
For example, consider a dataset of vehicles is given in which it contains information about
different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no
class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a
structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using
clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming
clusters like cars cluster which contains all the cars, bikes clusters which contains all the
bikes, etc.
Simply it is partitioning of similar objects which are applied on unlabelled data.
Properties of Clustering:
In the partitioning method, there is one technique called iterative relocation, which means
the object will be moved from one group to another to improve the partitioning.
Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part
of artificial intelligence. Machine learning algorithms build a model based on sample
data, known as training data, in order to make predictions or decisions without being
explicitly programmed to do so. Machine learning algorithms are used in a wide
variety of applications, such as in medicine, email filtering, speech recognition,
and computer vision, where it is difficult or unfeasible to develop conventional
algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which
focuses on making predictions using computers; but not all machine learning is
statistical learning. The study of mathematical optimization delivers methods, theory
and application domains to the field of machine learning. Data mining is a related
field of study, focusing on exploratory data analysis through unsupervised learning.
Some implementations of machine learning use data and neural networks in a way
that mimics the working of a biological brain. In its application across business
problems, machine learning is also referred to as predictive analytics.
3. A Model Optimization Process: If the model can fit better to the data points in the
training set, then weights are adjusted to reduce the discrepancy between the
known example and the model estimate. The algorithm will repeat this evaluate and
optimize process, updating weights autonomously until a threshold of accuracy has
been met.
Supervised learning, also known as supervised machine learning, is defined by its use of
labelled datasets to train algorithms that to classify data or predict outcomes accurately. As
input data is fed into the model, it adjusts its weights until the model has been fitted
appropriately. This occurs as part of the cross validation process to ensure that the model
avoids over fitting or under fitting. Supervised learning helps organizations solve for a variety
of real-world problems at scale, such as classifying spam in a separate folder from your
inbox. Some methods used in supervised learning include neural networks, naïve bayes,
linear regression, logistic regression, random forest, support vector machine (SVM), and
more.
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden
patterns or data groupings without the need for human intervention. Its ability to discover
similarities and differences in information make it the ideal solution for exploratory data
analysis, cross-selling strategies, customer segmentation, image and pattern recognition. It’s
also used to reduce the number of features in a model through the process of dimensionality
reduction; principal component analysis (PCA) and singular value decomposition (SVD) are
two common approaches for this. Other algorithms used in unsupervised learning include
neural networks, k-means clustering, probabilistic clustering methods, and more.
3. Semi-supervised learning
Machine Learning is a sub-set of artificial intelligence where computer algorithms are used to
autonomously learn from data and information. In machine learning computers don’t have to
be explicitly programmed but can change and improve their algorithms by themselves.
TensorFlow
Theano
Theano is wonderfully folded over Keras, an abnormal state neural systems library, that runs
nearly in parallel with the Theano library. Keras’ fundamental favorable position is that it is a
moderate Python library for profound discovering that can keep running over Theano or
TensorFlow.
It was created to make actualizing profound learning models as quick and simple as feasible
for innovative work. Discharged under the tolerant MIT permit, it keeps running on Python
2.7 or 3.5 and can consistently execute on GPUs and CPUs given the basic structures.
Sci-Kit Learn
Scikit-learn is one of the most well-known ML libraries. It is preferable for administered and
unsupervised learning calculations. Precedents implement direct and calculated relapses,
choice trees, bunching, k-implies, etc.
This framework involves a lot of calculations for regular AI and data mining assignments,
including bunching, relapse, and order.
Caffe
Caffe is another popular learning structure made with articulation, speed, and measured
quality as the utmost priority. It is created by the Berkeley Vision and Learning Center
(BVLC) and by network donors.
H20
Amazon Machine Learning provides visualization tools that help you go through the process
of creating machine learning (ML) models without having to learn complex ML
algorithms and technology.
It is a service that makes it easy for developers of all skill levels to use machine learning
technology. It connects to data stored in Amazon S3, Redshift, or RDS, and can run binary
classification, multiclass categorization, or regression on the data to build a model.
Torch
This framework provides wide support for machine learning algorithms to GPUs first. It is
easy to use and efficient because of the easy and fast scripting language, LuaJIT, and an
underlying C/CUDA implementation.
The goal of Torch is to have maximum flexibility and speed in building your scientific
algorithms along with an extremely simple process.
It offers training and prediction services that can be used together or individually. It is used
by enterprises to solve problems like ensuring food safety, clouds in satellite images,
responding four times faster to customer emails, etc.
Azure ML Studio
This Framework allows Microsoft Azure users to create and train models, then turn them into
APIs that can be consumed by other services. Also, you can connect your own Azure storage
to the service for larger models.
To use the Azure ML Studio, you don’t even need an account to try out the service. You can
log in anonymously and use Azure ML Studio for up to eight hours.
Spark ML Lib
This is Apache Spark’s machine learning library. The goal of this framework is to make
practical machine learning scalable and easy.
Despite artificial intelligence has been present for millennia, it was not until the 1950s that its
real potential was investigated. A generation of scientists, physicists, and intellectuals had the
idea of AI, but it wasn’t until Alan Turing, a British polymath, proposed that people solve
problems and make decisions using available information and also a reason.
The difficulty of computers was the major stumbling block to expansion. They needed to
adapt fundamentally before they could expand any further. Machines could execute orders
but not store them. Until 1974, financing was also a problem.
By 1974, computers had become extremely popular. They were now quicker, less expensive,
and capable of storing more data.
AI Research Today
AI research is ongoing and expanding in today’s world. AI research has grown at a pace of
12.9 percent annually over the last five years, as per Alice Bonasio, a technology journalist.
China is expected to overtake the United States as the world’s leading source of AI
technology in the next 4 years, having overtaken the United States’ second position in 2004
and is rapidly closing in on Europe’s top rank.
In the area of artificial intelligence development, Europe is the largest and most diverse
continent, with significant levels of international collaboration. India is the 3rd largest
country in AI research output, behind China and the USA.
AI in the Present
Artificial intelligence is being utilized for so many things and has so much promise that it’s
difficult to imagine our future without it, related to business.
Artificial intelligence technologies are boosting productivity like never seen before, from
workflow management solutions to trend forecasts and even the way companies buy
advertisements.
Artificial Intelligence can gather and organize vast volumes of data in order to draw
inferences and estimates that are outside of the human ability to comprehend manually. It
also improves organizational efficiency while lowering the risk of a mistake, and it identifies
unusual patterns, such as spam and frauds, instantaneously to alert organizations about
suspicious behaviour, among other things. AI has grown in importance and sophistication to
the point that a Japanese investment firm became the first to propose an AI Board Member
for its ability to forecast market trends faster than humans.
Artificial intelligence will indeed be and is already being used in many aspects of life, such as
self-driving cars in the coming years, more precise weather forecasting, and earlier health
diagnoses, to mention a few.
AI in The Future
It has been suggested that we are on the verge of the 4th Industrial Revolution, which will be
unlike any of the previous three. From steam and water power through electricity and
manufacturing process, computerization, and now, the question of what it is to be human is
being challenged.
Smarter technology in our factories and workplaces, as well as linked equipment that will
communicate, view the entire production process, and make autonomous choices, are just a
few of the methods the Industrial Revolution will lead to business improvements. One of the
most significant benefits of the 4th Industrial Revolution is the ability to improve the world’s
populace’s quality of life and increase income levels. As robots, humans, and smart devices
work on improving supply chains and warehousing, our businesses and organizations are
becoming “smarter” and more productive.
AI in Different Industries
Artificial intelligence (AI) may help you enhance the value of your company in a variety of
ways. It may help you optimize your operations, increase total revenue, and focus your staff
on more essential duties if applied correctly. As a result, AI is being utilized in a variety of
industries throughout the world, including health care, finance, manufacturing, and others.
Health Care
AI is proven to be uplift in the healthcare business. It’s enhancing nearly every area of the
industry, from data security to robot-assisted operations. AI is finally providing this sector,
which has been harmed by inefficient procedures and growing prices, a much-needed facelift.
Automotive
Self-driving vehicles are certainly something you’ve heard of, and they’re a hint that the
future is almost here. It’s no longer science fiction; the autonomous car is already a reality.
As per recent projections, by 2040, roughly 33 million automobiles with self-driving
capability are projected to be on the road.
Finance
According to experts, the banking industry and AI are a perfect combination. Real-time data
transmission, accuracy, and large-scale data processing are the most important elements
driving the financial sector. Because AI is ideal for these tasks, the banking industry is
recognizing its effectiveness and precision and incorporating machine learning, statistical
arbitrage, adaptive cognition, chatbots, and automation into its business operations.
E-Commerce
Have you ever come upon a picture of clothing that you were hunting for on one website but
couldn’t find on another? Well, that is done by AI. It’s due to the machine learning
techniques that businesses employ to develop strong client connections. These technologies
not only personalize customers’ experiences but also assist businesses in increasing sales.
Conclusion
In the early twenty-first century, no place has had a larger influence on AI than the
workplace. Machine-learning techniques are resulting in productivity gains that have never
been observed before. AI is transforming the way we do business, from workflow
management solutions to trend forecasts and even the way businesses buy advertising. AI
research has so much promise that it’s becoming difficult to envisage a world without it. Be
its self-driving vehicles, more precise weather predictions, or space travel, AI will be
prevalent in everyday life by 2030.
Data mining is a process of extracting useful Statistics refers to the analysis and
information, pattern, and trends from huge data sets presentation of numeric data, and it is the
and utilizes them to make a data-driven decision. major part of all data mining algorithm.
The data used in data mining is numeric or non- The data used in the statistic is numeric only.
numeric.
In data mining, data collection is not more important. In statistics, data collection is more
important.
The types of data mining are clustering, classification, The types of statistics are descriptive
association, neural network, sequence-based analysis, statistical and Inferential statistical.
visualization, etc.
It is suitable for huge data sets. It is suitable for smaller data set.
Data mining is an inductive process. It means the Statistics is the deductive process. It does not
generation of new theory from data. indulge in making any predictions.
Data cleaning is a part of data mining. In statistics, clean data is used to implement
the statistical method.
It requires less user interaction to validate the model, It requires user interaction to validate the
so it is easy to automate. model, so it is complex automate.
Data mining applications include financial Data The application of statistics includes
Analysis, Retail Industry, Telecommunication biostatistics, quality control, demography,
Industry, Biological Data Analysis, Certain Scientific operational research, etc.
Applications, etc.