DA Merge Notes(30!09!24)
DA Merge Notes(30!09!24)
Course Objectives:
To explore the fundamental concepts of data analytics.
To learn the principles and methods of statistical analysis
Discover interesting patterns, analyze supervised and unsupervised models and estimate theaccuracy
of the algorithms.
To understand the various search methods and visualization techniques.
UNIT - I
Data Management: Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality(noise, outliers, missing
values, duplicate data) and Data Processing & Processing.
UNIT - II
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application of
Modeling in Business, Databases & Types of Data and variables, Data Modeling Techniques, Missing
Imputationsetc. Need for Business Modeling.
UNIT - III
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable Rationalization,
and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.
UNIT - IV
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised Learning, Tree
Building – Regression, Classification, Overfitting, Pruning and Complexity, Multiple Decision Trees
etc. Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract features
from generated model as Height, Average Energy etc and Analyze for prediction
UNIT - V
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization Techniques,
71
R22 B.Tech. CSE Syllabus JNTU HYDERABAD
TEXT BOOKS:
1. Student’s Handbook for Associate Analytics – II, III.
2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan KaufmannPublishers.
REFERENCE BOOKS:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addision Wisley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway LabsJeffrey
D Ullman Stanford Univ.
72
Data Analytics
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
data-processing application software.
Volume
Variety
Velocity
Veracity
The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within Europe.
Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.
Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.
Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
Data Analytics
The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.
Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labeled Big Data.
FORMS OF DATA
– STRUCTURED FORM
– UNSTRUCTURED FORM.
• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few
websites such as blogs and wikipedia
SOURCES OF DATA
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.
DATA ANALYTICS
• Data analytics is the science of analyzing raw data in order to make conclusions about
that information...... This information can then be used to optimize processes to increase
the overall efficiency of a business or system.
Types:
In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.
– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.
This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the
past to correct his illness through already existing procedures.
Prescriptive analytics works with predictive analytics, which uses data to determine
near-term outcomes. Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer program's predictions.
Fig 0.1Relation between Social Media, Data Analysis and Big Data
Data Analytics
Social media data are used in number of domains such as health and political trending
and forecasting, hobbies, ebusiness,cyber-crime, counter terrorism, time-evolving opinion
mining, social net-work analysis, and human machineinteractions.
Finally, summarizing all the above concepts processing for social media data can be
categorized into 3 parts as shown infigure 0.1. The first part consists of social media
websites, the second part consists of data analysis part and the thirdpart consists of big data
management layer and schedules the jobs across the cluster.
Prediction Analytics means we are trying to Analysis means we analyze always what
find conclusions about future. has happened in the past
MACHINE LEARNING
Analytics
In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the
tool automatically. Hence we can say that Data analysis is a sub component of data analytics.
And data analytics is a sub component of machine learning tool. All these are described in
figure 0.2. The output of this machine learning tool generates a model. And from this model
predictive analytics and prescriptive analytics can be performed because the model gives
output as data to machine learning tool. This cycle continues till we get an efficient output.
Data Analytics
UNIT - I
1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS
Data architecture is composed of models, policies, rules or standards that govern which
data is collected, and how it is stored, arranged, integrated, and put to use in data systems
and in organizations. Data is usually one of several architecture domains that form the
pillars of an enterprise architecture or solution architecture.
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterpriserequirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).
• Technologydrivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.
Businesspolicies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
Data Analytics
• Data processingneeds
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).
The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.
Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
Data Analytics
such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
Observation Method:
we need to clearly differentiate our own observations from the observations provided to us by
other people. The range of data storage genre found in Archives and Collections, is suitable
for documenting observations e.g. audio, visual, textual and digital including sub-genres
of note taking, audio recording and video recording.
There exist various observation practices, and our role as an observer may vary
according to the research approach. We make observations from either the outsider or insider
point of view in relation to the researched phenomenon and the observation technique can be
structured or unstructured. The degree of the outsider or insider points of view can be seen as
a movable point in a continuum between the extremes of outsider and insider. If you decide
to take the insider point of view, you will be a participant observer in situ and actively
participate in the observed situation or community. The activity of a Participant observer in
situ is called field work. This observation technique has traditionally belonged to the data
collection methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.
Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most frequently.
These are –
A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any
one treatment. For the CRD, any difference among experimental units receiving the same
treatment is considered as experimental error. Hence, CRD is appropriate only for
experiments with homogeneous experimental units, such as laboratory experiments, where
environmental effects are relatively easy to control. For field experiments, where there is
generally large variation among experimental plots in such environmental factors as soil, the
CRD is rarely used. CRD is mainly used in agricultural field.
Step 1. Determine the total number of experimental plots (n) as the product of the number of
treatments (t) and the number of replications (r); that is, n = rt. For our example, n = 5 x 4 =
20. Here, one pot with a single plant in it may be called a plot. In case the number of
replications is not the same for all the treatments, the total number of experimental pots is to
be obtained as the sum of the replications for each treatment. i.e.,
n= i
Step 2. Assign a plot number to each experimental plot in any convenient manner; for
example, consecutively from 1 to n.
Step 3. Assign the treatments to the experimental plots randomly using a table of random
numbers.
Example 1: Assume that a farmer wishes to perform the experiment to determine which of
his 3 fertilizers to use on 2800 tress. Assuming that farmer has a farm divided in to 3 terraces,
where those 2800 trees can be divided in the below format
Solution
Scenario 1
First we divide the 2800 trees in to random assignment of almost 3 equal parts
Random Assignment1: 933 trees
Random Assignment2: 933 trees
Random Assignment3: 934 trees
So for example random assignment1 we can assign fertilizer1, random assignment2 we can
assign fertilizer2, random assignment3 we can assign fertilizer3.
Scenario 2
Thus the farmer will be able analyze and compare various fertilizer performance on different
terrace.
Data Analytics
Example 2:
A company wishes to test 4 different types of tyre. The tyres lifetime as determined
from their threads are given. Where each tyre has been tried on 6 similar automobiles
assigned at random to their tyres. Determine whether there is a significant difference between
tyres at .05 level.
Solution:
Null Hypothesis: There is no difference between the tyres in their life time.
We choose a random value closest to the average of all values in the table and subtract that
for each tyre in the automobile, for example by choosing 35
Now by using ANOVA (one way classification) Table, We calculate the F- Ratio.
F-Ratio:
The F ratio is the ratio of two mean square values. If the null hypothesis is true, you
expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation
among group mean is more than you'd expect to see by chance
If the value of F-Ratio is closer to 1 it means that null hypothesis is true. If F-ratio is
greater than then we assume that the null hypothesis is false.
In this scenario the value of F-ratio is greater than 1. This indicates there will be variation
between samples. So assumed null hypothesis will be false
Level of significance = 0.05 (given in question)
Degrees of Freedom = (3, 20)
Critical value = 3.10 (calculated from 5 percentage table)
F-Ratio >critical value (i.e) 2.376> 3.10
Hence assumed null hypothesis is false. This indicates there is life time difference between
tyres.
Data Analytics
A randomized block design, the experimenter divides subjects into subgroups called
blocks, such that the variability within blocks is less than the variability between blocks.
Then, subjects within each block are randomly assigned to treatment conditions. Compared to
a completely randomized design, this design reduces variability within treatment conditions
and potential confounding, producing a better estimate of treatment effects.
The table below shows a randomized block design for a hypothetical medical experiment.
Gender Treatment
Placebo Vaccine
Male 250 250
Female 250 250
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are
randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250
men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women
get the vaccine.
It is known that men and women are physiologically different and react differently to
medication. This design ensures that each treatment condition has an equal proportion of men
and women. As a result, differences between treatment conditions cannot be attributed to
gender. This randomized block design removes gender as a potential source of variability and
as a potential confounding variable.
LSD - Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme
each letter from A to D occurs only once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get disturbed if any row gets changed with
the other.
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and
columns. Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs - This design allows the experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyzes the impacts
of each of the variables.
Data Analytics
In a true experiment, randomization is essential so that the experimenter can infer cause and
effect without any bias.
Internal sources
If available, internal secondary data may be obtained with less time, effort and money
than the external secondary data. In addition, they may also be more pertinent to the situation
at hand since they are from within the organization. The internal sources include
Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The information
provided is of outside theorganization.
Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing isworking
Miscellaneous Reports- These are what information you are getting from operational
reports.If the data available within the organization are unsuitable or inadequate, the marketer
should extend the search to external secondary data sources.
Government Publications- Government sources provide an extremely rich pool of data for
the researchers. In addition, many of these data are available free of cost on internet websites.
There are number of government agencies generating data. These are:
Data Analytics
Director General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided region-
wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of economic
advisor provides information on wholesale price index. These indices may be related to a
number of sectors like food, fuel, power, food grains etc. It also generates All India
Consumer Price Index numbers for industrial workers, urban, non-manual employees and
cultural labourers.
Reserve Bank of India- This provides information on Banking Savings and investment. RBI
also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
State Statistical Abstract- This gives information on various types of activities related to the
state like - commercial activities, education, occupation etc.
The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
Various Associations of Press Media. Export Promotion Council.
Data Analytics
Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the information
suits the subscriber. These services are useful in television viewing, movement of consumer
goods etc. These syndicate services provide information data from both household as well as
institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, sociographic, general topics. Mail Diary Panel- It may be related to 2
fields - Purchase and Media.
Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are
changing and we need more of specific decision-making in the light of changing
environment. Also Syndicate services are able to provide information to the industries at a
low unit cost.
Disadvantages of Syndicate Services
The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
International Organization- These includes
The International Labour Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science andtechnology.
Based on various features (cost, data, process, source time etc.) various sources of
data can be compared as per table 1.
Sensor data is the output of a device that detects and responds to some type of input
from the physical environment. The output may be used to provide information or input to
another system or to guide a process. Examples are as follows
A photosensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
Lidar, a laser-based method of detection, range finding and mapping, typically uses a
low-power, eye-safe pulsing laser working in conjunction with a camera.
A charge-coupled device (CCD) stores and displays the data for an image in such a way
that each pixel is converted into an electrical charge, the intensity of which is related to a
color in the color spectrum.
Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, powerline voltage, chemical
concentrations, pollutant levels and vital body functions.
Data Analytics
The simplest form of signal is a direct current (DC) that is switched on and off; this is
the principle by which the early telegraph worked. More complex signals consist of an
alternating-current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a
network. Data and signals can be either analog or digital. A signal is periodic if it consists
of a continuously repeating pattern.
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or near the
Earth where there is an unobstructed line of sight to four or more GPS satellites. The system
provides critical capabilities to military, civil, and commercial users around the world. The
United States government created the system, maintains it, and makes it freely accessible to
anyone with a GPS receiver.
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot
have any erroneous elements and must convey the correct message without being misleading.
This accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
Data Analytics
target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have
more serious consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and nationality
are typically limited to a set of options and open answers are not permitted. Any answers
other than these would not be considered valid or legitimate based on the survey’s
requirement. This is the case for most data and must be carefully considered when
determining its quality. The people in each department in an organization understand what
data is valid or not to them, so the requirements must be leveraged when evaluating data
quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the
effort required, which also means it has to be collected at the right moment in time. Data
collected too soon or too late could misrepresent a situation and drive inaccurate
decisions.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized
and manipulated collections of data could offer a different meaning than the data
implied at a lower level. An appropriate level of granularity must be defined to provide
sufficient uniqueness and distinctive properties to become visible. This is a requirement for
operations to function effectively.
Data Analytics
Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data. However, its meaning has expanded to include any data that cannot be
understood and interpreted correctly by machines, such as unstructured text.
Noisy data
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two waves
combined with noise and has high SNR
Origins of noise
BUT...Missing (null) values may have significance in themselves (e.g. missing test in a
medical examination, deathdate missing means still alive!)
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data Analytics
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics
As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:
Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.
Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:
In general data analytics also deals with bit of human knowledge as discussed below
in figure 2.2 in this under each type of analytics there is a part of human knowledge required
in prediction. Descriptive analytics requires the highest human input while predictive
analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.
In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the
mainly used tool is Rand Phyton as shown in figure 2.3
With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top tools
in the data analytics market are as follows.
R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
Python – Python is an open-source, object-oriented programming language which is easy
to read, write and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with
real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk.
This tool is also popular for data pipelines and machine learning model development.
Data Analytics
Apart from the above-mentioned capabilities, a Data Analyst should also possess skills
such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if
you have knowledge of Machine Learning, then that would make you stand out from the
crowd.
Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in
detail. Nowadays majority of the business deals with prediction with large amount of data to
work with.
Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:
1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and
user interactivity.
Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Data Analytics
The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that isorganized by key values K2. A master node arranges it so that
for redundant copies of input data only one isprocessed.
The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are
located on the same worker node.
The Reduce() step: Worker nodes process each group of output data(perkey) in parallel,
executing the userprovidedReduce() code; each function is run exactly once for each K2 key
value pro-duced by the map step.
Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.
Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown
in Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be
processed by the mappers. Once the key/values aregenerated by mappers, a shuffling process
is used to mix (combine) these key values (combining the same keys in the sameworker
node). Finally, the reduce functions are used to count the words that generate a common
output as a result of thealgorithm. As a result of the execution or wrappers/reducers, the out-
put will generate a sorted list of word counts from theoriginal text input.
IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue
Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making
2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics
The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL
and object-oriented databases
A text database is a system that maintains a (usually large) text collection and
provides fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..
NoSQL databases are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc
Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a
class is a collection of objects. Object-oriented databases follow the fundamental principles
of object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..
In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.
InNominal Data there is no natural ordering in values in the attribute of the dataset.
Eg: color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
Data Analytics
cannot be valid.
Data Analytics
In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg:
size (S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.
Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.
Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.
Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make
data-driven decisions and meet varied business goals.
The entire process of data modelling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and then propose
Data Analytics
a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics
Data modeling can be achieved in various ways. However, the basic concept of each
of them remains the same. Let’s have a look at the commonly used data modeling methods:
Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult
in a hierarchical database. This is why it is rarely used now.
Relational model
The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be
shared easily and the computation becomes easier.
This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.
Entity-relationship model
Data Analytics
The entity relationship diagram explains relation between variables and with their
primary key and foreign key as shown in figure 2.10. along with this it also explains the
multiple instances of relation between tables.
Now that we have a basic understanding of data modeling, let’s see why it is important.
You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.
Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.
Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.
Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.
Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order
You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.
Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.
In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.
This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.
Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.
While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics
machines’ performance.
Data Analytics
More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.
Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.
Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.
For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.
Key takeaway: Data models become outdated quicker than you expect. It is necessary that
you keep them updated from time to time.
The Wrap Up
Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.
In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics
to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics
Advantages:
• Works well with numerical dataset.
• Very fast and reliable.
Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode
Disadvantage:
• Sensitive to outliers
UNIT-3
BLUE Property
assumptions
The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.
Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
Random:
o Our data must have been randomly sampled from the population.
Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
Exogeneity:
o The regressors aren’t correlated with the error term.
Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.
Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
Data Analytics
When you know where these conditions are violated, you may be able to plan ways to
change your experiment setup to help your situation fit the ideal Gauss Markov situation
more closely.
In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‘ideal’ conditions would be.
They also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurate or even unusable.
Data Analytics
and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE) possible if
The first of these assumptions can be read as “The expected value of the error term is
zero.”. The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.
Data Analytics
Regression Concepts
Regression
Each xi corresponds to the set of attributes of the ith observation (known as explanatory
variables) and yi corresponds to the target (or response) variable.
The explanatory attributes of a regression task can be either discrete or continuous.
Regression (Definition)
Regression is the task of learning a target function f that maps each attribute set x into a
continuous-valued output y.
To find a target function that can fit the input data with minimum error.
The error function for a regression task can be expressed in terms of the sum of absolute
or squared error:
Data Analytics
Suppose we wish to fit the following linear model to the observed data:
where w0 and w1 are parameters of the model and are called the regression coefficients.
A standard approach for doing this is to apply the method of least squares, which
attempts to find the parameters (w0,w1) that minimize the sum of the squared error
These equations can be summarized by the following matrix equation' which is also
known as the normal equation:
Since
the normal equations can be solved to obtain the following estimates for the parameters.
Data Analytics
Thus, the linear model that best fits the data in terms of minimizing the SSE is
We can show that the general solution to the normal equations given in D.6 can be
expressed as follow:
Data Analytics
Thus, linear model that results in the minimum squared error is given by
In summary, the least squares method is a systematic approach to fit a linear model to the
response variable g by minimizing the squared error between the true and estimated value
of g.
Although the model is relatively simple, it seems to provide a reasonably accurate
approximation because a linear model is the first-order Taylor series approximation for
any function with continuous derivatives.
Data Analytics
Logistic Regression
Consider a procedure in which individuals are selected on the basis of their scores in a
battery of tests.
After five years the candidates are classified as "good" or "poor.”
We are interested in examining the ability of the tests to predict the job performance of
the candidates.
Here the response variable, performance, is dichotomous.
We can code "good" as 1 and "poor" as 0, for example.
The predictor variables are the scores in the tests.
Data Analytics
In a study to determine the risk factors for cancer, health records of several people were
studied.
Data were collected on several variables, such as age, gender, smoking, diet, and the
family's medical history.
The response variable was the person had cancer (Y = 1) or did not have cancer (Y = 0).
The relationship between the probability π and X can often be represented by a logistic
response function.
It resembles an S-shaped curve.
The probability π initially increases slowly with increase in X, and then the increase
accelerates, finally stabilizes, but does not increase beyond 1.
Intuitively this makes sense.
Consider the probability of a questionnaire being returned as a function of cash reward,
or the probability of passing a test as a function of the time put in studying for it.
The shape of the S-curve can be reproduced if we model the probabilities as follows:
A sigmoid function is a bounded differentiable real function that is defined for all real
input values and has a positive derivative at each point.
Modeling the response probabilities by the logistic distribution and estimating the
parameters of the model given below constitutes fitting a logistic regression.
In logistic regression the fitting is carried out by working with the logits.
The Logit transformation produces a model that is linear in the parameters.
The method of estimation used is the maximum likelihood method.
The maximum likelihood estimates are obtained numerically, using an iterative
procedure.
OLS:
The ordinary least squares, or OLS, can also be called the linear least squares.
This is a method for approximately determining the unknown parameters located in a
linear regression model.
According to books of statistics and other online sources, the ordinary least squares is
obtained by minimizing the total of squared vertical distances between the observed
responses within the dataset and the responses predicted by the linear approximation.
Through a simple formula, you can express the resulting estimator, especially the single
regressor, located on the right-hand side of the linear regression model.
For example, you have a set of equations which consists of several equations that have
unknown parameters.
You may use the ordinary least squares method because this is the most standard
approach in finding the approximate solution to your overly determined systems.
In other words, it is your overall solution in minimizing the sum of the squares of errors
in your equation.
Data fitting can be your most suited application. Online sources have stated that the data
that best fits the ordinary least squares minimizes the sum of squared residuals.
“Residual” is “the difference between an observed value and the fitted value provided by
a model.”
Data Analytics
is a method used in estimating the parameters of a statistical model, and for fitting a
statistical model to data.
If you want to find the height measurement of every basketball player in a specific
location, you can use the maximum likelihood estimation.
Normally, you would encounter problems such as cost and time constraints.
If you could not afford to measure all of the basketball players’ heights, the maximum
likelihood estimation would be very handy.
Using the maximum likelihood estimation, you can estimate the mean and variance of the
height of your subjects.
The MLE would set the mean and variance as parameters in determining the specific
parametric values in a given model.
For j = 1,2,···, (k - 1). The model parameters are estimated by the method of maximum
likelihood. Statistical software is available to do this fitting.
Data Analytics
UNIT-4
Regression vs. Segmentation
We use linear or logistic regression technique for developing accurate models for
predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their effectiveness
Creating separate model for separate segments may be time consuming and not worth the
effort.
But, creating separate model for separate segments may provide higher predictive power.
Market Segmentation
Dividing the target market or customers on the basis of some significant features which
could help a company sell more products in less marketing expenses.
Companies have limited marketing budgets. Yet, the marketing team is expected to
makes large number of sales to ensure rising revenue & profits.
A product is created in two ways:
Create a product after analyzing (research) the needs and wants of target market –
For example: Computer. Companies like Dell, IBM, Microsoft entered this
market after analyzing the enormous market which this product upholds.
Create a product which evokes the needs & wants in target market – For example:
iPhone.
Once the product is created, the ball shifts to the marketing team’s court.
As mentioned above, they make use of market segmentation techniques.
This ensures the product is positioned to the right segment of customers with high
propensity to buy.
A very similar approach can also be used for developing a linear regression model.
Data Analytics
Logistic regression uses 1 or 0 indicator in the historical campaign data, which indicates
whether the customer has responded to the offer or not.
Usually, one uses the target (or ‘Y’ known as dependent variable) that has been identified
for model development to undertake an objective segmentation.
Remember, a separate model will be built for each segment.
A segmentation scheme which provides the maximum difference between the segments
with regards to the objective is usually selected.
Below is a simple example of this approach.
Objective Segmentation
Segmentation to identify the type of customers who would respond to a particular offer.
Segmentation to identify high spenders among customers who will use the e-commerce
channel for festive shopping.
Segmentation to identify customers who will default on their credit obligation for a loan
or credit card.
Non-Objective Segmentation
Segmentation of the customer base to understand the specific profiles which exist within
the customer base so that multiple marketing actions can be personalized for each
segment
Segmentation of geographies on the basis of affluence and lifestyle of people living in
each geography so that sales and distribution strategies can be formulated accordingly.
Segmentation of web site visitors on the basis of browsing behavior to understand the
level of engagement and affinity towards the brand.
Hence, it is critical that the segments created on the basis of an objective segmentation
methodology must be different with respect to the stated objective (e.g. response to an
offer).
However, in case of a non-objective methodology, the segments are different with respect
to the “generic profile” of observations belonging to each segment, but not with regards
to any specific outcome of interest.
The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.
Each of these techniques uses a distance measure (e.g. Euclidian distance, Manhattan
distance, Mahalanobis distance etc.)
This is done to maximize the distance between the two segments.
This implies maximum difference between the segments with regards to a combination of
all the variables (or factors).
Data Analytics
Tree Building
goal
o to create a model that predicts the value of a target variable based on several input
variables.
Classification tree analysis is when the predicted outcome is the class to which the
data belongs.
Regression tree analysis is when the predicted outcome can be considered a real
number. (e.g. the price of a house, or a patient’s length of stay in a hospital).
A decision tree
o is a flow-chart-like structure
o each internal (non-leaf) node denotes a test on an attribute
o each branch represents the outcome of a test,
o each leaf (or terminal) node holds a class label.
o The topmost node in a tree is the root node.
Decision-tree algorithms:
o ID3 (Iterative Dichotomiser 3)
o C4.5 (successor of ID3)
o CART (Classification and Regression Tree)
o CHAID (CHI-squared Automatic Interaction Detector). Performs multi-level
splits when computing classification trees.
o MARS: extends decision trees to handle numerical data better. Conditional
Inference Trees.
Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for
multiple testing to avoid over fitting.
This approach results in unbiased predictor selection and does not require pruning.
ID3 and CART follow a similar approach for learning decision tree from training tuples.
Data Analytics
Splits are found that maximize the homogeneity of child nodes with respect to the value
of the dependent variable.
Impurity Measure:
GINI Index Used by the CART (classification and regression tree) algorithm, Gini
impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the distribution of labels in
the subset.
Gini impurity can be computed by summing the probability fi of each item being chosen
times the probability 1-fi of a mistake in categorizing that item.
It reaches its minimum (zero) when all cases in the node fall into a single target category.
To compute Gini impurity for a set of items, suppose i ε {1, 2... m}, and let fi be the
fraction of items labeled with value i in the set.
Pruning
After building the decision tree, a tree-pruning step can be performed to reduce the size
of the decision tree.
Pruning helps by trimming the branches of the initial tree in a way that improves the
generalization capability of the decision tree.
The errors committed by a classification model are generally divided into two types:
o training errors
o generalization errors.
Training error
o also known as resubstitution error or apparent error.
o it is the number of misclassification errors committed on training records.
generalization error
o is the expected error of the model on previously unseen records.
o A good classification model must not only fit the training data well, it must also
accurately classify records it has never seen before.
A good model must have low training error as well as low generalization error.
Data Analytics
Model overfitting
o Decision trees that are too large are susceptible to a phenomenon known as
overfitting.
o A model that fits the training data too well can have a poorer generalization error
than a model with a higher training error.
o Such a situation is known as model overfitting.
Model underfitting
o The training and test error rates of the model are large when the size of the tree is
very small.
o This situation is known as model underfitting.
o Underfitting occurs because the model has yet to learn the true structure of the
data.
o Model complexity
o To understand the overfitting phenomenon, the training error of a model can be
reduced by increasing the model complexity.
o Overfitting and underfitting are two pathologies that are related to the model
complexity.
Data Analytics
Applications
o ARIMA models are important for generating forecasts and providing
understanding in all kinds of time series problems from economics to health care
applications.
o In quality and reliability, they are important in process monitoring if observations
are correlated.
o designing schemes for process adjustment
o monitoring a reliability system over time
o forecasting time series
o estimating missing values
o finding outliers and atypical events
o understanding the effects of changes in a system
Data Analytics
Forecast Accuracy can be defined as the deviation of Forecast or Prediction from the
actual results.
Error = Actual demand – Forecast
OR
ei = At – Ft
We measure Forecast Accuracy by 2 methods :
Mean Forecast Error (MFE)
o For n time periods where we have actual demand and forecast values:
o Ideal value = 0;
o MFE > 0, model tends to under-forecast
o MFE < 0, model tends to over-forecast
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the
errors
ETL Approach
Extract, Transform and Load (ETL) refers to a process in database usage and especially
in data warehousing that:
o Extracts data from homogeneous or heterogeneous data sources
o Transforms the data for storing it in proper format or structure for querying and
analysis purpose
o Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some data
ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer hardware.
The disparate systems containing the original data are frequently managed and operated
by different employees.
For example, a cost accounting system may combine data from payroll, sales, and
purchasing.
o Extract
The Extract step covers the data extraction from the source system and
makes it accessible for further processing.
The main objective of the extract step is to retrieve all the required data
from the source system with as little resources as possible.
The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any
kind of locking.
o Transform
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
o Load
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible.
The target of the Load process is often a database.
In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after
the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Staging
o It should be possible to restart, at least, some of the phases independently from
theothers.
o For example, if the transformation step fails, it should not be necessary to
restart the Extract step.
o We can ensure this by implementing proper staging. Staging means that the
data is simply dumped to the location (called the Staging Area) so that it can
then be read by the next processing phase.
o The staging area is also used during ETL process to store intermediate results
of processing.
o This is ok for the ETL process which uses for this purpose.
o However, the staging area should be accessed by the load ETL process only.
o It should never be available to anyone else; particularly not to end users as
it isnot intended for data presentation to the end-user.
o May contain incomplete or in-the-middle-of-the-processing data.
UNIT-5
Data Visualization
Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically
representing empirical information.
They are sometimes called information graphics, or even just charts and graphs.
The goal of visualizing data is to tell the story in the data.
Telling the story is predicated on understanding the data at a very deep level, and
gathering insight from comparisons of data points in the numbers
Gain insight into an information space by mapping data onto graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, and relationships among data.
Help find interesting regions and suitable parameters for further quantitative analysis.
Provide a visual proof of computer representations derived.
Data Analytics
For a data set of m dimensions, create m windows on the screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows.
The colors of the pixels reflect the corresponding values.
To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment.
Geometric Projection Visualization Techniques
Visualization of geometric transformations and projections of the data.
Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes Projection pursuit technique: Help users find meaningful projections of
multidimensional data
Prosection views
Hyperslice
Parallel coordinates
Scatter Plots
Scatterplot Matrices
Parallel Coordinates
Chernoff Faces
Stick Figure
For a large data set of high dimensionality, it would be difficult to visualize all
dimensions at the same time.
Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces).
The subspaces are visualized in a hierarchical manner
“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method.
To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5.
We want to observe how F changes w.r.t. other dimensions. We can fix X3,X4,X5
dimensions to selected values and visualize changes to F w.r.t. X1, X2
Malla Reddy Institute of Technology and Science
Visualizing Complex Data and Relations
https://www.slideserve.com/eben/introduction-to-information-visualization
NOTES-2
Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture.
The design data architecture is broken down into three traditional architectural
processes to be considered.
Conceptual Aspects represents the all business entities and its related
attributes.
Logical aspects represents the entire logic of the entity relationships.
Physical aspects represents the actual data mechanism for particular type of
functionalities.
Data Analysis study includes Developing and Executing plan for Collecting Data,
a Data analysis suppose that the Data have Already been Collected.
Because it presumes that data have already been collected, it includes development
and refinement of a hypothesis or question and the process of analyzing and
interpreting the data.
Epicycle:
• In data analysis iterative process that is applied to all the steps of data
analysis.
• Epicycle that is repeated for each step along the circumference of the entire
data analysis process.
for example, you might go through all 5 in the course of a day, but also deal with
each, for a large project, over the course of many months.
Although there are many different types of activities that you might engage in
while doing data analysis, every aspect of the entire process can be approached
through an iterative process that we call the “epicycle of data analysis”.
More specifically, for each of the five core activities, it is critical that you engage
in the following steps:
1. Setting Expectations,
3. Revising your expectations or fixing the data so your data and your
expectations match.
• Iterating through this 3-step process is what we call the “epicycle of data
analysis.”
More specifically, for each of the five core activities, it is critical that you engage
in the following steps:
• Collecting data
• Revise expectations
When we inspect the data or execute an analysis procedure, applies to each core
activity of the data analysis process.
• Collecting Information
This step entails collecting information about the question or the data. For the
question, you collect information by performing a literature search or asking
experts in order to ensure that your question is a good one.
The results of that operation are the data we need to collect, and then determine if
the data collected, matches our expectations.
Now that you have data in hand (the check at the restaurant), the next step is to
compare your expectations to the data. There are two possible outcomes: either
your expectations of the cost match the amount on the check, or they do not.
If your expectations and the data match, terrific, you can move onto the next
activity.
If the expectations and data the data do not match, there are two possible
explanations for discordance:
First, the expectations were wrong and need to be revised, or second the data was
wrong and contains an error.
1. Descriptive questions:
2. Explorative questions:
An exploratory question is one in which you analyze the data to see if there are;
• patterns
• trends
3. Inferential questions
4. Predictive questions
5. Causal questions:
A causal question asks about whether changing one factor will change another
factor, on average, in a population.
6. Mechanistic questions:
Finally, none of the questions described so far will lead to an answer that will tell
us.
1. Formulate question
Formulating question can be a useful way to guide the Exploratory data Analysis.
Sometimes data will comes from very messy formats we will need to do some
Cleaning.
Assuming that you don’t have any errors and warning in your data set.
4. Look at top and bottom of your data
Which is useful to check at the beginning and Ending of Data Set. If the data was
read properly, things are properly formulated or not.
7. Make a plot
• The main purpose of this model is with respect to set the data first, and last
describe the process by data analyst.
• The model is help us understanding the real world.
• Mean
• Standard Deviation
• Before we get to the data lets figure out what we expect to see from
the data.
• Which is used to initiate the discussion about the model and what we
expect from reality.
• Consider Implications
Primary Sources
While primary data can be collected through questionnaires, depth interview, focus
group interviews, case studies, experimentation and observation.
Secondary Sources
Internal Experts- These are people who are heading the various
departments. They can give an idea of how a particular thing is working.
Government Publications
It also generates All India Consumer Price Index numbers for industrial
workers, urban, non-manual employees and cultural labourers.
• Planning Commission
• Labour Bureau
This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.
• Non-Government Publications
Sensor data
• Sensor data is the output of a device that detects and responds to some type
of input from physical Environment.
Lidar(which stands for Light Detection and Ranging), a laser based method
of Detection.
Smart grid Sensors - It can provide real-time data about grid conditions,
detecting outages, faults, loads and triggering alarms.
Signals
What is Signal?
GPS
• The united state government created the system, maintain it, and makes it
freely accessible to anyone with a GPS receiver.
Data Management:
Data Management is concerned with the end-to-end life cycle of data, from
creation to retirement, and the controlled progression of data to and from each
stage within its lifecycle.
• Data mining applications are often applied to data that was collected for
another purpose or future.
• The first step is detection and correction, and it is often called as data
cleaning.
Issues:
• For continuous attributes, the numerical difference of the measured and true
value is called error.
• The term data Collection error refers to errors such as omitting data objects
or attribute values.
Noise data:
• The term has often been used as a synonym for corrupt data.
• The meaning has expanded to include any data that cannot be understood
and interpreted correctly by machines, such as unstructured text.
Outliers:
Outliers are
(1) data objects that have the characteristics the are differ from
most of the other data objects in data sets.
Missing Values:
• Example: Some people decline to give their phone numbers or age details.
• Regardless missing values should be taken into account during the data
analysis.
Replace them with mean (if data is numeric) or frequent value (if data is
categorical)
Duplicate Data:
A Data set may include data objects that are duplicates or almost duplicates
of one another.
Data Quality issues can also be considered from an application view point as
expressed by the statement.
“Data is of high quality if it is suitable for its intended use”. This approach
to data quality has proven quite useful.
• As with quality issues at the measurement and data collection level, there are
many issues that are specific particular applications and fields.
1. Timeliness:
Example:
If the data provides a snapshot of some ongoing phenomenon or process
such as purchasing behavior of customers or web browsing patterns, then
this snapshot represents the reality for only a limited time.
If the data is out of date, then so are the models and patterns that are based
on it.
2. Relevance:
The available data must contain the information necessary for the
application.
If Information about the age and gender of the driver is omitted, then it
likely that the model will have limited accuracy.
Data Preprocessing
• Data Integration: Data with different representations are put together and
conflicts within the data are resolved.
Missing values:
Noisy Data:
Binning method: Consult the neighborhood of the values and they perform
smoothing.
Regression:
They conform best fit line to two attributes. One attributes can predict the
other attribute.
Outlier analysis:
Outliers may be detected by the Clustering.
Data integration:
Data mining often requires data integration – the merging of data from
multiple databases like data cubes, files.
Data transformation:
Normalization: Scaling the attribute values fall within the specified range
Here are 4 main factors which signify the need for Data Analytics:
• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
Analytics is now a days used in all the fields ranging from Medical Sciences to
Government Activities.
R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS. It also provides tools to automatically install all packages as per user-
requirement.
Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse etc. It then creates visualizations, maps,
dashboards etc. with real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered to
the end-users quickly. It also offers data association and data visualization with data
being compressed to almost 10% of its original size.
Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text analytics,
machine learning.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the transformation
of data and parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times
faster on disk. This tool is also popular for data pipelines and machine learning
model development.
Have a clear understanding of your organization’s requirements and organize your data
properly.
Keep your data models simple. The best data modeling practice here is to use a tool
which can start small and scale up as needed.
It is highly recommended to organize your data properly using individual tables for
facts and dimensions to enable quick analysis.
Have a clear opinion on how much datasets you want to keep. Maintaining more than
what is actually required wastes your data modeling, and leads to performance issues.
It is the best practice to maintain one-to –one or one-to-many relationships. The many-
to-many relationship only introduces complexity in the system
Data models become outdated quicker than you expect. It is necessary that you keep
them updated from time to time.
Data modeling is nothing but a process through which data is stored structurally in
a format in a database. Data modeling is important because it enables
organizations to make data-driven decisions and meet varied business goals.
The entire process of data modeling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and
then propose a solution that aligns with its end-goals and suffices it in achieving
the desired objectives.
Data modeling can be achieved in various ways. However, the basic concept of
each of them remains the same. Let’s have a look at the commonly used data
modeling methods:
Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data
in a tree-like format. However, retrieving and accessing data is difficult in a
hierarchical database. This is why it is rarely used now.
Relational model
Proposed as an alternative to hierarchical model by an IBM researcher, here data is
represented in the form of tables. It reduces the complexity and provides a clear
overview of the data.
Network model
The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as
each record can be linked with multiple parent records.
Object-oriented model
This database model consists of a collection of objects, each with its own features
and methods. This type of database model is also called the post-relational
database model.
Entity-relationship model
Entity-relationship model, also known as ER model, represents entities and their
relationships in a graphical format. An entity could be anything – a concept, a
piece of data, or an object.
Now that we have a basic understanding of data modeling, let’s see why it is important.
Data modeling
Data modeling (data modeling) is the process of creating a data model for the data
to be stored in a Database
This data model is a conceptual representation of Data objects, the
associations between different data objects and the rules.
Data modeling helps in the visual representation of data and enforces
business rules, regulatory compliances, and government policies on the data.
Data models ensure consistency in naming conventions, default values,
semantics, and security while ensuring quality of the data.
Data model emphasizes on what data is needed and how it should be
organized instead of what operations need to be performed on the data.
Data model is like architect’s building plan which helps to build a
conceptual model and set the relationship between data items.
The primary goals of using data model are:
Ensures that all data objects required by the database are accurately
represented. Omission of data will lead to creation of faculty reports and
produce incorrect results.
A data model helps design the database at the conceptual, physical and
logical levels.
Data model structure helps to define the relational tables,primary and
foreign keys and stored procedures.
It provides a clear picture of the base data and can be used by database
developers to create a physical database.
It is also helpful to identify missing and redundant data.
Though the initial creation of data model is labor and time consuming, in the
long run, it makes your IT infrastructure upgrade and maintenance cheaper
and faster.
There are mainly three different types of data models:
Conceptual: This Data model defines WHAT the system contains. This
model is typically created by Business stakeholders and Data Architects.
The purpose is to organize scope and define business concepts and rules.
Logical: Defines HOW the system should be implemented regardless of the
DBMS.This model is typically created by Data Architects and Business
Analysts. The purpose is to developed technical map of rules and data
structures.
Physical: This Data model describes HOW the system will be implemented
using a specific DBMS system. This model is typically created by DBA and
developers. The purpose is actual implementation of the database.
Missing Imputations:
• Missing data is a common problem in practical data analysis. In datasets, missing
values could be represented as ‘?’, ‘nan’, ’N/A’, blank cell, or sometimes ‘-999’,
’inf’, ‘-inf’.
Missing Imputation simply means that we replace the missing values with some
guessed/estimated ones.
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not
a number). Unlike SAS, R uses the same symbol for character and numeric data.
For Example, We have defined “y” and then checked if there is any missing value. T or
True means that there is a missing value. y <- c(1,2,3,NA) is.na(y) # returns a vector (F
FF T)
Arithmetic functions on missing values yield missing values. For Example, x <-
c(1,2,NA,3) mean(x) # returns NA To remove missing values from our dataset we use
na.omit() function. For Example, We can create new dataset without missing data as
below: -
newdata<- na.omit(mydata)
Or, we can also use “na.rm=TRUE” in argument of the operator. From above example
we use na.rm and get desired result. x <- c(1,2,NA,3) mean(x, na.rm=TRUE)
# returns 2
• The study of missing data was formalized by with the concept of missing data
mechanisms.
• Missing data mechanism describes the underlying mechanism that generates
missing data and can be categorized into three types — missing completely at
random (MCAR), missing at random (MAR), and missing not at random
(MNAR).
Types of missing data:
• Understanding the reasons why data are missing is important for handling the
remaining data correctly. If values are missing completely at random, the data
sample is likely still representative of the population. But if the values are
missing systematically, analysis may be biased. For example, in a study of the
relation between IQ and income, if participants with an above-average IQ tend to
skip the question ‗What is your salary?‘, analyses that do not take into account
this missing at random may falsely fail to find a positive association between IQ
and salary. Because of these problems, methodologists routinely advise
researchers to design studies to minimize the occurrence of missing values.
Graphical models can be used to describe the missing data mechanism in detail.
• In the case of MCAR, the missingness of data is unrelated to any study variable:
thus, the participants with completely observed data are in effect a random
sample of all the participants assigned a particular intervention. With MCAR, the
random assignment of treatments is assumed to be preserved, but that is usually
an unrealistically strong assumption in practice.
Missing at random:
• Missing at random (MAR) occurs when the missingness is not random, but where
missingness can be fully accounted for by variables where there is complete
information. MAR is an assumption that is impossible to verify statistically, we
must rely on its substantive reasonableness.[8] An example is that males are less
likely to fill in a depression survey but this has nothing to do with their level of
depression, after accounting for maleness. Depending on the analysis method,
these data can still induce parameter bias in analyses due to the contingent
emptiness of cells (male, very high depression may have zero entries). However,
if the parameter is estimated with Full Information Maximum Likelihood, MAR
will provide asymptotically unbiased estimates.
Mean Imputations:
Median Imputation:
Regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that the
growth in sales is around two and a half times the growth in the economy. Using this
insight, we can predict future sales of the company based on current & past
information.
There are multiple benefits of using regression analysis. They are as follows:
Linear Regression
Logistic Regression
Polynomial Regression
Ridge Regression
Lasso Regression
1. Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can be
continuous or discrete, and nature of regression line is linear.
The relationship between the two variable is three types. They are
(iii) No Relationship
Where
y is dependent variable
x is independent variable
b is slope--> how much the line rises for each unit increase in x
a is y intercept --> the value of y when x=0.
Logistic Regression
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
It is a classification problem where your target element is categorical
Unlike in Linear Regression, in Logistic regression the output required is representedin
discrete values like binary 0 and 1.
It estimates relationship between a dependent variable (target) and one or more
independent variable (predictors) where dependent variable is categorical/nominal.
Logistic regression is a supervised learning classification algorithm used to predict
the probability of a dependent variable.
The nature of target or dependent variable is dichotomous(binary), which means
there would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no), etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
Sigmoid Function:
It is the logistic expression especially used in Logistic Regression.
The sigmoid function converts any line into a curve which has discrete values like
binary 0 and 1.
In this session let’s see how a continuous linear regression can be manipulated and
converted into Classifies Logistic.
The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Where,
P represents Probability of Output class Y represents predicted output.
Example
0 4.2
0 5.1
0 5.5
1 8.2
1 9.0
1 9.9
Polynomial Regression
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear model.
Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+ + bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables.
The Simple and Multiple Linear equations are also Polynomial equations with
a single degree, and the Polynomial regression equation is Linear equation
with the nth degree.
Stepwise Regression
• This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables is done
with the help of an automatic process, which involves no human intervention.
• Stepwise regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:
Standard stepwise regression does two things. It adds and removes predictors
as needed for each step.
Forward selection starts with most significant predictor in the model and
adds variable for each step.
Backward elimination starts with all predictors in the model and removes the
least significant variable for each step.
The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle
higher dimensionality of data set.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.
o The equation for ridge regression will be:
WHAT IS AN ESTIMATOR?
• In statistics, an estimator is a rule for calculating an estimate of a given quantity
based on observed data
• Example-
i. X follows a normal distribution, but we do not know the parameters of our
distribution, namely mean (μ) and variance (σ2 )
ii. To estimate the unknowns, the usual procedure is to draw a random sample
of size ‘n’ and use the sample data to estimate parameters.
TWO TYPES OF ESTIMATORS
• Point Estimators A point estimate of a population parameter is a single value of a
statistic. For example, the sample mean x is a point estimate of the population mean
μ. Similarly, the sample proportion p is a point estimate of the population proportion
P.
• Interval Estimators An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < x < b is an interval
estimate of the population mean μ. It indicates that the population mean is greater
than a but less than b.
PROPERTIES OF BLUE
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)
LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
• Sample mean is a linear estimator because it is a linear function of the X values.
UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:
E(β)=β
Similarly, if this is not the case, we say that the estimator is biased
• Similarly, if this is not the case, we say that the estimator is biased
• Bias=E(β ) - β
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be centered around the
true population , so too it is desirable for the sampling distribution to be as narrow
(or precise) as possible.
– Centering around “the truth” but with high variability might be of very little use
• One way of narrowing the sampling distribution is to increase the sampling size
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points,
and a similar number of points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares
Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a
line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
b = Σy − m Σx/N
= mx + b
Done!
Example
Example: Sam found how many hours of sunshine vs how many ice creams were
sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that datay
= mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 Xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
b = Σy − m Σx/ N
= 41 − 1.5183 x 26/ 5
= 0.3049...
y = mx + b
y = 1.518x + 0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so
he uses the above equation to estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
It works by making the total of the square of the errors as small as possible (that is
why it is called "least squares"):
You can imagine (but not accurately) each data point connected to a straight bar by
springs:
Outliers
Be careful! Least squares is sensitive to outliers. A strange value will pull the line
towards it.
The ordinary least squares, or OLS, can also be called the linear least squares. This
is a method for approximately determining the unknown parameters located in a
linear regression model. According to books of statistics and other online sources,
the ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses
predicted by the linear approximation. Through a simple formula, you can express
the resulting estimator, especially the single regressor, located on the right-hand side
of the linear regression model.
For example, you have a set of equations which consists of several equations that
have unknown parameters. You may use the ordinary least squares method because
this is the most standard approach in finding the approximate solution to your overly
determined systems. In other words, it is your overall solution in minimizing the sum
of the squares of errors in your equation. Data fitting can be your most suited
application. Online sources have stated that the data that best fits the ordinary least
squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model.”
Variable Rationalization:
Method selection allows you to specify how independent variables are entered into
the analysis. Using different methods, you can construct a variety of regression
models from the same set of variables.
Enter (Regression). A procedure for variable selection in which all variables in a
block are entered in a single step.
Stepwise. At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their probability of F becomes
sufficiently large. The method terminates when no more variables are eligible for
inclusion or removal.
Remove. A procedure for variable selection in which all variables in a block are
removed in a single step.
Backward Elimination. A variable selection procedure in which all variables are
entered into the equation and then sequentially removed. The variable with the
smallest partial correlation with the dependent variable is considered first for
removal. If it meets the criterion for elimination, it is removed. After the first variable is
removed, the variable remaining in the equation with the smallest partial
correlation is considered next. The procedure stops when there are no variables in
the equation that satisfy the removal criteria.
Forward Selection. A stepwise variable selection procedure in which variables are
sequentially entered into the model. The first variable considered for entry into the
equation is the one with the largest positive or negative correlation with the
dependent variable.
Model Building:
In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables. The major issues are finding the proper form (linear or curvilinear) of the
relationship and selecting which independent variables to include. In building
models it is often desirable to use qualitative as well as quantitative variables. As
noted above, quantitative variables measure how much or how many; qualitative
variables represent types or categories. For instance, suppose it is of interest to
predict sales of an iced tea that is available in either bottles or cans. Clearly, the
independent variable “container type” could influence the dependent variable
“sales.” Container type is a qualitative variable, however, and must be assigned
numerical values if it is to be used in a regression study. So-called dummy variables
are used to represent qualitative variables in regression analysis. For example, the
dummy variable x could be used to represent container type by setting x = 0 if the
iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage
could be placed in glass bottles, plastic bottles, or cans, it would require two dummy
variables to properly represent the qualitative variable container type. In general, k -
1 dummy variables are needed to model the effect of a qualitative variable that may
assume k values.
Sigmoid Function:
Sigmoid curve
The sigmoid function also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and map it into a value between 0 and 1. If the curve
goes to positive infinity, y predicted will become 1, and if the curve goes to negative
infinity, y predicted will become 0. If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify
it like 0 or NO. If the output is 0.75, we can say in terms of probability as: There is a
75 percent chance that patient will suffer from cancer.
Sigmoid function
‘0’ as x approaches −∞
‘1’ as x approaches +∞
Thus, if the output is more tan 0.5, we can classify the outcome as 1 (or YES) and if
it is less than 0.5, we can classify it as 0(or NO).
For example: If the output is 0.65, we can say in terms of probability as:
“There is a 65 percent chance that your favorite cricket team is going to win today ”.
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
The independent variables should be independent of each other. That is, the
model should have little or no multicollinearity.
The measure of total variation, SST, is the sum of the squared deviations of the
dependent variable about its mean: Σ(y − ȳ)2. This quantity is known as the total sum
of squares. The measure of unexplained variation, SSE, is referred to as the residual
sum of squares. SSE is the sum of the squared distances from each point in to the
estimated regression line: Σ(y − ŷ)2. SSE is also commonly referred to as the error
sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.
The ratio r2 = SSR/SST is called the coefficient of determination. If the data points
are clustered closely about the estimated regression line, the value of SSE will be
small and SSR/SST will be close to 1. Using r2, whose values lie between 0 and 1,
provides a measure of goodness of fit; values closer to 1 imply a better fit. A value
of r2 = 0 implies that there is no linear relationship between the dependent and
independent variables.
Here Og, Eg, Ng, and πg denote the observed events, expected events, observations,
predicted risk for the gth risk decile group, and G is the number of groups
Model Construction:
One Model Building Strategy
We've talked before about the "art" of model building. Unsurprisingly, there are
many approaches to model building, but here is one strategy—consisting of seven
steps—that is commonly used when building a regression model.
On a univariate basis, check for outliers, gross data errors, and missing
values.
Study bivariate relationships to reveal other outliers, to suggest possible
transformations, and to identify possible multicollinearities.
I can't possibly over-emphasize the importance of this step. There's not a data
analyst out there who hasn't made the mistake of skipping this step and later
regretting it when a data point was found in error, thereby nullifying hours of
work.
The training set, with at least 15-20 error degrees of freedom, is used to
estimate the model.
The validation set is used for cross-validation of the fitted model.
The fifth step
Using the training set, identify several candidate models:
A classification algorithm may predict a continuous value, but the continuous value
is in the form of a probability for a class label.
A regression algorithm may predict a discrete value, but the discrete value in the
form of an integer quantity.
Some algorithms can be used for both classification and regression with small
modifications, such as decision trees and artificial neural networks. Some algorithms
cannot, or cannot easily be used for both problem types, such as linear regression for
regression predictive modeling and logistic regression for classification predictive
modeling.
Importantly, the way that we evaluate classification and regression predictions varies
and does not overlap, for example:
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
The working of Supervised learning can be easily understood by the below example
and diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden patterns
from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.
Below are some main reasons which describe the importance of Unsupervised
Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
Computational
There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node
of the tree. The complete process can be better understood using the below
algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:
Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance
of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because
the goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant
trend in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
As we can see from the above diagram, the model is unable to capture the data
points present in the plot.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Initial one is random selection of data points that are used for training each of the
trees. Second one is random selection of Features that are used to create every tree.
Since a single decision tree tends to over fit on data, the way of randomness results
in having multiple decision trees where each of it has good accuracy on various
subsets of available training data.
Random forest is a machine learning algorithm that is used in Classification,
regression etc., while training the multiple decision trees are build and output would
be mean or average prediction of each trees.
5. The test Data is trained for trees and then sampled. Then every sample is
distributed to all the trees.
6. Finally, the class of test data set is decided by majority voting or average process.
CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector.
Morgan and Sonquist (1963) proposed a simple method for fitting
trees to predict a quantitative variable. They called the method AID,
for Automatic Interaction Detection. The algorithm performs stepwise
splitting. It begins with a single cluster of cases and searches a
candidate set of predictor variables for a way to split this cluster into
two clusters. Each predictor is tested for splitting as follows: sort all
the n cases on the predictor and examine all n-1 ways to split the
cluster in two. For each possible split, compute the within-cluster sum
of squares about the mean of the cluster on the dependent variable.
Choose the best of the n-1 splits to represent the predictor’s
contribution. Now do this for every other predictor. For the actual
split, choose the predictor and its cut point which yields the smallest
overall within-cluster sum of squares. Categorical predictors require
a different approach. Since categories are unordered, all possible splits
between categories must be considered. For deciding on one split of k
categories into two groups, this means that 2k-1 possible splits must
be considered. Once a split is found, its suitability is measured on the
same within-cluster sum of squares as for a quantitative predictor.
Morgan and Sonquist called their algorithm AID because it naturally
Incorporates interaction among predictors. Interaction is not
correlation. It has to do instead with conditional discrepancies. In the
analysis of variance, interaction means that a trend within one level of
a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-
products between predictors. In the tree model, it is represented by
branches from the same nodes which have different splitting
predictors further down the tree.
Regression trees parallel regression/ANOVA modeling, in which the
dependent variable is quantitative. Classification trees parallel
discriminant analysis and algebraic classification methods. Kass
(1980) proposed a modification to AID called CHAID for categorized
dependent and independent variables.
His algorithm incorporated a sequential merge and split procedure
based on a chi-square test statistic. Kass was concerned about
computation time (although this has since proved an unnecessary
worry), so he decided to settle for a sub-optimal split on each
predictor instead of searching for all possible combinations of the
categories. Kass’s algorithm is like sequential cross-tabulation. For
each predictor:
1) cross tabulate the m categories of the predictor with the k categories of the
dependent variable,
2) find the pair of categories of the predictor whose 2xk sub-
table is least significantly different on a chi-square test and
merge these two categories;
3) if the chi-square test statistic is not “significant” according to a
preset critical value, repeat this merging process for the selected
predictor until no non-significant chi-square is found for a sub-
table, and pick the predictor variable whose chi-square is largest and
split the sample into subsets, where l is the number of categories
resulting from the merging process on that predictor;
4) Continue splitting, as with AID, until no “significant” chi-squares result.
The CHAID algorithm saves some computer time, but it is not
guaranteed to find the splits which predict best at a given step. Only
by searching all possible category subsets can we do that. CHAID is
also limited to categorical predictors, so it cannot be used for
quantitative or mixed categorical- quantitative models.
ARIMA:
A popular and widely used statistical method for time series forecasting is the ARIMAmodel.
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of
model that captures a suite of different standard temporal structures in time seriesdata.
In this tutorial, you will discover how to develop an ARIMA model for timeseries
forecasting in Python.
After completing this tutorial, you will know:
About the ARIMA model the parameters used and assumptions made by the model.
How to fit an ARIMA model to data and use it to make forecasts.
How to configure the ARIMA model on your time series problem.
This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:
p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced, also called thedegree
of differencing.
q: The size of the moving average window, also called the order of moving average.A linear
regression model is constructed including the specified number and type of terms, and the data is
prepared by a degree of differencing in order to make it stationary, i.e. to remove trend and seasonal
structures that negatively affect the regression model.
A value of 0 can be used for a parameter, which indicates to not use that element of the model. This
way, the ARIMA model can be configured to perform the function of an ARMAmodel, and even a
simple AR, I, or MA model.
Adopting an ARIMA model for a time series assumes that the underlying process that generated
the observations is an ARIMA process. This may seem obvious, but helps to motivate the need to
confirm the assumptions of the model in the raw observations and inthe residual errors of forecasts
from the model.
Forecast error
However, error for one time period does not tell us very much. We need to measure
forecast accuracy over time. Two of the most commonly used error measures are
the mean absolute deviation (MAD) and the mean squared error (MSE). MAD is the
average of the sum of the absolute errors:
For n time periods where we have actual demand and forecast values:
Ideal value = 0;
MFE > 0, model tends to under-forecast
MFE < 0, Model tends to over-forecast
2. Mean Absolute Deviation (MAD)
For n time periods where we have actual demand and forecast values:
While MFE is a measure of forecast model bias, MAD indicates the absolute size of
the errors
Uses of Forecast error:
STL Approach
So, STL stands for Seasonal and Trend decomposition using Loess. This is a
statistical method of decomposing a Time Series data into 3 components containing
seasonality, trend and residual.
Now, what is a Time Series data? Well, it is a sequence of data points that varies
across a continuous time axis. Below is an example of a time series data where you
can see the time axis is at an hour level and value of stock varies across the time.
Now let’s talk about trend. Trend gives you a general direction of the overall data.
From the above example, I can say that from 9:00am to 11:00am there is a downward
trend and from 11:00am to 1:00pm there is an upward trend and after 1:00pm the
trend is constant.
Whereas seasonality is a regular and predictable pattern that recur at a fixed interval
of time. For example, the below plot helps us understand the total units sold per
month for a retailer. So, if we try to watch carefully, we can see an increment in the
unit sales during the month of December every year. So, there is a regular pattern or
seasonality in the unit sales associated with a period of 12 months.
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in
database usage and especially in data warehousing that:
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or
structure for querying and analysis purpose
Loads it into the final target (database, more
specifically, operational data store, data mart, or data
warehouse)
Usually all the three phases execute in parallel since the data extraction
takes time, so while the data is being pulled another transformation
process executes, processing the already received data and prepares the
data for loading and as soon as there is some data ready to be loaded
into the target, the data loading kicks off without waiting for the
completion of the previous phases.
Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration
framework
Stambia
Diyotta DI-SUITE for Modern Data Integration
FlyData
Rhino ETL
There are various steps involved in ETL. They are as below in detail:
Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources as
possible. The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any kind of
locking.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be joined. The
transformation step also requires joining data from several sources, generating
aggregates, generating surrogate keys, sorting, deriving new calculated values, and
applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database. In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Managing ETL Process
The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from
one of the systems, missing values in one of the reference tables, or simply a
connection or power outage. Therefore, it is necessary to design the ETL process
keeping fail-recovery in mind.
Feature Extraction:
The feature extraction process results in a much smaller and richer set of attributes.
The maximum number of features is controlled by
the FEAT_NUM_FEATURES build setting for feature extraction models.
Models built on extracted features may be of higher quality, because the data is
described by fewer, more meaningful attributes.
Feature extraction projects a data set with higher dimensionality onto a smaller
number of dimensions. As such it is useful for data visualization, since a complex
data set can be effectively visualized when it is reduced to two or three dimensions.
Feature extraction can be used to extract the themes of a document collection, where
documents are represented by a set of key words and their frequencies. Each theme
(feature) is represented by a combination of keywords. The documents in the
collection can then be expressed in terms of the discovered themes.
UNIT V
Data Visualization
Data Visualization
Data visualization is the graphic representation of data. It involves producing images
that communicate relationships among the represented data to viewers of the images.
This communication is achieved through the use of a systematic mapping between
graphic marks and data values in the creation of the visualization. This mapping
establishes how data values will be represented visually, determining how and to
what extent a property of a graphic mark, such as size or color, will change to reflect
change in the value of a datum.
Fig: chern off faces each face represents an ‘n’ dimensional data points
(n<18)
3.2 Stick figures: It maps multidimensional data to five –piece stick figure,
where each figure has 4 limbs and a body.
A sunburst chart (or a ring chart) is a pie chart with concentric circles, describing
the hierarchy of data values.
The tree diagram allows to describe the tree-like relations within the data
structure, usually from the upside down or from the left to the right.
These forms of data visualization are mostly useful for depicting the hierarchy or
relations of different variables within the data set. However, they are not too suited
for showing the relations between multiple data sets, as network data models work
best for such matter.
Most visualization techniques were mainly for numeric data. Recently, more and
more non-numeric data, such as text and social networks, have become available.
Many people on the Web tag various objects such as pictures, blog entries, and
product reviews. A tag cloud is a visualization of statistics of user-generated tags.
Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
The importance of a tag is indicated by font size or color.
NOTES-3
Prerequisites:
1. A course on “Database Management Systems”.
2. Knowledge of probability and statistics.
Course Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles and methods of statistical analysis
3. Discover interesting patterns, analyze supervised and unsupervised models and estimate the accuracy of the
algorithms.
4. To understand the various search methods and visualization techniques.
INTRODUCTION:
In the beginning times of computers and Internet, the data used was not as much of as it is
today, the data then could be so easily stored and managed by all the users and business enterprises
on a single computer, because the data never exceeded to the extent of 19 exabytes but now in this
era, the data has increased about 2.5 quintillion per day.
Most of the data is generated from social media sites like Facebook, Instagram, Twitter, etc, and the
other sources can be e-business, e-commerce transactions, hospital, school, bank data, etc. This data
is impossible to manage by traditional data storing techniques. Either the data being generated from
large-scale enterprises or the data generated from an individual, each and every aspect of data needs
to be analysed to benefit yourself from it. But how do we do it? Well, that’s where the term ‘Data
Analytics’ comes in.
Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with
respect to business requirements.
2|Pa ge
DATA ANALYTICS UNIT –I
Generate Reports – Reports are generated from the data and are passed on to the respective
teams and individuals to deal with further actions for a high rise in business.
Perform Market Analysis – Market Analysis can be performed to understand the strengths
and weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to customer
requirements and experience.
Data architecture in Information Technology is composed of models, policies, rules or standards that
govern which data is collected, and how it is stored, arranged, integrated, and put to use in data
systems and in organizations.
A data architecture should set data standards for all its data systems as a vision or a model of
the eventual interactions between those data systems.
Data architectures address data in storage and data in motion; descriptions of data stores, data
groups and data items; and mappings of those data artifacts to data qualities, applications,
locations etc.
Essential to realizing the target state, Data Architecture describes how data is processed, stored,
and utilized in a given system. It provides criteria for data processing operations that make it
possible to design data flows and also control the flow of data in the system.
The Data Architect is typically responsible for defining the target state, aligning during
development and then following up to ensure enhancements are done in the spirit of the original
blueprint.
3|Pa ge
DATA ANALYTICS UNIT –I
During the definition of the target state, the Data Architecture breaks a subject down to the atomic
level and then builds it back up to the desired form.
The Data Architect breaks the subject down by going through 3 traditional architectural processes:
Conceptual model: It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as rows and
column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database technology
will be suitable for architecture.
The data architecture is formed by dividing into three essential models and then are combined:
4|Pa ge
DATA ANALYTICS UNIT –I
Various constraints and influences will have an effect on data architecture design. These include
enterprise requirements, technology drivers, economics, business policies and data processing need.
Enterprise requirements:
These will generally include such elements as economical and effective system expansion,
acceptable performance levels (especially system access speed), transaction reliability, and
transparent data management.
In addition, the conversion of raw data such as transaction records and image files into more
useful information forms through such features as data warehouses is also a common
organizational requirement, since this enables managerial decision making and other
organizational processes.
One of the architecture techniques is the split between managing transaction data and (master)
reference data. Another one is splitting data capture systems from data retrieval systems (as
done in a data warehouse).
Technology drivers:
These are usually suggested by the completed data architecture and database architecture
designs.
In addition, some technology drivers will derive from existing organizational integration
frameworks and standards, organizational economics, and existing site resources (e.g.
previously purchased software licensing).
Economics:
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential candidates
due to their cost.
External factors such as the business cycle, interest rates, market conditions, and legal
considerations could all have an effect on decisions relevant to data architecture.
Business policies:
Business policies that also drive data architecture design include internal organizational policies,
rules of regulatory bodies, professional standards, and applicable governmental laws that can
vary by applicable agency.
These policies and rules will help describe the manner in which enterprise wishes to process
their data.
Data processing needs
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data mining),
repetitive periodic reporting, ad hoc reporting, and support of various organizational initiatives
as required (i.e. annual budgets, new product development)
The General Approach is based on designing the Architecture at three Levels of Specification.
The Logical Level
5|Pa ge
DATA ANALYTICS UNIT –I
1. Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
6|Pa ge
DATA ANALYTICS UNIT –I
questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the target audience on which analysis is performed otherwise it would be
a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee.
Some basic business or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing.
These can be both structured and unstructured like personal interviews or formal interviews
through telephone, face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video.
The survey method can be obtained in both online and offline mode like through website forms
and email. Then that survey answers are stored for analysing data. Examples are online surveys
or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher keenly observes
the behaviour and practices of the target audience using some data collecting tool and stores
the observed data in the form of text, audio, video, or any raw formats.
In this method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behaviour towards the products. The data
obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation.
The most frequently used experiment methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design used in data analytics which
is based on randomization and replication. It is mostly used for comparing the experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is divided into
small units called blocks.
Random experiments are performed on each of the blocks and results are drawn using a
technique known as analysis of variance (ANOVA). RBD was originated from the agriculture
sector.
7|Pa ge
DATA ANALYTICS UNIT –I
Randomized Block Design - The Term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different blocks of land
to ascertain their effect on the yield of the crop.
Blocks are formed in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment. The production
of each plot is measured after the treatment is given.
These data are then interpreted and inferences are drawn by using the analysis of Variance
technique so as to know the effect of various treatments like different dozes of fertilizers,
different types of irrigation etc.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but
contains rows and columns.
It is an arrangement of NxN squares with an equal number of rows and columns which contain
letters that occurs only once in a row. Hence the differences can be easily found with fewer errors
in the experiment. Sudoku puzzle is an example of a Latin square design.
A Latin square is one of the experimental designs which has a balanced two-way classification
scheme say for example - 4 X 4 arrangement. In this scheme each letter from A to D occurs only
once in each row and also only once in each column.
The Latin square is probably under used in most fields of research because text book examples
tend to be restricted to agriculture, the area which spawned most original work on ANOVA.
Agricultural examples often reflect geographical designs where rows and columns are literally two
dimensions of a grid in a field.
Rows and columns can be any two sources of variation in an experiment. In this sense a Latin
square is a generalisation of a randomized block design with two different blocking systems
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and columns.
Thus, the magnitude of error will be smaller than any other design.
FD- Factorial design is an experimental design where each experiment has two factors each with
possible values and on performing trail other combinational factors are derived. This design allows the
experimenter to test two or more variables simultaneously. It also measures interaction effects of the
variables and analyses the impacts of each of the variables. In a true experiment, randomization is
essential so that the experimenter can infer cause and effect without any bias.
2. Secondary data:
8|Pa ge
DATA ANALYTICS UNIT –I
Secondary data is the data which has already been collected and reused again for some valid purpose.
This type of data is previously recorded from primary data and it has two types of sources named
internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales record,
transactions, customer data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sales of a product. The information
provided is from outside the organization.
Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing is working.
Miscellaneous Reports- These are what information you are getting from operational reports.
If the data available within the organization are unsuitable or inadequate, the marketer should
extend the search to external secondary data sources.
External source:
The data which can’t be found at internal organizations and can be gained through external third-party
resources is external source data. The cost and time consumption are more because this contains a
huge amount of data. Examples of external sources are Government publications, news publications,
Registrar General of India, planning commission, international labour bureau, syndicate services, and
other non-governmental publications.
1. Government Publications-
Government sources provide an extremely rich pool of data for the researchers. In addition,
many of these data are available free of cost on internet websites. There are number of
government agencies generating data.
These are like: Registrar General of India- It is an office which generates demographic data.
It includes details of gender, age, occupation etc.
2. Central Statistical Organization-
This organization publishes the national accounts statistics. It contains estimates of national
income for several years, growth rate, and rate of major economic activities. Annual survey
of Industries is also published by the CSO.
It gives information about the total number of workers employed, production units, material
used and value added by the manufacturer.
3. Director General of Commercial Intelligence-
This office operates from Kolkata. It gives information about foreign trade i.e. import and
export. These figures are provided region-wise and country-wise.
4. Ministry of Commerce and Industries-
9|Pa ge
DATA ANALYTICS UNIT –I
This ministry through the office of economic advisor provides information on wholesale price
index. These indices may be related to a number of sectors like food, fuel, power, food grains
etc.
It also generates All India Consumer Price Index numbers for industrial workers, urban, non-
manual employees and cultural labourers.
5. Planning Commission-
It provides the basic statistics of Indian Economy.
6. Reserve Bank of India-
This provides information on Banking Savings and investment. RBI also prepares currency
and finance reports.
7. Labour Bureau-
It provides information on skilled, unskilled, white collared jobs etc.
8. National Sample Survey-
This is done by the Ministry of Planning and it provides social, economic, demographic,
industrial and agricultural statistics.
9. Department of Economic Affairs-
It conducts economic survey and it also generates information on income, consumption,
expenditure, investment, savings and foreign trade.
10. State Statistical Abstract-
This gives information on various types of activities related to the state like - commercial
activities, education, occupation etc.
11. Non-Government Publications-
These includes publications of various industrial and trade associations, such as The Indian
Cotton Mill Association Various chambers of commerce.
12. The Bombay Stock Exchange
It publishes a directory containing financial accounts, key profitability and other relevant
matter) Various Associations of Press Media.
Export Promotion Council.
Confederation of Indian Industries (CII)
Small Industries Development Board of India
Different Mills like - Woollen mills, Textile mills etc
The only disadvantage of the above sources is that the data may be biased. They are likely
to colour their negative points.
13. Syndicate Services-
These services are provided by certain organizations which collect and tabulate the marketing
information on a regular basis for a number of clients who are the subscribers to these
services.
These services are useful in television viewing, movement of consumer goods etc.
10 | P a g e
DATA ANALYTICS UNIT –I
These syndicate services provide information data from both household as well as institution.
11 | P a g e
DATA ANALYTICS UNIT –I
simultaneous analysis.
Data Management:
Data management is the practice of collecting, keeping, and using data securely, efficiently, and cost-
effectively. The goal of data management is to help people, organizations, and connected things
optimize the use of data within the bounds of policy and regulation so that they can make decisions
and take actions that maximize the benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies, procedures, and
practices. The work of data management has a wide scope, covering factors such as how to:
Create, access, and update data across a diverse data tier
Store data across multiple clouds and on premises
Provide high availability and disaster recovery
Use data in a growing variety of apps, analytics, and algorithms
Ensure data privacy and security
Archive and destroy data in accordance with retention schedules and compliance requirements
What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It doesn’t store
any data on the hard disk of your personal computer. In cloud computing, you can access data from
a remote server.
Service Models of Cloud computing are the reference models on which the Cloud Computing is based.
These can be categorized into
three basic service models as listed below:
1. INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual machines, virtual
storage, etc.
2. PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools, etc.
3. SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.
For providing the above services models AWS is one of the popular platforms. In this Amazon Cloud
(Web) Services is one of the popular service platforms for Data Management
Amazon Cloud (Web) Services Tutorial
What is AWS?
The full form of AWS is Amazon Web Services. It is a platform that offers flexible, reliable, scalable,
easy-to-use and, cost-effective cloud computing solutions.
12 | P a g e
DATA ANALYTICS UNIT –I
AWS is a comprehensive, easy to use computing platform offered Amazon. The platform is developed
with a combination of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged
software as a service (SaaS) offering.
History of AWS
2002- AWS services launched
2006- Launched its cloud products
2012- Holds first customer event
2015- Reveals revenues achieved of $4.6 billion
2016- Surpassed $10 billon revenue target
2016- Release snowball and snowmobile
2019- Offers nearly 100 cloud services
2021- AWS comprises over 200 products and services
13 | P a g e
DATA ANALYTICS UNIT –I
A prompt window will open. Click the Create Bucket button at the bottom of the page.
Create a Bucket dialog box will open. Fill the required details and click the Create button.
The bucket is created successfully in Amazon S3. The console displays the list of buckets and its
properties.
Select the Static Website Hosting option. Click the radio button Enable website hosting and fill the
required details.
14 | P a g e
DATA ANALYTICS UNIT –I
Click the Add files option. Select those files which are to be uploaded from the system and then
click the Open button.
Click the start upload button. The files will get uploaded into the bucket.
Afterwards, we can create, edit, modify, update the objects and other files in wide formats.
Amazon S3 Features
Low cost and Easy to Use − Using Amazon S3, the user can store a large amount of data at
very low charges.
Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by configuring
bucket policies using AWS IAM.
Scalable − Using Amazon S3, there need not be any worry about storage concerns. We can
store as much data as we have and access it anytime.
Higher performance − Amazon S3 is integrated with Amazon CloudFront, that distributes
content to the end users with low latency and provides high data transfer speeds without any
minimum usage commitments.
Integrated with AWS services − Amazon S3 integrated with AWS services include Amazon
CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon Route 53, Amazon
VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.
Data Quality:
15 | P a g e
DATA ANALYTICS UNIT –I
16 | P a g e
DATA ANALYTICS UNIT –I
OUTLIERS:
17 | P a g e
DATA ANALYTICS UNIT –I
Detect Outliers:
Most commonly used method to detect outliers is visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for
visualization).
Rejection:
Rejection of outliers is more acceptable in areas of practice where the underlying model of the
process being measured and the usual distribution of measurement error are confidently
known.
An outlier resulting from an instrument reading error may be excluded but it is desirable that
the reading is at least verified.
18 | P a g e
DATA ANALYTICS UNIT –I
Missing Values
Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to
wrong prediction or classification.
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number)
and R outputs the result for dividing by zero as ‘Inf’(Infinity).
Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
19 | P a g e
DATA ANALYTICS UNIT –I
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
1. Normalization:
Normalization is a technique often applied as part of data preparation in Data Analytics
through machine learning. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges of
20 | P a g e
DATA ANALYTICS UNIT –I
values. For machine learning, every dataset does not require normalization. It is done in order
to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
Discretization is the process through which we can transform continuous variables, models
or functions into a discrete form. We do this by creating a set of contiguous intervals (or
bins) that go across the range of our desired variable/model/function. Continuous data is
Measured, while Discrete data is Counted
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
21 | P a g e
DATA ANALYTICS UNIT –I
dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be done
correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred
to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following
stage of data processing. During preparation, raw data is diligently checked for any errors. The purpose
of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create
high-quality data for the best business intelligence.
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand. Data input is the first stage in
which raw data begins to take the form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself may
vary slightly depending on the source of data being processed (data lakes, social networks, connected
devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected
devices, determining customer needs, etc.).
22 | P a g e
DATA ANALYTICS UNIT –I
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.
23 | P a g e
DATA ANALYTICS UNIT –2
UNIT – II Syllabus
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application of
Modeling in Business, Databases & Types of Data and variables, Data Modeling Techniques, Missing
Imputations etc. Need for Business Modeling.
Topics:
1. Introduction to Data Analytics
2. Data Analytics Tools and Environment
3. Need for Business Modeling.
4. Data Modeling Techniques
5. Application of Modeling in Business
6. Databases & Types of Data and variables
7. Missing Imputations etc.
Unit-2 Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles Tools and Environment
3. To explore the applications of Business Modelling
4. To understand the Data Modeling Techniques
5. To understand the Data Types and Variables and Missing imputations
Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe concepts of data analytics.
2. To demonstrate the principles Tools and Environment
3. To analyze the applications of Business Modelling
4. To understand and Compare the Data Modeling Techniques
5. To describe the Data Types and Variables and Missing imputations
2|Page
DATA ANALYTICS UNIT –2
INTRODUCTION:
Data has been the buzzword for ages now. Either the data being generated from large-scale
enterprises or the data generated from an individual, each and every aspect of data needs to be
analyzed to benefit yourself from it.
3|Page
DATA ANALYTICS UNIT –2
3. Efficient Operations: With the help of data analytics, you can streamline your processes, save
money, and boost production. With an improved understanding of what your audience wants, you
spend lesser time creating ads and content that aren’t in line with your audience’s interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your campaigns are
performing. This helps in fine-tuning them for optimal outcomes. Additionally, you can also find
potential customers who are most likely to interact with a campaign and convert into leads.
4|Page
DATA ANALYTICS UNIT –2
4. Data Exploration and Analysis: After you gather the right data, the next vital step is to
execute exploratory data analysis. You can use data visualization and business intelligence tools,
data mining techniques, and predictive modelling to analyze, visualize, and predict future outcomes
from this data. Applying these methods can tell you the impact and relationship of a certain feature
as compared to other variables.
Below are the results you can get from the analysis:
You can identify when a customer purchases the next product.
You can understand how long it took to deliver the product.
You get a better insight into the kind of items a customer looks for, product returns, etc.
You will be able to predict the sales and profit for the next quarter.
You can minimize order cancellation by dispatching only relevant products.
You’ll be able to figure out the shortest route to deliver the product, etc.
5. Interpret the results: The final step is to interpret the results and validate if the outcomes
meet your expectations. You can find out hidden patterns and future trends. This will help you gain
insights that will support you with appropriate data-driven decision making.
R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac OS.
It also provides tools to automatically install all packages as per user-requirement.
Python – Python is an open-source, object-oriented programming language that is easy to
read, write, and maintain. It provides various machine learning and visualization libraries
such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras, etc. It also can be assembled on
any platform like SQL server, a MongoDB database or JSON
5|Page
DATA ANALYTICS UNIT –2
Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps, dashboards etc
with real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
SAS – A programming language and environment for data manipulation and analytics, this
tool is easily accessible and can analyze data from different sources.
Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly
used for clients’ internal data, this tool analyzes the tasks that summarize the data with a
preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data source types
such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used
for predictive analytics, such as data mining, text analytics, machine learning.
KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform,
which allows you to analyze and model data. With the benefit of visual programming, KNIME
provides a platform for reporting and integration through its modular data pipeline concept.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean
up data for analysis. It is used for cleaning messy data, the transformation of data and
parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This
tool is also popular for data pipelines and machine learning model development.
6|Page
DATA ANALYTICS UNIT –2
5. Logistics: Logistics companies use data analytics to develop new business models and
optimize routes. This, in turn, ensures that the delivery reaches on time in a cost-efficient
manner.
Cluster computing:
Cluster computing is a collection of tightly or loosely
connected computers that work together so that they
act as a single entity.
The connected computers execute operations all
together thus creating the idea of a single system.
The clusters are generally connected through fast local
area networks (LANs)
Apache Spark:
7|Page
DATA ANALYTICS UNIT –2
8|Page
DATA ANALYTICS UNIT –2
9|Page
DATA ANALYTICS UNIT –2
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory-
based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop
disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.
What is Scala?
Scala is a statically typed programming language that incorporates both functional and
object oriented, also suitable for imperative programming approaches.to increase scalability
of applications. It is a general-purpose programming language. It is a strong static type
language. In scala, everything is an object whether it is a function or a number. It does not
have concept of primitive data.
Scala primarily runs on JVM platform and it can also be used to write software for native
platforms using Scala-Native and JavaScript runtimes through ScalaJs.
This language was originally built for the Java Virtual Machine (JVM) and one of Scala’s
strengths is that it makes it very easy to interact with Java code.
Scala is a Scalable Language used to write Software for multiple platforms. Hence, it got
the name “Scala”. This language is intended to solve the problems of Java
10 | P a g e
DATA ANALYTICS UNIT –2
while simultaneously being more concise. Initially designed by Martin Odersky, itwas
released in 2003.
Why Scala?
Scala is the core language to be used in writing the most popular distributed big
data processing framework Apache Spark. Big Data processing is becoming
inevitable from small to large enterprises.
Extracting the valuable insights from data requires state of the art processing tools
and frameworks.
Scala is easy to learn for object-oriented programmers, Java developers. It is
becoming one of the popular languages in recent years.
Scala offers first-class functions for users
Scala can be executed on JVM, thus paving the way for the interoperability with
other languages.
It is designed for applications that are concurrent (parallel), distributed, and resilient
(robust) message-driven. It is one of the most demanding languages of this decade.
It is concise, powerful language and can quickly grow according to the demand of
its users.
It is object-oriented and has a lot of functional programming features providing a lot
of flexibility to the developers to code in a way they want.
Scala offers many Duck Types(Structural Types)
Unlike Java, Scala has many features of functional programming languages like Scheme,
Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation,
and pattern matching.
The name Scala is a portmanteau of "scalable" and "language", signifying that it is
designed to grow with the demands of its users.
11 | P a g e
DATA ANALYTICS UNIT –2
Cloudera Impala:
Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL
query engine for data stored in a computer cluster running Apache Hadoop.
Impala is the open source, massively parallel processing (MPP) SQL query engine for
native analytic database in a computer cluster running Apache Hadoop.
It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon.
Cloudera Impala is a query engine that runs on Apache Hadoop.
The project was announced in October 2012 with a public beta test distribution and
became generally available in May 2013.
Impala brings enabling users to issue low latency SQL queries to data stored in HDFS
and Apache HBase without requiring data movement or transformation.
Impala is integrated with Hadoop to use the same file and data formats, metadata,
security and resource management frameworks used by MapReduce, Apache Hive,
Apache Pig and other Hadoop software.
Impala is promoted for analysts and data scientists to perform analytics on data
stored in Hadoop via SQL or business intelligence tools.
The result is that large-scale data processing (via MapReduce) and interactive
queries can be done on the same system using the same data and metadata –
removing the need to migrate data sets into specialized systems and/or proprietary
formats simply to perform analysis.
Features include:
Supports HDFS and Apache HBase storage,
Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and
Parquet,
Supports Hadoop security (Kerberos authentication),
Fine-grained, role-based authorization with Apache Sentry,
Uses metadata, ODBC driver, and SQL syntax from Apache Hive.
12 | P a g e
DATA ANALYTICS UNIT –2
13 | P a g e
DATA ANALYTICS UNIT –2
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response
time becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive. The alternative for this issue is to distribute
database load on multiple hosts whenever the load increases. This method is known
as “scaling out.”
Oracle MongoDB
MySQL couchDB
14 | P a g e
DATA ANALYTICS UNIT –2
SQL NoSQL
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for These databases are not so good for
complex queries complex queries
15 | P a g e
DATA ANALYTICS UNIT –2
The table below summarizes the main differences between SQL and NoSQL databases.
16 | P a g e
DATA ANALYTICS UNIT –2
Benefits of NoSQL
The NoSQL data model addresses several issues that the relational model is not
designed to address:
Large volumes of structured, semi-structured, and unstructured data.
Object-oriented programming that is easy to use and flexible.
Efficient, scale-out architecture instead of expensive, monolithic architecture.
Variables:
Data consist of individuals and variables that give us information about those
individuals. An individual can be an object or a person.
A variable is an attribute, such as a measurement or a label.
Two types of Data
Quantitative data(Numerical)
Categorical data
17 | P a g e
DATA ANALYTICS UNIT –2
Categorical variables: Categorical variables represent groupings of some kind. They are
sometimes recorded as numbers, but the numbers represent categories rather than actual
amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal variables.
18 | P a g e
DATA ANALYTICS UNIT –2
Missing Imputations:
Imputation is the process of replacing missing data with substituted values.
2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but due entirely
to observed variables. For example, those from a lower socioeconomic status may be less
willing to provide salary information (but we know their SES status). The key is that the
missingness is not due to the values which are not observed. MCAR implies MAR but not
vice-versa.
3. MNAR
If the data are Missing Not At Random, then the missingness depends on the values of the
missing data. Censored data falls into this category. For example, individuals who are
heavier are less likely to report their weight. Another example, the device measuring some
response can only measure values above .5. Anything below that is missing.
19 | P a g e
DATA ANALYTICS UNIT –2
are replaced by, say, “Unknown,” then the mining program may mistakenly think
that they form an interesting concept, since they all have a value in common-that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the average
value of that particular attribute and use this value to replace the missing value in
that attribute column.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category
as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for income.
20 | P a g e
DATA ANALYTICS UNIT –2
Although business analytics is being leveraged in most commercial sectors and industries,
the following applications are the most common.
1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an ideal
way of gathering information about a purchaser’s spending habits, financial situation,
behavior trends, demographics, and lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer loyalty
to stay in business for the long haul. CRM systems analyze important performance indicators
such as demographics, buying patterns, socio-economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights that
help organizations maneuver their way through tricky terrain. Corporations turn to business
analysts to optimize budgeting, banking, financial planning, forecasting, and portfolio
management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average length
of employment, etc. By working with this information, business analysts help HR by
forecasting the best fits between the company and candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks, and
supply-chain management to create maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring marketing
and advertising metrics, identifying consumer behavior and the target audience, and
analyzing market trends.
21 | P a g e
DATA ANALYTICS UNIT –2
Below given are 5 different types of techniques used to organize the data:
1. Hierarchical Technique
The hierarchical model is a tree-like structure. There is one root node, or we can say one parent
node and the other child nodes are sorted in a particular order. But, the hierarchical model is very
rarely used now. This model can be used for real-world model relationships.
22 | P a g e
DATA ANALYTICS UNIT –2
2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The object-
oriented model communicates while supporting data abstraction, inheritance, and encapsulation.
3. Network Technique
The network model provides us with a flexible way of representing objects and relationships
between these entities. It has a feature known as a schema representing the data in the form of a
graph. An object is represented inside a node and the relation between them as an edge, enabling
them to maintain multiple parent and child records in a generalized manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to define data
elements and relationship for the entities in a system. This conceptual design provides a better
view of the data that helps us easy to understand. In this model, the entire database is represented
in a diagram called an entity-relationship diagram, consisting of Entities, Attributes, and
Relationships.
5. Relational Technique
Relational is used to describe the different relationships between the entities. And there are different
sets of relations between the entities such as one to one, one to many, many to one, and many to
many.
23 | P a g e
DATA ANALYTICS UN I T –3
UNIT - III
Linear & Logistic Regression
Syllabus
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable
Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.
Topics:
1. Regression – Concepts
2. Blue property assumptions
3. Least Square Estimation
4. Variable Rationalization
5. Model Building etc.
6. Logistic Regression - Model Theory
7. Model fit Statistics
8. Model Construction
9. Analytics applications to various Business Domains
Unit-2 Objectives:
1. To explore the Concept of Regression
2. To learn the Linear Regression
3. To explore Blue Property Assumptions
4. To Learn the Logistic Regression
5. To understand the Model Theory and Applications
Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe the Concept of Regression
2. To demonstrate Linear Regression
3. To analyze the Blue Property Assumptions
4. To explore the Logistic Regression
5. To describe the Model Theory and Applications
2|Page
DATA ANALYTICS UN I T –3
Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of the average
value of one variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
“Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables
&
One or More Independent variables / Predictor variables.
Regression describes how an independent variable is numerically related to the
dependent variable.
Regression can be used for prediction, estimation and hypothesis testing, and modeling
causal relationships.
3|Page
DATA ANALYTICS UN I T –3
i1
B0 mean y – B1*meanx
4|Page
DATA ANALYTICS UN I T –3
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple
linear regression. The procedure for linear regression is different and simpler than
that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x) *
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
yi-mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2
xi mean(x)
n
2
i1
B0 mean y – B1*meanx
We can find from the above formulas,
B1=2 and B0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")
5|Page
DATA ANALYTICS UN I T –3
Err
p is the predicted value and y is the actual
value, i is the index for a specific instance, n is
the number of predictions, because we must
calculate the error across all predicted values.
Estimating Error for y=0.8*x+0.4
x y = y-actual p = y-predicted p-y (p-y)^2 mean(x)= 3
1 1 1.2 0.2 0.04 s = sum of (p-y)2 = 2.4
2 3 2 -1 1
s/n = 2.4 / 5 = 0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64 sqrt(s/n) = sqrt(0.48) = 0.692
5 5 4.4 -0.6 0.36 RMSE = 0.692
6|Page
DATA ANALYTICS UN I T –3
7|Page
DATA ANALYTICS UN I T –3
Homoscedasticity vs Heteroscedasticity:
8|Page
DATA ANALYTICS UN I T –3
1. Stepwise Forward Selection: This procedure starts with an empty set of attributes as the
minimal set. The most relevant attributes are chosen (having minimum p-value) and are
added to the minimal set. In each iteration, one attribute is added to a reduced set.
Stepwise Backward Elimination: Here all the attributes are considered in the initial set
of attributes. In each iteration, one attribute is eliminated from the set of attributes whose
p-value is higher than significance level.
Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant attributes most
efficiently. This is the most common technique which is generally used for attribute selection.
Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute. Each
branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute
that is not the part of tree is considered irrelevant and hence discarded.
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1. Problem Definition
The first step in constructing a model is to
understand the industrial problem in a more comprehensive way. To identify the purpose of
the problem and the prediction target, we must define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation
9|Page
DATA ANALYTICS UN I T –3
Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders
into account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the
dataset.
3. Data Collection
Data collection is gathering data from relevant sources regarding the analytical problem,
then we extract meaningful insights from the data for prediction.
4. Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying
any algorithmic model to data, we have to explore it first.
By inspecting the data, we get to understand the explicit and hidden trends in data. We find
the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.
There are several sub steps involved in data exploration:
o Feature Identification:
You need to analyze which data features are available and which ones are
not.
Identify independent and target variables.
Identify data types and categories of these variables.
10 | P a g e
DATA ANALYTICS UN I T –3
o Univariate Analysis:
We inspect each variable one by one. This kind of analysis depends on the
variable type whether it is categorical and continuous.
Continuous variable: We mainly look for statistical trends like mean,
median, standard deviation, skewness, and many more in the dataset.
Categorical variable: We use a frequency table to understand the
spread of data for each category. We can measure the counts and
frequency of occurrence of values.
o Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two or more
variables.
We can find the correlation in case of continuous variables and the case of
categorical, we look for association and dissociation between them.
o Filling Null Values:
Usually, the dataset contains null values which lead to lower the potential of
the model. With a continuous variable, we fill these null values using the
mean or mode of that specific column. For the null values present in the
categorical column, we replace them with the most frequently occurred
categorical value. Remember, don’t delete that rows because you may lose
the information.
5. Predictive Modeling
Predictive modeling is a mathematical approach to create a statistical model to forecast
future behavior based on input test data.
Steps involved in predictive modeling:
Algorithm Selection:
o When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques. When we have unstructured data and want
to predict the clusters of items to which a particular input test sample belongs, we
use unsupervised algorithms. An actual data scientist applies multiple algorithms to
get a more accurate model.
Train Model:
o After assigning the algorithm and getting the data handy, we train our model using
the input data applying the preferred algorithm. It is an action to determine the
correspondence between independent variables, and the prediction targets.
Model Prediction:
11 | P a g e
DATA ANALYTICS UN I T –3
o We make predictions by giving the input test data to the trained model. We measure
the accuracy by using a cross-validation strategy or ROC curve which performs well
to derive model output for test data.
6. Model Deployment
There is nothing better than deploying the model in a real-time environment. It helps us to
gain analytical insights into the decision-making procedure. You constantly need to update
the model with additional features for customer satisfaction.
To predict business decisions, plan market strategies, and create personalized customer
interests, we integrate the machine learning model into the existing production domain.
When you go through the Amazon website and notice the product recommendations
completely based on your curiosities. You can experience the increase in the involvement of
the customers utilizing these services. That’s how a deployed model changes the mindset of
the customer and convince him to purchase the product.
Key Takeaways
12 | P a g e
DATA ANALYTICS UN I T –3
Logistic Regression:
Model Theory, Model fit Statistics, Model Construction
Introduction:
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
The curve from the logistic function indicates the likelihood of something such as whether
or not the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
called logistic regression, but is used to classify samples; therefore, it falls under the
classification algorithm.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
Definition: Multi-collinearity:
Multicollinearity is a statistical phenomenon in which multiple independent variables show high
correlation between each other and they are too inter-related.
Multicollinearity also called as Collinearity and it is an undesired situation for any statistical
regression model since it diminishes the reliability of the model itself.
13 | P a g e
DATA ANALYTICS UN I T –3
If two or more independent variables are too correlated, the data obtained from the
regression will be disturbed because the independent variables are actually dependent
between each other.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
Logistic Regression uses a more complex cost function, this cost function can be defined as
the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear
function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and
1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less
The sigmoid function maps any real value into another value within a range of 0 and 1,
The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form.
The below image is showing the logistic function:
1
z sigmoid ( y) ( y)
1 e y
Hypothesis Representation
When using linear regression, we used a formula for the line equation as:
y b0 b1x1 b2 x2 ... bn xn
In the above equation y is a response variable, x1, x2 ,...xn are the predictor variables,
and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.
> rm(list=ls())
> attach(mtcars) #attaching a
data set into the R environment
> input <- mtcars[,c("mpg","disp","hp","wt")]
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
15 | P a g e
DATA ANALYTICS UN I T –3
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
> #model<-lm(mpg~disp+hp+wt);model1# Show the model
> model<-glm(mpg~disp+hp+wt);model
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
16 | P a g e
DATA ANALYTICS UN I T –3
model, where N is the number of target classes. The matrix compares the actual target values
with those predicted by the machine learning model. This gives us a holistic view of how well
our classification model is performing and what kinds of errors it is making. It is a specific
table layout that allows visualization of the performance of an algorithm, typically a supervised
True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error
17 | P a g e
DATA ANALYTICS UN I T –3
Understanding True Positive, True Negative, False Positive and False Negativ
e in a Confusion Matrix
True Positive (TP)
The predicted value matches the actual value
The actual value was positive and the model predicted a positive value
True Negative (TN)
The predicted value matches the actual value
The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value
Also known as the Type 1 error
False Negative (FN) – Type 2 error
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error
To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations.
Accuracy is a great measure to understand that the model is Best.
Accuracy is dependable only when you have symmetric datasets where values of
false positive and false negatives are almost same.
TP TN
Accuracy
TP FP TN FN
Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
Precision is a useful metric in cases where False Positive is a higher concern than False
18 | P a g e
DATA ANALYTICS UN I T –3
Negatives.
19 | P a g e
DATA ANALYTICS UN I T –3
TP
Recall =
TP+ FN
Recall is a useful metric in cases where False Negative trumps False Positive.
Recall is important in medical cases where it doesn’t matter whether we raise a
false alarm but the actual positive cases should not go undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these
two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.
F Score 2
2*
Precision* Recall
Precision Recall
1
Recall Precesion
1 1
F1 is usually more useful than accuracy, especially if you have an uneven class
distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s better to look at
both Precision and Recall.
But there is a catch here. If the interpretability of the F1-score is poor, means that we
don’t know what our classifier is maximizing – precision or recall? So, we use it in
combination with other evaluation metrics which gives us a complete picture of the
result.
20 | P a g e
DATA ANALYTICS UN I T –3
Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get
as follows:
True Positive (TP) = 560
-Means 560 positive class data points were
21 | P a g e
DATA ANALYTICS UN I T –3
Precision:
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict correctly with
our model.
TP 560
Precision= 0.903
TP+ FP 560 60
We can easily calculate Precision and Recall for our model by plugging in the values into the
above questions:
TP 560
Recall = 0.918
TP+ FN 560 50
F1-Score
Precision* Recall
F1 Score 2*
Precision Recall
0.903* 0.918 0.8289
F Score 2 * 0.4552
1
0.903 0.918 1.821
one of the most important evaluation metrics for checking any classification
AUC - ROC curve is a performance measurement for the classification problems at various
22 | P a g e
DATA ANALYTICS UN I T –3
threshold settings. ROC is a probability curve and AUC represents the degree or measure
23 | P a g e
DATA ANALYTICS UN I T –3
of separability. It tells how much the model is capable of distinguishing between classes. Higher
the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the
Higher the AUC, the better the model is at distinguishing between patients with the disease and
no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the
x-axis.
Specificity
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TPTP+FN
24 | P a g e
DATA ANALYTICS UN I T –3
False Positive Rate (FPR) is defined as follows:
25 | P a g e
DATA ANALYTICS UN I T –3
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.
26 | P a g e
DATA ANALYTICS UN I T –3
Business analytics (BA) is the combination of skills, technologies, and practices used to
examine an organization's data and performance as a way to gain insights and make
data-driven decisions in the future using statistical analysis.
Although business analytics is being leveraged in most commercial sectors and industries, the
loyalty to stay in business for the long haul. CRM systems analyze important
that help organizations maneuver their way through tricky terrain. Corporations turn to
portfolio management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average
length of employment, etc. By working with this information, business analysts help HR
27 | P a g e
DATA ANALYTICS UN I T –3
by forecasting the best fits between the company and candidates.
5. Manufacturing
28 | P a g e
DATA ANALYTICS UN I T –3
Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks,
marketing and advertising metrics, identifying consumer behaviour and the target
29 | P a g e
DATA ANALYTICS UN I T –3
TOBE DISCUSSED:
Receiver Operating Characteristics:
ROC & AUC
Here, we add the constant term b0, by setting x0 = 1. This gives us K+1 parameters. The left
hand side of the above equation is called the logit of P (hence, the name logistic regression).
30 | P a g e
DATA ANALYTICS UN I T –3
The right hand side of the top equation is the sigmoid of z, which maps the real line to the
interval (0, 1), and is approximately linear near the origin. A useful fact about P(z) is that the
derivative P'(z) = P(z) (1 – P(z)). Here’s the derivation:
Later, we will want to take the gradient of P with respect to the set of coefficients b,
rather than z. In that case, P'(z) = P(z) (1 – P(z))z‘, where ‘ is the gradient taken with
respect to b.
31 | P a g e
DATA ANALYTICS UN I T -4
UNIT - IV
Object Segmentation & Time Series Methods
Syllabus:
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised
Learning, Tree Building – Regression, Classification, Overfitting, Pruning and Complexity,
Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract
features from generated model as Height, Average Energy etc and analyze for prediction
Topics:
Object Segmentation:
Supervised and Unsupervised Learning
Segmentaion & Regression Vs Segmentation
Regression, Classification, Overfitting,
Decision Tree Building
Pruning and Complexity
Multiple Decision Trees etc.
Unit-4 Objectives:
1. To explore the Segmentaion & Regression Vs Segmentation
2. To learn the Regression, Classification, Overfitting
3. To explore Decision Tree Building, Multiple Decision Trees etc.
4. To Learn the Arima, Measures of Forecast Accuracy
5. To understand the STL approach
Unit-4 Outcomes:
After completion of this course students will be able to
1. To Describe the Segmentaion & Regression Vs Segmentation
2. To demonstrate Regression, Classification, Overfitting
3. To analyze the Decision Tree Building, Multiple Decision Trees etc.
4. To explore the Arima, Measures of Forecast Accuracy
5. To describe the STL approach
2|Page
DATA ANALYTICS UN I T -4
Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly. So
to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and
taste of each fruit. Once the training is completed, we will test the model by giving
the new set of fruit. The model will identify the fruit and predict the output using a
suitable algorithm.
Unsupervised Machine Learning:
Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to find
the structure and patterns from the input data. Unsupervised learning does not
need any supervision. Instead, it finds patterns from the data by its own.
Unsupervised learning can be used for two types of problems: Clustering and
Association.
Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to
the model. We will just provide the input dataset to the model and allow the
model to find the patterns from the data. With the help of a suitable algorithm, the
model will train itself and divide the fruits into different groups according to the
most similar features between them.
3|Page
DATA ANALYTICS UN I T -4
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output the hidden patterns and useful insights
when it is given new data. from the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.
Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine
predict the correct output. things by his experiences.
4|Page
DATA ANALYTICS UN I T -4
Segmentation
Segmentation refers to the act of segmenting data according to your company’s
needs in order to refine your analyses based on a defined context. It is a
technique of splitting customers into separate groups depending on their attributes
or behavior.
studied period.
Steps:
Define purpose – Already mentioned in the statement above
Identify critical parameters – Some of the variables which come up in mind are
skill, motivation, vintage, department, education etc. Let us say that basis past
experience, we know that skill and motivation are most important parameters. Also,
for sake of simplicity we just select 2 variables. Taking additional variables will
increase the complexity, but can be done if it adds value.
Granularity – Let us say we are able to classify both skill and motivation into High
and Low using various techniques.
6|Page
DATA ANALYTICS UN I T -4
Objective Segmentation
Segmentation to identify the type of customers who would respond to a
particular offer.
Segmentation to identify high spenders among customers who will use the
e- commerce channel for festive shopping.
Segmentation to identify customers who will default on their credit obligation for
a loan or credit card.
Non-Objective Segmentation
https://www.yieldify.com/blog/types-of-market-segmentation/
Segmentation of the customer base to understand the specific profiles which exist
within the customer base so that multiple marketing actions can be
personalized for each segment
Segmentation of geographies on the basis of affluence and lifestyle of people living
in each geography so that sales and distribution strategies can be formulated
accordingly.
Hence, it is critical that the segments created on the basis of an objective
segmentation methodology must be different with respect to the stated objective
(e.g. response to an offer).
However, in case of a non-objective methodology, the segments are different with
respect to the “generic profile” of observations belonging to each segment, but not
with regards to any specific outcome of interest.
The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.
Regression Vs Segmentation
Regression analysis focuses on finding a relationship between a dependent variable
and one or more independent variables.
Predicts the value of a dependent variable based on the value of at least
one independent variable.
Explains the impact of changes in an independent variable on the
dependent variable.
We use linear or logistic regression technique for developing accurate models
for predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their effectiveness.
7|Page
DATA ANALYTICS UN I T -4
Creating separate model for separate segments may be time consuming and not
worth the effort. But, creating separate model for separate segments may provide
higher predictive power.
Decision Tree is a supervised learning technique that can be used for both
Classification problems.
Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
A decision tree simply asks a question, and based on the answer (Yes/No), it further
dataset, branches represent the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
Basic Decision Tree Learning Algorithm:
Now that we know what a Decision Tree is, we’ll see how it works internally. There
are many algorithms out there which construct Decision Trees, but one of the
best is called as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser 3.
8|Page
DATA ANALYTICS UN I T -4
Decision Tree Terminologies
9|Page
DATA ANALYTICS UN I T -4
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of possible answers
10 | P a g e
DATA ANALYTICS UN I T -4
More specifically, decision trees classify instances by sorting them down the tree
from the root node to some leaf node, which provides the classification of the
instance. Each node in the tree specifies a test of some attribute of the
instance,
11 | P a g e
DATA ANALYTICS UN I T -4
and each branch descending from that node corresponds to one of the possible
values for this attribute.
An instance is classified by starting at the root node of the decision tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute. This process is then repeated at the
node on this branch and so on until a leaf node is reached.
characteristics:
Instances are represented by attribute-value pairs.
o There is a finite list of attributes (e.g. hair colour) and each instance
stores a value for that attribute (e.g. blonde).
(Boolean classification).
12 | P a g e
DATA ANALYTICS UN I T -4
The training data may contain missing attribute values.
o Decision tree methods can be used even when some training
examples have unknown values (e.g., humidity is known for only a
fraction of the examples).
13 | P a g e
DATA ANALYTICS UN I T -4
After a decision tree learns classification rules, it can also be re-represented as a set of
Decision trees use multiple algorithms to decide to split a node into two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In
other words, we can say that the purity of the node increases with respect to the target
variable. The decision tree splits the nodes on all available variables and then selects the
Tree Building: Decision tree learning is the construction of a decision tree from class-
labeled training tuples. A decision tree is a flow-chart-like structure, where each internal
(non-leaf) node denotes a test on an attribute, each branch represents the outcome of a
test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the
root node. There are many specific decision-tree algorithms. Notable ones include the
following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)
The ID3 algorithm builds decision trees using a top-down greedy search approach through
the space of possible branches with no backtracking. A greedy algorithm, as the name
suggests, always makes the choice that seems to be the best at that moment.
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute with
14 | P a g e
DATA ANALYTICS UN I T -4
the record (real dataset) attribute and, based on the comparison, follows the branch and
15 | P a g e
DATA ANALYTICS UN I T -4
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in
Step -6: Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information being
From the graph, it is quite evident that the entropy H(X) is zero when the probability
projects perfect randomness in the data and there is no chance if perfectly determining
the outcome.
Information Gain
Information gain or IG is a statistical property that measures
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of
Class and attributes. For them we use the following statistics formulae:
17 | P a g e
DATA ANALYTICS UN I T -4
InformationGain( pi
pi Attribute) ni
I (p , n ) log n log
2 2
p n p n
i
p n p n
i i
i i i i i i i i
Entropy( Attribute)
pi ni
I p n
i
P N i
Data set:
18 | P a g e
DATA ANALYTICS UN I T -4
19 | P a g e
DATA ANALYTICS UN I T -4
Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usually
specialized in analysing datasets that have only one type of variable. (For example,
relation rules can be used only with nominal variables while neural networks can be
used only with numerical variables.)
Uses a white box model. If a given situation is observable in a model the
explanation for the condition is easily explained by Boolean logic. (An example of a
black box model is an artificial neural network since the explanation for the results is
difficult to understand.)
Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.
Robust: Performs well with large datasets. Large amounts of data can be analyzed
using standard computing resources in reasonable time.
Tools used to make Decision Tree:
Many data mining software packages provide implementations of one or more
Weka (a free and open-source data mining suite, contains many decision tree
algorithms)
Orange (a free data mining software suite, which includes the tree module orngTree)
KNIME
Microsoft SQL Server
Scikit-learn (a free and open-source machine learning library for the Python
programming language).
Salford Systems CART (which licensed the proprietary code of the original
CART authors)
IBM SPSS Modeler
Rapid Miner
20 | P a g e
DATA ANALYTICS UN I T -4
Classification Trees:
A classification tree is an algorithm where the
school.
These are examples of simple binary classifications where the categorical dependent
21 | P a g e
DATA ANALYTICS UN I T -4
Regression Trees
A regression tree refers to an algorithm
is a continuous variable.
factors.
In other words, they are just two and mutually exclusive. In some cases, there may be
more than two classes in which case a variant of the classification tree algorithm is
used.
Regression trees, on the other hand, are used when the response variable is
continuous. For instance, if the response variable is something like the price of a
In other words, regression trees are used for prediction-type problems while
classification trees are used for classification-type problems.
decision tree that is constructed by splitting a node into two child nodes repeatedly,
beginning with the root node that contains the whole learning sample. The CART
22 | P a g e
DATA ANALYTICS UN I T -4
growing method attempts to maximize within-node homogeneity.
The extent to which a node does not represent a homogenous subset of cases is an
indication of impurity. For example, a terminal node in which all cases have the
same
23 | P a g e
DATA ANALYTICS UN I T -4
value for the dependent variable is a homogenous node that requires no further
splitting because it is "pure." For categorical (nominal, ordinal) dependent variables the
common measure of impurity is Gini, which is based on squared probabilities of
membership for each category. Splits are found that maximize the homogeneity of
reduces the size of decision trees by removing sections of the tree that are non-
critical and redundant to classify instances. Pruning reduces the complexity of the final
One of the questions that arises in a decision tree algorithm is the optimal size of the final
tree. A tree that is too large risks overfitting the training data and poorly generalizing to
new samples. A small tree might not capture important structural information about the
sample space. However, it is hard to tell when a tree algorithm should stop because it is
impossible Before and After pruning to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as the horizon effect. A common
strategy is to grow the tree until each node contains a small number of instances then use
24 | P a g e
DATA ANALYTICS UN I T -4
pruning to remove nodes that do not provide additional information. Pruning should
reduce the size of a learning tree without reducing predictive accuracy as measured by
a cross-
25 | P a g e
DATA ANALYTICS UN I T -4
validation set. There are many techniques for tree pruning that differ in the measurement
Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
Pre-pruning procedures prevent a complete induction of the training set by
replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or
information gain (Attr)> minGain). They considered to be more efficient because
they do not induce an entire set, but rather trees remain small from the start.
Post-Pruning (or just pruning) is the most common way of simplifying trees. Here,
nodes and subtrees are replaced with leaves to reduce complexity.
The procedures are differentiated on the basis of their approach in the tree: Top-down
One of these representatives is pessimistic error pruning (PEP), which brings quite
good results with unseen items.
26 | P a g e
DATA ANALYTICS UN I T -4
CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and Sonquist
(1963) proposed a simple method for fitting trees to predict a quantitative
variable.
Each predictor is tested for splitting as follows: sort all the n cases on the predictor
and examine all n-1 ways to split the cluster in two. For each possible split, compute
the within-cluster sum of squares about the mean of the cluster on the dependent
variable.
Choose the best of the n-1 splits to represent the predictor’s contribution. Now do
this for every other predictor. For the actual split, choose the predictor and its cut
point which yields the smallest overall within-cluster sum of squares. Categorical
predictors require a different approach. Since categories are unordered, all possible
splits between categories must be considered. For deciding on one split of k
categories into two groups, this means that 2k-1 possible splits must be considered.
Once a split is found, its suitability is measured on the same within-cluster sum of
In the tree model, it is represented by branches from the same nodes which have
different splitting predictors further down the tree. Regression trees parallel
regression/ANOVA modeling, in which the dependent variable is quantitative.
Kass (1980) proposed a modification to AID called CHAID for categorized dependent
and independent variables. His algorithm incorporated a sequential merge and split
procedure based on a chi-square test statistic.
Kass’s algorithm is like sequential cross-tabulation. For each predictor:
1) cross tabulate the m categories of the predictor with the k categories of the
dependent variable.
2) find the pair of categories of the predictor whose 2xk sub-table is least
27 | P a g e
DATA ANALYTICS UN I T -4
categories.
value, repeat this merging process for the selected predictor until no non-
significant chi-square is found for a sub-table, and pick the predictor variable
28 | P a g e
DATA ANALYTICS UN I T -4
whose chi-square is largest and split the sample into subsets, where l is the
CHAID algorithm saves some computer time, but it is not guaranteed to find
impurity is a measure of how often a randomly chosen element from the set
would be incorrectly labeled if it were randomly labeled according to the distribution
of labels in the subset. Gini impurity can be computed by summing the probability fi
of each item being chosen times the probability 1-fi of a mistake in categorizing
that item.
29 | P a g e
DATA ANALYTICS UN I T -4
Let’s clearly understand overfitting, underfitting and perfectly fit models.
30 | P a g e
DATA ANALYTICS UN I T -4
From the three graphs shown above, one can clearly understand that the leftmost
figure line does not cover all the data points, so we can say that the model is under-
fitted. In this case, the model has failed to generalize the pattern to the new
easily seen as it gives very high errors on both training and testing data. This is
because the dataset is not clean and contains noise, the model has High Bias, and
the size of the training data is not enough.
When it comes to the overfitting, as shown in the rightmost graph, it shows the
model is covering all the data points correctly, and you might think this is a perfect
fit. But actually, no, it is not a good fit! Because the model learns too many details
from the dataset, it also considers noise. Thus, it negatively affects the new
data set; not every detail that the model has learned during training needs also
apply to the new data points, which gives a poor performance on testing or
validation
dataset. This is because the model has trained itself in a very complex manner
The best fit model is shown by the middle graph, where both training and testing
(validation) loss are minimum, or we can say training and testing accuracy should
be near each other and high in value.
31 | P a g e
DATA ANALYTICS UN I T -4
econometrics to geology and earthquake prediction; it’s also used in almost all
They are also useful for studying natural phenomena like atmospheric pressure,
temperature, wind speeds, earthquakes, and medical prediction for treatment.
Time series data is data that is observed at different points in time. Time Series
Analysis finds hidden patterns and helps obtain useful insights from the time series
data.
Time Series Analysis is useful in predicting future values or detecting anomalies
from the data. Such analysis typically requires many data points to be present
in the dataset to ensure consistency and reliability.
The different types of models and analyses that can be created through time series
analysis are:
o Classification: To Identify and assign categories to the data.
o Curve fitting: Plot the data along a curve and study the relationships of
variables present within the data.
o Intervention analysis: The Study of how an event can change the data.
o Segmentation: Splitting the data into segments to discover the underlying
32 | P a g e
DATA ANALYTICS UN I T -4
properties from the source information.
33 | P a g e
DATA ANALYTICS UN I T -4
Seasonal variation – Patterns of change in a time series within a year which tends to
repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise and fall of time series
over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any of the three above
mentioned components. They can be classified into – stationary and non – stationary
variation.
Stationary Variation: When the data neither increases nor decreases, i.e. it’s completely
random it’s called stationary variation. Or When the data has some explainable portion
remaining and can be analyzed further then such case is called non – stationary
variation.
35 | P a g e
DATA ANALYTICS UN I T -4
Univariate models such as these are used to understand better a single time-
dependent variable present in the data, such as temperature over time. They predict
future data points of and from the variables.
wherean initial differencing step (corresponding to the "integrated" part of the
model) can be applied to reduce the non-stationary.A standard notation used for
describing ARIMA is by parameters p,d and q.
The parameters are substituted with an integer value to indicate the specific ARIMA
model being used quickly. The parameters of the ARIMA model are further
described as follows:
o p: Stands for the number of lag observations included in the model, also
known as the lag order.
o q: Is the size of the moving average window and also called the order
of moving average.
Univariate stationary processes (ARMA)
A covariance stationary process is an ARMA (p, q) process of autoregressive order p and
moving
average order q if it can be written as
The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of
the stationarized series in the forecasting equation are called "autoregressive" terms,
lags of the forecast errors are called "moving average" terms, and a time series which
37 | P a g e
DATA ANALYTICS UN I T -4
The forecasting equation is constructed as follows. First, let y denote the dth difference
of Y, which means:
If d=0: yt = Yt
If d=1: yt = Yt - Yt-1
Note that the second difference of Y (the d=2 case) is not the difference from 2
periods ago. Rather, it is the first-difference-of-the-first difference, which is the discrete
analog of a second derivative, i.e., the local acceleration of the series rather than its
local trend.
We measure Forecast Accuracy by 2 methods : 1. Mean Forecast Error (MFE) For n time
(ei )
MFE i1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends to over-
forecast 2. Mean Absolute Deviation (MAD) For n time periods where we have actual
(ei )
MAD i1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the
errors
38 | P a g e
DATA ANALYTICS UN I T -4
Uses of Forecast error:
Forecast model bias
Absolute size of the forecast errors
Compare alternative forecasting models
39 | P a g e
DATA ANALYTICS UN I T -4
Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and especially in
Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some data
ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
The disparate systems containing the original data are frequently managed and operated
by different employees. For example, a cost accounting system may combine data from
40 | P a g e
DATA ANALYTICS UN I T -4
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration framework
Stambia
41 | P a g e
DATA ANALYTICS UN I T -4
accessible for further processing. The main objective of the extract step is to retrieve all
the required data from the source system with as little resources as possible. The extract
step should be designed in a way that it does not negatively affect the source system in
Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been
modified and provide an extract of such records. During further ETL steps, the
system needs to identify changes and propagate it down. Note, that by using daily
extract, we may not be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of the
system. The full extract requires keeping a copy of the last extract in the same
format in order to be able to identify changes. Full extract handles deletions as
well.
When using Incremental or Full extracts, the extract frequency is extremely
42 | P a g e
DATA ANALYTICS UN I T -4
important. Particularly for full extracts; the data volumes can be in tens of
gigabytes.
43 | P a g e
DATA ANALYTICS UN I T -4
Clean: The cleaning step is one of the most important as it ensures the quality
of the data in the data warehouse. Cleaning should perform basic data unification
rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
Transform:
The transform step applies a set of rules to transform the data from the source to
the target.
This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined.
The transformation step also requires joining data from several sources, generating
aggregates, generating surrogate keys, sorting, deriving new calculated values, and
applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database.
In order to make the load process efficient, it is helpful to disable any constraints
and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
possibility that the ETL process fails. This can be caused by missing extracts from one of
44 | P a g e
DATA ANALYTICS UN I T -4
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.
45 | P a g e
DATA ANALYTICS UN I T -4
Staging:
It should be possible to restart, at least, some of the phases independently from the
others. For example, if the transformation step fails, it should not be necessary to restart
the Extract step. We can ensure this by implementing proper staging. Staging means that
the data is simply dumped to the location (called the Staging Area) so that it can then
be read by the next processing phase. The staging area is also used during ETL process to
store intermediate results of processing. This is ok for the ETL process which uses for this
purpose. However, the staging area should be accessed by the load ETL process only. It
should never be available to anyone else; particularly not to end users as it is not intended
processing data.
46 | P a g e
Decision Tree Example
for
DATA ANALYTICS
(UNIT 4)
DATA ANALYTICS D e c i si on Tr ee Learn in g
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class and
attributes. For them we use the following statistics formulae:
InformationGain( Attribute)
pi
I (p , n ) log pi ni log ni
2 2
pi n i p i ni pi ni p i ni
i i
Entropy of an Attribute is:
Entropy( Attribute) p n
i i
I pi ni
PN
Gain Entropy(Class) Entropy(Attribute)
Illustrative Example:
Data set:
2|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
3|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
4|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
5|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
6|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
7|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
8|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
9|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g
10 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g
11 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g
12 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g
13 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g
14 | P a g e
Syllabus:
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection
Visualization Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization
Techniques, Visualizing Complex Data and Relations.
Unit-5 Objectives:
1. To explore Pixel-Oriented Visualization Techniques
2. To learn Geometric Projection Visualization Techniques
3. To explore Icon-Based Visualization Techniques
4. To Learn Hierarchical Visualization Techniques
5. To understand Visualizing Complex Data and Relations
Unit-5 Outcomes:
After completion of this course students will be able to
1. To Describe the Pixel-Oriented Visualization Techniques
2. To demonstrate Geometric Projection Visualization Techniques
3. To analyze the Icon-Based Visualization Techniques
4. To explore the Hierarchical Visualization Techniques
5. To compare the Visualizing Complex Data and Relations
Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically
representing empirical information.
They are sometimes called information graphics, or even just charts and graphs.
The goal of visualizing data is to tell the story in the data.
Telling the story is predicated on understanding the data at a very deep level, and
gathering insight from comparisons of data points in the numbers
Gain insight into an information space by mapping data onto graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, and relationships among data.
Help find interesting regions and suitable parameters for further quantitative
analysis.
Provide a visual proof of computer representations derived.
To save space and show the connections among multiple dimensions, space filling
isoften done in a circle segment.
The line plots are nothing but the values on a series of data points will be
connected with straight lines.
The plot may seem very simple but it has more applications not only in machine
learning but in many other areas.
Used to analyze the performance of a model using the ROC- AUC curve.
Bar Plot
This is one of the widely used plots, that we would have seen multiple times not
just in data analysis, but we use this plot also wherever there is a trend analysis in
many fields.
We can visualize the data in a cool plot and can convey the details straight forward to
others.
This plot may be simple and clear but it’s not much frequently used in Data science
applications.
Stacked Bar Graph:
Unlike a Multi-set Bar Graph which displays their bars side-by-side, Stacked Bar
Graphs segment their bars. Stacked Bar Graphs are used to show how a larger
category is divided into smaller categories and what the relationship of each part has
on the total amount. There are two types of Stacked Bar Graphs:
Simple Stacked Bar Graphs place each value for the segment after the previous one.
The total value of the bar is all the segment values added together. Ideal for
comparing the total amounts across each group/segmented bar.
100% Stack Bar Graphs show the percentage-of-the-whole of each group and are
plotted by the percentage of each value to the total amount in each group. This
makes it easier to see the relative differences between quantities in each group.
One major flaw of Stacked Bar Graphs is that they become harder to read the more
segments each bar has. Also comparing each segment to each other is difficult, as
It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.
This plot describes us as a representation, where each point in the entire dataset is
present with respect to any 2 to 3 features(Columns).
Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the
common one, where we will primarily try to find the patterns, clusters, and separability
of the data.
The colors are assigned to different data points based on how they were present in
the dataset i.e, target column representation.
We can color the data points as per their class label given in the dataset.
Box and Whisker Plot
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers will be
considered as an outlier.
The box plot also gives us a description of
Pie Chart :
A pie chart shows a static number and how categories represent part of a whole the
composition of something. A pie chart represents numbers in percentages, and the total
sum of all segments needs to equal 100%.
Extensively used in presentations and offices, Pie Charts help show proportions and
percentages between categories, by dividing a circle into proportional segments. Each arc
length represents a proportion of each category, while the full circle represents the total
sum of all the data, equal to 100%.
Donut Chart:
together.
A Donut Chart somewhat remedies this problem by de-emphasizing the use of the area.
Instead, readers focus more on reading the length of the arcs, rather than comparing the
proportions between slices.
Also, Donut Charts are more space-efficient than Pie Charts because the blank space inside a
Donut Chart can be used to display information inside it.
Marimekko Chart:
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x beeyebrow
slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics–head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth
shape, mouth size, and mouth opening. Each assigned one of 10 possible values.
Stick Figure
Containment within each circle represents a level in the hierarchy: each branch of
the tree is represented as a circle and its sub-branches are represented as circles
inside of it. The area of each circle can also be used to represent an additional
arbitrary value, such as quantity or file size. Colour may also be used to assign
As known as a Sunburst Chart, Ring Chart, Multi-level Pie Chart, Belt Chart, Radial
Treemap.
This type of visualisation shows hierarchy through a series of rings, that are sliced for
each category node. Each ring corresponds to a level in the hierarchy, with the
central circle representing the root node and the hierarchy moving outwards from it.
Rings are sliced up and divided based on their hierarchical relationship to the parent
slice. The angle of each slice is either divided equally under its parent node or can
be made proportional to a value.
Colour can be used to highlight hierarchal groupings or specific categories.
Treemap:
category is assigned a rectangle area with their subcategory rectangles nested inside
of it.
When a quantity is assigned to a category, its area size is displayed in proportion to
that quantity and to the other quantities within the same parent category in a part-
to-whole relationship. Also, the area size of the parent category is the total of its
subcategories. If no quantity is assigned to a subcategory, then it's area is divided
equally amongst the other subcategories within its parent category.
The way rectangles are divided and ordered into sub-rectangles is dependent on
the tiling algorithm used. Many tiling algorithms have been developed, but the
"squarified algorithm" which keeps each rectangle as square as possible is the one
commonly used.
directory on a computer, without taking up too much space on the screen. This
hierarchies, that gives a quick overview of the structure. Treemaps are also great at
comparing the proportions between categories via their area size.
The downside to a Treemap is that it doesn't show the hierarchal levels as clearly
as other charts that visualise hierarchal data (such as a Tree Diagram or Sunburst
Diagram).
Visualizing Complex Data and Relations
Word Cloud:
A visualisation method that displays how frequently words appear in a given body of text, by
making the size of each word proportional to its frequency. All the words are then arranged
in a cluster or cloud of words. Alternatively, the words can also be arranged in any format: