0% found this document useful (0 votes)
9 views

DA Merge Notes(30!09!24)

Uploaded by

Harika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DA Merge Notes(30!09!24)

Uploaded by

Harika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 348

R22 B.Tech.

CSE Syllabus JNTU HYDERABAD


CS513PE: DATA ANALYTICS (Professional
Elective - I)

III Year B.Tech. CSE I-Sem L T P C


3 0 0 3
Prerequisites
1. A course on “Database Management Systems”.
2. Knowledge of probability and statistics.

Course Objectives:
 To explore the fundamental concepts of data analytics.
 To learn the principles and methods of statistical analysis
 Discover interesting patterns, analyze supervised and unsupervised models and estimate theaccuracy
of the algorithms.
 To understand the various search methods and visualization techniques.

Course Outcomes: After completion of this course students will be able to


 Understand the impact of data analytics for business decisions and strategy
 Carry out data analysis/statistical analysis
 To carry out standard data visualization and formal inference procedures
 Design Data Architecture
 Understand various Data Sources

UNIT - I
Data Management: Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality(noise, outliers, missing
values, duplicate data) and Data Processing & Processing.

UNIT - II
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application of
Modeling in Business, Databases & Types of Data and variables, Data Modeling Techniques, Missing
Imputationsetc. Need for Business Modeling.

UNIT - III
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable Rationalization,
and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.

UNIT - IV
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised Learning, Tree
Building – Regression, Classification, Overfitting, Pruning and Complexity, Multiple Decision Trees
etc. Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract features
from generated model as Height, Average Energy etc and Analyze for prediction

UNIT - V
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization Techniques,

71
R22 B.Tech. CSE Syllabus JNTU HYDERABAD

Visualizing Complex Data and Relations.

TEXT BOOKS:
1. Student’s Handbook for Associate Analytics – II, III.
2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan KaufmannPublishers.

REFERENCE BOOKS:
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addision Wisley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway LabsJeffrey
D Ullman Stanford Univ.

72
Data Analytics

CS513PE: DATA ANALYTICS (Professional Elective - I)


BASIC TERMINOLOGIES
BIG DATA

Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
data-processing application software.

4V PROPERTIES OF BIG DATA

 Volume
 Variety
 Velocity
 Veracity

Volume of Big Data

The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within Europe.

Velocity of Big Data

Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.

Variety of Big Data

Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.

Veracity of Big Data

Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
Data Analytics

The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.

Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labeled Big Data.

FORMS OF DATA

• Collection of information stored in a particular file is represented as forms of data.

– STRUCTURED FORM

• Any form of relational database structure where relation between attributes


is possible. That there exists a relation between rows and columns in the
database with a table structure. Eg: using database programming languages
(sql, oracle, mysql etc).

– UNSTRUCTURED FORM.

• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few
websites such as blogs and wikipedia

– SEMI STRUCTURED DATA

• Does not have form tabular data similar to rdbms.


• Predefined organized formats available.
• Eg: csv, xml, json, txt file with tab separator etc..

SOURCES OF DATA

– There are two types of sources of data available.

– PRIMARY SOURCE OF DATA


• Eg: data created by individual or a business concern on their own.

– SECONDARY SOURCE OF DATA


• Eg: data can be extracted from cloud servers, website sources (kaggle, uci,
aws, google cloud, twitter, facebook, youtube, github etc..)
Data Analytics

DATA ANALYSIS

Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.

DATA ANALYTICS

• Data analytics is the science of analyzing raw data in order to make conclusions about
that information...... This information can then be used to optimize processes to increase
the overall efficiency of a business or system.

Types:

– Descriptive analytics Eg: (observation, case-study, surveys)

In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.

– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.

This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the
past to correct his illness through already existing procedures.

– Prescriptive analytics Eg: healthcare, banking.

Prescriptive analytics works with predictive analytics, which uses data to determine
near-term outcomes. Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer program's predictions.

Fig 0.1Relation between Social Media, Data Analysis and Big Data
Data Analytics

Social media data are used in number of domains such as health and political trending
and forecasting, hobbies, ebusiness,cyber-crime, counter terrorism, time-evolving opinion
mining, social net-work analysis, and human machineinteractions.

Finally, summarizing all the above concepts processing for social media data can be
categorized into 3 parts as shown infigure 0.1. The first part consists of social media
websites, the second part consists of data analysis part and the thirdpart consists of big data
management layer and schedules the jobs across the cluster.

DIFFERENCE BETWEEN DATA ANALYTICS AND DATA ANALYSIS

Characteristics Data Analytics Data Analysis

Form Used in business to make decision Form of data analytics in business to


from data – Data Driven identify useful information in data.

Structure It is a process of data collection with Cleaning, transforming the data


various strategies

Tools Excel, python, R etc.. KNIME, NodeXL, Rapid Miner etc..

Prediction Analytics means we are trying to Analysis means we analyze always what
find conclusions about future. has happened in the past

MACHINE LEARNING

• Machine learning is an application of artificial intelligence (AI) that provides systems


the ability to automatically learn and improve from experience without being
explicitly programmed.
• Machine learning focuses on the development of computer programs that can access
data and use it learn for themselves.

Analytics

Fig 0.2Relation betweenmachine learning and data analytics


Data Analytics

In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the
tool automatically. Hence we can say that Data analysis is a sub component of data analytics.
And data analytics is a sub component of machine learning tool. All these are described in
figure 0.2. The output of this machine learning tool generates a model. And from this model
predictive analytics and prescriptive analytics can be performed because the model gives
output as data to machine learning tool. This cycle continues till we get an efficient output.
Data Analytics

UNIT - I

1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS

Data architecture is composed of models, policies, rules or standards that govern which
data is collected, and how it is stored, arranged, integrated, and put to use in data systems
and in organizations. Data is usually one of several architecture domains that form the
pillars of an enterprise architecture or solution architecture.

Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.

• Enterpriserequirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).

• Technologydrivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).

• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.

 Businesspolicies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
Data Analytics

• Data processingneeds
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).

The General Approach is based on designing the Architecture at three Levels of


Specification as shown below in figure 1.1
 The LogicalLevel
 The PhysicalLevel
 The ImplementationLevel

Fig 1.1: Three levels architecture in data analytics.

The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.

Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
Data Analytics

such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.

1.2 Various Sources of Data

Understand various primary sources of the Data


Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data

The sources of generating primary data are -


 ObservationMethod
 SurveyMethod
 ExperimentalMethod

Observation Method:

Fig 1.2: Data collections

An observation is a data collection method, by which you gather knowledge of the


researched phenomenon through making observations of the phenomena, as and when it
occurs. The main aim is to focus on observations of human behavior, the use of the
phenomenon and human interactions related to the phenomenon. We can also make
observations on verbal and nonverbal expressions. In making and documenting observations,
Data Analytics

we need to clearly differentiate our own observations from the observations provided to us by
other people. The range of data storage genre found in Archives and Collections, is suitable
for documenting observations e.g. audio, visual, textual and digital including sub-genres
of note taking, audio recording and video recording.

There exist various observation practices, and our role as an observer may vary
according to the research approach. We make observations from either the outsider or insider
point of view in relation to the researched phenomenon and the observation technique can be
structured or unstructured. The degree of the outsider or insider points of view can be seen as
a movable point in a continuum between the extremes of outsider and insider. If you decide
to take the insider point of view, you will be a participant observer in situ and actively
participate in the observed situation or community. The activity of a Participant observer in
situ is called field work. This observation technique has traditionally belonged to the data
collection methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.

Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most frequently.
These are –

CRD - Completely Randomized Design

A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any
one treatment. For the CRD, any difference among experimental units receiving the same
treatment is considered as experimental error. Hence, CRD is appropriate only for
experiments with homogeneous experimental units, such as laboratory experiments, where
environmental effects are relatively easy to control. For field experiments, where there is
generally large variation among experimental plots in such environmental factors as soil, the
CRD is rarely used. CRD is mainly used in agricultural field.

Step 1. Determine the total number of experimental plots (n) as the product of the number of
treatments (t) and the number of replications (r); that is, n = rt. For our example, n = 5 x 4 =
20. Here, one pot with a single plant in it may be called a plot. In case the number of
replications is not the same for all the treatments, the total number of experimental pots is to
be obtained as the sum of the replications for each treatment. i.e.,

n= i

where r i, is the number of times the ith treatment replicated


Data Analytics

Step 2. Assign a plot number to each experimental plot in any convenient manner; for
example, consecutively from 1 to n.

Step 3. Assign the treatments to the experimental plots randomly using a table of random
numbers.

Example 1: Assume that a farmer wishes to perform the experiment to determine which of
his 3 fertilizers to use on 2800 tress. Assuming that farmer has a farm divided in to 3 terraces,
where those 2800 trees can be divided in the below format

Lower Terrace 1200


Middle Terrace 1000
Upper Terrace 600

Design a CRD for this experiment

Solution
Scenario 1

First we divide the 2800 trees in to random assignment of almost 3 equal parts
Random Assignment1: 933 trees
Random Assignment2: 933 trees
Random Assignment3: 934 trees
So for example random assignment1 we can assign fertilizer1, random assignment2 we can
assign fertilizer2, random assignment3 we can assign fertilizer3.

Scenario 2

2800 trees is divided into terrace as shown below

Total no of Terrace Random Fertilizer


trees assignment usage
Upper 200 fertilizer1
Terrace 200 fertilizer2
(600 trees) 200 fertilizer3
Middle 400 fertilizer1
2800 Trees Terrace 400 fertilizer2
(1200 trees) 400 fertilizer3
Lower 333 fertilizer1
Terrace 333 fertilizer2
(1000 trees) 334 fertilizer3

Thus the farmer will be able analyze and compare various fertilizer performance on different
terrace.
Data Analytics

Example 2:

A company wishes to test 4 different types of tyre. The tyres lifetime as determined
from their threads are given. Where each tyre has been tried on 6 similar automobiles
assigned at random to their tyres. Determine whether there is a significant difference between
tyres at .05 level.

Tyres Automobile Automobile Automobile Automobile Automobile Automobile


1 2 3 4 5 6
A 33 38 36 40 31 35
B 32 40 42 38 30 34
C 31 37 35 33 34 30
D 29 34 32 30 33 31

Solution:

Null Hypothesis: There is no difference between the tyres in their life time.

We choose a random value closest to the average of all values in the table and subtract that
for each tyre in the automobile, for example by choosing 35

Tyres Automobile Automobile Automobile Automobile Automobile Automobile Total


1 2 3 4 5 6
A -2 3 1 5 -4 0 3
B -3 5 7 3 -5 -1 6
C -4 2 0 -2 -1 -5 -10
D 6 -1 -3 -5 -2 -4 -21
T = Sum(X) = -22

N = no of samples = 24 (4 rows * 6 columns)

Correction factor = = 20.16


Square the values to find

Tyres Automobile Automobile Automobile Automobile Automobile Automobile Total


1 2 3 4 5 6
A 4 9 1 25 16 0 55
B 9 25 9 49 25 1 118
C 16 4 0 4 1 25 50
D 36 1 9 25 4 16 91
T = Sum(X 2) = 314
Data Analytics

Total sum of squares (SST) = sum(X 2) – Correlation factor


= 314 – 20.16 = 293.84
Sum of Squares between Treatments (SSTr) =
= ((3) 2/6 +(6) 2/6 +(10) 2/6 +(21) 2/6) – Correlation factor = 77.50
Sum of Squares Error (SSE) SST – SSTr = 293.84 – 77.50 = 216.34

Now by using ANOVA (one way classification) Table, We calculate the F- Ratio.

F-Ratio:

The F ratio is the ratio of two mean square values. If the null hypothesis is true, you
expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation
among group mean is more than you'd expect to see by chance

If the value of F-Ratio is closer to 1 it means that null hypothesis is true. If F-ratio is
greater than then we assume that the null hypothesis is false.

Source of Sum of Degrees of Mean of sum of F - Ratio


variation squares freedom squares
Between SSTr = 77.50 No of treatment – MSTr = SSTr/
treatments 1 Degrees of
= 4-1 =3 Freedom =
77.50/3 = 25.83
F-ratio =
MSTr/MSE =
25.83/10.87 =
2.376
Within SSE = 216.34 No of values – no MSE =
treatments of treatment = 24 SSE/Degrees of
– 4 =20 Freedom
=216.34/20
=10.87

In this scenario the value of F-ratio is greater than 1. This indicates there will be variation
between samples. So assumed null hypothesis will be false
Level of significance = 0.05 (given in question)
Degrees of Freedom = (3, 20)
Critical value = 3.10 (calculated from 5 percentage table)
F-Ratio >critical value (i.e) 2.376> 3.10

Hence assumed null hypothesis is false. This indicates there is life time difference between
tyres.
Data Analytics

A randomized block design, the experimenter divides subjects into subgroups called
blocks, such that the variability within blocks is less than the variability between blocks.
Then, subjects within each block are randomly assigned to treatment conditions. Compared to
a completely randomized design, this design reduces variability within treatment conditions
and potential confounding, producing a better estimate of treatment effects.

The table below shows a randomized block design for a hypothetical medical experiment.

Gender Treatment
Placebo Vaccine
Male 250 250
Female 250 250

Subjects are assigned to blocks, based on gender. Then, within each block, subjects are
randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250
men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women
get the vaccine.

It is known that men and women are physiologically different and react differently to
medication. This design ensures that each treatment condition has an equal proportion of men
and women. As a result, differences between treatment conditions cannot be attributed to
gender. This randomized block design removes gender as a potential source of variability and
as a potential confounding variable.

LSD - Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme
each letter from A to D occurs only once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get disturbed if any row gets changed with
the other.
A B C D
B C D A
C D A B
D A B C

The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and
columns. Thus the magnitude of error will be smaller than any other design.

FD - Factorial Designs - This design allows the experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyzes the impacts
of each of the variables.
Data Analytics

In a true experiment, randomization is essential so that the experimenter can infer cause and
effect without any bias.

Sources of Secondary Data


While primary data can be collected through questionnaires, depth interview, focus
group interviews, case studies, experimentation and observation; The secondary data can be
obtained through

 Internal Sources - These are within theorganization


 External Sources - These are outside theorganization
 Internal Sources ofData

Internal sources
If available, internal secondary data may be obtained with less time, effort and money
than the external secondary data. In addition, they may also be more pertinent to the situation
at hand since they are from within the organization. The internal sources include

Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.

Sales Force Report- It gives information about the sale of a product. The information
provided is of outside theorganization.

Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing isworking

Miscellaneous Reports- These are what information you are getting from operational
reports.If the data available within the organization are unsuitable or inadequate, the marketer
should extend the search to external secondary data sources.

External Sources of Data


External Sources are sources which are outside the company in a larger environment.
Collection of external data is more difficult because the data have much greater variety and
the sources are much more numerous.

External data can be divided into following classes.

Government Publications- Government sources provide an extremely rich pool of data for
the researchers. In addition, many of these data are available free of cost on internet websites.
There are number of government agencies generating data. These are:
Data Analytics

Registrar General of India- It is an office which generates demographic data. It includes


details of gender, age, occupation etc.

Central Statistical Organization- This organization publishes the national accounts


statistics. It contains estimates of national income for several years, growth rate, and rate of
major economic activities. Annual survey of Industries is also published by the CSO. It gives
information about the total number of workers employed, production units, material used and
value added by themanufacturer.

Director General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided region-
wise and country-wise.

Ministry of Commerce and Industries- This ministry through the office of economic
advisor provides information on wholesale price index. These indices may be related to a
number of sectors like food, fuel, power, food grains etc. It also generates All India
Consumer Price Index numbers for industrial workers, urban, non-manual employees and
cultural labourers.

Planning Commission- It provides the basic statistics of Indian Economy.

Reserve Bank of India- This provides information on Banking Savings and investment. RBI
also prepares currency and finance reports.

Labour Bureau- It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.

Department of Economic Affairs- It conducts economic survey and it also generates


information on income, consumption, expenditure, investment, savings and foreign trade.

State Statistical Abstract- This gives information on various types of activities related to the
state like - commercial activities, education, occupation etc.

Non-Government Publications- These includes publications of various industrial and trade


associations, such as

The Indian Cotton Mill Association Various chambers of commerce

The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
Various Associations of Press Media. Export Promotion Council.
Data Analytics

Confederation of Indian Industries (CII)


Small Industries Development Board of India

Different Mills like - Woolen mills, Textile mills etc


The only disadvantage of the above sources is that the data may be biased. They are likely to
colour their negative points.

Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the information
suits the subscriber. These services are useful in television viewing, movement of consumer
goods etc. These syndicate services provide information data from both household as well as
institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, sociographic, general topics. Mail Diary Panel- It may be related to 2
fields - Purchase and Media.

Electronic Scanner Services- These are used to generate data on volume.


They collect data for Institutions from Whole sellers, Retailers, and Industrial Firms

Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are
changing and we need more of specific decision-making in the light of changing
environment. Also Syndicate services are able to provide information to the industries at a
low unit cost.
Disadvantages of Syndicate Services
The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
International Organization- These includes
The International Labour Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science andtechnology.

The International Monetary Fund (IMA) - It publishes reports on national and


international foreign exchange regulations.
Data Analytics

1.2.1 Comparison of sources of data

Based on various features (cost, data, process, source time etc.) various sources of
data can be compared as per table 1.

Table 1: Difference between primary data and secondary data.

Comparison Feature Primary data Secondary data


Meaning Data that is collected by a Data that is collected by
researcher. other people.
Data Real time data Past data.
Process Very involved Quick and easy
Source Surveys, interviews, or Books, journals, publications
experiments, questionnaire, etc..
interview etc..
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Specific to researcher need May not be specific to
researcher need
Available Crude form Refined form
Accuracy and reliability More Less

1.3 Understanding Sources of Data from Sensor

Sensor data is the output of a device that detects and responds to some type of input
from the physical environment. The output may be used to provide information or input to
another system or to guide a process. Examples are as follows

 A photosensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
 Lidar, a laser-based method of detection, range finding and mapping, typically uses a
low-power, eye-safe pulsing laser working in conjunction with a camera.
 A charge-coupled device (CCD) stores and displays the data for an image in such a way
that each pixel is converted into an electrical charge, the intensity of which is related to a
color in the color spectrum.
 Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
 Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, powerline voltage, chemical
concentrations, pollutant levels and vital body functions.
Data Analytics

1.4 Understanding Sources of Data from Signal

The simplest form of signal is a direct current (DC) that is switched on and off; this is
the principle by which the early telegraph worked. More complex signals consist of an
alternating-current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a
network. Data and signals can be either analog or digital. A signal is periodic if it consists
of a continuously repeating pattern.

1.5 Understanding Sources of Data from GPS

The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or near the
Earth where there is an unobstructed line of sight to four or more GPS satellites. The system
provides critical capabilities to military, civil, and commercial users around the world. The
United States government created the system, maintains it, and makes it freely accessible to
anyone with a GPS receiver.

1.6 Data Management

Data management is the development and execution of architectures, policies,


practices and procedures in order to manage the information lifecycle needs of an enterprise
in an effective manner.

1.7 Data Quality


Data quality refers to the quality of data. Data quality refers to the state of qualitative
or quantitative pieces of information. There are many definitions of data quality but data is
generally considered high quality if it is "fit for [its] intended uses in operations, decision
making and planning

The seven characteristics that define data quality are:

1. Accuracy and Precision


2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness

Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot
have any erroneous elements and must convey the correct message without being misleading.
This accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
Data Analytics

target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have
more serious consequences) and, therefore, justifiably worth higher levels of investment.

Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and nationality
are typically limited to a set of options and open answers are not permitted. Any answers
other than these would not be considered valid or legitimate based on the survey’s
requirement. This is the case for most data and must be carefully considered when
determining its quality. The people in each department in an organization understand what
data is valid or not to them, so the requirements must be leveraged when evaluating data
quality.

Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.

Timeliness and Relevance: There must be a valid reason to collect the data to justify the
effort required, which also means it has to be collected at the right moment in time. Data
collected too soon or too late could misrepresent a situation and drive inaccurate
decisions.

Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate


data. Gaps in data collection lead to a partial view of the overall picture to be displayed.
Without a complete picture of how operations are running, uninformed actions will occur. It’s
important to understand the complete set of requirements that constitute a comprehensive set
of data to determine whether or not the requirements are being fulfilled.

Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.

Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized
and manipulated collections of data could offer a different meaning than the data
implied at a lower level. An appropriate level of granularity must be defined to provide
sufficient uniqueness and distinctive properties to become visible. This is a requirement for
operations to function effectively.
Data Analytics

Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data. However, its meaning has expanded to include any data that cannot be
understood and interpreted correctly by machines, such as unstructured text.

An outlier is an observation that lies an abnormal distance from other values in a


random sample from a population. In a sense, this definition leaves it up to the analyst (or a
consensus process) to decide what will be considered abnormal.
In statistics, missing data, or missing values, occur when no data value is stored for
the variable in an observation. Missing data are a common occurrence and can have a
significant effect on the conclusions that can be drawn from the data. Missing values can be
replaced by following techniques:

 Ignore the record with missing values.


 Replace the missing term with constant.
 Fill the missing value manually based on domain knowledge.
 Replace them with mean (if data is numeric) or frequent value (if data is
categorical)
 Use of modelling techniques such decision trees, baye`s algorithm, nearest
neighbor algorithm Etc.

In computing, data deduplication is a specialized data compression technique for


eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are
intelligent (data) compression and single instance (data) storage.

Noisy data

For objects, noise is considered an extraneous object.

For attributes, noise refers to modification of original values.

 Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
 We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two waves
combined with noise and has high SNR

Origins of noise

 outliers -- values seemingly out of the normal range of data


 duplicate records -- good database design should minimize this (use DISTINCT on
SQL retrievals)
 incorrect attribute values -- again good db design and integrity constraints should
minimize this
 numeric only, deal with rogue strings or characters where numbers should be.
 null handling for attributes (nulls=missing values)
Data Analytics

Missing Data Handling

Many causes: malfunctioning equipment, changes in experimental design, collation of


different data sources, measurement not possible. People may wish to not supply information.
Information is not applicable (childen don't have annual income)

 Discard records with missing values


 Ordinal-continuous data, could replace with attribute means
 Substitute with a value from a similar instance
 Ignore missing values, i.e., just proceed and let the tools deals with them
 Treat missing values as equals (all share the same missing value code)
 Treat missing values as unequal values

BUT...Missing (null) values may have significance in themselves (e.g. missing test in a
medical examination, deathdate missing means still alive!)

Missing completely at random (MCAR)

 Missingness of a value is independent of attributes


 Fill in values based on the attribute as suggested above (e.g. attribute mean)
 Analysis may be unbiased overall

Missing at Random (MAR)

 Missingness is related to other variables


 Fill in values based other values (e.g., from similar instances)
 Almost always produces a bias in the analysis

Missing Not at Random (MNAR)

 Missingness is related to unobserved measurements


 Informative or non-ignorable missingness

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

A major issue when merging data from multiple, heterogeneous sources

 Examples: Same person with multiple email addresses

1.8 Data Preprocessing

Data preprocessing is a data mining technique that involves transforming


raw data into an understandable format. Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues.

Data goes through a series of steps during preprocessing:


Data Analytics

 Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
 Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.
 Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.

Data Discretization: Involves the reduction of a number of values of a continuous attribute


by dividing the range of attribute intervals.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

1. Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data Analytics

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:


Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of the
attribute.the attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Componenet Analysis).
Data Analytics

UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics

As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your business.
Here are 4 main factors which signify the need for Data Analytics:

 Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
 Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized to
analyze different behavioral patterns. The techniques and the tools used vary according to the
organization or individual.

Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing,
and presenting findings in comprehensive reports. So, if you have the capability to collect
data from various sources, analyze the data, gather hidden insights and generate reports, then
you can become a Data Analyst. Refer to the image below:

Fig 2.1 Data Analytics


Data Analytics

In general data analytics also deals with bit of human knowledge as discussed below
in figure 2.2 in this under each type of analytics there is a part of human knowledge required
in prediction. Descriptive analytics requires the highest human input while predictive
analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.

Fig 2.3 Data and Human work

Fig 2.3 Data Analytics


Data Analytics

2.2 Introduction to Tools and Environment

In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the
mainly used tool is Rand Phyton as shown in figure 2.3

With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top tools
in the data analytics market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac
OS. It also provides tools to automatically install all packages as per user-requirement.
 Python – Python is an open-source, object-oriented programming language which is easy
to read, write and maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be
assembled on any platform like SQL server, a MongoDB database or JSON
 Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with
real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and analytics,
this tool is easily accessible and can analyze data from different sources.
 Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data source
types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining, text analytics, machine
learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help you
clean up data for analysis. It is used for cleaning messy data, the transformation of data
and parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk.
This tool is also popular for data pipelines and machine learning model development.
Data Analytics

Apart from the above-mentioned capabilities, a Data Analyst should also possess skills
such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if
you have knowledge of Machine Learning, then that would make you stand out from the
crowd.

2.3 Application of modelling a business & Need for business Modelling

Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in
detail. Nowadays majority of the business deals with prediction with large amount of data to
work with.

Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have
acquired crucial insight from the structured data collected from different enterprise systems
and anatomize by commercial database management systems. Eg:

1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and
user interactivity.

2.3.1 Utilizing Hadoop in Big Data Analytics.

Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the
rules for building, organizing and analyzing huge data-sets in the business environment, they
offered 3 architecture layers and also they indicate some graphical tools to explore and
represent unstructured-data, the authors specified how the famous companies could improve
their business. Eg: Google, Twitter and Facebook show their attention in processing big data
within cloud-environment
Data Analytics

Fig 2.4: Working of Hadoop – With Map Reduce Concept


Data Analytics

The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key
value, generating output that isorganized by key values K2. A master node arranges it so that
for redundant copies of input data only one isprocessed.

The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are
located on the same worker node.

The Reduce() step: Worker nodes process each group of output data(perkey) in parallel,
executing the userprovidedReduce() code; each function is run exactly once for each K2 key
value pro-duced by the map step.

Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.

Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown
in Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be
processed by the mappers. Once the key/values aregenerated by mappers, a shuffling process
is used to mix (combine) these key values (combining the same keys in the sameworker
node). Finally, the reduce functions are used to count the words that generate a common
output as a result of thealgorithm. As a result of the execution or wrappers/reducers, the out-
put will generate a sorted list of word counts from theoriginal text input.

2.3.2 The Employment of Big Data Analytics on IBM.

IBM and Microsoft are prominent representatives. IBM represented many big data options
that enable users to storing, managing, and analyzing data through various resources; it has a
good rendering on business-intelligence also healthcare areas. Compared with IBM, also
Microsoft showed powerful work in the area of cloud computing activities and techniques
another example is Face-book and Twitter, who are collecting various data from user's
profiles and using it to increase their revenue

2.3.3 The Performance of Data Driven Companies.

Big data analytics and Business intelligence are united fields which became widely
significant in the business and academic area, companies are permanently trying to make
insight from the extending the three V's ( variety, volume and velocity) to support decision
making

2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics

a database management system (DBMS)


Data Analytics

The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL
and object-oriented databases

A text database is a system that maintains a (usually large) text collection and
provides fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..

A desktop database is a database system that is made to run on a single computer


or PC. These simpler solutions for data storage are much more limited and constrained than
larger data center or data warehouse systems, where primitive database software is replaced
by sophisticated hardware and networking setups. Eg: Microsoft excel, open access, etc.

A relational database (RDB) is a collective set of multiple data sets organized by


tables, records and columns. RDBs establish a well-defined relationship
between database tables. Tables communicate and share information, which facilitates data
searchability, organization and reporting. Eg: sql, oracle,Db2, DbaaS etc

NoSQL databases are non-tabular, and store data differently than relational
tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc

Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a
class is a collection of objects. Object-oriented databases follow the fundamental principles
of object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..

2.5 Types of Data and variables

In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent
data and columns to represent the attribute.

In terms of big data we represent the columns from RDMS as an attribute or a


variable. This variable can be divided in to two types’ categorical data or qualitative data
and continuous or discrete data called as quantitative data. As shown below in figure 2.5.

Qualitative data or Categorical data is normally represented as variable that holds


characters. And this is divided in to two types’ nominal data and ordinal data.

InNominal Data there is no natural ordering in values in the attribute of the dataset.
Eg: color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined
with order for example there is no specific way to arrange gender of 50 students in a class. In
this case the first student can be male or female similarly for all 50 students. So ordering
Data Analytics

cannot be valid.
Data Analytics

In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg:
size (S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can
quantify the amount of data after performing ordering which gives valuable insights into the
data.

Fig 2.5: Types of Data Variables

Quantitative data or (discrete or continuous data) can be further divided in to two


types’ discrete attribute and continuous attribute.

Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.

2.5 Data Modelling Techniques

Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make
data-driven decisions and meet varied business goals.

The entire process of data modelling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and then propose
Data Analytics

a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics

Types of Data Models

Data modeling can be achieved in various ways. However, the basic concept of each
of them remains the same. Let’s have a look at the commonly used data modeling methods:

Hierarchical model

As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult
in a hierarchical database. This is why it is rarely used now.

Fig 2.6: Hierarchical Model Structure

Relational model

Proposed as an alternative to hierarchical model by an IBM researcher, here data is


represented in the form of tables. It reduces the complexity and provides a clear overview of
the data as shown below in figure 2.7.

Fig 2.7: Relational Model Structure


Network model
Data Analytics

The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be
shared easily and the computation becomes easier.

Fig 2.8: Network Model Structure


Object-oriented model

This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as
shown in figure 2.8.

Fig 2.9: Object-Oriented Model Structure

Entity-relationship model
Data Analytics

Entity-relationship model, also known as ER model, represents entities and their


relationships in a graphical format. An entity could be anything – a concept, a piece of data,
or an object.

Fig 2.10: Entity Relationship Diagram

The entity relationship diagram explains relation between variables and with their
primary key and foreign key as shown in figure 2.10. along with this it also explains the
multiple instances of relation between tables.

Now that we have a basic understanding of data modeling, let’s see why it is important.

Importance of Data Modeling


 A clear representation of data makes it easier to analyze the data properly. It provides
a quick overview of the data which can then be used by the developers in varied
applications.
 Data modeling represents the data properly in a model. It rules out any chances of
data redundancy and omission. This helps in clear analysis and processing.
 Data modeling improves data quality and enables the concerned stakeholders to make
data-driven decisions.

Since a lot of business processes depend on successful data modeling, it is necessary to


adopt the right data modeling techniques for the best results.

Best Data Modeling Practices to Drive Your Key Business Decisions

Have a clear understanding of your end-goals and results


Data Analytics

You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when
you know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.

Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.

Keep it sweet and simple and scale as you grow

Things will be sweet initially, but they can become complex in no time. This is why it is
highly recommended to keep your data models small and simple, to begin with.

Once you are sure of your initial models in terms of accuracy, you can gradually introduce
more datasets. This helps you in two ways. First, you are able to spot any inconsistencies in
the initial stages. Second, you can eliminate them on the go.

Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order

You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.

Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.

In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12
months, and the order will be the top stores in decreasing order.

This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.

Key takeaway: It is highly recommended to organize your data properly using individual
tables for facts and dimensions to enable quick analysis.

Keep as much as is needed

While you might be tempted to keep all the data with you, do not ever fall for this trap!
Although storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics

machines’ performance.
Data Analytics

More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.

Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining
more than what is actually required wastes your data modeling, and leads to performance
issues.

Keep crosschecking before continuing

Data modeling is a big project, especially when you are dealing with huge amounts of data.
Thus, you need to be cautious enough. Keep checking your data model before continuing to
the next step.

For example, if you need to choose a primary key to identify each record in the dataset
properly, make sure that you are picking the right attribute. Product ID could be one such
attribute. Thus, even if two counts match, their product ID can help you in distinguishing
each record. Keep checking if you are on the right track. Are product IDs same too? In those
aces, you will need to look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.

Let them evolve


Data models are never written in stone. As your business evolves, it is essential to customize
your data modeling accordingly. Thus, it is essential that you keep them updating over time.
The best practice here is to store your data models in as easy-to-manage repository such that
you can make easy adjustments on the go.

Key takeaway: Data models become outdated quicker than you expect. It is necessary that
you keep them updated from time to time.

The Wrap Up

Data modeling plays a crucial role in the growth of businesses, especially when you
organizations to base your decisions on facts and figures. To achieve the varied business
intelligence insights and goals, it is recommended to model your data correctly and use
appropriate tools to ensure the simplicity of the system.

2.6 Missing Imputations

In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics

to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics

I. Do nothing to missing data


II. Fill the missing values in the dataset using mean, median.

Eg: for sample dataset given below

SNo Column 1 Column2 Column 3


1 3 6 NAN
2 5 10 12
3 6 11 15
4 NAN 12 14
5 6 NAN NAN
6 10 13 16

Can be replaced as using column mean as follows

SNo Column 1 Column2 Column 3


1 3 6 9.5
2 5 10 12
3 6 11 15
4 12 14
5 6 8.66 9.5
6 10 13 16

Advantages:
• Works well with numerical dataset.
• Very fast and reliable.

Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data

III. Imputations using (most frequent) or (zero / constant) values


This can be used for categorical attributes.
Disadvantage:
• Does not correlate relation between columns
• Creates bias in data.

IV. Imputation using KNN


It creates a basic mean impute then uses the resulting complete list to construct a KDTree.
Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-
NNs, it takes the weighted average of them.
Data Analytics

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest
neighbours to the observation with missing data and then imputing them based on the non-
missing values in the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode

Disadvantage:
• Sensitive to outliers

UNIT-3
BLUE Property
assumptions

 The Gauss Markov theorem tells us that if a certain set of assumptions are met, the
ordinary least squares estimate for regression coefficients gives you the Best Linear
Unbiased Estimate (BLUE) possible.

 There are five Gauss Markov assumptions (also called conditions):

 Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
 Random:
o Our data must have been randomly sampled from the population.
 Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
 Exogeneity:
o The regressors aren’t correlated with the error term.
 Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.

Purpose of the Assumptions


 The Gauss Markov assumptions guarantee the validity of ordinary least squares for
estimating regression coefficients.

 Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
Data Analytics

 When you know where these conditions are violated, you may be able to plan ways to
change your experiment setup to help your situation fit the ideal Gauss Markov situation
more closely.

 In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‘ideal’ conditions would be.

 They also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurate or even unusable.
Data Analytics

The Gauss-Markov Assumptions in Algebra


 We can summarize the Gauss-Markov Assumptions succinctly in algebra, by saying that
a linear regression model represented by

 and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE) possible if

 The first of these assumptions can be read as “The expected value of the error term is
zero.”. The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.
Data Analytics

Regression Concepts

Regression

 It is a Predictive modeling technique where the target variable to be estimated is


continuous.

Examples of applications of regression

 predicting a stock market index using other economic indicators


 forecasting the amount of precipitation in a region based on characteristics of the jet
stream
 projecting the total sales of a company based on the amount spent for advertising
 estimating the age of a fossil according to the amount of carbon-14 left in the organic
material.

 Let D denote a data set that contains N observations,

 Each xi corresponds to the set of attributes of the ith observation (known as explanatory
variables) and yi corresponds to the target (or response) variable.
 The explanatory attributes of a regression task can be either discrete or continuous.

Regression (Definition)

 Regression is the task of learning a target function f that maps each attribute set x into a
continuous-valued output y.

The goal of regression

 To find a target function that can fit the input data with minimum error.
 The error function for a regression task can be expressed in terms of the sum of absolute
or squared error:
Data Analytics

Simple Linear Regression

 Consider the physiological data shown in Figure D.1.


 The data corresponds to measurements of heat flux and skin temperature of a person
during sleep.
 Suppose we are interested in predicting the skin temperature of a person based on the
heat flux measurements generated by a heat sensor.
 The two-dimensional scatter plot shows that there is a strong linear relationship between
the two variables.
Data Analytics

Least Square Estimation or Least Square


Method

 Suppose we wish to fit the following linear model to the observed data:

 where w0 and w1 are parameters of the model and are called the regression coefficients.
 A standard approach for doing this is to apply the method of least squares, which
attempts to find the parameters (w0,w1) that minimize the sum of the squared error

 which is also known as the residual sum of squares.


 This optimization problem can be solved by taking the partial derivative of E with respect
to w0 and w1, setting them to zero, and solving the corresponding system of linear
equations.

 These equations can be summarized by the following matrix equation' which is also
known as the normal equation:

 Since

 the normal equations can be solved to obtain the following estimates for the parameters.
Data Analytics

 Thus, the linear model that best fits the data in terms of minimizing the SSE is

 Figure D.2 shows the line corresponding to this model.

 We can show that the general solution to the normal equations given in D.6 can be
expressed as follow:
Data Analytics

 Thus, linear model that results in the minimum squared error is given by

 In summary, the least squares method is a systematic approach to fit a linear model to the
response variable g by minimizing the squared error between the true and estimated value
of g.
 Although the model is relatively simple, it seems to provide a reasonably accurate
approximation because a linear model is the first-order Taylor series approximation for
any function with continuous derivatives.
Data Analytics

Logistic Regression

Logistic regression, or Logit regression, or Logit model

o is a regression model where the dependent variable (DV) is categorical.


o was developed by statistician David Cox in 1958.

 The response variable Y has been regarded as a continuous quantitative variable.


 There are situations, however, where the response variable is qualitative.
 The predictor variables, however, have been both quantitative, as well as qualitative.
 Indicator variables fall into the second category.

 Consider a procedure in which individuals are selected on the basis of their scores in a
battery of tests.
 After five years the candidates are classified as "good" or "poor.”
 We are interested in examining the ability of the tests to predict the job performance of
the candidates.
 Here the response variable, performance, is dichotomous.
 We can code "good" as 1 and "poor" as 0, for example.
 The predictor variables are the scores in the tests.
Data Analytics

 In a study to determine the risk factors for cancer, health records of several people were
studied.
 Data were collected on several variables, such as age, gender, smoking, diet, and the
family's medical history.
 The response variable was the person had cancer (Y = 1) or did not have cancer (Y = 0).

 The relationship between the probability π and X can often be represented by a logistic
response function.
 It resembles an S-shaped curve.
 The probability π initially increases slowly with increase in X, and then the increase
accelerates, finally stabilizes, but does not increase beyond 1.
 Intuitively this makes sense.
 Consider the probability of a questionnaire being returned as a function of cash reward,
or the probability of passing a test as a function of the time put in studying for it.
 The shape of the S-curve can be reproduced if we model the probabilities as follows:

 A sigmoid function is a bounded differentiable real function that is defined for all real
input values and has a positive derivative at each point.

 It has an “S” shape. It is defined by below function:


Data Analytics

 The process of linearization of logistic regression function is called Logit


Transformation.

 Modeling the response probabilities by the logistic distribution and estimating the
parameters of the model given below constitutes fitting a logistic regression.
 In logistic regression the fitting is carried out by working with the logits.
 The Logit transformation produces a model that is linear in the parameters.
 The method of estimation used is the maximum likelihood method.
 The maximum likelihood estimates are obtained numerically, using an iterative
procedure.

OLS:

 The ordinary least squares, or OLS, can also be called the linear least squares.
 This is a method for approximately determining the unknown parameters located in a
linear regression model.
 According to books of statistics and other online sources, the ordinary least squares is
obtained by minimizing the total of squared vertical distances between the observed
responses within the dataset and the responses predicted by the linear approximation.
 Through a simple formula, you can express the resulting estimator, especially the single
regressor, located on the right-hand side of the linear regression model.
 For example, you have a set of equations which consists of several equations that have
unknown parameters.
 You may use the ordinary least squares method because this is the most standard
approach in finding the approximate solution to your overly determined systems.
 In other words, it is your overall solution in minimizing the sum of the squares of errors
in your equation.
 Data fitting can be your most suited application. Online sources have stated that the data
that best fits the ordinary least squares minimizes the sum of squared residuals.
 “Residual” is “the difference between an observed value and the fitted value provided by
a model.”
Data Analytics

Maximum likelihood estimation, or MLE,

 is a method used in estimating the parameters of a statistical model, and for fitting a
statistical model to data.
 If you want to find the height measurement of every basketball player in a specific
location, you can use the maximum likelihood estimation.
 Normally, you would encounter problems such as cost and time constraints.
 If you could not afford to measure all of the basketball players’ heights, the maximum
likelihood estimation would be very handy.
 Using the maximum likelihood estimation, you can estimate the mean and variance of the
height of your subjects.
 The MLE would set the mean and variance as parameters in determining the specific
parametric values in a given model.

Multinomial Logistic Regression

 We have n independent observations with p explanatory variables.


 The qualitative response variable has k categories.
 To construct the logits in the multinomial case one of the categories is considered the
base level and all the logits are constructed relative to it. Any category can be taken as the
base level.
 We will take category k as the base level in our description of the method.
 Since there is no ordering, it is apparent that any category may be labeled k. Let 7rj denote
the multinomial probability of an observation falling in the jth category.
 We want to find the relationship between this probability and the p explanatory variables,Xl,
X 2 , ... ,Xp. The multiple logistic regression model then is

 Since all the 7r'S add to unity, this reduces to

 For j = 1,2,···, (k - 1). The model parameters are estimated by the method of maximum
likelihood. Statistical software is available to do this fitting.
Data Analytics

UNIT-4
Regression vs. Segmentation

 Regression analysis focuses on finding a relationship between a dependent variable and


one or more independent variables.
 Predicts the value of a dependent variable based on the value of at least one independent
variable.
 Explains the impact of changes in an independent variable on the dependent variable.

 We use linear or logistic regression technique for developing accurate models for
predicting an outcome of interest.
 Often, we create separate models for separate segments.
 Segmentation methods such as CHAID or CRT is used to judge their effectiveness
 Creating separate model for separate segments may be time consuming and not worth the
effort.
 But, creating separate model for separate segments may provide higher predictive power.

 Market Segmentation
 Dividing the target market or customers on the basis of some significant features which
could help a company sell more products in less marketing expenses.

 Companies have limited marketing budgets. Yet, the marketing team is expected to
makes large number of sales to ensure rising revenue & profits.
 A product is created in two ways:
 Create a product after analyzing (research) the needs and wants of target market –
For example: Computer. Companies like Dell, IBM, Microsoft entered this
market after analyzing the enormous market which this product upholds.
 Create a product which evokes the needs & wants in target market – For example:
iPhone.
 Once the product is created, the ball shifts to the marketing team’s court.
 As mentioned above, they make use of market segmentation techniques.
 This ensures the product is positioned to the right segment of customers with high
propensity to buy.

 How to create segments for model development?


 Commonly adopted methodology
 Let us consider an example.
 Here we’ll build a logistic regression model for predicting likelihood of a customer to
respond to an offer.
Data Analytics

 A very similar approach can also be used for developing a linear regression model.
Data Analytics

 Logistic regression uses 1 or 0 indicator in the historical campaign data, which indicates
whether the customer has responded to the offer or not.
 Usually, one uses the target (or ‘Y’ known as dependent variable) that has been identified
for model development to undertake an objective segmentation.
 Remember, a separate model will be built for each segment.
 A segmentation scheme which provides the maximum difference between the segments
with regards to the objective is usually selected.
 Below is a simple example of this approach.

 Fig: Sample segmentation for building a logistic regression – commonly adopted


methodology
 The above segmentation scheme is the best possible objective segmentation developed,
because the segments demonstrate the maximum separation with regards to the objectives
(i.e. response rate).
Data Analytics

Supervised and Unsupervised


Learning

There are two broad set of methodologies for segmentation:

 Objective (supervised) segmentation


 Non-Objective (unsupervised) segmentation

Objective Segmentation
 Segmentation to identify the type of customers who would respond to a particular offer.
 Segmentation to identify high spenders among customers who will use the e-commerce
channel for festive shopping.
 Segmentation to identify customers who will default on their credit obligation for a loan
or credit card.

Non-Objective Segmentation
 Segmentation of the customer base to understand the specific profiles which exist within
the customer base so that multiple marketing actions can be personalized for each
segment
 Segmentation of geographies on the basis of affluence and lifestyle of people living in
each geography so that sales and distribution strategies can be formulated accordingly.
 Segmentation of web site visitors on the basis of browsing behavior to understand the
level of engagement and affinity towards the brand.
 Hence, it is critical that the segments created on the basis of an objective segmentation
methodology must be different with respect to the stated objective (e.g. response to an
offer).
 However, in case of a non-objective methodology, the segments are different with respect
to the “generic profile” of observations belonging to each segment, but not with regards
to any specific outcome of interest.
 The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.
 Each of these techniques uses a distance measure (e.g. Euclidian distance, Manhattan
distance, Mahalanobis distance etc.)
 This is done to maximize the distance between the two segments.
 This implies maximum difference between the segments with regards to a combination of
all the variables (or factors).
Data Analytics

Tree Building

 Decision tree learning


o is a method commonly used in data mining.
o is the construction of a decision tree from class-labeled training tuples.

 goal
o to create a model that predicts the value of a target variable based on several input
variables.

 Decision trees used in data mining are of two main types.


o Classification tree analysis
o Regression tree analysis

 Classification tree analysis is when the predicted outcome is the class to which the
data belongs.
 Regression tree analysis is when the predicted outcome can be considered a real
number. (e.g. the price of a house, or a patient’s length of stay in a hospital).

 A decision tree
o is a flow-chart-like structure
o each internal (non-leaf) node denotes a test on an attribute
o each branch represents the outcome of a test,
o each leaf (or terminal) node holds a class label.
o The topmost node in a tree is the root node.

 Decision-tree algorithms:
o ID3 (Iterative Dichotomiser 3)
o C4.5 (successor of ID3)
o CART (Classification and Regression Tree)
o CHAID (CHI-squared Automatic Interaction Detector). Performs multi-level
splits when computing classification trees.
o MARS: extends decision trees to handle numerical data better. Conditional
Inference Trees.

 Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for
multiple testing to avoid over fitting.
 This approach results in unbiased predictor selection and does not require pruning.
 ID3 and CART follow a similar approach for learning decision tree from training tuples.
Data Analytics

CHAID (CHI-squared Automatic Interaction Detector)


 A simple method for fitting trees to predict a quantitative variable proposed by Morgan
and Sonquist (1963).
 They called the method AID, for Automatic Interaction Detection.
 The algorithm performs stepwise splitting.
 It begins with a single cluster of cases and searches a candidate set of predictor variables
for a way to split this cluster into two clusters.
 Each predictor is tested for splitting as follows:
o Sort all the n cases on the predictor and examine all n-1 ways to split the cluster in
two.
o For each possible split, compute the within-cluster sum of squares about the mean
of the cluster on the dependent variable.
o Choose the best of the n-1 splits to represent the predictor’s contribution. Now do
this for every other predictor.
o For the actual split, choose the predictor and its cut point which yields the
smallest overall within-cluster sum of squares.
o Categorical predictors require a different approach. Since categories are
unordered, all possible splits between categories must be considered.
o For deciding on one split of k categories into two groups, this means that 2k-1
possible splits must be considered.
o Once a split is found, its suitability is measured on the same within-cluster sum of
squares as for a quantitative predictor.
 Morgan and Sonquist called their algorithm AID because it naturally incorporates
interaction among predictors. Interaction is not correlation.
 It has to do instead with conditional discrepancies.
 In the analysis of variance, interaction means that a trend within one level of a variable is
not parallel to a trend within another level of the same variable.

 In the ANOVA model, interaction is represented by cross-products between predictors.


 In the tree model, it is represented by branches from the same nodes which have different
splitting predictors further down the tree.
Data Analytics

 Regression trees parallel regression/ANOVA modeling in which the dependent variable


is quantitative.
 Classification trees parallel discriminant analysis and algebraic classification methods.
 Kass (1980) proposed a modification to AID called CHAID for categorized dependent
and independent variables.
 His algorithm incorporated a sequential merge and split procedure based on a chi-square
test statistic.
 Kass was concerned about computation time, so he decided to settle for a sub-optimal
split on each predictor instead of searching for all possible combinations of the
categories.
 Kass’s algorithm is like sequential cross-tabulation.
o For each predictor:
1. cross tabulate the m categories of the predictor with the k categories of the
dependent variable,
2. find the pair of categories of the predictor whose 2xk sub-table is least
significantly different on a chi-square test and merge these two categories;
3. if the chi-square test statistic is not “significant” according to a preset critical
value, repeat this merging process for the selected predictor until no non-
significant chi-square is found for a sub-table, and pick the predictor variable
whose chi-square is largest and split the sample into subsets, where l is the
number of categories resulting from the merging process on that predictor;
4. Continue splitting, as with AID, until no “significant” chi-squares result. The
CHAID algorithm saves some computer time, but it is not guaranteed to find
the splits which predict best at a given step. Only by searching all possible
category subsets can we do that. CHAID is also limited to categorical
predictors, so it cannot be used for quantitative or mixed categorical
quantitative models.

CART (Classification And Regression Tree)


 CART algorithm was introduced in Breiman et al. (1986).
 A CART tree is a binary decision tree that is constructed by splitting a node into two
child nodes repeatedly, beginning with the root node that contains the whole learning
sample.
 The CART growing method attempts to maximize within-node homogeneity.
 The extent to which a node does not represent a homogenous subset of cases is an
indication of impurity.
 For example, a terminal node in which all cases have the same value for the dependent
variable is a homogenous node that requires no further splitting because it is "pure."
 For categorical (nominal, ordinal) dependent variables the common measure of impurity
is Gini, which is based on squared probabilities of membership for each category.
Data Analytics

 Splits are found that maximize the homogeneity of child nodes with respect to the value
of the dependent variable.

 Impurity Measure:

 GINI Index Used by the CART (classification and regression tree) algorithm, Gini
impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it were randomly labeled according to the distribution of labels in
the subset.
 Gini impurity can be computed by summing the probability fi of each item being chosen
times the probability 1-fi of a mistake in categorizing that item.
 It reaches its minimum (zero) when all cases in the node fall into a single target category.
 To compute Gini impurity for a set of items, suppose i ε {1, 2... m}, and let fi be the
fraction of items labeled with value i in the set.

Advantages of Decision Tree:


 Simple to understand and interpret. People are able to understand decision tree models
after a brief explanation.
 Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed.
 Able to handle both numerical and categorical data. Other techniques are usually
specialized in analysing datasets that have only one type of variable.
 Uses a white box model. If a given situation is observable in a model the explanation for
the condition is easily explained by Boolean logic.
 Possible to validate a model using statistical tests. That makes it possible to account for
the reliability of the model.
 Robust. Performs well even if its assumptions are somewhat violated by the true model
from which the data were generated.
 Performs well with large datasets. Large amounts of data can be analyzed using standard
computing resources in reasonable time.
Data Analytics

Tools used to make Decision Tree:


 Many data mining software packages provide implementations of one or more decision
tree algorithms.
 Several examples include:
o Salford Systems CART
o IBM SPSS Modeler
o Rapid Miner
o SAS Enterprise Miner
o Matlab
o R (an open source software environment for statistical computing which includes
several CART implementations such as rpart, party and random Forest packages)
o Weka (a free and open-source data mining suite, contains many decision tree
algorithms)
o Orange (a free data mining software suite, which includes the tree module
orngTree)
o KNIME
o Microsoft SQL Server
o Scikit-learn (a free and open-source machine learning library for the Python
programming language).

 Pruning
 After building the decision tree, a tree-pruning step can be performed to reduce the size
of the decision tree.
 Pruning helps by trimming the branches of the initial tree in a way that improves the
generalization capability of the decision tree.

 The errors committed by a classification model are generally divided into two types:
o training errors
o generalization errors.

 Training error
o also known as resubstitution error or apparent error.
o it is the number of misclassification errors committed on training records.
 generalization error
o is the expected error of the model on previously unseen records.
o A good classification model must not only fit the training data well, it must also
accurately classify records it has never seen before.
 A good model must have low training error as well as low generalization error.
Data Analytics

 Model overfitting
o Decision trees that are too large are susceptible to a phenomenon known as
overfitting.
o A model that fits the training data too well can have a poorer generalization error
than a model with a higher training error.
o Such a situation is known as model overfitting.

 Model underfitting
o The training and test error rates of the model are large when the size of the tree is
very small.
o This situation is known as model underfitting.
o Underfitting occurs because the model has yet to learn the true structure of the
data.

o Model complexity
o To understand the overfitting phenomenon, the training error of a model can be
reduced by increasing the model complexity.
o Overfitting and underfitting are two pathologies that are related to the model
complexity.
Data Analytics

ARIMA (Autoregressive Integrated Moving Average)

 ARIMA model is a generalization of an autoregressive moving average (ARMA) model,


in time series analysis,
 These models are fitted to time series data either to better understand the data or to
predict future points in the series (forecasting).
 They are applied in some cases where data show evidence of non-stationary, wherein
initial differencing step (corresponding to the "integrated" part of the model) can be
applied to reduce the non-stationary.

 Non-seasonal ARIMA models


o These are generally denoted ARIMA(p, d, q) where parameters p, d, and q are
non-negative integers, p is the order of the Autoregressive model, d is the degree
of differencing, and q is the order of the Moving-average model.

 Seasonal ARIMA models


o These are usually denoted ARIMA(p, d, q)(P, D, Q)_m, where m refers to the
number of periods in each season, and the uppercase P, D, Q refer to the
autoregressive, differencing, and moving average terms for the seasonal part of
the ARIMA model.

 ARIMA models form an important part of the Box-Jenkins approach to time-series


modeling.

 Applications
o ARIMA models are important for generating forecasts and providing
understanding in all kinds of time series problems from economics to health care
applications.
o In quality and reliability, they are important in process monitoring if observations
are correlated.
o designing schemes for process adjustment
o monitoring a reliability system over time
o forecasting time series
o estimating missing values
o finding outliers and atypical events
o understanding the effects of changes in a system
Data Analytics

Measure of Forecast Accuracy

 Forecast Accuracy can be defined as the deviation of Forecast or Prediction from the
actual results.
Error = Actual demand – Forecast
OR
ei = At – Ft
 We measure Forecast Accuracy by 2 methods :
 Mean Forecast Error (MFE)
o For n time periods where we have actual demand and forecast values:

o Ideal value = 0;
o MFE > 0, model tends to under-forecast
o MFE < 0, model tends to over-forecast

 Mean Absolute Deviation (MAD)


o For n time periods where we have actual demand and forecast values:

 While MFE is a measure of forecast model bias, MAD indicates the absolute size of the
errors

Uses of Forecast error:


 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment
Data Analytics

ETL Approach

 Extract, Transform and Load (ETL) refers to a process in database usage and especially
in data warehousing that:
o Extracts data from homogeneous or heterogeneous data sources
o Transforms the data for storing it in proper format or structure for querying and
analysis purpose
o Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
 Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some data
ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.
 ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer hardware.
 The disparate systems containing the original data are frequently managed and operated
by different employees.
 For example, a cost accounting system may combine data from payroll, sales, and
purchasing.

 Commercially available ETL tools include:


o Anatella
o Alteryx
o CampaignRunner
o ESF Database Migration Toolkit
o InformaticaPowerCenter
o Talend
o IBM InfoSphereDataStage
o Ab Initio
o Oracle Data Integrator (ODI)
o Oracle Warehouse Builder (OWB)
o Microsoft SQL Server Integration Services (SSIS)
o Tomahawk Business Integrator by Novasoft Technologies.
o
Stambia
o Diyotta DI-SUITE for Modern Data Integration
o FlyData
o Rhino ETL
o SAP Business Objects Data Services
Data Analytics

o SAS Data Integration Studio


o SnapLogic
o Clover ETL opensource engine supporting only basic partial functionality and not
server
o SQ-ALL - ETL with SQL queries from internet sources such as APIs
o North Concepts Data Pipeline

 Various steps involved in ETL.


o Extract
o Transform
o Load

o Extract
 The Extract step covers the data extraction from the source system and
makes it accessible for further processing.
 The main objective of the extract step is to retrieve all the required data
from the source system with as little resources as possible.
 The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any
kind of locking.

 There are several ways to perform the extract:


 Update notification - if the source system is able to provide a
notification that a record has been changed and describe the
change, this is the easiest way to get the data.
 Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to
identify which records have been modified and provide an extract
of such records. During further ETL steps, the system needs to
identify changes and propagate it down. Note, that by using daily
extract, we may not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has
been changed at all, so a full extract is the only way one can get the
data out of the system. The full extract requires keeping a copy of
the last extract in the same format in order to be able to identify
changes. Full extract handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data
volumes can be in tens of gigabytes.
Data Analytics

 Clean - The cleaning step is one of the most important as it ensures


the quality of the data in the data warehouse. Cleaning should
perform basic data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided
value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
 Validate address fields against each other (State/Country,
City/State, City/ZIP code, City/Street).

o Transform
 The transform step applies a set of rules to transform the data from the
source to the target.
 This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
 The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.

o Load
 During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible.
 The target of the Load process is often a database.
 In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after
the load completes.
 The referential integrity needs to be maintained by ETL tool to ensure
consistency.

 Managing ETL Process

o The ETL process seems quite straight forward.


o As with every application, there is a possibility that the ETL process fails.
o This can be caused by missing extracts from one of the systems, missing values in
one of the reference tables, or simply a connection or power outage.
Data Analytics

o Therefore, it is necessary to design the ETL process keeping fail-recovery in


mind.

 Staging
o It should be possible to restart, at least, some of the phases independently from
theothers.
o For example, if the transformation step fails, it should not be necessary to
restart the Extract step.
o We can ensure this by implementing proper staging. Staging means that the
data is simply dumped to the location (called the Staging Area) so that it can
then be read by the next processing phase.
o The staging area is also used during ETL process to store intermediate results
of processing.
o This is ok for the ETL process which uses for this purpose.
o However, the staging area should be accessed by the load ETL process only.
o It should never be available to anyone else; particularly not to end users as
it isnot intended for data presentation to the end-user.
o May contain incomplete or in-the-middle-of-the-processing data.

UNIT-5
Data Visualization

Data Visualization

 Data visualization is the art and practice of gathering, analyzing, and graphically
representing empirical information.
 They are sometimes called information graphics, or even just charts and graphs.
 The goal of visualizing data is to tell the story in the data.
 Telling the story is predicated on understanding the data at a very deep level, and
gathering insight from comparisons of data points in the numbers

Why data visualization?

 Gain insight into an information space by mapping data onto graphical primitives
Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, and relationships among data.
 Help find interesting regions and suitable parameters for further quantitative analysis.
 Provide a visual proof of computer representations derived.
Data Analytics

Categorization of visualization methods


 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
Pixel-Oriented Visualization Techniques

 For a data set of m dimensions, create m windows on the screen, one for each dimension.
 The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows.
 The colors of the pixels reflect the corresponding values.

Laying Out Pixels in Circle Segments

 To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment.
Geometric Projection Visualization Techniques
Visualization of geometric transformations and projections of the data.

Methods

 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes Projection pursuit technique: Help users find meaningful projections of
multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates

Scatter Plots

 A scatter plot displays 2-D data points using Cartesian coordinates.


 A third dimension can be added using different colors or shapes to represent different
data points
 Through this visualization, in the adjacent figure, we can see that points of types “+” and
“×” tend to be colocated

Scatterplot Matrices

 The scatter-plot matrix is an extension to the scatter plot.


 For k-dimensional data a minimum of (k2-k)/2 scatterplots of 2D will be required.
 There can be maximum of k2 plots of 2D
 In the adjoining figure , there are k2 plots.
 Out of these, k are X-X plots, and all X-Y plots (where X, Y are distinct dimensions) are
given in 2 orientations (X vs Y and Y vs, X)

Parallel Coordinates

 The scatter-plot matrix becomes less effective as the dimensionality increases.


 Another technique, called parallel coordinates, can handle higher dimensionality
 n equidistant axes which are parallel to one of the screen axes and correspond to the
attributes (i.e. n dimensions)
 The axes are scaled to the [minimum, maximum]: range of the corresponding attribute
 Every data item corresponds to a polygonal line which intersects each of the axes at the
point which corresponds to the value for the attribute
Icon-Based Visualization Techniques

 Visualization of the data values as features of icons


 Typical visualization methods
o Chernoff Faces
o Stick Figures
 General techniques
o Shape coding: Use shape to represent certain information encoding
o Color icons: Use color icons to encode more information
o Tile bars: Use small icons to represent the relevant feature vectors in document
retrieval

Chernoff Faces

 A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y


be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics–head eccentricity, eye size, eye
spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size,
and mouth opening): Each assigned one of 10 possible values.

Chernoff Faces Stick Figure

Stick Figure

 A census data figure showing age, income, gender, education


 A 5-piece stick figure (1 body and 4 limbs w. different angle/length)
 Age, income are indicated by position of the figure.
 Gender, education are indicated by angle/length.
 Visualization can show a texture pattern
Hierarchical Visualization

 For a large data set of high dimensionality, it would be difficult to visualize all
dimensions at the same time.
 Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces).
 The subspaces are visualized in a hierarchical manner
 “Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method.
 To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5.
 We want to observe how F changes w.r.t. other dimensions. We can fix X3,X4,X5
dimensions to selected values and visualize changes to F w.r.t. X1, X2
Malla Reddy Institute of Technology and Science
Visualizing Complex Data and Relations

 Most visualization techniques were mainly for numeric data.


 Recently, more and more non-numeric data, such as text and social networks,
havebecome available.
 Many people on the Web tag various objects such as pictures, blog entries, and
productreviews.
 A tag cloud is a visualization of statistics of user-generated tags.
 Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
 The importance of a tag is indicated by font size or color.
Reference:

https://www.slideserve.com/eben/introduction-to-information-visualization
NOTES-2

CS513PE: DATA ANALYTICS (Professional Elective - I)


UNIT - I
DATA MANAGEMENT

DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR


ANALYSIS
DESIGN DATA ARCHITECTURE:

Data architecture is composed of models, policies, rules or standards that govern


which data is collected, and how it is stored, arranged, integrated, and put to use in
data systems and in organizations.

 Data is usually one of several architecture domains that form the pillars of an
enterprise architecture or solution architecture.

 An Enterprise architecture is a conceptual blueprint that defines the structure


and operation of an organization.
 The Architecture is the structure and description of a system or Enterprise.
 It enables the implementation, understanding, maintenance, repair and
further development of a system or enterprise.
 Data architecture will be used in achieving the implementation of physical
database.
 It can be compared to a house design where all the description of house
structure to be built such as from the choice of materials, sizes, styles of
rooms etc. are described in the blueprint.
 In the same way, the data architecture describes the way data will be
processed, stored and used by organization.
 Designing data architecture is complex process because this will involve
related abstract data models to real life business activities and entities before
implementing the database design and finally setting up IT hardware
infrastructure.

The design data architecture is broken down into three traditional architectural
processes to be considered.

 Conceptual Aspects represents the all business entities and its related
attributes.
 Logical aspects represents the entire logic of the entity relationships.
 Physical aspects represents the actual data mechanism for particular type of
functionalities.

Manage the data for Analysis:

Data Analysis study includes Developing and Executing plan for Collecting Data,
a Data analysis suppose that the Data have Already been Collected.

In Detail this Study Includes Development of Hypothesis or Question, the


designing of Data Collection Process, Collecting the Data, Analyzing the Data and
Interpreting the Data.

Because it presumes that data have already been collected, it includes development
and refinement of a hypothesis or question and the process of analyzing and
interpreting the data.

A data analysis may appear to follow a linear, and one-step-after-other process


which at the end arrives nicely packaged and Coherent results.

In reality data analysis is highly iterative, non-linear process, better reflected by


series of Epicycles. In which information is learned at each step.

Epicycle:

Epicycle is nothing small circle whose moves around the circumference of


the larger circle.

• In data analysis iterative process that is applied to all the steps of data
analysis.

• Epicycle that is repeated for each step along the circumference of the entire
data analysis process.

Epicycle of data Analysis:

There are 5 core activities of Data Analysis:

• Stating and refining the question

• Exploring the data


• Building formal statistical models

• Interpreting the results

• Communicating the results

• These 5 activities can occur at different time scales.

for example, you might go through all 5 in the course of a day, but also deal with
each, for a large project, over the course of many months.

Although there are many different types of activities that you might engage in
while doing data analysis, every aspect of the entire process can be approached
through an iterative process that we call the “epicycle of data analysis”.

More specifically, for each of the five core activities, it is critical that you engage
in the following steps:
1. Setting Expectations,

2. Collecting information (data), comparing the data to your expectations, and


if the expectations don’t match,

3. Revising your expectations or fixing the data so your data and your
expectations match.

• Iterating through this 3-step process is what we call the “epicycle of data
analysis.”

• We Will apply the “epicycle” to each these five core activities.

More specifically, for each of the five core activities, it is critical that you engage
in the following steps:

• Set expectations or Developing Expectations

• Collecting data

• Revise expectations

• Setting Expectations or Developing Expectations:

Developing expectations is the process of deliberately thinking about what you


expect before you do anything, such as inspect your data, perform a procedure, or
enter a command.

For experienced data analysts, developing expectations may be an automatic,


almost subconscious process, but its an important activity to cultivate and
deliberate.

When we inspect the data or execute an analysis procedure, applies to each core
activity of the data analysis process.

• Collecting Information

This step entails collecting information about the question or the data. For the
question, you collect information by performing a literature search or asking
experts in order to ensure that your question is a good one.
The results of that operation are the data we need to collect, and then determine if
the data collected, matches our expectations.

• Comparing Expectations to Data

Now that you have data in hand (the check at the restaurant), the next step is to
compare your expectations to the data. There are two possible outcomes: either
your expectations of the cost match the amount on the check, or they do not.

If your expectations and the data match, terrific, you can move onto the next
activity.

If the expectations and data the data do not match, there are two possible
explanations for discordance:

First, the expectations were wrong and need to be revised, or second the data was
wrong and contains an error.

Core activities of data analysis:

1. Stating and refining the question:


Doing data analysis requires quite a bit of thinking, you’ve spent more time
thinking and design than doing.

The six types of questions are:

1. Descriptive questions:

Descriptive aims at describing something, mainly functions and characteristics. Is


that seeks to summarize a characteristics of a set of data.

2. Explorative questions:

An exploratory question is one in which you analyze the data to see if there are;

• patterns

• trends

• or relationships between variables.


Explorative questions are also called as Hypothesis Generators.

3. Inferential questions

An inferential question would be a restatement of this proposed hypothesis as a


question and would be answered by analyzing a different set of data.

4. Predictive questions

It would be where set the predictions of data set.

5. Causal questions:

A causal question asks about whether changing one factor will change another
factor, on average, in a population.

6. Mechanistic questions:

Finally, none of the questions described so far will lead to an answer that will tell
us.

2. Exploratory data Analysis:


In this section we will run through an inferential checklist of things to do when
stating on an exploratory data analysis.

The elements of checklist are:

1. Formulate question

We have already discussed the importance of properly creating a question.

Formulating question can be a useful way to guide the Exploratory data Analysis.

2. Read in your Data

Sometimes data will comes from very messy formats we will need to do some
Cleaning.

3. Check the packaging

Assuming that you don’t have any errors and warning in your data set.
4. Look at top and bottom of your data

Which is useful to check at the beginning and Ending of Data Set. If the data was
read properly, things are properly formulated or not.

5. ABC your “n”s:

Always good way to figure out if any thing is wrong or not

6. Validate with External data Sources

Match with outside data Source

7. Make a plot

Making plot is Visualize your data is good way to understanding

8. Try Easy solution first


What is simplest solution we could provide this.

9. Challenge your solution:

In this we should always think about ways to challenge your results.

10. Follow up question Do

you have the right data?Do

you need other data?

Do you have right question?

Build Formal Statistical Model

• In this model Objects are described.

• Object Model: Object model consists of Collection of Objects.

• Objects are associated with functions and methods.

• The main purpose of this model is with respect to set the data first, and last
describe the process by data analyst.
• The model is help us understanding the real world.

Applying Normal Model

• This model says randomness in set of data.

• That can be explained by Normal Distribution.

• N.D. Specified by two parameters

• Mean

• Standard Deviation

Drawing fake picture

• To begin with we can make some pictures and histograms of data.

• Before we get to the data lets figure out what we expect to see from
the data.

• Which is used to initiate the discussion about the model and what we
expect from reality.

Interpreting the Results

• Several principles interpreting the results:

• Revisit your question

• Start with primary statistical model

• Develop Overall Interpretation

• Consider Implications

Communicating the Results

• Convey the key points of Data analysis.

• It will Provide Tools of DA.

• It gives careful thought to communicating the final results So analysis is


useful and Informative.
Understand various sources of the Data Like Sensors/Signals/GPS etc.

Understand various sources of the Data

• Data can be generated from two types of sources namely


Primary Sources and
Secondary Sources

Primary Sources

While primary data can be collected through questionnaires, depth interview, focus
group interviews, case studies, experimentation and observation.

Secondary Sources

The secondary data can be obtained through

• Internal Sources - These are within the organization

• External Sources - These are outside the organization

The internal sources include

 Accounting resources- This gives so much information which can be used


by the marketing researcher. They give information about internal factors.

 Sales Force Report- It gives information about the sale of a product.

 Internal Experts- These are people who are heading the various
departments. They can give an idea of how a particular thing is working.

 Miscellaneous Reports- Information you are getting from operational


reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.

External Sources of Data Includes

 Sources which are outside the company in a larger environment.


 Collection of external data is more difficult because the data have much
greater variety and the sources are much more numerous.

External data can be divided into following classes.

 Government Publications

 Government sources provide an extremely rich pool of data for the


researchers.

 Many of these data are available free of cost on internet websites.

• Registrar General of India

 It is an office which generates demographic data. It includes details of


gender, age, occupation etc.

• Central Statistical Organization

 This organization publishes the national accounts statistics. It contains


estimates of national income for several years, growth rate, and rate of major
economic activities.

 It gives information about the total number of workers employed,


production units, material used and value added by the manufacturer.

• Ministry of Commerce and Industries


 This ministry through the office of economic advisor provides information
on wholesale price index. These indices may be related to a number of
sectors like food, fuel, power, food grains etc.

 It also generates All India Consumer Price Index numbers for industrial
workers, urban, non-manual employees and cultural labourers.

• Planning Commission

It provides the basic statistics of Indian Economy.

• Reserve Bank of India


This provides information on Banking Savings and investment. RBI also prepares
currency and finance reports.

• Labour Bureau

 It provides information on skilled, unskilled, white collared jobs etc.

• Department of Economic Affairs

 It conducts economic survey and it also generates information on income,


consumption, expenditure, investment, savings and foreign trade.

• State Statistical Abstract

 This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.

• Non-Government Publications

These includes publications of various industrial and trade associations, such as

 The Indian Cotton Mill Association

 Various chambers of commerce

 The Bombay Stock Exchange (it publishes a directory containing financial


accounts, key profitability and other relevant matter)

 Various Associations of Press Media.

 Export Promotion Council.

 Confederation of Indian Industries (CII)

 Small Industries Development Board of India

 Different Mills like - Woolen mills, Textile mills etc

Understand various sources of the Data like Sensors/Signals/GPS etc.

• In this Application of DA are seeming Endless.

• More and More data is being collected everyday.


• New opportunity to apply data analytics to more parts of business, science
and everyday life.

• Data mining is essential step in data analytics task.

• DM involves Extracting data from unstructured data sources. These may


include Sensors/Signals/GPS

• The Key steps in this process are ETL the data.

• These steps Converts the data into useful manageable formats.

Sensor data

• Sensor data is the output of a device that detects and responds to some type
of input from physical Environment.

• The output may be used to provide information or input to the another


system or to guide a process.

Here few examples of Sensors

 A photo sensor detects the presence of visible light, infrared


transmission(IR) and Ultraviolet (UV) energy.

 Lidar(which stands for Light Detection and Ranging), a laser based method
of Detection.

 Smart grid Sensors - It can provide real-time data about grid conditions,
detecting outages, faults, loads and triggering alarms.

 Wireless sensor networks - Combined Specialized transducers with a


communication infrastructure for monitoring and recording conditions at
diverse locations.

Signals

What is Signal?

• Simplest form of signal is a direct current(DC) that is switched on and off.


• More complex signals consists of an alternate-current(AC) or
electromagnetic carrier that contains one or more data streams.

Data and Signal:

Data must be transformed into electromagnetic signals to transmission


across the network.

Data and signals can be either analog or digital.

A signal is periodic if it consists of continuously repeating patterns.

GPS

• The Global Positioning System(GPS) is a space based navigation system


that provides location and time information in all weather conditions,
anywhere on or near the earth.

• The system provide critical capabilities to military, civil, and commercial


users around the world.

• It is completely based on time and known position of specialized satellites.

• GPS satellites continuously transmit time and position.

• The united state government created the system, maintain it, and makes it
freely accessible to anyone with a GPS receiver.

Data Management:

Data Management is the implementation of policies and procedures that put


organizations in control of their business data regardless of where it resides.

Data Management is concerned with the end-to-end life cycle of data, from
creation to retirement, and the controlled progression of data to and from each
stage within its lifecycle.

Data Management minimizes the risks and costs of regulatory non-compliance,


legal complications, and security breaches.
Data Management practices can prevent ambiguity and make sure that data
confirms to organizational best practices for accesses, storage, backup and
retirement among other things.

The benefits of data management is include enhanced compliance, greater


security, improved sales, and marketing strategies better product classification,
and improved data governance to reduce organizational risk.

Data Quality(Noise, Outliers, Missing Values, Duplicate Data):

• Data mining applications are often applied to data that was collected for
another purpose or future.

• The first step is detection and correction, and it is often called as data
cleaning.

Specific Aspects of data quality is

1. Measurement and data collection issues

2. Issues related to applications

1. Measurement and data collection issues

• It is unrealistic to expect the data will be perfect.

• There may be problems due to human error, limitations of measuring


devices or flaws in the data collection process.

Issues:

Measurement and data Collection Errors:


• The measurement refers to any problem resulting from the measurement
process.

• The values recorded differ from true values.

• For continuous attributes, the numerical difference of the measured and true
value is called error.
• The term data Collection error refers to errors such as omitting data objects
or attribute values.

• For Example: Keyboard error

They often exist well-developed techniques for detecting and correcting


these errors with human intervention.

Noise data:

• Noisy data is meaningless data.

• The term has often been used as a synonym for corrupt data.

• The meaning has expanded to include any data that cannot be understood
and interpreted correctly by machines, such as unstructured text.

Outliers:

Outliers are

(1) data objects that have the characteristics the are differ from
most of the other data objects in data sets.

(2) Values of an attribute that are unusual with respect to the


typical values for that attributes.

• The outliers also called as anomalous object or values.

Missing Values:

• It is not unusual for an object to be missing one or more attribute values.

• In some cases information was not collected.

• Example: Some people decline to give their phone numbers or age details.

• In these cases, some attributes are not applicable to all objects.

• Regardless missing values should be taken into account during the data
analysis.

The missing values can be replaced by following techniques:


 Ignore the record with missing values.

 Replace the missing term with constant.

 Fill the missing value manually based on domain knowledge.

 Replace them with mean (if data is numeric) or frequent value (if data is
categorical)

 Use of modelling techniques such decision trees, baye`s algorithm, nearest


neighbor algorithm Etc.

Duplicate Data:

A Data set may include data objects that are duplicates or almost duplicates
of one another.

Many People receive duplicate mailing because they appear in a database


multiple times under slightly different names.

2. Issues related to Applications:

Data Quality issues can also be considered from an application view point as
expressed by the statement.

“Data is of high quality if it is suitable for its intended use”. This approach
to data quality has proven quite useful.

• As with quality issues at the measurement and data collection level, there are
many issues that are specific particular applications and fields.

Few general issues are:

1. Timeliness:

Some data starts to age as soon as it has been Collected.

Example:
If the data provides a snapshot of some ongoing phenomenon or process
such as purchasing behavior of customers or web browsing patterns, then
this snapshot represents the reality for only a limited time.

If the data is out of date, then so are the models and patterns that are based
on it.

2. Relevance:

The available data must contain the information necessary for the
application.

Example: Accident rate for drivers

If Information about the age and gender of the driver is omitted, then it
likely that the model will have limited accuracy.

3. Knowledge about the data:

Datasets are go by documentation that describes different aspects of the data.

Other important characteristics are-

 the precision of the data,

 the type of features(nominal, ordinal, interval, ratio),

 the scale of measurement(meters, feet for length) and

 origin of the data.

Data Preprocessing

• Data preprocessing is a data mining technique that involves transforming


raw data into an understandable format.

• Real-world data is often incomplete, inconsistent, and/or lacking in certain


behaviors or trends, and is likely to contain many errors.

• Data preprocessing is a proven method of resolving such issues. Data


preprocessing prepares raw data for further processing.
Data goes through a series of steps during preprocessing:

• Data Cleaning: Data is cleansed through processes such as filling in missing


values, smoothing the noisy data, or resolving the inconsistencies in the data.

• Data Integration: Data with different representations are put together and
conflicts within the data are resolved.

• Data Transformation: Data is normalized, aggregated and generalized.

• Data Reduction: This step aims to present a reduced representation of the


data in a data warehouse. Produce the Same or Similar Analytical Results.

• Data Discretization: It is a part of Data Reduction process. Divide the range


of Continuous attributes into intervals.

Data Cleaning: Missing values, Noisy data

Missing values:

 Ignore the Missing Records

 Fill the missing values manually

 Fill missing values with constant terms

 Use of modelling techniques such decision trees, baye`s algorithm, nearest


neighbor algorithm Etc.

 Fill missing values with mean.

Noisy Data:

Binning method: Consult the neighborhood of the values and they perform
smoothing.

Regression:

They conform best fit line to two attributes. One attributes can predict the
other attribute.

Outlier analysis:
Outliers may be detected by the Clustering.

Data integration:

Data mining often requires data integration – the merging of data from
multiple databases like data cubes, files.

Data transformation:

Normalization: Scaling the attribute values fall within the specified range

Aggregation: Moving up in the concept hierarchy on numerical attributes.

Generalization: Moving up in the concept hierarchy on nominal attributes.

Data Reduction: Obtain reduced representation of data set that is much


smaller in volume.

Dimensionality reduction: Project the original data into a Smaller space.

Numerosity reduction: It replace the original data volume by alternate


parameters.

Estimate the data, So that Typically only parameters need to be stored.

Data Compression: Original data can be reconstructed from the compressed


data. Without any information loss.

Data Discretization: It is a part of Data Reduction process. Divide the range


of Continuous attributes into intervals.
UNIT-II
Data Analytics

Introduction to Data Analytics:


As an enormous amount of data gets generated, the need to extract useful insights is a
must for a business enterprise. Data Analytics has a key role in improving your
business.
• Data has been the buzzword for ages now.
• Either the data being generated from large scale enterprises or data generated from
individuals each and every aspect of data need to be analyzed to benefit to
yourself from it.
Data Analytics has a key role in improving your business.

What is Data Analytics?


→ Data Analytics refers to the techniques to analyze data to enhance productivity
and business gain.
→ Data is extracted from various sources and is cleaned and categorized to analyze
different behavioral patterns.
→ The techniques and the tools used vary according to the organization or
individual.
→ Data analysts translate numbers into plain English. A Data Analyst delivers value
to their companies by taking information about specific topics and then
interpreting, analyzing, and presenting findings in comprehensive reports.
→ So, if you have the capability to collect data from various sources, analyze the
data, gather hidden insights and generate reports, then you can become a Data
Analyst.
Why Data Analytics Important?
As an enormous amount of data gets generated, the need to extract useful insights is a
must for business enterprises.
Data analytics has a key role in improving your business.

Here are 4 main factors which signify the need for Data Analytics:

• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.

• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.

• Perform Market Analysis – Market Analysis can be performed to understand the


strengths and the weaknesses of competitors.

• Improve Business Requirement – Analysis of Data allows improving Business to


customer requirements and experience.

Introduction to Tools and Environment

Analytics is now a days used in all the fields ranging from Medical Sciences to
Government Activities.

Various Steps involved in Analytics:


1. Access
2. Manage
3. Analyze
4. Report
With the increasing demand for Data Analytics in the market, many tools have
emerged with various functionalities for this purpose. Either open-source or user-
friendly, the top tools used in the data analytics market are as follows.

R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS. It also provides tools to automatically install all packages as per user-
requirement.

Python – Python is an open-source, object-oriented programming language which is


easy to read, write and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras
etc. It also can be assembled on any platform like SQL server, a MongoDB database
or JSON.

Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse etc. It then creates visualizations, maps,
dashboards etc. with real-time updates on the web.

QlikView – This tool offers in-memory data processing with the results delivered to
the end-users quickly. It also offers data association and data visualization with data
being compressed to almost 10% of its original size.

SAS – A programming language and environment for data manipulation and


analytics, this tool is easily accessible and can analyze data from different sources.

Microsoft Excel – This tool is one of the most widely used tools for data analytics.
Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the
data with a preview of pivot tables.

RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text analytics,
machine learning.

OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the transformation
of data and parsing data from websites.

Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times
faster on disk. This tool is also popular for data pipelines and machine learning
model development.

Application of Modeling in Business


 Higher quality
 Reduced cost
 Clearer scope
 Faster performance
 Better documentation

Importance of Data Modeling


 A clear representation of data makes it easier to analyze the data properly. It
provides a quick overview of the data which can then be used by the
developers in varied applications.
 Data modeling represents the data properly in a model. It rules out any
chances of data redundancy and omission. This helps in clear analysis and
processing.
 Data modeling improves data quality and enables the concerned stakeholders
to make data-driven decisions.
Since a lot of business processes depend on successful data modeling, it is
necessary to adopt the right data modeling techniques for the best results.

 Have a clear understanding of your organization’s requirements and organize your data
properly.
 Keep your data models simple. The best data modeling practice here is to use a tool
which can start small and scale up as needed.
 It is highly recommended to organize your data properly using individual tables for
facts and dimensions to enable quick analysis.
 Have a clear opinion on how much datasets you want to keep. Maintaining more than
what is actually required wastes your data modeling, and leads to performance issues.
 It is the best practice to maintain one-to –one or one-to-many relationships. The many-
to-many relationship only introduces complexity in the system
 Data models become outdated quicker than you expect. It is necessary that you keep
them updated from time to time.

Databases & Types of Data and variables


Types of data in Analytics
 There are two main types of variables, qualitative (aka categorical
) and quantitative (aka numerical)
 Qualitative variable: has labels or names used to identify an
attribute of an element.
 Qualitative data use either the nominal or ordinal scale of
measurement
 Nominal: order does not matter e.g. Gender
 Ordinal: order does matter e.g. Education levels
 Quantitative variable: has numeric values that indicate how much
or how many of something.
 Quantitative data uses either the interval or ratio scale
 Inteval:difference of quantities that are meaningful but ratios of
quantities that cannot be compared e.g. temperature with the
Celsius scale
 Ratio: ratios of quantities that are meaningful e.g. Height
 Cross-Sectional vs.Time series data
We have two types of data set based on how the data were collecting
Cross-sectional: data collected at the same or approximately the same
point in time
Time series: data collected over several time periods.

A data dictionary, or metadata repository, as defined in the IBM Dictionary of


Computing, is a "centralized repository of information about data such as meaning,
relationships to other data, origin, usage, and format”.
 Data can be categorized on various parameters like Categorical, Type etc.
 Data can be categorized on various parameters like Categorical, Type etc.
 Data is of 2 types – Numeric and Character. Again numeric data can be further
divided into sub group of – Discrete and Continuous.
 Again, Data can be divided into 2 categories – Nominal and ordinal.
 Also based on usage data is divided into 2 categories – Quantitative and
Qualitative

Data Modeling Techniques:

Data modeling is nothing but a process through which data is stored structurally in
a format in a database. Data modeling is important because it enables
organizations to make data-driven decisions and meet varied business goals.
The entire process of data modeling is not as easy as it seems, though. You are
required to have a deeper understanding of the structure of an organization and
then propose a solution that aligns with its end-goals and suffices it in achieving
the desired objectives.

Types of Data Models

Data modeling can be achieved in various ways. However, the basic concept of
each of them remains the same. Let’s have a look at the commonly used data
modeling methods:

Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data
in a tree-like format. However, retrieving and accessing data is difficult in a
hierarchical database. This is why it is rarely used now.
Relational model
Proposed as an alternative to hierarchical model by an IBM researcher, here data is
represented in the form of tables. It reduces the complexity and provides a clear
overview of the data.

Network model
The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as
each record can be linked with multiple parent records.
Object-oriented model
This database model consists of a collection of objects, each with its own features
and methods. This type of database model is also called the post-relational
database model.
Entity-relationship model
Entity-relationship model, also known as ER model, represents entities and their
relationships in a graphical format. An entity could be anything – a concept, a
piece of data, or an object.

Now that we have a basic understanding of data modeling, let’s see why it is important.
Data modeling
Data modeling (data modeling) is the process of creating a data model for the data
to be stored in a Database
 This data model is a conceptual representation of Data objects, the
associations between different data objects and the rules.
 Data modeling helps in the visual representation of data and enforces
business rules, regulatory compliances, and government policies on the data.
 Data models ensure consistency in naming conventions, default values,
semantics, and security while ensuring quality of the data.
 Data model emphasizes on what data is needed and how it should be
organized instead of what operations need to be performed on the data.
 Data model is like architect’s building plan which helps to build a
conceptual model and set the relationship between data items.
The primary goals of using data model are:
 Ensures that all data objects required by the database are accurately
represented. Omission of data will lead to creation of faculty reports and
produce incorrect results.
 A data model helps design the database at the conceptual, physical and
logical levels.
 Data model structure helps to define the relational tables,primary and
foreign keys and stored procedures.
 It provides a clear picture of the base data and can be used by database
developers to create a physical database.
 It is also helpful to identify missing and redundant data.
 Though the initial creation of data model is labor and time consuming, in the
long run, it makes your IT infrastructure upgrade and maintenance cheaper
and faster.
There are mainly three different types of data models:
 Conceptual: This Data model defines WHAT the system contains. This
model is typically created by Business stakeholders and Data Architects.
The purpose is to organize scope and define business concepts and rules.
 Logical: Defines HOW the system should be implemented regardless of the
DBMS.This model is typically created by Data Architects and Business
Analysts. The purpose is to developed technical map of rules and data
structures.
 Physical: This Data model describes HOW the system will be implemented
using a specific DBMS system. This model is typically created by DBA and
developers. The purpose is actual implementation of the database.

Advantages of Data Model:


 The main goal of a designing data model is to make certain that data objects
offered by functional team are represented accurately.
 The data model should be detailed enough to be used for building the
physical database.
 The information in the data model can be used for defining the relationship
between tables, primary and foreign keys, and stored procedures.
 Data Model helps business to communicate the within and across
organizations.
 Data model helps to documents data mappings in ETL process
 Help to recognize correct sources of data populate the model
Entity Relationship Diagrams
 Also referred to as ER diagrams or ERDs.Enity-Relationship modeling is
a default technique for modeling and the design of relational (traditional)
databases. In this notation architect identifies:

UML class Diagrams


 UML (Unifed Modeling Language) is a standardized family of notations for
modeling and design of information systems.
Data Dictionary
 Data dictionary is an inventory of data sets/tables with the list of their attributes/columns.
 Core data dictionary elements: List of data sets/tables, list of attributes/columns of each table
with data type.

Missing Imputations:
• Missing data is a common problem in practical data analysis. In datasets, missing
values could be represented as ‘?’, ‘nan’, ’N/A’, blank cell, or sometimes ‘-999’,
’inf’, ‘-inf’.
Missing Imputation simply means that we replace the missing values with some
guessed/estimated ones.
 In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not
a number). Unlike SAS, R uses the same symbol for character and numeric data.

 To test if there is any missing in the dataset we use is.na () function.

For Example, We have defined “y” and then checked if there is any missing value. T or
True means that there is a missing value. y <- c(1,2,3,NA) is.na(y) # returns a vector (F
FF T)

Arithmetic functions on missing values yield missing values. For Example, x <-
c(1,2,NA,3) mean(x) # returns NA To remove missing values from our dataset we use
na.omit() function. For Example, We can create new dataset without missing data as
below: -

newdata<- na.omit(mydata)

Or, we can also use “na.rm=TRUE” in argument of the operator. From above example
we use na.rm and get desired result. x <- c(1,2,NA,3) mean(x, na.rm=TRUE)

# returns 2

• The study of missing data was formalized by with the concept of missing data
mechanisms.
• Missing data mechanism describes the underlying mechanism that generates
missing data and can be categorized into three types — missing completely at
random (MCAR), missing at random (MAR), and missing not at random
(MNAR).
Types of missing data:
• Understanding the reasons why data are missing is important for handling the
remaining data correctly. If values are missing completely at random, the data
sample is likely still representative of the population. But if the values are
missing systematically, analysis may be biased. For example, in a study of the
relation between IQ and income, if participants with an above-average IQ tend to
skip the question ‗What is your salary?‘, analyses that do not take into account
this missing at random may falsely fail to find a positive association between IQ
and salary. Because of these problems, methodologists routinely advise
researchers to design studies to minimize the occurrence of missing values.
Graphical models can be used to describe the missing data mechanism in detail.

Missing completely at random:


• Values in a data set are Missing Completely at Random (MCAR) if the events
that lead to any particular data-item being missing are independent both of
observable variables and of unobservable parameters of interest, and occur
entirely at random. When data are MCAR, the analysis performed on the data is
unbiased; however, data are rarely MCAR.

• In the case of MCAR, the missingness of data is unrelated to any study variable:
thus, the participants with completely observed data are in effect a random
sample of all the participants assigned a particular intervention. With MCAR, the
random assignment of treatments is assumed to be preserved, but that is usually
an unrealistically strong assumption in practice.

Missing at random:
• Missing at random (MAR) occurs when the missingness is not random, but where
missingness can be fully accounted for by variables where there is complete
information. MAR is an assumption that is impossible to verify statistically, we
must rely on its substantive reasonableness.[8] An example is that males are less
likely to fill in a depression survey but this has nothing to do with their level of
depression, after accounting for maleness. Depending on the analysis method,
these data can still induce parameter bias in analyses due to the contingent
emptiness of cells (male, very high depression may have zero entries). However,
if the parameter is estimated with Full Information Maximum Likelihood, MAR
will provide asymptotically unbiased estimates.

Missing not at random:


• Missing not at random (MNAR) (also known as nonignorable nonresponse) is
data that is neither MAR nor MCAR (i.e. the value of the variable that's missing
is related to the reason it's missing).To extend the previous example, this would
occur if men failed to fill in a depression survey because of their level of
depression.

Missing Value Replacement Policies:


 Ignore the records with missing values.
 Replace them with a global constant (e.g., “?”).
 Fill in missing values manually based on your domain knowledge.
 Replace them with the variable mean (if numerical) or the most frequent value (if
categorical).
 Use modeling techniques such as nearest neighbors, Bayes‘ rule, decision tree, or
EM algorithm.
Now introducing is.na() function, which returns the value of true for each data point that
is NA.
• Now introducing !is.na() function, which returns all the values of present in the
data set except the missing values.

Calculating Mean value of data set:

Similarly calculating median value:


Calculating SD:

Mean Imputations:
Median Imputation:

Need for Business Modeling:


• It is defined as process of exploring the range of business decisions.
• Identifying the elements that are essential for business.
• At present business models are generally being used by small and large
companies over the globe.
• It facilitates organizations in improving their existing business and creating new
ones & developing better strategies for future.
• It enables the business managers in answering the most essential queries about
their business like
• who is the customers,
• what is the value,
• How does it make the money and so on.
Business model assists the very initial stage of idea generation to the final stage in the
following manner:
1. Define:
It helps in defining and assessing the existing business model or plan.
2. Discover:
It helps in determining the “whitespaces” for growth and innovation opportunities.
3. Develop:
It helps in development of new valuable ideas and business models.
4. Deliver:
It helps in testing the developed ideas.
Furthermore business models can be used in different levels of organization:
 With the applications of business modeling, decision making, managing
modifications, planning and implementations can be done by organization in a
better way.
 In addition there are several uses of business modeling with which it has proved
that it is not only helpful to transform the key elements of business but also helps
in innovating completely new ideas.
Need for Business Modeling:
 A statistical model embodies a set of assumptions concerning the generation of the
observed data, and similar data from a larger population.
 A model represents often in considerably idealized form, the data-generating process.
 Signal processing is an enabling technology that encompasses the fundamental theory,
applications, algorithms, and implementations of processing or transferring information
contained in many different physical, symbolic, or abstract formats broadly designated
as signals.
 It uses mathematical,statistical,computational,heuristic,and linguistic representations,
formalisms, and techniques for
representation,modeling,analysis,synthesis,discovery,recovery,sensing,acquisition,extra
ction,learning,security, or forensics.
 In manufacturing statistical models are used to define warranty policies, solving
various conveyor related issues, statistical process control etc.
UNIT-III
Regression
Regression Concepts:
Regression analysis is a form of predictive modelling technique which investigates
the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding
the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied
through regression.
 Dependent-Target Variable, e.g: test score
 Independent Variable- Predictive Variable or Explanatory Variable,
e.g : age

Regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:

Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that the
growth in sales is around two and a half times the growth in the economy. Using this
insight, we can predict future sales of the company based on current & past
information.

There are multiple benefits of using regression analysis. They are as follows:

1. It indicates the significant relationships between dependent variable and


independent variable.
2. It indicates the strength of impact of multiple independent variables on a
dependent variable.

There are various kinds of regression techniques available to make predictions.


These techniques are mostly driven by three metrics (number of independent
variables, type of dependent variables and shape of regression line). We’ll discuss
them in detail in the following sections.
For the creative ones, you can even cook up new regressions, if you feel the need to
use a combination of the parameters above, which people haven’t used before. But
before you start that, let us understand the most commonly used regressions:

 Linear Regression
 Logistic Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression

1. Linear Regression

It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can be
continuous or discrete, and nature of regression line is linear.

The relationship between the two variable is three types. They are

(i) Linear Relationship

• The graph of linear relationship between two variables looks as follows,


(ii) Non-Linear relationship

• The graph of linear relationship between two variables looks as follows,

(iii) No Relationship

• The graph of linear relationship between two variables looks as follows,


Linear Regression establishes a relationship between dependent variable (Y) and
one or more independent variables (X) using a best fit straight line (also known as
regression line).

Where
 y is dependent variable
 x is independent variable
 b is slope--> how much the line rises for each unit increase in x
 a is y intercept --> the value of y when x=0.

Simple Linear Regression: It represents the relationships between the two


variables. One is independent variables is X and one dependent variable is Y.
Multiple Linear Regression:
When you have multiple independent variables, then we call it as Multiple Linear
Regression

Assumptions of linear regression:


 There must be a linear relation between independent and dependent variables.
 There should not be any outliers present.
 No heteroscedasticity
 Sample observations should be independent.
 Error terms should be normally distributed with mean 0 and constant variance.
 Absence of multicollinearity and auto-correlation.

Logistic Regression
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
 It is a classification problem where your target element is categorical
 Unlike in Linear Regression, in Logistic regression the output required is representedin
discrete values like binary 0 and 1.
 It estimates relationship between a dependent variable (target) and one or more
independent variable (predictors) where dependent variable is categorical/nominal.
 Logistic regression is a supervised learning classification algorithm used to predict
the probability of a dependent variable.
 The nature of target or dependent variable is dichotomous(binary), which means
there would be only two possible classes.
 In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no), etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.
 Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

Sigmoid Function:
 It is the logistic expression especially used in Logistic Regression.
 The sigmoid function converts any line into a curve which has discrete values like
binary 0 and 1.
 In this session let’s see how a continuous linear regression can be manipulated and
converted into Classifies Logistic.
 The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
 It maps any real value into another value within a range of 0 and 1.
 The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.

Where,
P represents Probability of Output class Y represents predicted output.

Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.


• The independent variable should not have multi-collinearity.

Logistic Regression Equation:

Example

Admissions(dependent variables) CGPA(Independent variables)

0 4.2
0 5.1

0 5.5

1 8.2

1 9.0

1 9.9

Logistic regression can be binomial, ordinal or multinomial.


 Binomial or binary logistic regression deals with situations in which the
observed outcome for a dependent variable can have only two possible
types, "0" and "1" (which may represent, for example, "dead" vs. "alive"
or "win" vs. "loss").
 Multinomial logistic regression deals with situations where the outcome
can have three or more possible types (e.g., "disease A" vs. "disease B"
vs. "disease C") that are not ordered.
 Ordinal logistic regression deals with dependent variables that are
ordered.
Differences Between Linear and Logistic Regression

Polynomial Regression
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear model.
Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+ + bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.

Need for Polynomial Regression:

• If we apply a linear model on a linear dataset, then it provides us a good


result as we have seen in Simple Linear Regression, but if we apply the same
model without any modification on a non-linear dataset, then it will produce a
drastic output. Due to which loss function will increase, the error rate will be
high, and accuracy will be decreased.
• So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way
using the below comparison diagram of the linear dataset and non-linear
dataset.
• In the above image, we have taken a dataset which is arranged non-linearly.
So if we try to cover it with a linear model, then we can clearly see that it
hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should
use the Polynomial Regression model instead of Simple Linear Regression.

When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables.

The Simple and Multiple Linear equations are also Polynomial equations with
a single degree, and the Polynomial regression equation is Linear equation
with the nth degree.

So if we add a degree to our linear equations, then it will be converted into


Polynomial Linear equations.

Stepwise Regression
• This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables is done
with the help of an automatic process, which involves no human intervention.
• Stepwise regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:

Standard stepwise regression does two things. It adds and removes predictors
as needed for each step.

 Forward selection starts with most significant predictor in the model and
adds variable for each step.
 Backward elimination starts with all predictors in the model and removes the
least significant variable for each step.
 The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle
higher dimensionality of data set.

Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge
regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity
of the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will
be:

BLUE Property Assumptions:


Best Linear Unbiased Estimator:

WHAT IS AN ESTIMATOR?
• In statistics, an estimator is a rule for calculating an estimate of a given quantity
based on observed data
• Example-
i. X follows a normal distribution, but we do not know the parameters of our
distribution, namely mean (μ) and variance (σ2 )
ii. To estimate the unknowns, the usual procedure is to draw a random sample
of size ‘n’ and use the sample data to estimate parameters.
TWO TYPES OF ESTIMATORS
• Point Estimators A point estimate of a population parameter is a single value of a
statistic. For example, the sample mean x is a point estimate of the population mean
μ. Similarly, the sample proportion p is a point estimate of the population proportion
P.
• Interval Estimators An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < x < b is an interval
estimate of the population mean μ. It indicates that the population mean is greater
than a but less than b.

PROPERTIES OF BLUE
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)

LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the
sample observations
• Sample mean is a linear estimator because it is a linear function of the X values.

UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true
mean of the variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as
its expected value equal to the true value of population.
• We also write this as follows:
E(β)=β
Similarly, if this is not the case, we say that the estimator is biased
• Similarly, if this is not the case, we say that the estimator is biased
• Bias=E(β ) - β

MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be centered around the
true population , so too it is desirable for the sampling distribution to be as narrow
(or precise) as possible.
– Centering around “the truth” but with high variability might be of very little use
• One way of narrowing the sampling distribution is to increase the sampling size

Assumptions of Gauss Markov or BLUE assumptions:


1 Linearity in Parameters: The population model is linear in its parameters and
correctly specified
2 Random Sampling: The observed data represent a random sample from the
population described by the model.
3 Variation in X: There is variation in the explanatory variable.
4 Zero conditional mean: Expected value of the error term is zero conditional on all
values of the explanatory variable
5 Homoskedasticity: The error term has the same variance conditional on all values
of the explanatory variable.
6 Normality: The error term is independent of the explanatory variables and
normally distributed.
7 Multicolinearity: There is must be no corelation among independent variables.
8 Non Collinearity: In regression analysis, regressor is being calculated aren’t
perfectlty corelated with each other.
9 Exogeneity: the regressors are not corelated with error terms.

Least Square Estimation:


Least Squares Calculator. Least Squares Regression is a way of finding a straight
line that best fits the data, called the "Line of Best Fit". Enter your data as (x,y) pairs,
and find the equation of a line that best fits the data.
Line of Best Fit

Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points,
and a similar number of points above and below the line.

But for better accuracy let's see how to calculate the line using Least Squares
Regression.

The Line

Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a
line :

y = mx + b

Where:

 y = how far up
 x = how far along
 m = Slope or Gradient (how steep the line is)
 b = the Y Intercept (where the line crosses the Y axis)

Steps

To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy


Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum
up")

Step 3: Calculate Slope m:

m = N Σ(xy) − Σx Σy/N Σ(x2) − (Σx)2

(N is the number of points.)

Step 4: Calculate Intercept b:

b = Σy − m Σx/N

Step 5: Assemble the equation of a line y

= mx + b

Done!

Example

Example: Sam found how many hours of sunshine vs how many ice creams were
sold at the shop from Monday to Friday:

"x" "y"
Hours of Ice Creams
Sunshine Sold

2 4

3 5

5 7

7 10

9 15
Let us find the best m (slope) and b (y-intercept) that suits that datay

= mx + b

Step 1: For each (x,y) calculate x2 and xy:

x y x2 xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

x y x2 Xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:


m = N Σ(xy) − Σx Σy/ N Σ(x2) − (Σx)2

= 5 x 263 − 26 x 41/ 5 x 168 − 262

= 1315 – 1066/ 840 − 676

= 249/ 164 = 1.5183...

Step 4: Calculate Intercept b:

b = Σy − m Σx/ N

= 41 − 1.5183 x 26/ 5

= 0.3049...

Step 5: Assemble the equation of a line:

y = mx + b

y = 1.518x + 0.305

Let's see how it works out:

x y y = 1.518x + 0.305 error

2 4 3.34 −0.66

3 5 4.86 −0.14

5 7 7.89 0.89

7 10 10.93 0.93

9 15 13.97 −1.03

Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so
he uses the above equation to estimate that he will sell

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.

How does it work?

It works by making the total of the square of the errors as small as possible (that is
why it is called "least squares"):

The straight line minimizes the sum of squared errors


So, when we square each of those errors and add them all up, the total is as small as
possible.

You can imagine (but not accurately) each data point connected to a straight bar by
springs:

Outliers

Be careful! Least squares is sensitive to outliers. A strange value will pull the line
towards it.

OLS -> Ordinary Least Square


MLE -> Maximum Likelihood Estimation

The ordinary least squares, or OLS, can also be called the linear least squares. This
is a method for approximately determining the unknown parameters located in a
linear regression model. According to books of statistics and other online sources,
the ordinary least squares is obtained by minimizing the total of squared vertical
distances between the observed responses within the dataset and the responses
predicted by the linear approximation. Through a simple formula, you can express
the resulting estimator, especially the single regressor, located on the right-hand side
of the linear regression model.

For example, you have a set of equations which consists of several equations that
have unknown parameters. You may use the ordinary least squares method because
this is the most standard approach in finding the approximate solution to your overly
determined systems. In other words, it is your overall solution in minimizing the sum
of the squares of errors in your equation. Data fitting can be your most suited
application. Online sources have stated that the data that best fits the ordinary least
squares minimizes the sum of squared residuals. “Residual” is “the difference
between an observed value and the fitted value provided by a model.”

Maximum likelihood estimation, or MLE, is a method used in estimating the


parameters of a statistical model, and for fitting a statistical model to data. If you
want to find the height measurement of every basketball player in a specific location,
you can use the maximum likelihood estimation. Normally, you would encounter
problems such as cost and time constraints. If you could not afford to measure all of
the basketball players’ heights, the maximum likelihood estimation would be very
handy. Using the maximum likelihood estimation, you can estimate the mean and
variance of the height of your subjects. The MLE would set the mean and variance
as parameters in determining the specific parametric values in a given model.

Variable Rationalization:
Method selection allows you to specify how independent variables are entered into
the analysis. Using different methods, you can construct a variety of regression
models from the same set of variables.
 Enter (Regression). A procedure for variable selection in which all variables in a
block are entered in a single step.
 Stepwise. At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their probability of F becomes
sufficiently large. The method terminates when no more variables are eligible for
inclusion or removal.
 Remove. A procedure for variable selection in which all variables in a block are
removed in a single step.
 Backward Elimination. A variable selection procedure in which all variables are
entered into the equation and then sequentially removed. The variable with the
smallest partial correlation with the dependent variable is considered first for
removal. If it meets the criterion for elimination, it is removed. After the first variable is
removed, the variable remaining in the equation with the smallest partial
correlation is considered next. The procedure stops when there are no variables in
the equation that satisfy the removal criteria.
 Forward Selection. A stepwise variable selection procedure in which variables are
sequentially entered into the model. The first variable considered for entry into the
equation is the one with the largest positive or negative correlation with the
dependent variable.
Model Building:
In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables. The major issues are finding the proper form (linear or curvilinear) of the
relationship and selecting which independent variables to include. In building
models it is often desirable to use qualitative as well as quantitative variables. As
noted above, quantitative variables measure how much or how many; qualitative
variables represent types or categories. For instance, suppose it is of interest to
predict sales of an iced tea that is available in either bottles or cans. Clearly, the
independent variable “container type” could influence the dependent variable
“sales.” Container type is a qualitative variable, however, and must be assigned
numerical values if it is to be used in a regression study. So-called dummy variables
are used to represent qualitative variables in regression analysis. For example, the
dummy variable x could be used to represent container type by setting x = 0 if the
iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage
could be placed in glass bottles, plastic bottles, or cans, it would require two dummy
variables to properly represent the qualitative variable container type. In general, k -
1 dummy variables are needed to model the effect of a qualitative variable that may
assume k values.

The general linear model y = β0 + β1x1 + β2x2 + . . . + βpxp + ε can be used to


model a wide variety of curvilinear relationships between dependent and
independent variables. For instance, each of the independent variables could
be a nonlinear function of other variables. Also, statisticians sometimes find it
necessaryto transform the dependent variable in order to build a satisfactory model.
A logarithmic transformation is one of the more common types.
Logistic Regression
Model Theory:
Logistic regression is a statistical method for predicting binary classes. The outcome
or target variable is binary in nature. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.

It is a special case of linear regression where the target variable is categorical in


nature. It uses a log of odds as the dependent variable. Logistic Regression predicts
the probability of occurrence of a binary event utilizing a logit function.

Linear Regression Equation:

Where, y is dependent variable and x1, x2 … and Xn are explanatory variables.

Sigmoid Function:

Apply Sigmoid function on linear regression:

Properties of Logistic Regression:

 The dependent variable in logistic regression follows Bernoulli Distribution.

 Estimation is done through maximum likelihood.

 No R Square, Model fitness is calculated through Concordance, KS-Statistics.


Maximum Likelihood Estimation
The MLE is a “likelihood” maximization method, while OLS is a distance-
minimizing approximation method. Maximizing the likelihood function determines
the parameters that are most likely to produce the observed data. From a statistical
point of view, MLE sets the mean and variance as parameters in determining the
specific parametric values for a given model. This set of parameters can be used for
predicting the data needed in a normal distribution.
Sigmoid Function

Sigmoid curve

The sigmoid function also called the logistic function gives an ‘S’ shaped curve that
can take any real-valued number and map it into a value between 0 and 1. If the curve
goes to positive infinity, y predicted will become 1, and if the curve goes to negative
infinity, y predicted will become 0. If the output of the sigmoid function is more than
0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify
it like 0 or NO. If the output is 0.75, we can say in terms of probability as: There is a
75 percent chance that patient will suffer from cancer.

Sigmoid function

The sigmoid curve has a finite limit of:

‘0’ as x approaches −∞
‘1’ as x approaches +∞

The output of sigmoid function when x=0 is 0.5

Thus, if the output is more tan 0.5, we can classify the outcome as 1 (or YES) and if
it is less than 0.5, we can classify it as 0(or NO).

For example: If the output is 0.65, we can say in terms of probability as:

“There is a 65 percent chance that your favorite cricket team is going to win today ”.

Logistic Regression Assumptions


 Binary logistic regression requires the dependent variable to be binary.

 For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.

 Only meaningful variables should be included.

 The independent variables should be independent of each other. That is, the
model should have little or no multicollinearity.

 The independent variables are linearly related to the log odds.

 Logistic regression requires quite large sample sizes.

Binary Logistic Regression model building in Scikit learn


Model Fit Statistics:
A commonly used measure of the goodness of fit provided by the estimated
regression equation is the coefficient of determination. Computation of this
coefficient is based on the analysis of variance procedure that partitions the total
variation in the dependent variable, denoted SST, into two parts: the part explained
by the estimated regression equation, denoted SSR, and the part that remains
unexplained, denoted SSE.

The measure of total variation, SST, is the sum of the squared deviations of the
dependent variable about its mean: Σ(y − ȳ)2. This quantity is known as the total sum
of squares. The measure of unexplained variation, SSE, is referred to as the residual
sum of squares. SSE is the sum of the squared distances from each point in to the
estimated regression line: Σ(y − ŷ)2. SSE is also commonly referred to as the error
sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.

The ratio r2 = SSR/SST is called the coefficient of determination. If the data points
are clustered closely about the estimated regression line, the value of SSE will be
small and SSR/SST will be close to 1. Using r2, whose values lie between 0 and 1,
provides a measure of goodness of fit; values closer to 1 imply a better fit. A value
of r2 = 0 implies that there is no linear relationship between the dependent and
independent variables.

When expressed as a percentage, the coefficient of determination can be interpreted


as the percentage of the total sum of squares that can be explained using the
estimated regression equation. For the stress-level research study, the value of r2 is
0.583; thus, 58.3% of the total sum of squares can be explained by the estimated
regression equation ŷ = 42.3 + 0.49x. For typical data found in the social sciences,
values of r2 as low as 0.25 are often considered useful. For data in the physical
sciences, r2 values of 0.60 or greater are frequently found.

Hosmer Lemeshow Test:


 The Hosmer–Lemeshow test is a statistical test for goodness of fit for
logistic regression models.
 It is used frequently in risk prediction models.
 The test assesses whether or not the observed event rates match
expected event rates in subgroups of the model population.
 The Hosmer–Lemeshow test specifically identifies subgroups as the
deciles of fitted risk values.
 Models for which expected and observed event rates in
subgroups are similar are called well calibrated.

 The Hosmer–Lemeshow test statistic is given by:

Here Og, Eg, Ng, and πg denote the observed events, expected events, observations,
predicted risk for the gth risk decile group, and G is the number of groups

Model Construction:
One Model Building Strategy
We've talked before about the "art" of model building. Unsurprisingly, there are
many approaches to model building, but here is one strategy—consisting of seven
steps—that is commonly used when building a regression model.

The first step


Decide on the type of model that is needed in order to achieve the goals of the
study. In general, there are five reasons one might want to build a regression
model. They are:
 For predictive reasons — that is, the model will be used to predict the
response variable from a chosen set of predictors.
 For theoretical reasons — that is, the researcher wants to estimate a model
based on a known theoretical relationship between the response and
predictors.
 For control purposes — that is, the model will be used to control a response
variable by manipulating the values of the predictor variables.
 For inferential reasons — that is, the model will be used to explore the
strength of the relationships between the response and the predictors.
 For data summary reasons — that is, the model will be used merely as a
way to summarize a large set of data by a single equation.
The second step
Decide which predictor variables and response variable on which to collect the
data. Collect the data.

The third step


Explore the data. That is:

 On a univariate basis, check for outliers, gross data errors, and missing
values.
 Study bivariate relationships to reveal other outliers, to suggest possible
transformations, and to identify possible multicollinearities.
I can't possibly over-emphasize the importance of this step. There's not a data
analyst out there who hasn't made the mistake of skipping this step and later
regretting it when a data point was found in error, thereby nullifying hours of
work.

The fourth step


Randomly divide the data into a training set and a validation set:

 The training set, with at least 15-20 error degrees of freedom, is used to
estimate the model.
 The validation set is used for cross-validation of the fitted model.
The fifth step
Using the training set, identify several candidate models:

 Use best subsets regression.


 Use stepwise regression, which of course only yields one model unless
different alpha-to-remove and alpha-to-enter values are specified.
The sixth step
Select and evaluate a few "good" models:
 Select the models based on the criteria we learned, as well as the number and
nature of the predictors.
 Evaluate the selected models for violation of the model conditions.
 If none of the models provide a satisfactory fit, try something else, such as
collecting more data, identifying different predictors, or formulating a
different type of model.
The seventh and final step
Select the final model:

 Compare the competing models by cross-validating them against the


validation data.
 The model with a smaller mean square prediction error (or larger cross-
validation R2) is a better predictive model.
 Consider residual plots, outliers, parsimony, relevance, and ease of
measurement of predictors.
And, most of all, don't forget that there is not necessarily only one good model for
a given set of data. There might be a few equally satisfactory models.

Analytics applications to various Business Domains:


 Finance
BA is of utmost importance to the finance sector. Data Scientists are in high
demand in investment banking, portfolio management, financial planning,
budgeting, forecasting, etc.
For example: Companies these days have a large amount of financial data.
Use of intelligent Business Analytics tools can help use this data to determine
the products’ prices. Also, on the basis of historical information Business
Analysts can study the trends on the performance of a particular stock and
advise the client on whether to retain it or sell it.
 Marketing
Studying buying patterns of consumer behaviour, analysing trends, help in
identifying the target audience, employing advertising techniques that can
appeal to the consumers, forecast supply requirements, etc.
For example: Use Business Analytics to gauge the effectiveness and impact
of a marketing strategy on the customers. Data can be used to build loyal
customers by giving them exactly what they want as per their specifications.
 HR Professionals
HR professionals can make use of data to find information about educational
background of high performing candidates, employee attrition rate, number
of years of service of employees, age, gender, etc. This information can play
a pivotal role in the selection procedure of a candidate.
For example: HR manager can predict the employee retention rate on the
basis of data given by Business Analytics.
 CRM
Business Analytics helps one analyze the key performance indicators, which
further helps in decision making and make strategies to boost the relationship
with the consumers. The demographics, and data about other socio-economic
factors, purchasing patterns, lifestyle, etc., are of prime importance to the
CRM department.
 For example: The company wants to improve its service in a particular
geographical segment. With data analytics, one can predict the customer’s
preferences in that particular segment, what appeals to them, and accordingly
improve relations with customers.
 Manufacturing
Business Analytics can help you in supply chain management, inventory
management, measure performance of targets, risk mitigation plans, improve
efficiency in the basis of product data, etc.
For example: The Manager wants information on performance of a
machinery which has been used past 10 years. The historical data will help
evaluate the performance of the machinery and decide whether costs of
maintaining the machine will exceed the cost of buying a new machinery.
 Credit Card Companies
Credit card transactions of a customer can determine many factors: financial
health, life style, preferences of purchases, behavioral trends, etc.
For example: Credit card companies can help the retail sector by locating the
target audience. According to the transactions reports, retail companies can
predict the choices of the consumers, their spending pattern, preference over
buying competitor’s products, etc. This historical as well as real-time
information helps them direct their marketing strategies in such a way that it
hits the dart and reaches the right audience.
UNIT-IV
OBJECT SEGMENTATION
Regression Vs Classification:

Regression and Classification algorithms are Supervised Learning algorithms. Both


the algorithms are used for prediction in Machine learning and work with the labeled
datasets. But the difference between both is how they are used for different machine
learning problems.

The main difference between Regression and Classification algorithms that


Regression algorithms are used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

Classification predictive modeling problems are different from regression predictive


modeling problems.

 Classification is the task of predicting a discrete class label.


 Regression is the task of predicting a continuous quantity.
There is some overlap between the algorithms for classification and regression; for
example:

 A classification algorithm may predict a continuous value, but the continuous value
is in the form of a probability for a class label.
 A regression algorithm may predict a discrete value, but the discrete value in the
form of an integer quantity.
Some algorithms can be used for both classification and regression with small
modifications, such as decision trees and artificial neural networks. Some algorithms
cannot, or cannot easily be used for both problem types, such as linear regression for
regression predictive modeling and logistic regression for classification predictive
modeling.

Importantly, the way that we evaluate classification and regression predictions varies
and does not overlap, for example:

 Classification predictions can be evaluated using accuracy, whereas regression


predictions cannot.
 Regression predictions can be evaluated using root mean squared error, whereas
classification predictions cannot.

Supervised and Unsupervised Learning:


Supervised Machine Learning:

Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output


data to the machine learning model. The aim of a supervised learning algorithm is
to find a mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.

The working of Supervised learning can be easily understood by the below example
and diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation
sets as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning:

In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden patterns
from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in


which models are not supervised using training dataset. Instead, models itself find
the hidden patterns and insights from the given data. It can be compared to learning
which takes place in the human brain while learning new things. It can be defined
as:

Unsupervised learning is a type of machine learning in which models are trained


using unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification


problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained
upon the given dataset, which means it does not have any idea about the features of
the dataset. The task of the unsupervised learning algorithm is to identify the image
features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between
images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised
Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to
find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have labeled
input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning


as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

Difference b/w Supervised and Unsupervised Learning :

SUPERVISED LEARNING UNSUPERVISED LEARNING

Uses Known and Labeled Data

Input Data as input Uses Unknown Data as input

Computational

Complexity Very Complex Less Computational Complexity

Uses Real Time Analysis of

Real Time Uses off-line analysis Data


Number of Classes are not

Number of Classes Number of Classes are known known

Moderate Accurate and Reliable

Accuracy of Results Accurate and Reliable Results Results

Tree Building – Regression:


Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables. An example is shown on the right. Each interior node corresponds to one
of the input variables; there are edges to children for each of the possible values of
that input variable. Each leaf represents a value of the target variable given the values
of the input variables represented by the path from the root to the leaf.
Decision trees used in data mining are of two main types:
 Classification tree analysis is when the predicted outcome is the class to
which the data belongs.
 Regression tree analysis is when the predicted outcome can be
considered a real number (e.g. the price of a house, or a
patient’s length of stay in a hospital).
 Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given
dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
 In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
 Below diagram explains the general structure of a decision tree:

There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows
a tree-like structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from
the tree.
 Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node
of the tree. The complete process can be better understood using the below
algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It


specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Overfitting and Underfitting:


Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models.
The main goal of each machine learning model is to generalize well.
Here generalization defines the ability of an ML model to provide a suitable output
by adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and
overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance
of the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to


our model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because
the goal of the regression model to find the best fit line, but here we have not got
any best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant
trend in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.


Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data
points present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit

The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

Pruning and Complexity


Pruning is a process of deleting the unnecessary nodes from a tree in order to getthe
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.

Multiple Decision Trees:


Multiple decision tress are together are called as Random Forest based on random
sample of a Data. They make the use concept called reducing entropy in data. At the
end of the algorithm, notes from various texts are gathered to generate a final
response from outcome for the observations.

The trees are trained independent of one another. So there is no chance of


sequentially dependent training. But it is possible to train the trees parallelly. In the
process of selecting the trees, the term random in random forest comes in two ways.

Initial one is random selection of data points that are used for training each of the
trees. Second one is random selection of Features that are used to create every tree.
Since a single decision tree tends to over fit on data, the way of randomness results
in having multiple decision trees where each of it has good accuracy on various
subsets of available training data.
Random forest is a machine learning algorithm that is used in Classification,
regression etc., while training the multiple decision trees are build and output would
be mean or average prediction of each trees.

The steps of this algorithm are as follows,

1. Sampling and training of data set is performed through bagging process. It


provides number of trees.

2. Notes are divided based on certain splitting criteria.

3. Data is divided into every node based on splitting criteria.

4. Classification is performed on every leaf node.

5. The test Data is trained for trees and then sampled. Then every sample is
distributed to all the trees.

6. Finally, the class of test data set is decided by majority voting or average process.

Tools used to make Decision Tree:


Many data mining software packages provide implementations of one
or more decision tree algorithms. Several examples include:
 Salford Systems CART (which licensed the proprietary code of the
original CART authors)
 IBM SPSS Modeler
 Rapid Miner
 SAS Enterprise Miner
 Matlab
 R (an open source software environment for statistical
computing which includes several CART implementations
such as rpart, party and random Forest packages)
 Weka (a free and open-source data mining suite, contains many decision
tree algorithms)
 Orange (a free data mining software suite, which includes the tree module
orngTree)
 KNIME
 Microsoft SQL Server
 Scikit-learn (a free and open-source machine learning
library for the Python programming language).
There are many specific decision-tree algorithms. Notable ones include:
 ID3 (Iterative Dichotomiser 3)
 C4.5 (successor of ID3)
 CART (Classification And Regression Tree)
 CHAID (CHI-squared Automatic Interaction Detector).
Performs multi-level splits when computing
classification trees.
 MARS: extends decision trees to handle numerical data better.
 Conditional Inference Trees. Statistics-based approach that
uses non-parametric tests as splitting criteria, corrected for
multiple testing to avoid over fitting. This approach results in
unbiased predictor selection and does not require pruning.
ID3 and CART were invented independently at around the same
time (between 1970 and 1980), yet follow a similar approach for
learning decision tree from training tuples.

CART stands for Classification And Regression Tree.


CART algorithm was introduced in Breiman et al. (1986). A CART tree is a
binary decision tree that is constructed by splitting a node into two child nodes
repeatedly, beginning with the root node that contains the whole learning
sample. The CART growing method attempts to maximize within-node
homogeneity. The extent to which a node does not represent a homogenous
subset of cases is an indication of impurity. For example, a terminal node in
which all cases have the same value for the dependent variable is a homogenous
node that requires no further splitting because it is "pure." For categorical
(nominal, ordinal) dependent variables the common measure of impurity is Gini,
which is based on squared probabilities of membership for each category. Splits
are found that maximize the homogeneity of child nodes with respect to the
value of the dependent variable.

CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector.
Morgan and Sonquist (1963) proposed a simple method for fitting
trees to predict a quantitative variable. They called the method AID,
for Automatic Interaction Detection. The algorithm performs stepwise
splitting. It begins with a single cluster of cases and searches a
candidate set of predictor variables for a way to split this cluster into
two clusters. Each predictor is tested for splitting as follows: sort all
the n cases on the predictor and examine all n-1 ways to split the
cluster in two. For each possible split, compute the within-cluster sum
of squares about the mean of the cluster on the dependent variable.
Choose the best of the n-1 splits to represent the predictor’s
contribution. Now do this for every other predictor. For the actual
split, choose the predictor and its cut point which yields the smallest
overall within-cluster sum of squares. Categorical predictors require
a different approach. Since categories are unordered, all possible splits
between categories must be considered. For deciding on one split of k
categories into two groups, this means that 2k-1 possible splits must
be considered. Once a split is found, its suitability is measured on the
same within-cluster sum of squares as for a quantitative predictor.
Morgan and Sonquist called their algorithm AID because it naturally
Incorporates interaction among predictors. Interaction is not
correlation. It has to do instead with conditional discrepancies. In the
analysis of variance, interaction means that a trend within one level of
a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-
products between predictors. In the tree model, it is represented by
branches from the same nodes which have different splitting
predictors further down the tree.
Regression trees parallel regression/ANOVA modeling, in which the
dependent variable is quantitative. Classification trees parallel
discriminant analysis and algebraic classification methods. Kass
(1980) proposed a modification to AID called CHAID for categorized
dependent and independent variables.
His algorithm incorporated a sequential merge and split procedure
based on a chi-square test statistic. Kass was concerned about
computation time (although this has since proved an unnecessary
worry), so he decided to settle for a sub-optimal split on each
predictor instead of searching for all possible combinations of the
categories. Kass’s algorithm is like sequential cross-tabulation. For
each predictor:

1) cross tabulate the m categories of the predictor with the k categories of the
dependent variable,
2) find the pair of categories of the predictor whose 2xk sub-
table is least significantly different on a chi-square test and
merge these two categories;
3) if the chi-square test statistic is not “significant” according to a
preset critical value, repeat this merging process for the selected
predictor until no non-significant chi-square is found for a sub-
table, and pick the predictor variable whose chi-square is largest and
split the sample into subsets, where l is the number of categories
resulting from the merging process on that predictor;
4) Continue splitting, as with AID, until no “significant” chi-squares result.
The CHAID algorithm saves some computer time, but it is not
guaranteed to find the splits which predict best at a given step. Only
by searching all possible category subsets can we do that. CHAID is
also limited to categorical predictors, so it cannot be used for
quantitative or mixed categorical- quantitative models.

Time Series Methods:


Components of Time Series
Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise
and fall of time series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any
of the three above mentioned components. They can be classified
into – stationary and non – stationary variation.
When the data neither increases nor decreases, i.e. it’s completely
random it’s called stationary variation.
When the data has some explainable portion remaining and can be
analysed further then such case is called non – stationary variation.

ARIMA:
A popular and widely used statistical method for time series forecasting is the ARIMAmodel.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of
model that captures a suite of different standard temporal structures in time seriesdata.

In this tutorial, you will discover how to develop an ARIMA model for timeseries
forecasting in Python.
After completing this tutorial, you will know:
About the ARIMA model the parameters used and assumptions made by the model.
How to fit an ARIMA model to data and use it to make forecasts.
How to configure the ARIMA model on your time series problem.

Autoregressive Integrated Moving Average Model


An ARIMA model is a class of statistical models for analyzing and forecasting time seriesdata.
It explicitly caters to a suite of standard structures in time series data, and as such providesa simple
yet powerful method for making skillful time series forecasts.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a


generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.

This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:

AR: Autoregression. A model that uses the dependent relationship between an


observation and some number of lagged observations.
I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation
from an observation at the previous time step) in order to make the timeseries stationary.
MA: Moving Average. A model that uses the dependency between an observationand a
residual error from a moving average model applied to lagged observations.
Each of these components are explicitly specified in the model as a parameter. A standard notation
is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly
indicate the specific ARIMA model being used.

The parameters of the ARIMA model are defined as follows:

p: The number of lag observations included in the model, also called the lag order.
d: The number of times that the raw observations are differenced, also called thedegree
of differencing.
q: The size of the moving average window, also called the order of moving average.A linear
regression model is constructed including the specified number and type of terms, and the data is
prepared by a degree of differencing in order to make it stationary, i.e. to remove trend and seasonal
structures that negatively affect the regression model.

A value of 0 can be used for a parameter, which indicates to not use that element of the model. This
way, the ARIMA model can be configured to perform the function of an ARMAmodel, and even a
simple AR, I, or MA model.
Adopting an ARIMA model for a time series assumes that the underlying process that generated
the observations is an ARIMA process. This may seem obvious, but helps to motivate the need to
confirm the assumptions of the model in the raw observations and inthe residual errors of forecasts
from the model.

Measure of Forecast Accuracy:


Forecast error is the difference between the forecast and actual value for a given
period, or

where Et = forecast error for period t

At = actual value for period t

Ft = forecast for period t

Forecast error

Difference between forecast and actual value for a given period.

However, error for one time period does not tell us very much. We need to measure
forecast accuracy over time. Two of the most commonly used error measures are
the mean absolute deviation (MAD) and the mean squared error (MSE). MAD is the
average of the sum of the absolute errors:

We measure Forecast Accuracy by 2 methods :

1. Mean Forecast Error (MFE)

For n time periods where we have actual demand and forecast values:
Ideal value = 0;
MFE > 0, model tends to under-forecast
MFE < 0, Model tends to over-forecast
2. Mean Absolute Deviation (MAD)
For n time periods where we have actual demand and forecast values:

While MFE is a measure of forecast model bias, MAD indicates the absolute size of
the errors
Uses of Forecast error:

 Forecast model bias


 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment

STL Approach

What is STL decomposition?

So, STL stands for Seasonal and Trend decomposition using Loess. This is a
statistical method of decomposing a Time Series data into 3 components containing
seasonality, trend and residual.
Now, what is a Time Series data? Well, it is a sequence of data points that varies
across a continuous time axis. Below is an example of a time series data where you
can see the time axis is at an hour level and value of stock varies across the time.
Now let’s talk about trend. Trend gives you a general direction of the overall data.
From the above example, I can say that from 9:00am to 11:00am there is a downward
trend and from 11:00am to 1:00pm there is an upward trend and after 1:00pm the
trend is constant.

Whereas seasonality is a regular and predictable pattern that recur at a fixed interval
of time. For example, the below plot helps us understand the total units sold per
month for a retailer. So, if we try to watch carefully, we can see an increment in the
unit sales during the month of December every year. So, there is a regular pattern or
seasonality in the unit sales associated with a period of 12 months.

Randomness or Noise or Residual is the random fluctuation or unpredictable change.


This is something which we cannot guess. For example, there is a high unit sales
associated during March 2014, which is something irregular across every March
month. So, we can say it has some high percentage of noise or residual during this
month.
Now, let’s talk about Loess. So, loess is a regression technique that uses local
weighted regression to fit a smooth curve through points in a sequence, which in our
case is the Time Series data.

ETL Approach:
Extract, Transform and Load (ETL) refers to a process in
database usage and especially in data warehousing that:
 Extracts data from homogeneous or heterogeneous data sources
 Transforms the data for storing it in proper format or
structure for querying and analysis purpose
 Loads it into the final target (database, more
specifically, operational data store, data mart, or data
warehouse)

Usually all the three phases execute in parallel since the data extraction
takes time, so while the data is being pulled another transformation
process executes, processing the already received data and prepares the
data for loading and as soon as there is some data ready to be loaded
into the target, the data loading kicks off without waiting for the
completion of the previous phases.

ETL systems commonly integrate data from multiple applications


(systems), typically developed and supported by different vendors or
hosted on separate computer hardware. The disparate systems
containing the original data are frequently managed and operated by
different employees. For example, a cost accounting system may
combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:

 Anatella
 Alteryx
 CampaignRunner
 ESF Database Migration Toolkit
 InformaticaPowerCenter
 Talend
 IBM InfoSphereDataStage
 Ab Initio
 Oracle Data Integrator (ODI)
 Oracle Warehouse Builder (OWB)
 Microsoft SQL Server Integration Services (SSIS)
 Tomahawk Business Integrator by Novasoft Technologies.
 Pentaho Data Integration (or Kettle) opensource data integration
framework
 Stambia
 Diyotta DI-SUITE for Modern Data Integration
 FlyData
 Rhino ETL

There are various steps involved in ETL. They are as below in detail:

Extract:

The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources as
possible. The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any kind of
locking.

There are several ways to perform the extract:

 Update notification - if the source system is able to provide a notification that a


record has been changed and describe the change, this is the easiest way to get the
data.
 Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the system needs
to identify changes and propagate it down. Note, that by using daily extract, we may
not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has been changed at
all, so a full extract is the only way one can get the data out of the system. The full
extract requires keeping a copy of the last extract in the same format in order to be
able to identify changes. Full extract handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of gigabytes.

Clean:

The cleaning step is one of the most important as it ensures the quality of the data in
the data warehouse. Cleaning should perform basic data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).

Transform:

The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be joined. The
transformation step also requires joining data from several sources, generating
aggregates, generating surrogate keys, sorting, deriving new calculated values, and
applying advanced validation rules.

Load:

During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database. In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Managing ETL Process

The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from
one of the systems, missing values in one of the reference tables, or simply a
connection or power outage. Therefore, it is necessary to design the ETL process
keeping fail-recovery in mind.

Feature Extraction:

Feature extraction is an attribute reduction process. Unlike feature selection, which


ranks the existing attributes according to their predictive significance, feature
extraction actually transforms the attributes. The transformed attributes, or features,
are linear combinations of the original attributes.

The feature extraction process results in a much smaller and richer set of attributes.
The maximum number of features is controlled by
the FEAT_NUM_FEATURES build setting for feature extraction models.

Models built on extracted features may be of higher quality, because the data is
described by fewer, more meaningful attributes.

Feature extraction projects a data set with higher dimensionality onto a smaller
number of dimensions. As such it is useful for data visualization, since a complex
data set can be effectively visualized when it is reduced to two or three dimensions.

Some applications of feature extraction are latent semantic analysis, data


compression, data decomposition and projection, and pattern recognition. Feature
extraction can also be used to enhance the speed and effectiveness of supervised
learning.

Feature extraction can be used to extract the themes of a document collection, where
documents are represented by a set of key words and their frequencies. Each theme
(feature) is represented by a combination of keywords. The documents in the
collection can then be expressed in terms of the discovered themes.
UNIT V
Data Visualization

Data Visualization
Data visualization is the graphic representation of data. It involves producing images
that communicate relationships among the represented data to viewers of the images.
This communication is achieved through the use of a systematic mapping between
graphic marks and data values in the creation of the visualization. This mapping
establishes how data values will be represented visually, determining how and to
what extent a property of a graphic mark, such as size or color, will change to reflect
change in the value of a datum.

Data visualization aims to communicate data clearly and effectively through


graphical representation. Data visualization has been used extensively in many
applications for Eg. Business, health care, education etc. More popularly, data
visualization techniques are used to discover data relationships that are otherwise
not easily observable by looking at the raw data.
Data Visualization techniques:
Pixel oriented visualization techniques:
A simple way to visualize the value of a dimension is to use a pixel where
the color of the pixel reflects the dimension’s value.
For a data set of m dimensions pixel oriented techniques create m windows
on the screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the
corresponding
position in the windows.
The color of the pixel reflects other corresponding values.
Inside a window, the data values are arranged in some global order shared by
all windows
Eg: All Electronics maintains a customer information table, which
consists of 4 dimensions: income, credit_limit, transaction_volume
and age. We analyze the correlation between income and other
attributes by visualization.
We sort all customers in income in ascending order and use this order
to layout the customer data in the 4 visualization windows as shown
in fig.
The pixel colors are chosen so that the smaller the value, the lighter the
shading.
Using pixel based visualization we can easily observe that credit_limit
increases as income increases customer whose income is in the middle
range are more likely to purchase more from All Electronics, these is
no clear correlation between income and age.

Fig: Pixel oriented visualization of 4 attributes by sorting all customers


in income Ascending order.
Geometric Projection visualization techniques

A drawback of pixel-oriented visualization techniques is that they cannot


help us much in understanding the distribution of data in a
multidimensional space.
Geometric projection techniques help users find interesting
projections of multidimensional data sets.
A scatter plot displays 2-D data point using Cartesian co-ordinates. A third
dimension can be added using different colors of shapes to represent
different data points.
Eg. Where x and y are two spatial attributes and the third dimension is
represented by different shapes
Through this visualization, we can see that points of types “+”
&”X” tend to be collocated.
Fig: visualization of 2D data set using
scatter plot

Icon based visualization techniques:

It uses small icons to represent multidimensional data values


2 popular icon based techniques:-
3.1 Chern off faces: - They display multidimensional data of up to 18
variables as a cartoon human face.

Fig: chern off faces each face represents an ‘n’ dimensional data points
(n<18)
3.2 Stick figures: It maps multidimensional data to five –piece stick figure,
where each figure has 4 limbs and a body.

2 dimensions are mapped to the display axes and the remaining


dimensions are mapped to the angle and/ or length of the limbs.

Hierarchical visualization techniques (i.e. subspaces)


The subspaces are visualized in a hierarchical manner. So it’s important to show
how one set of data values compares to another one or more data value sets.
A dendrogram is an illustration of a hierarchical clustering of various data sets,
helping to understand their relations in an instant.

A dendrogram is an illustration of a hierarchical clustering of various data sets,


helping to understand their relations in an instant.

A sunburst chart (or a ring chart) is a pie chart with concentric circles, describing
the hierarchy of data values.
The tree diagram allows to describe the tree-like relations within the data
structure, usually from the upside down or from the left to the right.

These forms of data visualization are mostly useful for depicting the hierarchy or
relations of different variables within the data set. However, they are not too suited
for showing the relations between multiple data sets, as network data models work
best for such matter.

Visualizing Complex Data and Relations

Most visualization techniques were mainly for numeric data. Recently, more and
more non-numeric data, such as text and social networks, have become available.
Many people on the Web tag various objects such as pictures, blog entries, and
product reviews. A tag cloud is a visualization of statistics of user-generated tags.
Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
The importance of a tag is indicated by font size or color.
NOTES-3

CS513PE: DATA ANALYTICS (Professional Elective - I)


DATA ANALYTICS UNIT –I

Prerequisites:
1. A course on “Database Management Systems”.
2. Knowledge of probability and statistics.

Course Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles and methods of statistical analysis
3. Discover interesting patterns, analyze supervised and unsupervised models and estimate the accuracy of the
algorithms.
4. To understand the various search methods and visualization techniques.

Course Outcomes: After completion of this course students will be able to


1. Understand the impact of data analytics for business decisions and strategy
2. Carry out data analysis/statistical analysis
3. To carry out standard data visualization and formal inference procedures
4. Design Data Architecture
5. Understand various Data Sources

INTRODUCTION:

In the beginning times of computers and Internet, the data used was not as much of as it is
today, the data then could be so easily stored and managed by all the users and business enterprises
on a single computer, because the data never exceeded to the extent of 19 exabytes but now in this
era, the data has increased about 2.5 quintillion per day.

Most of the data is generated from social media sites like Facebook, Instagram, Twitter, etc, and the
other sources can be e-business, e-commerce transactions, hospital, school, bank data, etc. This data
is impossible to manage by traditional data storing techniques. Either the data being generated from
large-scale enterprises or the data generated from an individual, each and every aspect of data needs
to be analysed to benefit yourself from it. But how do we do it? Well, that’s where the term ‘Data
Analytics’ comes in.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden insights,
Interesting Patterns in Data, generate reports, perform market analysis, and improve business
requirements.

What is the role of Data Analytics?

 Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with
respect to business requirements.

2|Pa ge
DATA ANALYTICS UNIT –I

 Generate Reports – Reports are generated from the data and are passed on to the respective
teams and individuals to deal with further actions for a high rise in business.
 Perform Market Analysis – Market Analysis can be performed to understand the strengths
and weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to customer
requirements and experience.

What are the tools used in Data Analytics?


ith the increasing demand for Data Analytics in the market, many tools have emerged with various
functionalities for this purpose. Either open-source or user-friendly, the top tools in the data analytics
market are as follows.
 R programming
 Python
 Tableau Public
 QlikView
 SAS
 Microsoft Excel
 RapidMiner
 KNIME
 OpenRefine
 Apache Spark

Data and architecture design:

Data architecture in Information Technology is composed of models, policies, rules or standards that
govern which data is collected, and how it is stored, arranged, integrated, and put to use in data
systems and in organizations.

 A data architecture should set data standards for all its data systems as a vision or a model of
the eventual interactions between those data systems.
 Data architectures address data in storage and data in motion; descriptions of data stores, data
groups and data items; and mappings of those data artifacts to data qualities, applications,
locations etc.
 Essential to realizing the target state, Data Architecture describes how data is processed, stored,
and utilized in a given system. It provides criteria for data processing operations that make it
possible to design data flows and also control the flow of data in the system.
 The Data Architect is typically responsible for defining the target state, aligning during
development and then following up to ensure enhancements are done in the spirit of the original
blueprint.

3|Pa ge
DATA ANALYTICS UNIT –I

During the definition of the target state, the Data Architecture breaks a subject down to the atomic
level and then builds it back up to the desired form.

The Data Architect breaks the subject down by going through 3 traditional architectural processes:

Conceptual model: It is a business model which uses Entity Relationship (ER) model for relation
between entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as rows and
column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database technology
will be suitable for architecture.

Layer View Data (What) Stakeholder

1 Scope/Contextual List of things and architectural Planner


standards important to the business

2 Business Semantic model Owner


Model/Conceptual or Conceptual/Enterprise Data Model

3 System Model/Logical Enterprise/Logical Data Model Designer


4 Technology Physical Data Model Builder
Model/Physical
5 Detailed Actual databases Subcontractor
Representations

The data architecture is formed by dividing into three essential models and then are combined:

Factors that influence Data Architecture:

4|Pa ge
DATA ANALYTICS UNIT –I

Various constraints and influences will have an effect on data architecture design. These include
enterprise requirements, technology drivers, economics, business policies and data processing need.
Enterprise requirements:
 These will generally include such elements as economical and effective system expansion,
acceptable performance levels (especially system access speed), transaction reliability, and
transparent data management.
 In addition, the conversion of raw data such as transaction records and image files into more
useful information forms through such features as data warehouses is also a common
organizational requirement, since this enables managerial decision making and other
organizational processes.
 One of the architecture techniques is the split between managing transaction data and (master)
reference data. Another one is splitting data capture systems from data retrieval systems (as
done in a data warehouse).
Technology drivers:
 These are usually suggested by the completed data architecture and database architecture
designs.
 In addition, some technology drivers will derive from existing organizational integration
frameworks and standards, organizational economics, and existing site resources (e.g.
previously purchased software licensing).
Economics:
 These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential candidates
due to their cost.
 External factors such as the business cycle, interest rates, market conditions, and legal
considerations could all have an effect on decisions relevant to data architecture.
Business policies:
 Business policies that also drive data architecture design include internal organizational policies,
rules of regulatory bodies, professional standards, and applicable governmental laws that can
vary by applicable agency.
 These policies and rules will help describe the manner in which enterprise wishes to process
their data.
Data processing needs
 These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data mining),
repetitive periodic reporting, ad hoc reporting, and support of various organizational initiatives
as required (i.e. annual budgets, new product development)
 The General Approach is based on designing the Architecture at three Levels of Specification.
 The Logical Level

5|Pa ge
DATA ANALYTICS UNIT –I

 The Physical Level


 The Implementation Level

Understand various sources of the Data:


 Data can be generated from two types of sources namely Primary and Secondary Sources of
Primary Data.
 Data collection is the process of acquiring, collecting, extracting, and storing the voluminous
amount of data which may be in the structured or unstructured form like text, video, audio,
XML files, records, or other image files used in later stages of data analysis.
 In the process of big data analysis, “Data collection” is the initial step before starting to analyse
the patterns or useful information in data. The data which is to be analysed must be collected
from different valid sources.
 The data which is collected is known as raw data which is not useful now but on cleaning the
impure and utilizing that data for further analysis forms information, the information obtained
is known as “knowledge”. Knowledge has many meanings like business knowledge or sales of
enterprise products, disease treatment, etc.
 The main goal of data collection is to collect information-rich data.
 Data collection starts with asking some questions such as what type of data is to be collected
and what is the source of collection.
 Most of the data collected are of two types known as qualitative data which is a group of
non-numerical data such as words, sentences mostly focus on behaviour and actions of the
group and another one is quantitative data which is in numerical forms and can be calculated
using different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data

1. Primary data:
 The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as

6|Pa ge
DATA ANALYTICS UNIT –I

questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the target audience on which analysis is performed otherwise it would be
a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
 The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee.
 Some basic business or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing.
 These can be both structured and unstructured like personal interviews or formal interviews
through telephone, face to face, email, etc.

2. Survey method:
 The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video.
 The survey method can be obtained in both online and offline mode like through website forms
and email. Then that survey answers are stored for analysing data. Examples are online surveys
or surveys through social media polls.
3. Observation method:
 The observation method is a method of data collection in which the researcher keenly observes
the behaviour and practices of the target audience using some data collecting tool and stores
the observed data in the form of text, audio, video, or any raw formats.
 In this method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behaviour towards the products. The data
obtained will be sent for processing.
4. Experimental method:
 The experimental method is the process of collecting data through performing experiments,
research, and investigation.
 The most frequently used experiment methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design used in data analytics which
is based on randomization and replication. It is mostly used for comparing the experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is divided into
small units called blocks.
 Random experiments are performed on each of the blocks and results are drawn using a
technique known as analysis of variance (ANOVA). RBD was originated from the agriculture
sector.

7|Pa ge
DATA ANALYTICS UNIT –I

 Randomized Block Design - The Term Randomized Block Design has originated from agricultural
research. In this design several treatments of variables are applied to different blocks of land
to ascertain their effect on the yield of the crop.
 Blocks are formed in such a manner that each block contains as many plots as a number of
treatments so that one plot from each is selected at random for each treatment. The production
of each plot is measured after the treatment is given.
 These data are then interpreted and inferences are drawn by using the analysis of Variance
technique so as to know the effect of various treatments like different dozes of fertilizers,
different types of irrigation etc.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but
contains rows and columns.
 It is an arrangement of NxN squares with an equal number of rows and columns which contain
letters that occurs only once in a row. Hence the differences can be easily found with fewer errors
in the experiment. Sudoku puzzle is an example of a Latin square design.
 A Latin square is one of the experimental designs which has a balanced two-way classification
scheme say for example - 4 X 4 arrangement. In this scheme each letter from A to D occurs only
once in each row and also only once in each column.
 The Latin square is probably under used in most fields of research because text book examples
tend to be restricted to agriculture, the area which spawned most original work on ANOVA.
Agricultural examples often reflect geographical designs where rows and columns are literally two
dimensions of a grid in a field.
 Rows and columns can be any two sources of variation in an experiment. In this sense a Latin
square is a generalisation of a randomized block design with two different blocking systems
 A B C D
B C D A
C D A B
D A B C
 The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and columns.
Thus, the magnitude of error will be smaller than any other design.
FD- Factorial design is an experimental design where each experiment has two factors each with
possible values and on performing trail other combinational factors are derived. This design allows the
experimenter to test two or more variables simultaneously. It also measures interaction effects of the
variables and analyses the impacts of each of the variables. In a true experiment, randomization is
essential so that the experimenter can infer cause and effect without any bias.

2. Secondary data:

8|Pa ge
DATA ANALYTICS UNIT –I

Secondary data is the data which has already been collected and reused again for some valid purpose.
This type of data is previously recorded from primary data and it has two types of sources named
internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales record,
transactions, customer data, accounting resources, etc. The cost and time consumption is less in
obtaining internal sources.
 Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
 Sales Force Report- It gives information about the sales of a product. The information
provided is from outside the organization.
 Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing is working.
 Miscellaneous Reports- These are what information you are getting from operational reports.
If the data available within the organization are unsuitable or inadequate, the marketer should
extend the search to external secondary data sources.
External source:
The data which can’t be found at internal organizations and can be gained through external third-party
resources is external source data. The cost and time consumption are more because this contains a
huge amount of data. Examples of external sources are Government publications, news publications,
Registrar General of India, planning commission, international labour bureau, syndicate services, and
other non-governmental publications.
1. Government Publications-
 Government sources provide an extremely rich pool of data for the researchers. In addition,
many of these data are available free of cost on internet websites. There are number of
government agencies generating data.
These are like: Registrar General of India- It is an office which generates demographic data.
It includes details of gender, age, occupation etc.
2. Central Statistical Organization-
 This organization publishes the national accounts statistics. It contains estimates of national
income for several years, growth rate, and rate of major economic activities. Annual survey
of Industries is also published by the CSO.
 It gives information about the total number of workers employed, production units, material
used and value added by the manufacturer.
3. Director General of Commercial Intelligence-
 This office operates from Kolkata. It gives information about foreign trade i.e. import and
export. These figures are provided region-wise and country-wise.
4. Ministry of Commerce and Industries-

9|Pa ge
DATA ANALYTICS UNIT –I

 This ministry through the office of economic advisor provides information on wholesale price
index. These indices may be related to a number of sectors like food, fuel, power, food grains
etc.
 It also generates All India Consumer Price Index numbers for industrial workers, urban, non-
manual employees and cultural labourers.
5. Planning Commission-
 It provides the basic statistics of Indian Economy.
6. Reserve Bank of India-
 This provides information on Banking Savings and investment. RBI also prepares currency
and finance reports.
7. Labour Bureau-
 It provides information on skilled, unskilled, white collared jobs etc.
8. National Sample Survey-
 This is done by the Ministry of Planning and it provides social, economic, demographic,
industrial and agricultural statistics.
9. Department of Economic Affairs-
 It conducts economic survey and it also generates information on income, consumption,
expenditure, investment, savings and foreign trade.
10. State Statistical Abstract-
 This gives information on various types of activities related to the state like - commercial
activities, education, occupation etc.
11. Non-Government Publications-
 These includes publications of various industrial and trade associations, such as The Indian
Cotton Mill Association Various chambers of commerce.
12. The Bombay Stock Exchange
 It publishes a directory containing financial accounts, key profitability and other relevant
matter) Various Associations of Press Media.
 Export Promotion Council.
 Confederation of Indian Industries (CII)
 Small Industries Development Board of India
 Different Mills like - Woollen mills, Textile mills etc
 The only disadvantage of the above sources is that the data may be biased. They are likely
to colour their negative points.
13. Syndicate Services-
 These services are provided by certain organizations which collect and tabulate the marketing
information on a regular basis for a number of clients who are the subscribers to these
services.
 These services are useful in television viewing, movement of consumer goods etc.

10 | P a g e
DATA ANALYTICS UNIT –I

 These syndicate services provide information data from both household as well as institution.

In collecting data from household, they use three approaches:


Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
Mail Diary Panel- It may be related to 2 fields - Purchase and Media.
Electronic Scanner Services- These are used to generate data on volume.
They collect data for Institutions from
 Whole sellers
 Retailers, and
 Industrial Firms
 Various syndicate services are Operations Research Group (ORG) and The Indian Marketing
Research Bureau (IMRB).
Importance of Syndicate Services:
 Syndicate services are becoming popular since the constraints of decision making are changing
and we need more of specific decision-making in the light of changing environment. Also,
Syndicate services are able to provide information to the industries at a low unit cost.
Disadvantages of Syndicate Services:
 The information provided is not exclusive. A number of research agencies provide customized
services which suits the requirement of each individual organization.
International Organization-
These includes
 The International Labour Organization (ILO):
 It publishes data on the total and active population, employment, unemployment, wages
and consumer prices.
 The Organization for Economic Co-operation and development (OECD):
 It publishes data on foreign trade, industry, food, transport, and science and technology.
 The International Monetary Fund (IMA):
 It publishes reports on national and international foreign exchange regulations.
Other sources:
Sensor’s data: With the advancement of IoT devices, the sensors of these devices collect data which
can be used for sensor data analytics to track the performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through surveillance
cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by users
on different platforms can be predicted and collected with their permission for data analysis. The search
engines also provide their data through keywords and queries searched mostly.
Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and real time

11 | P a g e
DATA ANALYTICS UNIT –I

simultaneous analysis.

Data Management:
Data management is the practice of collecting, keeping, and using data securely, efficiently, and cost-
effectively. The goal of data management is to help people, organizations, and connected things
optimize the use of data within the bounds of policy and regulation so that they can make decisions
and take actions that maximize the benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies, procedures, and
practices. The work of data management has a wide scope, covering factors such as how to:
 Create, access, and update data across a diverse data tier
 Store data across multiple clouds and on premises
 Provide high availability and disaster recovery
 Use data in a growing variety of apps, analytics, and algorithms
 Ensure data privacy and security
 Archive and destroy data in accordance with retention schedules and compliance requirements
What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It doesn’t store
any data on the hard disk of your personal computer. In cloud computing, you can access data from
a remote server.
Service Models of Cloud computing are the reference models on which the Cloud Computing is based.
These can be categorized into
three basic service models as listed below:
1. INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual machines, virtual
storage, etc.
2. PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools, etc.
3. SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.

For providing the above services models AWS is one of the popular platforms. In this Amazon Cloud
(Web) Services is one of the popular service platforms for Data Management
Amazon Cloud (Web) Services Tutorial
What is AWS?
The full form of AWS is Amazon Web Services. It is a platform that offers flexible, reliable, scalable,
easy-to-use and, cost-effective cloud computing solutions.

12 | P a g e
DATA ANALYTICS UNIT –I

AWS is a comprehensive, easy to use computing platform offered Amazon. The platform is developed
with a combination of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged
software as a service (SaaS) offering.

History of AWS
2002- AWS services launched
2006- Launched its cloud products
2012- Holds first customer event
2015- Reveals revenues achieved of $4.6 billion
2016- Surpassed $10 billon revenue target
2016- Release snowball and snowmobile
2019- Offers nearly 100 cloud services
2021- AWS comprises over 200 products and services

Important AWS Services


Amazon Web Services offers a wide range of different business purpose global cloud-based products.
The products include storage, databases, analytics, networking, mobile, development tools,
enterprise applications, with a pay-as-you-go pricing model.

Amazon Web Services - Amazon S3:

 Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-based


service designed for online backup and archiving of data and application programs.
 It allows to upload, store, and download any type of files up to 5 TB in size. This service
allows the subscribers to access the same systems that Amazon uses to run its own web
sites.
 The subscriber has control over the accessibility of data, i.e. privately/publicly accessible.
1. How to Configure S3?
Following are the steps to configure a S3 account.
Step 1 − Open the Amazon S3 console using this link − https://console.aws.amazon.com/s3/home
Step 2 − Create a Bucket using the following steps.

13 | P a g e
DATA ANALYTICS UNIT –I

 A prompt window will open. Click the Create Bucket button at the bottom of the page.

 Create a Bucket dialog box will open. Fill the required details and click the Create button.

 The bucket is created successfully in Amazon S3. The console displays the list of buckets and its
properties.

 Select the Static Website Hosting option. Click the radio button Enable website hosting and fill the
required details.

Step 3 − Add an Object to a bucket using the following steps.


 Open the Amazon S3 console using the following
link. https://console.aws.amazon.com/s3/home

14 | P a g e
DATA ANALYTICS UNIT –I

 Click the Upload button.

 Click the Add files option. Select those files which are to be uploaded from the system and then
click the Open button.

 Click the start upload button. The files will get uploaded into the bucket.
 Afterwards, we can create, edit, modify, update the objects and other files in wide formats.

Amazon S3 Features
 Low cost and Easy to Use − Using Amazon S3, the user can store a large amount of data at
very low charges.
 Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted
automatically once it is uploaded. The user has complete control over their data by configuring
bucket policies using AWS IAM.
 Scalable − Using Amazon S3, there need not be any worry about storage concerns. We can
store as much data as we have and access it anytime.
 Higher performance − Amazon S3 is integrated with Amazon CloudFront, that distributes
content to the end users with low latency and provides high data transfer speeds without any
minimum usage commitments.
 Integrated with AWS services − Amazon S3 integrated with AWS services include Amazon
CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon Route 53, Amazon
VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.

We are discussing Amazon S3:


https://d1.awsstatic.com/whitepapers/aws-overview.pdf

Data Quality:

15 | P a g e
DATA ANALYTICS UNIT –I

What is Data Quality?


There are many definitions of data quality, in general, data quality is the assessment of how
much the data is usable and fits its serving context.

Why Data Quality is Important?


Enhancing the data quality is a critical concern as data is considered as the core of all activities
within organizations, poor data quality leads to inaccurate reporting which will result inaccurate
decisions and surely economic damages.

Many factors help measuring data quality such as:


 Data Accuracy: Data are accurate when data values stored in the database correspond
to real-world values.
 Data Uniqueness: A measure of unwanted duplication existing within or across systems
for a particular field, record, or data set.
 Data Consistency: Violation of semantic rules defined over the dataset.
 Data Completeness: The degree to which values are present in a data collection.
 Data Timeliness: The extent to which age of the data is appropriated for the task at
hand.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,
Believability.

16 | P a g e
DATA ANALYTICS UNIT –I

OUTLIERS:

 Outlier is a point or an observation that deviates significantly from the


other observations.
 Outlier is a commonly used terminology by analysts and data scientists
as it needs close attention else it can result in wildly wrong estimations. Simply speaking,
Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
 Reasons for outliers: Due to experimental errors or “special circumstances”.
 There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise.
 There are various methods of outlier detection. Some are graphical such as normal probability
plots. Others are model-based. Box plots are a hybrid.
Types of Outliers:

Outlier can be of two types:


Univariate: These outliers can be found when we look at distribution of a single variable.
Multivariate: Multi-variate outliers are outliers in an n-dimensional space.

In order to find them, you have to look at distributions in multi-dimensions.

Impact of Outliers on a dataset:


Outliers can drastically change the results of the data analysis and statistical modelling. There are
numerous unfavourable impacts of outliers in the data set:
 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other statistical model
assumptions.

17 | P a g e
DATA ANALYTICS UNIT –I

Detect Outliers:
Most commonly used method to detect outliers is visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for
visualization).

Outlier treatments are three types:


Retention:
 There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise. There are various methods
of outlier detection. Some are graphical such as normal probability plots. Others are model-
based. Box plots are a hybrid.
Exclusion:
 According to a purpose of the study, it is necessary to decide, whether and which outlier will
be removed/excluded from the data, since they could highly bias the final results of the
analysis.

Rejection:
 Rejection of outliers is more acceptable in areas of practice where the underlying model of the
process being measured and the usual distribution of measurement error are confidently
known.
 An outlier resulting from an instrument reading error may be excluded but it is desirable that
the reading is at least verified.

Other treatment methods


OUTLIER package in R: to detect and treat outliers in Data.
Outlier detection from graphical representation:
– Scatter plot and Box plot

– The observations out of box are treated as outliers in data


Missing Data treatment:

18 | P a g e
DATA ANALYTICS UNIT –I

Missing Values
 Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to
wrong prediction or classification.

 In R, missing values are represented by the symbol NA (not available).

 Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number)
and R outputs the result for dividing by zero as ‘Inf’(Infinity).

PMM approach to treat missing values:


• PMM-> Predictive Mean Matching (PMM) is a semi-parametric imputation approach.
• It is similar to the regression method except that for each missing value, it fills in a value
randomly from among the observed donor values from an observation
• whose regression-predicted values are closest to the regression-predicted value for the missing
value from the simulated regression model.

Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:

19 | P a g e
DATA ANALYTICS UNIT –I

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. Binning, also called discretization, is a
technique for reducing the cardinality (The total number of unique values for a dimension is known as
its cardinality) of continuous and discrete data. Binning groups related values together in bins to reduce
the number of distinct values
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:

1. Normalization:
Normalization is a technique often applied as part of data preparation in Data Analytics
through machine learning. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges of

20 | P a g e
DATA ANALYTICS UNIT –I

values. For machine learning, every dataset does not require normalization. It is done in order
to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
Discretization is the process through which we can transform continuous variables, models
or functions into a discrete form. We do this by creating a set of contiguous intervals (or
bins) that go across the range of our desired variable/model/function. Continuous data is
Measured, while Discrete data is Counted

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute. The
attribute having p-value greater than significance level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of

21 | P a g e
DATA ANALYTICS UNIT –I

dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be done
correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.

Six stages of data processing


1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources, including data
lakes and data warehouses. It is important that the data sources available are trustworthy and well-
built so the data collected (and later used as information) is of the highest possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred
to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following
stage of data processing. During preparation, raw data is diligently checked for any errors. The purpose
of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create
high-quality data for the best business intelligence.

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand. Data input is the first stage in
which raw data begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself may
vary slightly depending on the source of data being processed (data lakes, social networks, connected
devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected
devices, determining customer needs, etc.).

22 | P a g e
DATA ANALYTICS UNIT –I

5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.

*** End of Unit-1 ***

23 | P a g e
DATA ANALYTICS UNIT –2

UNIT – II Syllabus
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application of
Modeling in Business, Databases & Types of Data and variables, Data Modeling Techniques, Missing
Imputations etc. Need for Business Modeling.
Topics:
1. Introduction to Data Analytics
2. Data Analytics Tools and Environment
3. Need for Business Modeling.
4. Data Modeling Techniques
5. Application of Modeling in Business
6. Databases & Types of Data and variables
7. Missing Imputations etc.

Unit-2 Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles Tools and Environment
3. To explore the applications of Business Modelling
4. To understand the Data Modeling Techniques
5. To understand the Data Types and Variables and Missing imputations

Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe concepts of data analytics.
2. To demonstrate the principles Tools and Environment
3. To analyze the applications of Business Modelling
4. To understand and Compare the Data Modeling Techniques
5. To describe the Data Types and Variables and Missing imputations

2|Page
DATA ANALYTICS UNIT –2

INTRODUCTION:

Data has been the buzzword for ages now. Either the data being generated from large-scale
enterprises or the data generated from an individual, each and every aspect of data needs to be
analyzed to benefit yourself from it.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden insights,
generate reports, perform market analysis, and improve business requirements.

What is the role of Data Analytics?


 Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with
respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in business.
 Perform Market Analysis – Market Analysis can be performed to understand the strengths
and weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Ways to Use Data Analytics:


Now that you have looked at what data analytics is, let’s understand how we can use data analytics.

Fig: Ways to use Data Analytics


1. Improved Decision Making: Data Analytics eliminates guesswork and manual tasks. Be it
choosing the right content, planning marketing campaigns, or developing products. Organizations
can use the insights they gain from data analytics to make informed decisions. Thus, leading to
better outcomes and customer satisfaction.
2. Better Customer Service: Data analytics allows you to tailor customer service according to
their needs. It also provides personalization and builds stronger relationships with customers.
Analyzed data can reveal information about customers’ interests, concerns, and more. It helps you
give better recommendations for products and services.

3|Page
DATA ANALYTICS UNIT –2

3. Efficient Operations: With the help of data analytics, you can streamline your processes, save
money, and boost production. With an improved understanding of what your audience wants, you
spend lesser time creating ads and content that aren’t in line with your audience’s interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your campaigns are
performing. This helps in fine-tuning them for optimal outcomes. Additionally, you can also find
potential customers who are most likely to interact with a campaign and convert into leads.

Steps Involved in Data Analytics:


Next step to understanding what data analytics is to learn how data is analyzed in organizations.
There are a few steps that are involved in the data analytics lifecycle. Below are the steps that you
can take to solve your problems.

Fig: Data Analytics process steps


1. Understand the problem: Understanding the business problems, defining the organizational
goals, and planning a lucrative solution is the first step in the analytics process. E-commerce
companies often encounter issues such as predicting the return of items, giving relevant product
recommendations, cancellation of orders, identifying frauds, optimizing vehicle routing, etc.
2. Data Collection: Next, you need to collect transactional business data and customer-related
information from the past few years to address the problems your business is facing. The data can
have information about the total units that were sold for a product, the sales, and profit that were
made, and also when was the order placed. Past data plays a crucial role in shaping the future of a
business.
3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and contain
unwanted missing values. Such data is not suitable or relevant for performing data analysis. Hence,
you need to clean the data to remove unwanted, redundant, and missing values to make it ready
for analysis.

4|Page
DATA ANALYTICS UNIT –2

4. Data Exploration and Analysis: After you gather the right data, the next vital step is to
execute exploratory data analysis. You can use data visualization and business intelligence tools,
data mining techniques, and predictive modelling to analyze, visualize, and predict future outcomes
from this data. Applying these methods can tell you the impact and relationship of a certain feature
as compared to other variables.
Below are the results you can get from the analysis:
 You can identify when a customer purchases the next product.
 You can understand how long it took to deliver the product.
 You get a better insight into the kind of items a customer looks for, product returns, etc.
 You will be able to predict the sales and profit for the next quarter.
 You can minimize order cancellation by dispatching only relevant products.
 You’ll be able to figure out the shortest route to deliver the product, etc.
5. Interpret the results: The final step is to interpret the results and validate if the outcomes
meet your expectations. You can find out hidden patterns and future trends. This will help you gain
insights that will support you with appropriate data-driven decision making.

What are the tools used in Data Analytics?


With the increasing demand for Data Analytics in the market, many tools have emerged with various
functionalities for this purpose. Either open-source or user-friendly, the top tools in the data analytics
market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac OS.
It also provides tools to automatically install all packages as per user-requirement.
 Python – Python is an open-source, object-oriented programming language that is easy to
read, write, and maintain. It provides various machine learning and visualization libraries
such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras, etc. It also can be assembled on
any platform like SQL server, a MongoDB database or JSON

5|Page
DATA ANALYTICS UNIT –2

 Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse, etc. It then creates visualizations, maps, dashboards etc
with real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered to the
end-users quickly. It also offers data association and data visualization with data being
compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and analytics, this
tool is easily accessible and can analyze data from different sources.
 Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly
used for clients’ internal data, this tool analyzes the tasks that summarize the data with a
preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data source types
such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used
for predictive analytics, such as data mining, text analytics, machine learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform,
which allows you to analyze and model data. With the benefit of visual programming, KNIME
provides a platform for reporting and integration through its modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean
up data for analysis. It is used for cleaning messy data, the transformation of data and
parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This
tool is also popular for data pipelines and machine learning model development.

Data Analytics Applications:


Data analytics is used in almost every sector of business, let’s discuss a few of them:
1. Retail: Data analytics helps retailers understand their customer needs and buying habits to
predict trends, recommend new products, and boost their business. They optimize the supply
chain, and retail operations at every step of the customer journey.
2. Healthcare: Healthcare industries analyse patient data to provide lifesaving diagnoses and
treatment options. Data analytics help in discovering new drug development methods as well.
3. Manufacturing: Using data analytics, manufacturing sectors can discover new cost-saving
opportunities. They can solve complex supply chain issues, labour constraints, and equipment
breakdowns.
4. Banking sector: Banking and financial institutions use analytics to find out probable loan
defaulters and customer churn out rate. It also helps in detecting fraudulent transactions
immediately.

6|Page
DATA ANALYTICS UNIT –2

5. Logistics: Logistics companies use data analytics to develop new business models and
optimize routes. This, in turn, ensures that the delivery reaches on time in a cost-efficient
manner.

Cluster computing:
 Cluster computing is a collection of tightly or loosely
connected computers that work together so that they
act as a single entity.
 The connected computers execute operations all
together thus creating the idea of a single system.
 The clusters are generally connected through fast local
area networks (LANs)

Why is Cluster Computing important?

 Cluster computing gives a relatively inexpensive, unconventional to the large server or


mainframe computer solutions.
 It resolves the demand for content criticality and process services in a faster way.
 Many organizations and IT companies are implementing cluster computing to augment their
scalability, availability, processing speed and resource management at economic prices.
 It ensures that computational power is always available. It provides a single general
strategy for the implementation and application of parallel high-performance systems
independent of certain hardware vendors and their product decisions.

Apache Spark:

7|Page
DATA ANALYTICS UNIT –2

 Apache Spark is a lightning-fast cluster computing technology, designed for fast


computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.
 Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming.
 Apart from supporting all these workloads in a respective system, it reduces the
management burden of maintaining separate tools.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia.
It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation
in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Features of Apache Spark:


Apache Spark has following features.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries,
Streaming data, Machine learning (ML), and Graph algorithms.

8|Page
DATA ANALYTICS UNIT –2

Spark Built on Hadoop


The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.


Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn (Yet Another
Resource Negotiator) without any pre-installation or root access required. It helps to integrate
Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
The following illustration depicts the different components of Spark.

9|Page
DATA ANALYTICS UNIT –2

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other functionality
is built upon. It provides In-Memory computing and referencing datasets in external storage
systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory-
based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop
disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.

What is Scala?
 Scala is a statically typed programming language that incorporates both functional and
object oriented, also suitable for imperative programming approaches.to increase scalability
of applications. It is a general-purpose programming language. It is a strong static type
language. In scala, everything is an object whether it is a function or a number. It does not
have concept of primitive data.
 Scala primarily runs on JVM platform and it can also be used to write software for native
platforms using Scala-Native and JavaScript runtimes through ScalaJs.
 This language was originally built for the Java Virtual Machine (JVM) and one of Scala’s
strengths is that it makes it very easy to interact with Java code.
 Scala is a Scalable Language used to write Software for multiple platforms. Hence, it got
the name “Scala”. This language is intended to solve the problems of Java

10 | P a g e
DATA ANALYTICS UNIT –2

while simultaneously being more concise. Initially designed by Martin Odersky, itwas
released in 2003.

Why Scala?
 Scala is the core language to be used in writing the most popular distributed big
data processing framework Apache Spark. Big Data processing is becoming
inevitable from small to large enterprises.
 Extracting the valuable insights from data requires state of the art processing tools
and frameworks.
 Scala is easy to learn for object-oriented programmers, Java developers. It is
becoming one of the popular languages in recent years.
 Scala offers first-class functions for users
 Scala can be executed on JVM, thus paving the way for the interoperability with
other languages.
 It is designed for applications that are concurrent (parallel), distributed, and resilient
(robust) message-driven. It is one of the most demanding languages of this decade.
 It is concise, powerful language and can quickly grow according to the demand of
its users.
 It is object-oriented and has a lot of functional programming features providing a lot
of flexibility to the developers to code in a way they want.
 Scala offers many Duck Types(Structural Types)
 Unlike Java, Scala has many features of functional programming languages like Scheme,
Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation,
and pattern matching.
 The name Scala is a portmanteau of "scalable" and "language", signifying that it is
designed to grow with the demands of its users.

Where Scala can be used?


 Web Applications
 Utilities and Libraries
 Data Streaming
 Parallel batch processing
 Concurrency and distributed application
 Data analytics with Spark
 AWS lambda Expression

11 | P a g e
DATA ANALYTICS UNIT –2

Cloudera Impala:

 Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL
query engine for data stored in a computer cluster running Apache Hadoop.
 Impala is the open source, massively parallel processing (MPP) SQL query engine for
native analytic database in a computer cluster running Apache Hadoop.
 It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon.
 Cloudera Impala is a query engine that runs on Apache Hadoop.
 The project was announced in October 2012 with a public beta test distribution and
became generally available in May 2013.
 Impala brings enabling users to issue low latency SQL queries to data stored in HDFS
and Apache HBase without requiring data movement or transformation.
 Impala is integrated with Hadoop to use the same file and data formats, metadata,
security and resource management frameworks used by MapReduce, Apache Hive,
Apache Pig and other Hadoop software.
 Impala is promoted for analysts and data scientists to perform analytics on data
stored in Hadoop via SQL or business intelligence tools.
 The result is that large-scale data processing (via MapReduce) and interactive
queries can be done on the same system using the same data and metadata –
removing the need to migrate data sets into specialized systems and/or proprietary
formats simply to perform analysis.

Features include:
 Supports HDFS and Apache HBase storage,
 Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and
Parquet,
 Supports Hadoop security (Kerberos authentication),
 Fine-grained, role-based authorization with Apache Sentry,
 Uses metadata, ODBC driver, and SQL syntax from Apache Hive.

12 | P a g e
DATA ANALYTICS UNIT –2

Databases & Types of Data and variables


Data Base: A Database is a collection of related data.
Database Management System: DBMS is a software or set of Programs used to define,
construct and manipulate the data.
Relational Database Management System: RDBMS is a software system used to
maintain relational databases. Many relational database systems have an option of using
the SQL.
NoSQL:
 NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a
NoSQL database is for distributed data stores with humongous data storage needs.
NoSQL is used for Big data and real-time web apps. For example, companies like
Twitter, Facebook and Google collect terabytes of user data every single day.
 NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term
would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in
1998.
 Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, unstructured and
polymorphic data.

13 | P a g e
DATA ANALYTICS UNIT –2

Why NoSQL?
 The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response
time becomes slow when you use RDBMS for massive volumes of data.
 To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive. The alternative for this issue is to distribute
database load on multiple hosts whenever the load increases. This method is known
as “scaling out.”

Types of NoSQL Databases:

 Document-oriented: JSON documents MongoDB and CouchDB


 Key-value: Redis and DynamoDB
 Wide-column: Cassandra and HBase
 Graph: Neo4j and Amazon Neptune
Relational Non-relational
Databases (SQL) Databases (NoSQL)

Oracle MongoDB

MySQL couchDB

SQL Server BigTable

14 | P a g e
DATA ANALYTICS UNIT –2

SQL vs NOSQL DB:

SQL NoSQL

RELATIONAL DATABASE MANAGEMENT Non-relational or distributed database


SYSTEM (RDBMS) system.

These databases have fixed or static or


predefined schema They have dynamic schema

These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.

These databases are best suited for These databases are not so good for
complex queries complex queries

Vertically Scalable Horizontally scalable

Follows CAP (consistency, availability,


Follows ACID property partition tolerance)

15 | P a g e
DATA ANALYTICS UNIT –2

Differences between SQL and NoSQL

The table below summarizes the main differences between SQL and NoSQL databases.

SQL Databases NoSQL Databases

Document: JSON documents, Key-value:


Data Storage Tables with fixed key-value pairs, Wide-column: tables with
Model rows and columns rows and dynamic columns, Graph: nodes
and edges

Developed in the Developed in the late 2000s with a focus


Development 1970s with a focus on on scaling and allowing for rapid
History reducing data application change driven by agile and
duplication DevOps practices.

Document: MongoDB and CouchDB, Key-


Oracle, MySQL,
value: Redis and DynamoDB, Wide-
Examples Microsoft SQL Server,
column: Cassandra and HBase, Graph:
and PostgreSQL
Neo4j and Amazon Neptune

Document: general purpose, Key-value:


large amounts of data with simple lookup
queries, Wide-column: large amounts of
Primary Purpose General purpose
data with predictable query patterns,
Graph: analyzing and traversing
relationships between connected data

Schemas Rigid Flexible

Vertical (scale-up Horizontal (scale-out across commodity


Scaling
with a larger server) servers)

Multi-Record Most do not support multi-record ACID


ACID Supported transactions. However, some—like
Transactions MongoDB—do.

Joins Typically required Typically not required

16 | P a g e
DATA ANALYTICS UNIT –2

SQL Databases NoSQL Databases

Requires ORM Many do not require ORMs. MongoDB


Data to Object
(object-relational documents map directly to data structures
Mapping
mapping) in most popular programming languages.

Benefits of NoSQL
 The NoSQL data model addresses several issues that the relational model is not
designed to address:
 Large volumes of structured, semi-structured, and unstructured data.
 Object-oriented programming that is easy to use and flexible.
 Efficient, scale-out architecture instead of expensive, monolithic architecture.

Variables:
 Data consist of individuals and variables that give us information about those
individuals. An individual can be an object or a person.
 A variable is an attribute, such as a measurement or a label.
 Two types of Data
 Quantitative data(Numerical)
 Categorical data

 Quantitative Variables: Quantitative data, contains numerical that can be added,


subtracted, divided, etc.
There are two types of quantitative variables: discrete and continuous.

17 | P a g e
DATA ANALYTICS UNIT –2

Discrete vs continuous variables

Type of What does the data represent? Examples


variable

Discrete Counts of individual items or values. • Number of students in a


variables class
• Number of different tree
species in a forest

Continuous Measurements of continuous or non- • Distance


variables finite values. • Volume
• Age

Categorical variables: Categorical variables represent groupings of some kind. They are
sometimes recorded as numbers, but the numbers represent categories rather than actual
amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal variables.

Type of variable What does the data Examples


represent?

Binary variables Yes/no outcomes. • Heads/tails in a coin flip


• Win/lose in a football game

Nominal Groups with no rank or • Colors


variables order between them. • Brands
• ZIP CODE

Ordinal variables Groups that are ranked in a • Finishing place in a race


specific order. • Rating scale responses in a
survey*

18 | P a g e
DATA ANALYTICS UNIT –2

Missing Imputations:
Imputation is the process of replacing missing data with substituted values.

Types of missing data


Missing data can be classified into one of three categories
1. MCAR
Data which is Missing Completely At Random has nothing systematic about which
observations are missing values. There is no relationship between missingness and either
observed or unobserved covariates.

2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but due entirely
to observed variables. For example, those from a lower socioeconomic status may be less
willing to provide salary information (but we know their SES status). The key is that the
missingness is not due to the values which are not observed. MCAR implies MAR but not
vice-versa.

3. MNAR
If the data are Missing Not At Random, then the missingness depends on the values of the
missing data. Censored data falls into this category. For example, individuals who are
heavier are less likely to report their weight. Another example, the device measuring some
response can only measure values above .5. Anything below that is missing.

There can be two types of gaps in Data:


1. Missing Data Imputation
2. Model based Technique

Imputations: (Treatment of Missing Values)


1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless the
tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or -∞. If missing values

19 | P a g e
DATA ANALYTICS UNIT –2

are replaced by, say, “Unknown,” then the mining program may mistakenly think
that they form an interesting concept, since they all have a value in common-that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the average
value of that particular attribute and use this value to replace the missing value in
that attribute column.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category
as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for income.

Need for Business Modelling:


The main need of Business Modelling for the Companies that embrace big data analytics
and transform their business models in parallel will create new opportunities for revenue
streams, customers, products and services Having a big data strategy and vision that
identifies and capitalizes on new opportunities.

Analytics applications to various Business Domains


Application of Modelling in Business:
 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and studying of
business-related data using statistical models and iterative methodologies. The goal
of BA is to narrow down which datasets are useful and which can increase revenue,
productivity, and efficiency.
 Business analytics (BA) is the combination of skills, technologies, and practices used
to examine an organization's data and performance as a way to gain insights and
make data-driven decisions in the future using statistical analysis.

20 | P a g e
DATA ANALYTICS UNIT –2

Although business analytics is being leveraged in most commercial sectors and industries,
the following applications are the most common.
1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an ideal
way of gathering information about a purchaser’s spending habits, financial situation,
behavior trends, demographics, and lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer loyalty
to stay in business for the long haul. CRM systems analyze important performance indicators
such as demographics, buying patterns, socio-economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights that
help organizations maneuver their way through tricky terrain. Corporations turn to business
analysts to optimize budgeting, banking, financial planning, forecasting, and portfolio
management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average length
of employment, etc. By working with this information, business analysts help HR by
forecasting the best fits between the company and candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks, and
supply-chain management to create maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring marketing
and advertising metrics, identifying consumer behavior and the target audience, and
analyzing market trends.

21 | P a g e
DATA ANALYTICS UNIT –2

Data Modelling Techniques in Data Analytics:


What is Data Modelling?
 Data Modelling is the process of analyzing the data objects and their relationship to the
other objects. It is used to analyze the data requirements that are required for the business
processes. The data models are created for the data to be stored in a database.
 The Data Model's main focus is on what data is needed and how we have to organize data
rather than what operations we have to perform.
 Data Model is basically an architect's building plan. It is a process of documenting
complex software system design as in a diagram that can be easily understood.

Uses of Data Modelling:


 Data Modelling helps create a robust design with a data model that can show an
organization's entire data on the same platform.
 The database at the logical, physical, and conceptual levels can be designed with the help
data model.
 Data Modelling Tools help in the improvement of data quality.
 Redundant data and missing data can be identified with the help of data models.
 The data model is quite a time consuming, but it makes the maintenance cheaper and
faster.

Data Modelling Techniques:

Below given are 5 different types of techniques used to organize the data:
1. Hierarchical Technique
The hierarchical model is a tree-like structure. There is one root node, or we can say one parent
node and the other child nodes are sorted in a particular order. But, the hierarchical model is very
rarely used now. This model can be used for real-world model relationships.

22 | P a g e
DATA ANALYTICS UNIT –2

2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The object-
oriented model communicates while supporting data abstraction, inheritance, and encapsulation.
3. Network Technique
The network model provides us with a flexible way of representing objects and relationships
between these entities. It has a feature known as a schema representing the data in the form of a
graph. An object is represented inside a node and the relation between them as an edge, enabling
them to maintain multiple parent and child records in a generalized manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to define data
elements and relationship for the entities in a system. This conceptual design provides a better
view of the data that helps us easy to understand. In this model, the entire database is represented
in a diagram called an entity-relationship diagram, consisting of Entities, Attributes, and
Relationships.
5. Relational Technique
Relational is used to describe the different relationships between the entities. And there are different
sets of relations between the entities such as one to one, one to many, many to one, and many to
many.

*** End of Unit-2 ***

23 | P a g e
DATA ANALYTICS UN I T –3

UNIT - III
Linear & Logistic Regression
Syllabus
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable
Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.

Topics:
1. Regression – Concepts
2. Blue property assumptions
3. Least Square Estimation
4. Variable Rationalization
5. Model Building etc.
6. Logistic Regression - Model Theory
7. Model fit Statistics
8. Model Construction
9. Analytics applications to various Business Domains

Unit-2 Objectives:
1. To explore the Concept of Regression
2. To learn the Linear Regression
3. To explore Blue Property Assumptions
4. To Learn the Logistic Regression
5. To understand the Model Theory and Applications

Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe the Concept of Regression
2. To demonstrate Linear Regression
3. To analyze the Blue Property Assumptions
4. To explore the Logistic Regression
5. To describe the Model Theory and Applications

2|Page
DATA ANALYTICS UN I T –3

Regression – Concepts:
Introduction:
 The term regression is used to indicate the estimation or prediction of the average
value of one variable for a specified value of another variable.
 Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.

“Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables
&
One or More Independent variables / Predictor variables.
 Regression describes how an independent variable is numerically related to the
dependent variable.
 Regression can be used for prediction, estimation and hypothesis testing, and modeling
causal relationships.

When Regression is chosen?


 A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”.
 Many different models can be used, the simplest is linear regression. It tries to fit data
with the best hyperplane which goes through the points.
 Mathematically a linear relationship represents a straight line when plotted as a graph.
 A non-linear relationship where the exponent of any variable is not equal to 1 creates
a curve.
Types of Regression Analysis Techniques:
1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
6. Bayesian Linear Regression

3|Page
DATA ANALYTICS UN I T –3

Advantages & Limitations:


 Fast and easy to model and is particularly useful when the relationship to be modeled
is not extremely complex and if you don’t have a lot of data.
 Very intuitive to understand and interpret.
 Linear Regression is very sensitive to outliers.
Linear regression:
 Linear Regression is a very simple method but has proven to be very useful for a
large number of situations.
 When we have a single input attribute (x) and we want to use linear regression, this
is called simple linear regression.

 simple linear regression we want to model our data as follows:


y = B0 + B1 * x
 we know and B0 and B1 are coefficients that we need to estimate that move the line
around.
 Simple regression is great, because rather than having to search for values by trial
and error or calculate them analytically using more advanced linear algebra, we can
estimate them directly from our data.
OLS Regression:
Linear Regression using Ordinary Least Squares Approximation
Based on Gauss Markov Theorem:
We can start off by estimating the value for B1 as:
n

 xi  mean(x)* yi  mean( y)


B1  i1
n
  xi  mean(x) 2

i1

B0  mean y – B1*meanx

4|Page
DATA ANALYTICS UN I T –3

 If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple
linear regression. The procedure for linear regression is different and simpler than
that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x) *
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
yi-mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2

Mean(x) = 1.4 and Mean(y) = 5.8


n

 xi  mean(x)* yi  mean( y)


B1  i1

 xi  mean(x)
n
2

i1

B0  mean y – B1*meanx
We can find from the above formulas,
B1=2 and B0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")

5|Page
DATA ANALYTICS UN I T –3

> a <- data.frame(x = 7)


>a
7
> result <- predict(relxy,a)
> print(result)
6
> #Note: you can observe that
> 0.8*7+0.4
[1] 6 #The same calculated using the line equation y= 0.8*x +0.4.
Simple linear regression is the simplest form of regression and the most studied.
Calculating B1 & B0 using Correlations and Standard Deviations:
B1 = corr(x, y) * stdev(y) / stdev(x)
Correlation(x, y)* St.Deviation( y)
B1 
St.Deviation(x)
Where cor (x,y) is the correlation between x & y and stdev() is the calculation of the standard
deviation for a variable. The same is calculated in R as follows:
> x<-c(1,2,4,3,5)
> y<-c(1,3,3,2,5)
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> B1=cor(x,y)*sd(y)/sd(x)
> B1
[1] 0.8
> B0=mean(y)-B1*mean(x)
> B0
[1] 0.4

Estimating Error: (RMSE: Root Mean Squared Error)


We can calculate the error for our predictions called the Root Mean Squared Error or RMSE.
Root Mean Square Error can be calculated by

Err 

p is the predicted value and y is the actual
value, i is the index for a specific instance, n is
the number of predictions, because we must
calculate the error across all predicted values.
Estimating Error for y=0.8*x+0.4
x y = y-actual p = y-predicted p-y (p-y)^2  mean(x)= 3
1 1 1.2 0.2 0.04 s = sum of (p-y)2 = 2.4
2 3 2 -1 1
 s/n = 2.4 / 5 = 0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64  sqrt(s/n) = sqrt(0.48) = 0.692
5 5 4.4 -0.6 0.36  RMSE = 0.692

6|Page
DATA ANALYTICS UN I T –3

Properties and Assumptions of OLS approximation:


1. Unbiasedness:
i. Biased estimator is defined as the difference between its expected value and the
true value. i.e., e(y)=y_actual – y_predited
ii. If the biased error (bias) is zero then estimator become unbiased.
iii. Unbiasedness is important only when it is combined with small variance
2. Least Variance:
i. An estimator is best when it has the smallest or least variance
ii. Least variance property is more important when it combined with small biased.
3. Efficient estimator:
i. An estimator said to be efficient when it fulfilled both conditions.
ii. Estimator should unbiased and have least variance
4. Best Linear Unbiased Estimator (BLUE Properties):
i. An estimator is said to be BLUE when it fulfill the above properties
ii. An estimator is BLUE if it is Unbiased, Least Variance and Linear Estimator
5. Minimum Mean Square Error (MSE):
i. An estimator is said to be MSE estimator if it has smallest mean square error.
ii. Less difference between estimated value and True Value
6. Sufficient Estimator:
i. An estimator is sufficient if it utilizes all the information of a sample about the
True parameter.
ii. It must use all the observations of the sample.
Assumptions of OLS Regression:
1. There are random sampling of observations.
2. The conditional mean should be zero
3. There is homoscedasticity and no Auto-correlation.
4. Error terms should be normally distributed(optional)
5. The Properties of OLS estimates of simple linear regression equation is
y = B0+B1*x + µ (µ -> Error)
6. The above equation is based on the following assumptions
a. Randomness of µ
b. Mean of µ is Zero
c. Variance of µ is constant
d. The variance of µ has normal distribution
e. Error µ of different observations are independent.

7|Page
DATA ANALYTICS UN I T –3

Homoscedasticity vs Heteroscedasticity:

 The Assumption of homoscedasticity (meaning “same variance”) is central to linear


regression models. Homoscedasticity describes a situation in which the error term (that is,
the “noise” or random disturbance in the relationship between the independent variables
and the dependent variable) is the same across all values of the independent variables.
 Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error
term differs across values of an independent variable.
 The impact of violating the assumption of homoscedasticity is a matter of degree, increasing
as heteroscedasticity increases.
 Homoscedasticity means “having the same scatter.” For it to exist in a set of data, the points
must be about the same distance from the line, as shown in the picture above.
 The opposite is heteroscedasticity (“different scatter”), where points are at widely varying
distances from the regression line.
Variable Rationalization:
 The data set may have a large number of attributes. But some of those attributes can be
irrelevant or redundant. The goal of Variable Rationalization is to improve the Data
Processing in an optimal way through attribute subset selection.
 This process is to find a minimum set of attributes such that dropping of those irrelevant
attributes does not much affect the utility of data and the cost of data analysis could be
reduced.
 Mining on a reduced data set also makes the discovered pattern easier to understand. As
part of Data processing, we use the below methods of Attribute subset selection
1. Stepwise Forward Selection
2. Stepwise Backward Elimination
3. Combination of Forward Selection and Backward Elimination
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.

8|Page
DATA ANALYTICS UN I T –3

1. Stepwise Forward Selection: This procedure starts with an empty set of attributes as the
minimal set. The most relevant attributes are chosen (having minimum p-value) and are
added to the minimal set. In each iteration, one attribute is added to a reduced set.
 Stepwise Backward Elimination: Here all the attributes are considered in the initial set
of attributes. In each iteration, one attribute is eliminated from the set of attributes whose
p-value is higher than significance level.
 Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant attributes most
efficiently. This is the most common technique which is generally used for attribute selection.
 Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute. Each
branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute
that is not the part of tree is considered irrelevant and hence discarded.

Model Building Life Cycle in Data Analytics:


When we come across a business analytical problem, without acknowledging the stumbling blocks,
we proceed towards the execution. Before realizing the misfortunes, we try to implement and predict
the outcomes. The problem-solving steps involved in the data science model-building life cycle.
Let’s understand every model building step in-depth,
The data science model-building life cycle includes some important steps to follow. The following
are the steps to follow to build a Data Model

1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment

1. Problem Definition
 The first step in constructing a model is to
understand the industrial problem in a more comprehensive way. To identify the purpose of
the problem and the prediction target, we must define the project objectives appropriately.
 Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.

2. Hypothesis Generation

9|Page
DATA ANALYTICS UN I T –3

 Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.
 Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders
into account. We search for every suitable factor that can influence the outcome.
 Hypothesis generation focuses on what you can create rather than what is available in the
dataset.

3. Data Collection
 Data collection is gathering data from relevant sources regarding the analytical problem,
then we extract meaningful insights from the data for prediction.

The data gathered must have:


 Proficiency in answer hypothesis questions.
 Capacity to elaborate on every data parameter.
 Effectiveness to justify your research.
 Competency to predict outcomes accurately.

4. Data Exploration/Transformation
 The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying
any algorithmic model to data, we have to explore it first.
 By inspecting the data, we get to understand the explicit and hidden trends in data. We find
the relation between data features and the target variable.
 Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.
 There are several sub steps involved in data exploration:
o Feature Identification:
 You need to analyze which data features are available and which ones are
not.
 Identify independent and target variables.
 Identify data types and categories of these variables.

10 | P a g e
DATA ANALYTICS UN I T –3

o Univariate Analysis:
 We inspect each variable one by one. This kind of analysis depends on the
variable type whether it is categorical and continuous.
 Continuous variable: We mainly look for statistical trends like mean,
median, standard deviation, skewness, and many more in the dataset.
 Categorical variable: We use a frequency table to understand the
spread of data for each category. We can measure the counts and
frequency of occurrence of values.
o Multi-variate Analysis:
 The bi-variate analysis helps to discover the relation between two or more
variables.
 We can find the correlation in case of continuous variables and the case of
categorical, we look for association and dissociation between them.
o Filling Null Values:
 Usually, the dataset contains null values which lead to lower the potential of
the model. With a continuous variable, we fill these null values using the
mean or mode of that specific column. For the null values present in the
categorical column, we replace them with the most frequently occurred
categorical value. Remember, don’t delete that rows because you may lose
the information.
5. Predictive Modeling
 Predictive modeling is a mathematical approach to create a statistical model to forecast
future behavior based on input test data.
Steps involved in predictive modeling:
 Algorithm Selection:
o When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques. When we have unstructured data and want
to predict the clusters of items to which a particular input test sample belongs, we
use unsupervised algorithms. An actual data scientist applies multiple algorithms to
get a more accurate model.
 Train Model:
o After assigning the algorithm and getting the data handy, we train our model using
the input data applying the preferred algorithm. It is an action to determine the
correspondence between independent variables, and the prediction targets.
 Model Prediction:

11 | P a g e
DATA ANALYTICS UN I T –3

o We make predictions by giving the input test data to the trained model. We measure
the accuracy by using a cross-validation strategy or ROC curve which performs well
to derive model output for test data.

6. Model Deployment
 There is nothing better than deploying the model in a real-time environment. It helps us to
gain analytical insights into the decision-making procedure. You constantly need to update
the model with additional features for customer satisfaction.
 To predict business decisions, plan market strategies, and create personalized customer
interests, we integrate the machine learning model into the existing production domain.
 When you go through the Amazon website and notice the product recommendations
completely based on your curiosities. You can experience the increase in the involvement of
the customers utilizing these services. That’s how a deployed model changes the mindset of
the customer and convince him to purchase the product.

Key Takeaways

SUMMARY OF DA MODEL LIFE CYCLE:


 Understand the purpose of the business analytical problem.
 Generate hypotheses before looking at data.
 Collect reliable data from well-known resources.
 Invest most of the time in data exploration to extract meaningful insights from the data.
 Choose the signature algorithm to train the model and use test data to evaluate.
 Deploy the model into the production environment so it will be available to users and
strategize to make business decisions effectively.

12 | P a g e
DATA ANALYTICS UN I T –3

Logistic Regression:
Model Theory, Model fit Statistics, Model Construction
Introduction:
 Logistic regression is one of the most popular Machine Learning algorithms, which comes

under the Supervised Learning technique. It is used for predicting the categorical

dependent variable using a given set of independent variables.

 The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic

values which lie between 0 and 1.


 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic

function, which predicts two maximum values (0 or 1).

 The curve from the logistic function indicates the likelihood of something such as whether
or not the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

 Logistic regression uses the concept of predictive modeling as regression; therefore, it is

called logistic regression, but is used to classify samples; therefore, it falls under the

classification algorithm.
 In logistic regression, we use the concept of the threshold value, which defines the

probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value

below the threshold values tends to 0.


Types of Logistic Regressions:
On the basis of the categories, Logistic Regression can be classified into three types:
 Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered

types of the dependent variable, such as "cat", "dogs", or "sheep"

 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of

dependent variables, such as "low", "Medium", or "High".

Definition: Multi-collinearity:
 Multicollinearity is a statistical phenomenon in which multiple independent variables show high
correlation between each other and they are too inter-related.
 Multicollinearity also called as Collinearity and it is an undesired situation for any statistical
regression model since it diminishes the reliability of the model itself.

13 | P a g e
DATA ANALYTICS UN I T –3

 If two or more independent variables are too correlated, the data obtained from the
regression will be disturbed because the independent variables are actually dependent
between each other.
Assumptions for Logistic Regression:
 The dependent variable must be categorical in nature.
 The independent variable should not have multi-collinearity.
Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear Regression equation. The

mathematical steps to get Logistic Regression equations are given below:

 Logistic Regression uses a more complex cost function, this cost function can be defined as

the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear

function.
 The hypothesis of logistic regression tends it to limit the cost function between 0 and
1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less

than 0 which is not possible as per the hypothesis of logistic regression.

0  h (x)  1 --- Logistic Regression Hypothesis Expectation

Logistic Function (Sigmoid Function):


 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

 The sigmoid function maps any real value into another value within a range of 0 and 1,

and so forma S-Form curve.

 The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form.
 The below image is showing the logistic function:

Fig: Sigmoid Function Graph


14 | P a g e
DATA ANALYTICS UN I T –3

The Sigmoid function can be interpreted as a probability indicating to a Class-1 or Class-


0. So the Regression model makes the following predictions as

1
z  sigmoid ( y)   ( y) 
1 e y
Hypothesis Representation
 When using linear regression, we used a formula for the line equation as:

y  b0  b1x1  b2 x2 ...  bn xn
 In the above equation y is a response variable, x1, x2 ,...xn are the predictor variables,

and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.

 For logistic regression, we need the maximum likelihood hypothesis h ( y)


 Apply Sigmoid function on y as

z   ( y)  (b0  b1x1  b2 x2 ...  bn xn )


1
z   ( y) 
1 e(b 0 b1x1 b2 x2 ...bn xn )

Example for Sigmoid Function in R:


> #Example for Sigmoid Function
> y<-c(-10:10);y
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
> z<-1/(1+exp(-y));z
[1] 4.539787e-05 1.233946e-04 3.353501e-04 9.110512e-04 2.472623e-03 6.692851e-03
1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-01
5.000000e-01 7.310586e-01
8.807971e-01 9.525741e-01
9.820138e-01 9.933071e-01
[17] 9.975274e-01 9.990889e-01
9.996646e-01 9.998766e-01
9.999546e-01
> plot(y,z)

> rm(list=ls())
> attach(mtcars) #attaching a
data set into the R environment
> input <- mtcars[,c("mpg","disp","hp","wt")]
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215

15 | P a g e
DATA ANALYTICS UN I T –3
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
> #model<-lm(mpg~disp+hp+wt);model1# Show the model
> model<-glm(mpg~disp+hp+wt);model

Call: glm(formula = mpg ~ disp + hp + wt)

Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

Degrees of Freedom: 31 Total (i.e. Null); 28 Residual


Null Deviance: 1126
Residual Deviance: 195 AIC: 158.6
> newx<-data.frame(disp=150,hp=150,wt=4) #new input for prediction
> predict(model,newx)
1
17.08791
> 37.15+(-0.000937)*150+(-0.0311)*150+(-3.8008)*4 #checking with the data newx
[1] 17.14125
y<-input[,c("mpg")]; y
z=1/(1+exp(-y));z
plot(y,z)
> y<-input[,c("mpg")]
> y
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0
21.4
> z=1/(1+exp(-y));z
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999994 1.0000000
1.0000000 1.0000000 1.0000000
0.9999999 1.0000000 0.9999997
0.9999696 0.9999696 0.9999996
1.0000000 1.0000000 1.0000000
1.0000000 0.9999998 0.9999997
0.9999983 1.0000000 1.0000000
1.0000000 1.0000000 0.9999999
1.0000000 0.9999997 1.0000000
> plot(y,z)

16 | P a g e
DATA ANALYTICS UN I T –3

Confusion Matrix (or) Error Matrix (or) Contingency Table:


What is a Confusion Matrix?
“A Confusion matrix is an N x N matrix used for evaluating the performance of a classification

model, where N is the number of target classes. The matrix compares the actual target values
with those predicted by the machine learning model. This gives us a holistic view of how well

our classification model is performing and what kinds of errors it is making. It is a specific

table layout that allows visualization of the performance of an algorithm, typically a supervised

learning one (in unsupervised learning it is usually called a matching matrix).”


For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Let’s decipher the matrix:

 The target variable has two values: Positive or Negative


 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable

 True Positive
 True Negative
 False Positive – Type 1 Error
 False Negative – Type 2 Error

Why we need a Confusion matrix?


 Precision vs Recall
 F1-score

17 | P a g e
DATA ANALYTICS UN I T –3

Understanding True Positive, True Negative, False Positive and False Negativ
e in a Confusion Matrix
True Positive (TP)
 The predicted value matches the actual value
 The actual value was positive and the model predicted a positive value
True Negative (TN)
 The predicted value matches the actual value
 The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
 The predicted value was falsely predicted
 The actual value was negative but the model predicted a positive value
 Also known as the Type 1 error
False Negative (FN) – Type 2 error
 The predicted value was falsely predicted
 The actual value was positive but the model predicted a negative value
 Also known as the Type 2 error
To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics
Accuracy:

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations.
 Accuracy is a great measure to understand that the model is Best.
 Accuracy is dependable only when you have symmetric datasets where values of
false positive and false negatives are almost same.
TP  TN
Accuracy 
TP  FP  TN  FN
Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
 Precision is a useful metric in cases where False Positive is a higher concern than False
18 | P a g e
DATA ANALYTICS UN I T –3
Negatives.

19 | P a g e
DATA ANALYTICS UN I T –3

 Precision is important in music or video recommendation systems, e-commerce


websites, etc. Wrong results could lead to customer churn and be harmful to the
business.
Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all observations in actual
class.

TP
Recall =
TP+ FN
 Recall is a useful metric in cases where False Negative trumps False Positive.
 Recall is important in medical cases where it doesn’t matter whether we raise a
false alarm but the actual positive cases should not go undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these
two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.

F  Score  2
 2*
Precision* Recall
Precision  Recall
1
Recall  Precesion
1 1

 F1 is usually more useful than accuracy, especially if you have an uneven class
distribution.
 Accuracy works best if false positives and false negatives have similar cost.
 If the cost of false positives and false negatives are very different, it’s better to look at
both Precision and Recall.

 But there is a catch here. If the interpretability of the F1-score is poor, means that we
don’t know what our classifier is maximizing – precision or recall? So, we use it in
combination with other evaluation metrics which gives us a complete picture of the

result.

20 | P a g e
DATA ANALYTICS UN I T –3

Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get

the below confusion matrix:

The different values of the Confusion matrix would be

as follows:
 True Positive (TP) = 560
-Means 560 positive class data points were

correctly classified by the model.

 True Negative (TN) = 330


-Means 330 negative class data points

were correctly classified by the model.

 False Positive (FP) = 60


-Means 60 negative class data points were incorrectly classified as belonging to the

positive class by the model.

 False Negative (FN) = 50


-Means 50 positive class data points were incorrectly classified as belonging to the
negative class by the model.
This turned out to be a pretty decent classifier for our dataset considering the relatively

larger number of true positive and true negative values.


Precisely we have the outcomes represented in Confusion Matrix as:

TP = 560, TN = 330, FP = 60, FN = 50


Accuracy:
The accuracy for our model turns out to be:
TP  TN
Accuracy 
TP  FP  TN  FN
 Accuracy  560  330 890  0.89

560  60  330  50 1000

Hence Accuracy is 89%...Not bad!

21 | P a g e
DATA ANALYTICS UN I T –3

Precision:
It tells us how many of the correctly predicted cases actually turned out to be positive.
TP
Precision =
TP+ FP
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict correctly with
our model.
TP 560
Precision=   0.903
TP+ FP 560  60
We can easily calculate Precision and Recall for our model by plugging in the values into the

above questions:

TP 560
Recall =   0.918
TP+ FN 560  50

F1-Score
Precision* Recall
F1  Score  2*
Precision  Recall
0.903* 0.918 0.8289
 F  Score  2 *   0.4552
1
0.903  0.918 1.821

AUC (Area Under Curve) ROC (Receiver Operating Characteristics) Curves:


Performance measurement is an essential task in Data Modelling Evaluation. It is

one of the most important evaluation metrics for checking any classification

model’s performance. It is also written as AUROC (Area Under the Receiver

Operating Characteristics) So when it comes to a classification problem, we can

count on an AUC - ROC Curve.

When we need to check or visualize the performance of the multi-class classification


problem, we use the AUC (Area Under The Curve) ROC (Receiver Operating
Characteristics) curve.

What is the AUC - ROC Curve?

AUC - ROC curve is a performance measurement for the classification problems at various

22 | P a g e
DATA ANALYTICS UN I T –3
threshold settings. ROC is a probability curve and AUC represents the degree or measure

23 | P a g e
DATA ANALYTICS UN I T –3

of separability. It tells how much the model is capable of distinguishing between classes. Higher

the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the
Higher the AUC, the better the model is at distinguishing between patients with the disease and

no disease.

The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the

x-axis.

TPR (True Positive Rate) / Recall /Sensitivity

Specificity

FPR (False Positive Rate)

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:

 True Positive Rate

 False Positive Rate

 True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
 TPR=TPTP+FN

24 | P a g e
DATA ANALYTICS UN I T –3
False Positive Rate (FPR) is defined as follows:

25 | P a g e
DATA ANALYTICS UN I T –3

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.

26 | P a g e
DATA ANALYTICS UN I T –3

Analytics applications to various Business Domains:


Application of Modelling in Business:
 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and studying of
business-related data using statistical models and iterative methodologies. The goal of
BA is to narrow down which datasets are useful and which can increase revenue,
productivity, and efficiency.

 Business analytics (BA) is the combination of skills, technologies, and practices used to
examine an organization's data and performance as a way to gain insights and make
data-driven decisions in the future using statistical analysis.

Although business analytics is being leveraged in most commercial sectors and industries, the

following applications are the most common.


1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an

ideal way of gathering information about a purchaser’s spending habits, financial

situation, behaviour trends, demographics, and lifestyle preferences.


2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer

loyalty to stay in business for the long haul. CRM systems analyze important

performance indicators such as demographics, buying patterns, socio- economic


information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights

that help organizations maneuver their way through tricky terrain. Corporations turn to

business analysts to optimize budgeting, banking, financial planning, forecasting, and

portfolio management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average

length of employment, etc. By working with this information, business analysts help HR

27 | P a g e
DATA ANALYTICS UN I T –3
by forecasting the best fits between the company and candidates.
5. Manufacturing

28 | P a g e
DATA ANALYTICS UN I T –3

Business analysts work with data to help stakeholders understand the things that affect

operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks,

and supply-chain management to create maximum efficiency.


6. Marketing
Business analysts help answer these questions and so many more, by measuring

marketing and advertising metrics, identifying consumer behaviour and the target

audience, and analyzing market trends.

*** End of Unit-3 ***

29 | P a g e
DATA ANALYTICS UN I T –3

Add-ons for Unit-3

TOBE DISCUSSED:
Receiver Operating Characteristics:
ROC & AUC

Derivation for Logistic Regression:


The logistic regression model assumes that the log-odds of an observation y can be expressed
as a linear function of the K input variables x:

Here, we add the constant term b0, by setting x0 = 1. This gives us K+1 parameters. The left
hand side of the above equation is called the logit of P (hence, the name logistic regression).

Let’s take the exponent of both sides of the logit equation.

(Since ln(ab)=ln(a)+ln(b) and exp(a+b)=exp(a)exp(b).)


We can also invert the logit equation to get a new expression for P(x):

30 | P a g e
DATA ANALYTICS UN I T –3

The right hand side of the top equation is the sigmoid of z, which maps the real line to the
interval (0, 1), and is approximately linear near the origin. A useful fact about P(z) is that the
derivative P'(z) = P(z) (1 – P(z)). Here’s the derivation:

Later, we will want to take the gradient of P with respect to the set of coefficients b,
rather than z. In that case, P'(z) = P(z) (1 – P(z))z‘, where ‘ is the gradient taken with
respect to b.

31 | P a g e
DATA ANALYTICS UN I T -4

UNIT - IV
Object Segmentation & Time Series Methods
Syllabus:
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised
Learning, Tree Building – Regression, Classification, Overfitting, Pruning and Complexity,
Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract
features from generated model as Height, Average Energy etc and analyze for prediction

Topics:
Object Segmentation:
 Supervised and Unsupervised Learning
 Segmentaion & Regression Vs Segmentation
 Regression, Classification, Overfitting,
 Decision Tree Building
 Pruning and Complexity
 Multiple Decision Trees etc.

Time Series Methods:


 Arima, Measures of Forecast Accuracy
 STL approach
 Extract features from generated model as Height Average Energy etc. and

Unit-4 Objectives:
1. To explore the Segmentaion & Regression Vs Segmentation
2. To learn the Regression, Classification, Overfitting
3. To explore Decision Tree Building, Multiple Decision Trees etc.
4. To Learn the Arima, Measures of Forecast Accuracy
5. To understand the STL approach

Unit-4 Outcomes:
After completion of this course students will be able to
1. To Describe the Segmentaion & Regression Vs Segmentation
2. To demonstrate Regression, Classification, Overfitting
3. To analyze the Decision Tree Building, Multiple Decision Trees etc.
4. To explore the Arima, Measures of Forecast Accuracy
5. To describe the STL approach

2|Page
DATA ANALYTICS UN I T -4

Supervised and Unsupervised Learning


Supervised Learning:
 Supervised learning is a machine learning method in which models are trained
using labeled data. In supervised learning, models need to find the mapping
function to map the input variable (X) with the output variable (Y).
 We find a relation between x & y, suchthat y=f(x)
 Supervised learning needs supervision to train the model, which is similar to as a
student learns things in the presence of a teacher. Supervised learning can be used
for two types of problems: Classification and Regression.

 Example: Suppose we have an image of different types of fruits. The task of our

supervised learning model is to identify the fruits and classify them accordingly. So

to identify the image in supervised learning, we will give the input data as well as
output for that, which means we will train the model by the shape, size, color, and

taste of each fruit. Once the training is completed, we will test the model by giving

the new set of fruit. The model will identify the fruit and predict the output using a

suitable algorithm.
Unsupervised Machine Learning:
 Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to find
the structure and patterns from the input data. Unsupervised learning does not
need any supervision. Instead, it finds patterns from the data by its own.
 Unsupervised learning can be used for two types of problems: Clustering and
Association.

 Example: To understand the unsupervised learning, we will use the example given
above. So unlike supervised learning, here we will not provide any supervision to

the model. We will just provide the input dataset to the model and allow the

model to find the patterns from the data. With the help of a suitable algorithm, the

model will train itself and divide the fruits into different groups according to the
most similar features between them.

3|Page
DATA ANALYTICS UN I T -4

The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained Unsupervised learning algorithms are


using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output the hidden patterns and useful insights
when it is given new data. from the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.

Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine
predict the correct output. things by his experiences.

It includes various algorithms such as Linear In Unsupervised Learing we have K-means


Regression, Logistic Regression, Support clustering. KNN (k-nearest neighbors),
Vector Machine, Multi-class Classification, Hierarchal clustering, Anomaly detection,
Decision tree, Random Forest, Decision Neural Networks, Principle Component
Trees , Bayesian Logic, etc. Analysis, Independent Component Analysis,
Apriori algorithms, etc.

4|Page
DATA ANALYTICS UN I T -4

Segmentation
 Segmentation refers to the act of segmenting data according to your company’s
needs in order to refine your analyses based on a defined context. It is a
technique of splitting customers into separate groups depending on their attributes
or behavior.

 The purpose of segmentation is to better understand your customers(visitors), and


to obtain actionable data in order to improve your website or mobile app. In
concrete terms, a segment enables you to filter your analyses based on certain
elements (single or combined).

 Segmentation can be done on elements related to a visit,


as well as on elements related to multiple visits during a

studied period.

Steps:
 Define purpose – Already mentioned in the statement above
 Identify critical parameters – Some of the variables which come up in mind are
skill, motivation, vintage, department, education etc. Let us say that basis past
experience, we know that skill and motivation are most important parameters. Also,
for sake of simplicity we just select 2 variables. Taking additional variables will
increase the complexity, but can be done if it adds value.
 Granularity – Let us say we are able to classify both skill and motivation into High
and Low using various techniques.

There are two broad set of methodologies for segmentation:


 Objective (supervised) segmentation
5|Page
DATA ANALYTICS UN I T -4
 Non-Objective (unsupervised) segmentation

6|Page
DATA ANALYTICS UN I T -4

Objective Segmentation
 Segmentation to identify the type of customers who would respond to a
particular offer.
 Segmentation to identify high spenders among customers who will use the
e- commerce channel for festive shopping.
 Segmentation to identify customers who will default on their credit obligation for
a loan or credit card.

Non-Objective Segmentation

https://www.yieldify.com/blog/types-of-market-segmentation/
 Segmentation of the customer base to understand the specific profiles which exist
within the customer base so that multiple marketing actions can be
personalized for each segment
 Segmentation of geographies on the basis of affluence and lifestyle of people living
in each geography so that sales and distribution strategies can be formulated
accordingly.
 Hence, it is critical that the segments created on the basis of an objective
segmentation methodology must be different with respect to the stated objective
(e.g. response to an offer).
 However, in case of a non-objective methodology, the segments are different with
respect to the “generic profile” of observations belonging to each segment, but not
with regards to any specific outcome of interest.
 The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.

Regression Vs Segmentation
 Regression analysis focuses on finding a relationship between a dependent variable
and one or more independent variables.
 Predicts the value of a dependent variable based on the value of at least
one independent variable.
 Explains the impact of changes in an independent variable on the
dependent variable.
 We use linear or logistic regression technique for developing accurate models
for predicting an outcome of interest.
 Often, we create separate models for separate segments.
 Segmentation methods such as CHAID or CRT is used to judge their effectiveness.
7|Page
DATA ANALYTICS UN I T -4

 Creating separate model for separate segments may be time consuming and not
worth the effort. But, creating separate model for separate segments may provide
higher predictive power.

Decision Tree Classification Algorithm:

 Decision Tree is a supervised learning technique that can be used for both

classification and Regression problems, but mostly it is preferred for solving

Classification problems.

 Decision Trees usually mimic human thinking ability while making a decision, so

it is easy to understand.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further

split the tree into subtrees.

 It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.
 It is a tree-structured classifier, where internal nodes represent the features of a

dataset, branches represent the decision rules and each leaf node represents the
outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf

Node. Decision nodes are used to make any decision and have multiple branches,

whereas Leaf nodes are the output of those decisions and do not contain any

further branches.
 Basic Decision Tree Learning Algorithm:
 Now that we know what a Decision Tree is, we’ll see how it works internally. There

are many algorithms out there which construct Decision Trees, but one of the

best is called as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser 3.

There are two main types of Decision Trees:


1. Classification trees (Yes/No types)
What we’ve seen above is an example of classification tree, where the outcome was a

variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.


2. Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a number like 123.

8|Page
DATA ANALYTICS UN I T -4
Decision Tree Terminologies

9|Page
DATA ANALYTICS UN I T -4

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
Decision Tree Representation:
 Each non-leaf node is connected to a test that splits its set of possible answers

into subsets corresponding to different test results.


 Each branch carries a particular test result's subset to another node.
 Each node is connected to a set of possible answers.
 Below diagram explains the general structure of a decision tree:

 A decision tree is an arrangement of tests that provides an appropriate


classification at every step in an analysis.

 "In general, decision trees represent a disjunction of conjunctions of constraints on


the attribute-values of instances. Each path from the tree root to a leaf corresponds
to a conjunction of attribute tests, and the tree itself to a disjunction of these
conjunctions" (Mitchell, 1997, p.53).

10 | P a g e
DATA ANALYTICS UN I T -4
 More specifically, decision trees classify instances by sorting them down the tree
from the root node to some leaf node, which provides the classification of the
instance. Each node in the tree specifies a test of some attribute of the
instance,

11 | P a g e
DATA ANALYTICS UN I T -4

and each branch descending from that node corresponds to one of the possible
values for this attribute.
 An instance is classified by starting at the root node of the decision tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute. This process is then repeated at the
node on this branch and so on until a leaf node is reached.

Appropriate Problems for Decision Tree Learning


Decision tree learning is generally best suited to problems with the following

characteristics:
 Instances are represented by attribute-value pairs.
o There is a finite list of attributes (e.g. hair colour) and each instance
stores a value for that attribute (e.g. blonde).

o When each attribute has a small number of distinct values (e.g.


blonde, brown, red) it is easier for the decision tree to reach a useful
solution.

o The algorithm can be extended to handle real-valued attributes (e.g.


a floating point temperature)

 The target function has discrete output values.


o A decision tree classifies each example as one of the output values.
 Simplest case exists when there are only two possible classes

(Boolean classification).

 However, it is easy to extend the decision tree to produce a


target function with more than two possible output values.
o Although it is less common, the algorithm can also be extended to
produce a target function with real-valued outputs.
 Disjunctive descriptions may be required.
o Decision trees naturally represent disjunctive expressions.
 The training data may contain errors.
o Errors in the classification of examples, or in the attribute values
describing those examples are handled well by decision trees, making them
a robust learning method.

12 | P a g e
DATA ANALYTICS UN I T -4
 The training data may contain missing attribute values.
o Decision tree methods can be used even when some training
examples have unknown values (e.g., humidity is known for only a
fraction of the examples).

13 | P a g e
DATA ANALYTICS UN I T -4

After a decision tree learns classification rules, it can also be re-represented as a set of

if-then rules in order to improve readability.


How does the Decision Tree algorithm Work?
The decision of making strategic splits heavily affects a tree’s accuracy. The decision

criteria are different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node into two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In

other words, we can say that the purity of the node increases with respect to the target

variable. The decision tree splits the nodes on all available variables and then selects the

split which results in most homogeneous sub-nodes.

Tree Building: Decision tree learning is the construction of a decision tree from class-

labeled training tuples. A decision tree is a flow-chart-like structure, where each internal
(non-leaf) node denotes a test on an attribute, each branch represents the outcome of a

test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the

root node. There are many specific decision-tree algorithms. Notable ones include the

following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)

MARS → (multivariate adaptive regression splines): Extends decision trees to handle


numerical data better
Conditional Inference Trees → Statistics-based approach that uses non-parametric tests as

splitting criteria, corrected for multiple testing to avoid over fitting.

The ID3 algorithm builds decision trees using a top-down greedy search approach through

the space of possible branches with no backtracking. A greedy algorithm, as the name

suggests, always makes the choice that seems to be the best at that moment.

In a decision tree, for predicting the class of the given dataset, the algorithm starts

from the root node of the tree. This algorithm compares the values of root attribute with
14 | P a g e
DATA ANALYTICS UN I T -4
the record (real dataset) attribute and, based on the comparison, follows the branch and

jumps to the next node.

15 | P a g e
DATA ANALYTICS UN I T -4

For the next node, the algorithm again compares the attribute value with the other sub-

nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

 Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
 Step-3: Divide the S into subsets that contains possible values for the best
attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset

created in

 Step -6: Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information being

processed. The higher the entropy, the harder it is to draw any


conclusions from that information. Flipping a coin is an example of an

action that provides information that is random.

From the graph, it is quite evident that the entropy H(X) is zero when the probability

is either 0 or 1. The Entropy is maximum when the probability is 0.5 because it

projects perfect randomness in the data and there is no chance if perfectly determining
the outcome.
Information Gain
Information gain or IG is a statistical property that measures

how well a given attribute separates the training examples

according to their target classification. Constructing a


decision tree is all about finding an attribute that returns the

highest information gain and the smallest entropy.


ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with
entropy more than zero needs further splitting.

Hypothesis space search in decision tree learning:


16 | P a g e
DATA ANALYTICS UN I T -4

In order to derive the Hypothesis space, we compute the Entropy and Information Gain of
Class and attributes. For them we use the following statistics formulae:

17 | P a g e
DATA ANALYTICS UN I T -4

Entropy of Class is:


P   N 
Entropy(Class)   log  P  N log 2  
P N 2
P N  P N P N 
   

For any Attribute,

InformationGain(  pi
pi Attribute)  ni 
I (p , n )   log  n log 
2  2  
p n  p  n
i
p n p n 
i i

i i  i i  i i  i i 

Entropy of an Attribute is:

Entropy( Attribute)  
pi  ni 

I p  n 
i
P N i

Gain  Entropy(Class)  Entropy(Attribute)

Illustrative Example: Concept: “Play Tennis”:

Data set:

18 | P a g e
DATA ANALYTICS UN I T -4

Basic algorithm for inducing a decision tree from training tuples:


Algorithm:
Generate decision tree. Generate a decision tree from the training
tuples of data
partition D.
Input:
Data partition, D, which is a set of training tuples and their
associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
return N as a leaf node labeled with the class C;
(3) if attribute list is empty then
return N as a leaf node labeled with the majority class in D;
// majority voting
(4) apply Attribute selection method(D, attribute list) to find the
“best”
splitting criterion;
(5) Label node N with splitting criterion;
(6) if splitting attribute is discrete-valued and multiway splits
allowed
then // not restricted to binary trees
(7) attribute list= attribute list - splitting attribute
(8) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for
each partition
(9) let Dj be the set of data tuples in D satisfying outcome j;
// a partition
(10) if Dj is empty then
attach a leaf labeled with the majority class in D to node N;
else
attach the node returned by Generate decision tree(Dj,
attribute list) to node N;
(11) return N;

Advantages of Decision Tree:


 Simple to understand and interpret. People are able to understand decision
tree models after a brief explanation.

19 | P a g e
DATA ANALYTICS UN I T -4

 Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed.

 Able to handle both numerical and categorical data. Other techniques are usually
specialized in analysing datasets that have only one type of variable. (For example,
relation rules can be used only with nominal variables while neural networks can be
used only with numerical variables.)
 Uses a white box model. If a given situation is observable in a model the

explanation for the condition is easily explained by Boolean logic. (An example of a
black box model is an artificial neural network since the explanation for the results is

difficult to understand.)
 Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.

 Robust: Performs well with large datasets. Large amounts of data can be analyzed
using standard computing resources in reasonable time.
Tools used to make Decision Tree:
Many data mining software packages provide implementations of one or more

decision tree algorithms. Several examples include:


 SAS Enterprise Miner
 Matlab
 R (an open source software environment for statistical computing which includes
several CART implementations such as rpart, party and random Forest packages)

 Weka (a free and open-source data mining suite, contains many decision tree
algorithms)
 Orange (a free data mining software suite, which includes the tree module orngTree)
 KNIME
 Microsoft SQL Server
 Scikit-learn (a free and open-source machine learning library for the Python
programming language).

 Salford Systems CART (which licensed the proprietary code of the original
CART authors)
 IBM SPSS Modeler
 Rapid Miner

20 | P a g e
DATA ANALYTICS UN I T -4

Multiple Decision Trees:


Classification & Regression Trees:
 Classification and regression trees is a term used to describe decision tree algorithms
that are used for classification and regression learning tasks.
 The Classification and Regression Tree methodology, also known as the CART were
introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles
Stone.

Classification Trees:
A classification tree is an algorithm where the

target variable is fixed or categorical. The

algorithm is then used to identify the “class”

within which a target variable would most


likely fall.
 An example of a classification-type

problem would be determining who will or

will not subscribe to a digital platform; or

who will or will not graduate from high

school.
 These are examples of simple binary classifications where the categorical dependent

variable can assume only one of two, mutually exclusive values.

21 | P a g e
DATA ANALYTICS UN I T -4

Regression Trees
 A regression tree refers to an algorithm

where the target variable is and the


algorithm is used to predict its value which

is a continuous variable.

 As an example of a regression type

problem, you may want to predict the

selling prices of a residential house, which is


a continuous dependent variable.

 This will depend on both continuous factors

like square footage as well as categorical

factors.

Difference Between Classification and Regression Trees


 Classification trees are used when the dataset needs to be split into classes that belong
to the response variable. In many cases, the classes Yes or No.

 In other words, they are just two and mutually exclusive. In some cases, there may be

more than two classes in which case a variant of the classification tree algorithm is

used.
 Regression trees, on the other hand, are used when the response variable is

continuous. For instance, if the response variable is something like the price of a

property or the temperature of the day, a regression tree is used.

 In other words, regression trees are used for prediction-type problems while
classification trees are used for classification-type problems.

CART: CART stands for Classification And Regression Tree.


 CART algorithm was introduced in Breiman et al. (1986). A CART tree is a binary

decision tree that is constructed by splitting a node into two child nodes repeatedly,
beginning with the root node that contains the whole learning sample. The CART

22 | P a g e
DATA ANALYTICS UN I T -4
growing method attempts to maximize within-node homogeneity.

 The extent to which a node does not represent a homogenous subset of cases is an

indication of impurity. For example, a terminal node in which all cases have the
same

23 | P a g e
DATA ANALYTICS UN I T -4

value for the dependent variable is a homogenous node that requires no further

splitting because it is "pure." For categorical (nominal, ordinal) dependent variables the
common measure of impurity is Gini, which is based on squared probabilities of

membership for each category. Splits are found that maximize the homogeneity of

child nodes with respect to the value of the dependent variable.

Decision tree pruning:


Pruning is a data compression technique in machine learning and search algorithms that

reduces the size of decision trees by removing sections of the tree that are non-
critical and redundant to classify instances. Pruning reduces the complexity of the final

classifier, and hence improves predictive accuracy by the reduction of overfitting.

One of the questions that arises in a decision tree algorithm is the optimal size of the final

tree. A tree that is too large risks overfitting the training data and poorly generalizing to

new samples. A small tree might not capture important structural information about the
sample space. However, it is hard to tell when a tree algorithm should stop because it is
impossible Before and After pruning to tell if the addition of a single extra node will

dramatically decrease error. This problem is known as the horizon effect. A common

strategy is to grow the tree until each node contains a small number of instances then use
24 | P a g e
DATA ANALYTICS UN I T -4
pruning to remove nodes that do not provide additional information. Pruning should

reduce the size of a learning tree without reducing predictive accuracy as measured by
a cross-

25 | P a g e
DATA ANALYTICS UN I T -4

validation set. There are many techniques for tree pruning that differ in the measurement

that is used to optimize performance.

Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
 Pre-pruning procedures prevent a complete induction of the training set by

replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or
information gain (Attr)> minGain). They considered to be more efficient because
they do not induce an entire set, but rather trees remain small from the start.
 Post-Pruning (or just pruning) is the most common way of simplifying trees. Here,
nodes and subtrees are replaced with leaves to reduce complexity.

The procedures are differentiated on the basis of their approach in the tree: Top-down

approach & Bottom-Up approach

Bottom-up pruning approach:


 These procedures start at the last node in the tree (the lowest point).
 Following recursively upwards, they determine the relevance of each individual
node.
 If the relevance for the classification is not given, the node is dropped or
replaced by a leaf.
 The advantage is that no relevant sub-trees can be lost with this method.
 These methods include Reduced Error Pruning (REP), Minimum Cost Complexity
Pruning (MCCP), or Minimum Error Pruning (MEP).

Top-down pruning approach:


 In contrast to the bottom-up method, this method starts at the root of the tree.
Following the structure below, a relevance check is carried out which decides
whether a node is relevant for the classification of all n items or not.
 By pruning the tree at an inner node, it can happen that an entire sub-tree
(regardless of its relevance) is dropped.

 One of these representatives is pessimistic error pruning (PEP), which brings quite
good results with unseen items.

26 | P a g e
DATA ANALYTICS UN I T -4

CHAID:
 CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and Sonquist
(1963) proposed a simple method for fitting trees to predict a quantitative
variable.

 Each predictor is tested for splitting as follows: sort all the n cases on the predictor
and examine all n-1 ways to split the cluster in two. For each possible split, compute
the within-cluster sum of squares about the mean of the cluster on the dependent
variable.
 Choose the best of the n-1 splits to represent the predictor’s contribution. Now do

this for every other predictor. For the actual split, choose the predictor and its cut

point which yields the smallest overall within-cluster sum of squares. Categorical

predictors require a different approach. Since categories are unordered, all possible
splits between categories must be considered. For deciding on one split of k
categories into two groups, this means that 2k-1 possible splits must be considered.

Once a split is found, its suitability is measured on the same within-cluster sum of

squares as for a quantitative predictor.

 It has to do instead with conditional discrepancies. In the analysis of variance,


interaction means that a trend within one level of a variable is not parallel to a trend
within another level of the same variable. In the ANOVA model, interaction is
represented by cross-products between predictors.

 In the tree model, it is represented by branches from the same nodes which have

different splitting predictors further down the tree. Regression trees parallel
regression/ANOVA modeling, in which the dependent variable is quantitative.

Classification trees parallel discriminant analysis and algebraic classification methods.

Kass (1980) proposed a modification to AID called CHAID for categorized dependent

and independent variables. His algorithm incorporated a sequential merge and split
procedure based on a chi-square test statistic.
 Kass’s algorithm is like sequential cross-tabulation. For each predictor:
1) cross tabulate the m categories of the predictor with the k categories of the
dependent variable.
2) find the pair of categories of the predictor whose 2xk sub-table is least

significantly different on a chi-square test and merge these two

27 | P a g e
DATA ANALYTICS UN I T -4
categories.

3) if the chi-square test statistic is not “significant” according to a preset critical

value, repeat this merging process for the selected predictor until no non-
significant chi-square is found for a sub-table, and pick the predictor variable

28 | P a g e
DATA ANALYTICS UN I T -4

whose chi-square is largest and split the sample into subsets, where l is the

number of categories resulting from the merging process on that


predictor.

4) Continue splitting, as with AID, until no “significant” chi-squares result. The

CHAID algorithm saves some computer time, but it is not guaranteed to find

the splits which predict best at a given step.


 Only by searching all possible category subsets can we do that. CHAID is also limited
to categorical predictors, so it cannot be used for quantitative or mixed categorical
quantitative models.

GINI Index Impurity Measure:


 GINI Index Used by the CART (classification and regression tree) algorithm, Gini

impurity is a measure of how often a randomly chosen element from the set
would be incorrectly labeled if it were randomly labeled according to the distribution

of labels in the subset. Gini impurity can be computed by summing the probability fi

of each item being chosen times the probability 1-fi of a mistake in categorizing

that item.

Overfitting and Underfitting

29 | P a g e
DATA ANALYTICS UN I T -4
 Let’s clearly understand overfitting, underfitting and perfectly fit models.

30 | P a g e
DATA ANALYTICS UN I T -4

 From the three graphs shown above, one can clearly understand that the leftmost

figure line does not cover all the data points, so we can say that the model is under-
fitted. In this case, the model has failed to generalize the pattern to the new

dataset, leading to poor performance on testing. The under-fitted model can be

easily seen as it gives very high errors on both training and testing data. This is

because the dataset is not clean and contains noise, the model has High Bias, and
the size of the training data is not enough.

 When it comes to the overfitting, as shown in the rightmost graph, it shows the

model is covering all the data points correctly, and you might think this is a perfect

fit. But actually, no, it is not a good fit! Because the model learns too many details
from the dataset, it also considers noise. Thus, it negatively affects the new

data set; not every detail that the model has learned during training needs also
apply to the new data points, which gives a poor performance on testing or

validation

dataset. This is because the model has trained itself in a very complex manner

and has high variance.

 The best fit model is shown by the middle graph, where both training and testing
(validation) loss are minimum, or we can say training and testing accuracy should
be near each other and high in value.

31 | P a g e
DATA ANALYTICS UN I T -4

Time Series Methods:


 Time series forecasting focuses on analyzing data changes across equally spaced
time intervals.

 Time series analysis is used in a wide variety of domains, ranging from

econometrics to geology and earthquake prediction; it’s also used in almost all

applied sciences and engineering.

 Time-series databases are highly popular and provide a wide spectrum of


numerous applications such as stock market analysis, economic and sales

forecasting, budget analysis, to name a few.

 They are also useful for studying natural phenomena like atmospheric pressure,
temperature, wind speeds, earthquakes, and medical prediction for treatment.
 Time series data is data that is observed at different points in time. Time Series
Analysis finds hidden patterns and helps obtain useful insights from the time series
data.
 Time Series Analysis is useful in predicting future values or detecting anomalies
from the data. Such analysis typically requires many data points to be present
in the dataset to ensure consistency and reliability.
 The different types of models and analyses that can be created through time series
analysis are:
o Classification: To Identify and assign categories to the data.
o Curve fitting: Plot the data along a curve and study the relationships of
variables present within the data.

o Descriptive analysis: Help Identify certain patterns in time-series data such


as trends, cycles, or seasonal variation.

o Explanative analysis: To understand the data and its relationships, the


dependent features, and cause and effect and its tradeoff.

o Exploratory analysis: Describe and focus on the main characteristics of the


time series data, usually in a visual format.

o Forecasting: Predicting future data based on historical trends. Using the


historical data as a model for future data and predicting scenarios that could
happen along with the future plot points.

o Intervention analysis: The Study of how an event can change the data.
o Segmentation: Splitting the data into segments to discover the underlying
32 | P a g e
DATA ANALYTICS UN I T -4
properties from the source information.

33 | P a g e
DATA ANALYTICS UN I T -4

Components of Time Series:


Long term trend – The smooth long term direction of time series where the data can
increase or decrease in some pattern.

Seasonal variation – Patterns of change in a time series within a year which tends to
repeat every year.

Cyclical variation – Its much alike seasonal variation but the rise and fall of time series
over periods are longer than one year.

Irregular variation – Any variation that is not explainable by any of the three above

mentioned components. They can be classified into – stationary and non – stationary
variation.

Stationary Variation: When the data neither increases nor decreases, i.e. it’s completely

random it’s called stationary variation. Or When the data has some explainable portion

remaining and can be analyzed further then such case is called non – stationary
variation.

ARIMA & ARMA:


What is ARIMA?
 In time series analysis, ARIMA is an acronym that stands for AutoRegressive

Integrated Moving Average model is ageneralization of an autoregressive moving


average (ARMA) model. These models are fitted to time series data either to better

understand the data or to predict future points in the series (forecasting).


 They are applied in some cases where data show evidence of non-stationary,
34 | P a g e
DATA ANALYTICS UN I T -4
 .A popular and very widely used statistical method for time series forecasting and
analysis is the ARIMA model.

35 | P a g e
DATA ANALYTICS UN I T -4

 It is a class of models that capture a spectrum of different standard temporal


structures present in time series data. By implementing an ARIMA model, you can
forecast and analyze a time series using past values, such as predicting future
prices based on historical earnings.

 Univariate models such as these are used to understand better a single time-
dependent variable present in the data, such as temperature over time. They predict
future data points of and from the variables.
 wherean initial differencing step (corresponding to the "integrated" part of the
model) can be applied to reduce the non-stationary.A standard notation used for
describing ARIMA is by parameters p,d and q.

 Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q) where


parameters p, d, and q are non-negative integers, p is the order of the
Autoregressive model, d is the degree of differencing, and q is the order of the
Moving-average model.

 The parameters are substituted with an integer value to indicate the specific ARIMA
model being used quickly. The parameters of the ARIMA model are further
described as follows:

o p: Stands for the number of lag observations included in the model, also
known as the lag order.

o d: The number of times the raw observations are differentiated, also


called the degree of differencing.

o q: Is the size of the moving average window and also called the order
of moving average.
Univariate stationary processes (ARMA)
A covariance stationary process is an ARMA (p, q) process of autoregressive order p and

moving
average order q if it can be written as

The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of
the stationarized series in the forecasting equation are called "autoregressive" terms,

lags of the forecast errors are called "moving average" terms, and a time series which

needs to be differenced to be made stationary is said to be an "integrated" version of


36 | P a g e
DATA ANALYTICS UN I T -4
a stationary series. Random-walk and random-trend models, autoregressive models,
and exponential smoothing models are all special cases of ARIMA models.

37 | P a g e
DATA ANALYTICS UN I T -4

A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:

 p is the number of autoregressive terms,


 d is the number of nonseasonal differences needed for stationarity, and
 q is the number of lagged forecast errors in the prediction equation.

The forecasting equation is constructed as follows. First, let y denote the dth difference
of Y, which means:

If d=0: yt = Yt

If d=1: yt = Yt - Yt-1

If d=2: yt = (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2

Note that the second difference of Y (the d=2 case) is not the difference from 2
periods ago. Rather, it is the first-difference-of-the-first difference, which is the discrete
analog of a second derivative, i.e., the local acceleration of the series rather than its
local trend.

In terms of y, the general forecasting equation is:

ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p - θ1et-1 -…- θqet-q

Measure of Forecast Accuracy: Forecast Accuracy can be defined as the deviation of

Forecast or Prediction from the actual results.

Error = Actual demand – Forecast OR ei  At  Ft

We measure Forecast Accuracy by 2 methods : 1. Mean Forecast Error (MFE) For n time

periods where we have actual demand and forecast values:

(ei )
MFE  i1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends to over-

forecast 2. Mean Absolute Deviation (MAD) For n time periods where we have actual

demand and forecast values:

(ei )
MAD  i1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute size of the

errors

38 | P a g e
DATA ANALYTICS UN I T -4
Uses of Forecast error:
 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models

39 | P a g e
DATA ANALYTICS UN I T -4

 Identify forecast models that need adjustment

Approach:

Extract, Transform and Load (ETL) refers to a process in database usage and especially in

data warehousing that:


 Extracts data from homogeneous or heterogeneous data sources
 Transforms the data for storing it in proper format or structure for querying and
analysis purpose

 Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so

while the data is being pulled another transformation process executes, processing the

already received data and prepares the data for loading and as soon as there is some data

ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.

ETL systems commonly integrate data from multiple applications (systems), typically

developed and supported by different vendors or hosted on separate computer hardware.

The disparate systems containing the original data are frequently managed and operated

by different employees. For example, a cost accounting system may combine data from

payroll, sales, and purchasing.


Commercially available ETL tools include:
 Anatella
 Alteryx
 CampaignRunner
 ESF Database Migration Toolkit
 InformaticaPowerCenter
 Talend
 IBM InfoSphereDataStage
 Ab Initio
 Oracle Data Integrator (ODI)
 Oracle Warehouse Builder (OWB)
 Microsoft SQL Server Integration Services (SSIS)

40 | P a g e
DATA ANALYTICS UN I T -4
 Tomahawk Business Integrator by Novasoft Technologies.
 Pentaho Data Integration (or Kettle) opensource data integration framework
 Stambia

41 | P a g e
DATA ANALYTICS UN I T -4

 Diyotta DI-SUITE for Modern Data Integration


 FlyData
 Rhino ETL
 SAP Business Objects Data Services
 SAS Data Integration Studio
 SnapLogic
 Clover ETL opensource engine supporting only basic partial functionality and
not server
 SQ-ALL - ETL with SQL queries from internet sources such as APIs
 North Concepts Data Pipeline
There are various steps involved in ETL. They are as below in detail:
Extract:
The Extract step covers the data extraction from the source system and makes it

accessible for further processing. The main objective of the extract step is to retrieve all

the required data from the source system with as little resources as possible. The extract

step should be designed in a way that it does not negatively affect the source system in

terms or performance, response time or any kind of locking.


There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification that a
record has been changed and describe the change, this is the easiest way to get the
data.

 Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been
modified and provide an extract of such records. During further ETL steps, the
system needs to identify changes and propagate it down. Note, that by using daily
extract, we may not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of the
system. The full extract requires keeping a copy of the last extract in the same
format in order to be able to identify changes. Full extract handles deletions as
well.
 When using Incremental or Full extracts, the extract frequency is extremely

42 | P a g e
DATA ANALYTICS UN I T -4
important. Particularly for full extracts; the data volumes can be in tens of
gigabytes.

43 | P a g e
DATA ANALYTICS UN I T -4

 Clean: The cleaning step is one of the most important as it ensures the quality
of the data in the data warehouse. Cleaning should perform basic data unification
rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)

 Convert null values into standardized Not Available/Not Provided valueConvert


phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).

Transform:
 The transform step applies a set of rules to transform the data from the source to
the target.

 This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined.

 The transformation step also requires joining data from several sources, generating
aggregates, generating surrogate keys, sorting, deriving new calculated values, and
applying advanced validation rules.

Load:
 During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database.

 In order to make the load process efficient, it is helpful to disable any constraints
and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.

Managing ETL Process


The ETL process seems quite straight forward. As with every application, there is a

possibility that the ETL process fails. This can be caused by missing extracts from one of

44 | P a g e
DATA ANALYTICS UN I T -4
the systems, missing values in one of the reference tables, or simply a connection or

power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.

45 | P a g e
DATA ANALYTICS UN I T -4

Staging:
It should be possible to restart, at least, some of the phases independently from the

others. For example, if the transformation step fails, it should not be necessary to restart
the Extract step. We can ensure this by implementing proper staging. Staging means that

the data is simply dumped to the location (called the Staging Area) so that it can then

be read by the next processing phase. The staging area is also used during ETL process to

store intermediate results of processing. This is ok for the ETL process which uses for this
purpose. However, the staging area should be accessed by the load ETL process only. It

should never be available to anyone else; particularly not to end users as it is not intended

for data presentation to the end-user. May contain incomplete or in-the-middle-of-the-

processing data.

*** End of Unit-4 ***

46 | P a g e
Decision Tree Example
for
DATA ANALYTICS
(UNIT 4)
DATA ANALYTICS D e c i si on Tr ee Learn in g

Hypothesis space search in decision tree learning:

In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class and
attributes. For them we use the following statistics formulae:

Entropy of Class is:


P   N 
Entropy(Class)   log  P  N log 2 
P N 2
P N  PN PN 
   
For any Attribute,

InformationGain( Attribute)
pi
I (p , n )   log  pi   ni log  ni 
2   2  
pi  n i  p i  ni  pi  ni  p i  ni 
i i


Entropy of an Attribute is:

Entropy( Attribute)   p n 
 i i
I pi  ni 
PN
Gain  Entropy(Class)  Entropy(Attribute)
Illustrative Example:

Concept: “Play Tennis”:

Data set:

Day Outlook Temperature Humidity Wind Play Golf


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

2|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

By using the said formulae, the decision tree is derived as follows:

3|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

4|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

5|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

6|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

7|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

8|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

9|Page
DATA ANALYTICS D e c i si on Tr ee Learn in g

10 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g

11 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g

12 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g

13 | P a g e
DATA ANALYTICS D e c i si on Tr ee Learn in g

14 | P a g e
Syllabus:
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection
Visualization Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization
Techniques, Visualizing Complex Data and Relations.

 Pixel-Oriented Visualization Techniques


 Geometric Projection Visualization Techniques
 Icon-Based Visualization Techniques
 Hierarchical Visualization Techniques
 Visualizing Complex Data and Relations.

Unit-5 Objectives:
1. To explore Pixel-Oriented Visualization Techniques
2. To learn Geometric Projection Visualization Techniques
3. To explore Icon-Based Visualization Techniques
4. To Learn Hierarchical Visualization Techniques
5. To understand Visualizing Complex Data and Relations

Unit-5 Outcomes:
After completion of this course students will be able to
1. To Describe the Pixel-Oriented Visualization Techniques
2. To demonstrate Geometric Projection Visualization Techniques
3. To analyze the Icon-Based Visualization Techniques
4. To explore the Hierarchical Visualization Techniques
5. To compare the Visualizing Complex Data and Relations
Data Visualization

 Data visualization is the art and practice of gathering, analyzing, and graphically
representing empirical information.
 They are sometimes called information graphics, or even just charts and graphs.
 The goal of visualizing data is to tell the story in the data.
 Telling the story is predicated on understanding the data at a very deep level, and
gathering insight from comparisons of data points in the numbers

Why data visualization?

 Gain insight into an information space by mapping data onto graphical primitives
Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, and relationships among data.
 Help find interesting regions and suitable parameters for further quantitative
analysis.
 Provide a visual proof of computer representations derived.

Categorization of visualization methods


 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

Pixel-Oriented Visualization Techniques


 For a data set of m dimensions, create m windows on the screen, one for each
dimension.
 The m dimension values of a record are mapped to m pixels at the
correspondingpositions in the windows.
 The colors of the pixels reflect the corresponding values.

 To save space and show the connections among multiple dimensions, space filling
isoften done in a circle segment.

Geometric Projection Visualization Techniques


Visualization of geometric transformations and projections of the
data.Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes Projection pursuit technique: Help users find meaningful
projections ofmultidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates
Line Plot:
 This is the plot that you can see in the nook and corners of any sort of analysis
between 2 variables.

 The line plots are nothing but the values on a series of data points will be
connected with straight lines.

 The plot may seem very simple but it has more applications not only in machine
learning but in many other areas.
 Used to analyze the performance of a model using the ROC- AUC curve.

Bar Plot
 This is one of the widely used plots, that we would have seen multiple times not

just in data analysis, but we use this plot also wherever there is a trend analysis in

many fields.
 We can visualize the data in a cool plot and can convey the details straight forward to

others.
 This plot may be simple and clear but it’s not much frequently used in Data science
applications.
Stacked Bar Graph:

 Unlike a Multi-set Bar Graph which displays their bars side-by-side, Stacked Bar

Graphs segment their bars. Stacked Bar Graphs are used to show how a larger
category is divided into smaller categories and what the relationship of each part has

on the total amount. There are two types of Stacked Bar Graphs:

 Simple Stacked Bar Graphs place each value for the segment after the previous one.
The total value of the bar is all the segment values added together. Ideal for
comparing the total amounts across each group/segmented bar.
 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are

plotted by the percentage of each value to the total amount in each group. This

makes it easier to see the relative differences between quantities in each group.

 One major flaw of Stacked Bar Graphs is that they become harder to read the more

segments each bar has. Also comparing each segment to each other is difficult, as

they're not aligned on a common baseline.


Scatter Plot

 It is one of the most commonly used plots used for visualizing simple data in Machine
learning and Data Science.

 This plot describes us as a representation, where each point in the entire dataset is
present with respect to any 2 to 3 features(Columns).

 Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the
common one, where we will primarily try to find the patterns, clusters, and separability

of the data.

 The colors are assigned to different data points based on how they were present in
the dataset i.e, target column representation.
 We can color the data points as per their class label given in the dataset.
Box and Whisker Plot
 This plot can be used to obtain more statistical details about the data.
 The straight lines at the maximum and minimum are also called whiskers.
 Points that lie outside the whiskers will be
considered as an outlier.
 The box plot also gives us a description of

the 25th, 50th,75th quartiles.

 With the help of a box plot, we can also


determine the Interquartile

range(IQR) where maximum details of the

data will be present


 These box plots come under univariate

analysis, which means that we are exploring

data only with one variable.

Pie Chart :
A pie chart shows a static number and how categories represent part of a whole the

composition of something. A pie chart represents numbers in percentages, and the total
sum of all segments needs to equal 100%.
 Extensively used in presentations and offices, Pie Charts help show proportions and
percentages between categories, by dividing a circle into proportional segments. Each arc
length represents a proportion of each category, while the full circle represents the total
sum of all the data, equal to 100%.
Donut Chart:

 A donut chart is essentially a Pie Chart with an


area of the centre cut out. Pie Charts are
sometimes criticised for focusing readers on the

proportional areas of the slices to one another


and to the chart as a whole. This makes it tricky
to see the differences between slices, especially
when you try to compare multiple Pie Charts

together.
 A Donut Chart somewhat remedies this problem by de-emphasizing the use of the area.
Instead, readers focus more on reading the length of the arcs, rather than comparing the
proportions between slices.
 Also, Donut Charts are more space-efficient than Pie Charts because the blank space inside a
Donut Chart can be used to display information inside it.

Marimekko Chart:

Also known as a Mosaic Plot.


 Marimekko Charts are used to visualise categorical data over a pair of variables. In a
Marimekko Chart, both axes are variable with a percentage scale, that determines both the
width and height of each segment. So Marimekko Charts work as a kind of two- way 100%
Stacked Bar Graph. This makes it possible to detect relationships between categories and their

subcategories via the two axes.


 The main flaws of Marimekko Charts are that they can be hard to read, especially when there
are many segments. Also, it’s hard to accurately make comparisons between each segment,
as they are not all arranged next to each other along a common baseline. Therefore,
Marimekko Charts are better suited for giving a more general overview of the data.

Icon-Based Visualization Techniques


 It uses small icons to represent multidimensional data values
 Visualization of the data values as features of icons
 Typical visualization methods
o Chernoff Faces
o Stick Figures
Chernoff Faces

Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x beeyebrow
slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics–head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth

shape, mouth size, and mouth opening. Each assigned one of 10 possible values.

Stick Figure

 A census data figure showing age, income, gender, education


 A 5-piece stick figure (1 body and 4 limbs w. different angle/length)
 Age, income are indicated by position of the figure.
 Gender, education are indicated by angle/length.
 Visualization can show a texture pattern.
 2 dimensions are mapped to the display axes and the remaining dimensions are
mapped to the angle and/ or length of the limbs.
Hierarchical Visualization
Circle Packing

 Circle Packing is a variation of a Treemap that uses circles instead of rectangles.

Containment within each circle represents a level in the hierarchy: each branch of

the tree is represented as a circle and its sub-branches are represented as circles
inside of it. The area of each circle can also be used to represent an additional

arbitrary value, such as quantity or file size. Colour may also be used to assign

categories or to represent another variable via different shades.

 As beautiful as Circle Packing appears, it's not as space-efficient as a Treemap, as


there's a lot of empty space within the circles. Despite this, Circle Packing actually
reveals hierarchal structure better than a Treemap.
Sunburst Diagram

 As known as a Sunburst Chart, Ring Chart, Multi-level Pie Chart, Belt Chart, Radial
Treemap.

 This type of visualisation shows hierarchy through a series of rings, that are sliced for
each category node. Each ring corresponds to a level in the hierarchy, with the
central circle representing the root node and the hierarchy moving outwards from it.
 Rings are sliced up and divided based on their hierarchical relationship to the parent
slice. The angle of each slice is either divided equally under its parent node or can
be made proportional to a value.
 Colour can be used to highlight hierarchal groupings or specific categories.

Treemap:

 Treemaps are an alternative way of visualising the hierarchical structure of


a Tree Diagram while also displaying quantities for each category via area size. Each

category is assigned a rectangle area with their subcategory rectangles nested inside

of it.
 When a quantity is assigned to a category, its area size is displayed in proportion to
that quantity and to the other quantities within the same parent category in a part-
to-whole relationship. Also, the area size of the parent category is the total of its
subcategories. If no quantity is assigned to a subcategory, then it's area is divided
equally amongst the other subcategories within its parent category.

 The way rectangles are divided and ordered into sub-rectangles is dependent on
the tiling algorithm used. Many tiling algorithms have been developed, but the
"squarified algorithm" which keeps each rectangle as square as possible is the one
commonly used.

 Ben Shneiderman originally developed Treemaps as a way of visualising a vast file

directory on a computer, without taking up too much space on the screen. This

makes Treemaps a more compact and space-efficient option for displaying

hierarchies, that gives a quick overview of the structure. Treemaps are also great at
comparing the proportions between categories via their area size.

 The downside to a Treemap is that it doesn't show the hierarchal levels as clearly
as other charts that visualise hierarchal data (such as a Tree Diagram or Sunburst
Diagram).
Visualizing Complex Data and Relations

 For a large data set of high dimensionality, it would be difficult to visualize


alldimensions at the same time.
 Hierarchical visualization techniques partition all dimensions into subsets (i.e.,
subspaces).
 The subspaces are visualized in a hierarchical manner
 “Worlds-within-Worlds,” also known as n-Vision, is a representative
hierarchicalvisualization method.
 To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5.
 We want to observe how F changes w.r.t. other dimensions. We can fix
X3,X4,X5dimensions to selected values and visualize changes to F w.r.t. X1, X2
 Most visualization techniques were mainly for numeric data.
 Recently, more and more non-numeric data, such as text and socialnetworks,
havebecome available.
 Many people on the Web tag various objects such as pictures, blog entries, and
productreviews.

 A tag cloud is a visualization of statistics of user-generated tags.


 Often, in a tag cloud, tags are listed alphabetically or in a user-preferred order.
 The importance of a tag is indicated by font size or color.

Word Cloud:

Also known as aTag Cloud.

 A visualisation method that displays how frequently words appear in a given body of text, by
making the size of each word proportional to its frequency. All the words are then arranged
in a cluster or cloud of words. Alternatively, the words can also be arranged in any format:

horizontal lines, columns or within a shape.


 Word Clouds can also be used to display words that have meta-data assigned to them.
For example, in a Word Cloud with all the World's country's names, the population could be
assigned to each name to determine its size.
 Colour used on Word Clouds is usually meaningless and is primarily aesthetic, but it canbe
used to categorise words or to display another data variable.
 Typically, Word Clouds are used on websites or blogs to depict keyword or tag usage.
Word Clouds can also be used to compare two different bodies of text together.
 Although being simple and easy to understand, Word Clouds have some major flaws:
 Long words are emphasised over short words.
 Words whose letters contain many ascenders and descenders may receive moreattention.
 They're not great for analytical accuracy, so used more for aesthetic reasons instead.

You might also like