Preperation Material: Reference Guide For Mba Analytics
Preperation Material: Reference Guide For Mba Analytics
PREPERATION MATERIAL
IT INDUSTRY
Information technology (IT) is the use of computers to store, retrieve, transmit, and manipulate
data, or information, often in the context of a business or other enterprise. The nature of this
industry has been rapidly changing for the past few years and is expected to be so in the coming
years. Generally, the IT industry is used as a collective term with the Information Technology
enabled Services (ITeS) industry.
Categories include:
• IT Products
• IT Services
India is a major player in the IT Services and the BPO segments. India is also a major exporter
of IT and has contributed 7.7% to the Indian GDP as of 2016-17. Though the industry had been
on a track of continuous growth since its rise in India, the last 2-to 3 years have been a little
tough for the sector. The initial years had been a continuous growth for the sector as it was just
a support function sector. Now, the industry itself has become so big that it has also started to
have its ups and downs.
Let us look at some of the IT concepts from a management student’s point of view, and not just
as a technical person.
• Analytics, data science, and the big data industry in India are currently estimated to be $2.71
billion annually in revenues, growing at a healthy rate of 33.5% CAGR.
• Analytics, data science, and the big data industry in India are expected to grow seven times
in the next seven years. It is estimated to become a 20-billion-dollar industry in India by 2025.
• In terms of geographies served, almost 64% of analytics revenues in India come from
analytics exports to the USA. Indian analytics industry currently services almost $1.7 billion
in revenue to USA firms.
• Indian domestic market serves as a significant market, with almost 4.7% of analytics revenues
coming for Indian firms.
1
The average work experience of analytics professionals in India is 7.9 years; up from 7.7 years
from last year.
• 57% of analytics professionals have a Master’s/ Post Graduation degree, which is the same
as last year.
DATA ANALYTICS
Data analytics (DA) is the process of examining data sets to conclude the information they
contain, increasingly with the aid of specialized systems and software. Data analytics
technologies and techniques are widely used in commercial industries to enable organizations
to make more-informed business decisions and by scientists and researchers to verify or
disprove scientific models, theories, and hypotheses.
Data analytics initiatives can help businesses increase revenues, improve operational
efficiency, optimize marketing campaigns and customer service efforts, respond more quickly
to emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal
of boosting business performance.
ANALYTICS INDUSTRY
Analytics is used across industry sectors and almost every industry has scope for analytics.
Though, there are some early adopters for this too, as with all other things. E.g. Finance &
Banking is the biggest sector for use of analytics followed by Marketing & Advertising and E-
Commerce.
The data analytics market in India is growing at a fast pace, with companies and startups
offering analytics services and products catering to various industries. Different sectors have
seen different penetration and adoption of analytics, and so is the revenue generated from these
sectors.
2
What is Business Analytics?
It is one of the important elements of the transaction cost. Making the right decision about the
target market, product/service, pricing, supply chain, etc. minimizes the cost. Hence, the
organization prefers to take a decision that is backed by data and facts rather than intuition and
a person’s knowledge about the situation.
And here analytics comes into the picture!
Business Analytics is the combination of statistical and operational research techniques,
information technology, and management strategy for framing a business problem, collecting
and analyzing data to make a decision that adds value to the organization.
3 major components of business analytics are:
1. Business Context
2. Technology
3. Data Science
Business Context: The first step of the BA project is to identify the problem and needs of the
projects. An organization can ask the right question.
For example, an organization wants to launch a new product in the market for which consumer
behavior needs to be analyzed. Here, the business context is to identify the segment of
customers that can be targeted for the launch of a new product.
Technology: To analyze consumer behavior we need data. The data is captured, stored,
analyzed, and shared using information technology. Also, technology is required to deploy
solutions that the company decides based on analysis of data. For Ex: personalized offers based
on customer loyalty scores are an example of solution deployment-based customers association
with an organization.
Data Science: Data science is used to analyze the data using various statistical models and
machine learning algorithms. It consists of statistical and operations research techniques,
machine learning, and deep learning algorithms. For Ex: A logistic regression algorithm is used
to classify the customers into Loyal customers and Explorers (Customers which try out new
products and services rather than sticking to one).
There are four major types of analytics:
Descriptive Analytics: This type of data analytics includes statistical analysis of historical data
to gain insights and know the current trends using business intelligence tools. This can
3
Also, be used to forecast the demand for product/service in near future and be ready for it. This
includes analyzing and visualizing the data using tools like PowerBI, Tableau, and languages
like R, Python.
Diagnostic Analytics: This type of analytics focuses on past performance to determine
what has happened and why. In time series analysis, diagnostic analytics can be used to
understand why sales have increased and decreased monthly. Algorithms used for classification
and regression fall under this category.
Predictive Analytics: This type of analytics is used to predict the possible outcomes in the
future using statistical models and machine learning algorithms. A very common application
of predictive analytics is sentiment analysis to determine the sentiment of the person (Positive
or negative) on a particular topic. Predictive models are built on the preliminary descriptive
analytics stage to derive outcomes.
Prescriptive Analytics: This type of analytics is used to recommend the solution based on
the insights obtained from predictive analytics. It uses a feedback system that constantly learns
and updates the relationship between the action and outcome. An example of prescriptive
analytics is the recommendation algorithm of Spotify which suggest users’ songs/podcast
based on the genre of songs they listened to in past.
So, after knowing the major types of analytics, you must have understood how important is to
gather the data from authentic resources or target audiences.
Now, let’s discuss the basic statistical concepts that are required for the primary analysis of the
data.
4
Continuous and Discrete variables
DISCRETE VARIABLE
A discrete variable is a type of statistical variable that can assume only a fixed number of
distinct values and lacks an inherent order.
Also known as a categorical variable, it has separate, invisible categories. However, no values
can exist in-between two categories, i.e., it does not attain all the values within the limits of the
variable. For example: -
• The number of printing mistakes in a book.
• The number of road accidents in New Delhi.
• The number of siblings of an individual.
CONTINUOUS VARIABLE
A continuous variable, as the name suggests is a random variable that assumes all the possible
values in a continuum. Simply put, it can take any value within the given range.
A continuous variable is defined throughout values, meaning that it can suppose any values in
between the minimum and maximum value. For example: -
• Height of a person
• Age of a person
• Profit earned by the company
Distribution
Distribution tells how the data is distributed across the center. There are 2 types of distributions:
1. Continuous distribution
E.g., Normal Distribution, Poisson Distribution, Bernoulli Distribution, Chi-square
Distribution.
2. Discrete distribution
E.g., Binomial Distribution, Poisson Distribution, Bernoulli Distribution.
Quantitative attributes: Mean, median, maximum, minimum, standard deviation
Categorical/qualitative data: Mode, count
Let’s have an overview of these statistical terms:
Mean: Average of all the values available. It can be used to get an average value of a certain
attribute for a group of samples. Mean is used in parametric statistical tests.
5
For example: To analyze the average amount of Coca-Cola filled in a bottle of 100 ml based
on 150 samples of Coca-Cola bottles.
Median: It’s the statistical measure that gives the middle value of the data when sorted in
ascending order. Median can be used to check if data is normally distributed or not. When the
mean of all values is equal to media, data is said to be normally distributed. The distribution of
data always plays an important role in statistical analysis and model formulation.
Mode: Mode is the most frequently occurring value of a particular attribute in a dataset. Mode
is specifically used with categorical data in a non-parametric statistical test such as the chi-
square test.
For Example, in a survey of the highest used social media, if respondents chose Instagram
over other channels, it is said to be the Mode of the categorical data sets. Here, the model is
nothing but the count of a specific response.
Standard deviation: Standard deviation is a measure of the dispersion of data where how far
the value is away from its mean is calculated.
For example, the GPA system used in grading the students is based on standard deviation
where how far the student’s mark in a particular subject is far (less/more) from an average score
of the class.
6
Hypothesis testing:
Hypothesis testing is the statistical concept used to compare sample parameters of 2 or more
samples using various tests such as z-test, t-test, ANOVA, and chi-square. Hypothesis in a
simple language means educated guess. The assumption is made that the mean of both the
sample groups or sample and population groups are either the same or different. The
assumption that has to be proven true is considered an alternate hypothesis.
For example, H0 (Null Hypothesis): The return from stock A is higher than return from stock
B
H1 (Alternate): The return from stock B is higher than the return from stock A
Here, the hypothesis that statistician wants to prove is ‘The return from stock B is higher
than return from stock A’ based on available historical data of stock prices.
The null hypothesis is either accepted or rejected based on P-value obtained by performing
statistical tests.
Now let’s see what is P-value
P-value is the probability of occurring result is given that null hypothesis is true. The
confidence interval is associated with the hypothesis which is known as alpha.
Alpha: Alpha is nothing but the probability of rejecting the null hypothesis when it’s true. The
lower the alpha, the better after all it’s a probability of occurring an error.
Whereas beta is completely the opposite of an alpha.
Beta: It is the probability of accepting a null hypothesis when not true. 1- beta i.e., ‘Probability
of not making type 2 error’ is known as the power of the test.
Ideally, you would like to keep both errors as low as possible which is practically not possible
as both errors are complementary to each other. Hence, commonly used values of alpha are
0.01, 0.05, and 0.10 which gives a good balance between alpha and beta.
7
So how this P-value is used to prove the hypothesis
When P-value > alpha i.e., the probability of the null hypothesis being true is higher than the
probability of type I error (rejecting the null hypothesis when true), the null hypothesis is
accepted.
When P-value < alpha i.e., the probability of the null hypothesis being true is lesser than the
probability of type I error (rejecting the null hypothesis when true), the null hypothesis is
rejected.
8
Different Statistical Process
Z-TEST
Z-test is a statistical procedure used to test an alternative hypothesis against a null hypothesis.
Z-test is any statistical hypothesis used to determine whether two samples’ means are different
when variances are known and the sample is large (n ≥ 30). It is a Comparison of the means of
two independent groups of samples, taken from one population with known variance.
Null: Sample mean is same as the population mean
Alternate: Sample mean is not the same as the population mean
If the test statistic is lower than the critical value, accept the hypothesis or else reject the
hypothesis
Understanding a One-Sample Z-Test
Let’s say we need to determine if girls on average score higher than 600 in the exam. We have
the information that the standard deviation for girls’ scores is 100. So, we collect the data of
20 girls by using random samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.
In this example:
• Mean Score for Girls is 641
• The size of the sample is 20
• The population mean is 600
• Standard Deviation for Population is 100
9
Putting in the above formula, we get a z-score, and thereby we compute p-value as 0.033 from
z-score which is less than 0.05, hence we reject the null hypothesis
Understanding a Two-Sample Z-Test
Here, let’s say we want to know if Girls on average score 10 marks more than the boys. We
have the information that the standard deviation for girls’ Scores is 100 and for boys’ scores is
90. Then we collect the data of 20 girls and 20 boys by using random samples and recording
their marks. Finally, we also set our ⍺ value (significance level) to be 0.05.
In this example:
• Mean Score for Girls (Sample Mean) is 641
• Mean Score for Boys (Sample Mean) is 613.3
• Standard Deviation for the Population of Girls is 100
• Standard deviation for the Population of Boys is 90
• Sample Size is 20 for both Girls and Boys
• Difference between Mean of Population is 10
• Putting in the above formula, we get a z-score, and thereby we compute p-value as
0.278 from z-score which is greater than 0.05, hence we fail to reject the null hypothesis
T-Test
If we have a sample size of less than 30 and do not know the population variance, then we must
use a t-test.
10
Understanding a One-Sample t-Test
Let’s say we want to determine if on average girls score more than 600 in the exam. We do not
have the information related to variance (or standard deviation) for girls’ scores. To perform a
t-test, we randomly collect the data of 10 girls with their marks and choose our ⍺ value
(significance level) to be 0.05 for Hypothesis Testing.
In this example:
• Mean Score for Girls is 606.8
• The size of the sample is 10
• The population mean is 600
• Standard Deviation for the sample is 13.14
Putting in the above formula, we get an at-score, and thereby we compute p-value as 0.06 from
t-score which is greater than 0.05, hence we fail to reject the null hypothesis and don’t have
enough evidence to support the hypothesis that on average, girls score more than 600 in the
exam
Understanding a Two-Sample t-Test
Here, let’s say we want to determine if on average, boys score 15 marks more than girls in the
exam. We do not have the information related to variance (or standard deviation) for girls’
scores or boys’ scores. To perform a t-test. we randomly collect the data of 10 girls and boys
with their marks. We choose our ⍺ value (significance level) to be 0.05 as the criteria for
Hypothesis Testing.
In this example:
• Mean Score for Boys is 630.1
• Mean Score for Girls is 606.8
• Difference between Population Mean 15
• Standard Deviation for Boys’ score is 13.42
• Standard Deviation for Girls’ score is 13.14
• Putting in the above formula, we get an at-score, and thereby we compute p-value as
0.019 from t-score which is less than 0.05, hence we reject the null hypothesis and
conclude that on average boys score 15 marks more than girls in the exam.
11
MBA (ANALYTICS) PI QUESTIONS
• How did you use analytics in the marketing department of your company?
• What are projects did you work on in the Qlik sense?
• Explain the clustering project which you worked on related to student performance?
Q4. Questions based on the subject of interest (Statistical Quality control, ML, and supply
chain management)
• What is a P chart?
A p-chart is an attributes control chart that is used with data obtained in small
subgroups. Because the size of the subgroup can fluctuate, it displays a percentage
rather than a count of nonconforming items. P-charts depict the evolution of a process
over time. The process attribute (or characteristic) is always expressed as a yes/no,
pass/fail, or go/no go situation. Use a p-chart to plot the percentage of incomplete
insurance claim forms received weekly
• What is a Control Chart?
Shewhart charts, also known as process-behavior charts, are a statistical process control
tool that can be used to identify whether a manufacturing or business process is under
control. Control charts are more accurately described as the graphical device for
Statistical Process Monitoring
• In X bar- R chart, which chart will be plotted first and why?
First, we proceed with the R chart to ensure that the process lies within the range. If the
process lies within the range, we procced for X bar
• What are the various assumptions of linear regression?
12
Homoscedasticity
Normality
No Autocorrelation
Independent variables are not correlated among themselves
• Tell me in one word, the difference between Clustering and Linear regression.
Clustering belongs to Unsupervised learning and linear regression belongs to
supervised learning
• Why don’t we import all packages at once in python?
Since Packages are user-defined functions. There are chances that when we load all the
packages at once, the user-defined function may overlap, and there are also high
chances that CPU usage will be increased. To avoid complexity, we generally load the
package which is required rather than loading everything
13