0% found this document useful (0 votes)
25 views

2022 Emba 507

This document provides an overview and definitions of various statistical techniques used for data analysis, including hypothesis testing methods like z-tests, t-tests, and chi-square tests. It also discusses ANOVA, MANOVA, correlation analysis, linear regression, polynomial regression, logistic regression, quantile regression, factor analysis, interpretive structural modeling, grey relational analysis, RIDIT analysis, data envelopment analysis, artificial neural networks, conjoint analysis, and canonical correlation analysis.

Uploaded by

TECH CENTRAL
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

2022 Emba 507

This document provides an overview and definitions of various statistical techniques used for data analysis, including hypothesis testing methods like z-tests, t-tests, and chi-square tests. It also discusses ANOVA, MANOVA, correlation analysis, linear regression, polynomial regression, logistic regression, quantile regression, factor analysis, interpretive structural modeling, grey relational analysis, RIDIT analysis, data envelopment analysis, artificial neural networks, conjoint analysis, and canonical correlation analysis.

Uploaded by

TECH CENTRAL
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Applied Data Analysis

Assignment No. 01

Name: Saad Iftikhar


Reg. No. 2022 EMBA-507

Dr. Abdul Aziz Khan Niazi


Techniques of Data Analysis
Hypothesis Testing; z, t & chi-square test
t Test: Hypothesis testing for small sample size.

Z test: Hypothesis testing for large sample.

Chi-square Test: Test of Significance to determine the difference observed and expected frequencies of
certain observations.

One-way & Two-way ANOVA & MANOVA

a. ANOVA
There are two main types of ANOVA: (1) "one-way" ANOVA compares levels (i.e. groups) of a single
factor based on single continuous response variable (e.g. comparing test score by 'level of education') and
(2) a "two-way" ANOVA compares levels of two or more factors for mean differences on a single
continuous response variable (e.g. comparing test score by both 'level of education' and 'zodiac sign'). In
practice, you will see one-way ANOVAs more often and when the term ANOVA is generically used, it
often refers to a one-way ANOVA. Henceforth in this blog entry, I use the term ANOVA to refer to the
one-way flavor.

b. MANOVA
The obvious difference between ANOVA and a "Multivariate Analysis of Variance" (MANOVA) is the
“M”, which stands for multivariate. In basic terms, A MANOVA is an ANOVA with two or more
continuous response variables. Like ANOVA, MANOVA has both a one-way flavor and a two-way
flavor. The number of factor variables involved distinguish a one-way MANOVA from a two-way
MANOVA. 
Correlation And Different Type Of Regression Analysis

c. Correlation Analysis
Correlation analysis is applied in quantifying the association between two continuous variables, for
example, a dependent and independent variable or among two independent variables.

d. Regression Analysis
Regression analysis refers to assessing the relationship between the outcome variable and one or more
variables. The outcome variable is known as the dependent or response variable and the risk elements,
and co-founders are known as predictors or independent variables. The dependent variable is shown by
“y” and independent variables are shown by “x” in regression analysis.

The sample of a correlation coefficient is estimated in the correlation analysis. It ranges between -1 and
+1, denoted by r and quantifies the strength and direction of the linear association among two variables.
The correlation among two variables can either be positive, i.e. a higher level of one variable is related to
a higher level of another or negative, i.e. a higher level of one variable is related to a lower level of the
other.

The sign of the coefficient of correlation shows the direction of the association. The magnitude of the
coefficient shows the strength of the association.

For example, a correlation of r = 0.8 indicates a positive and strong association among two variables,
while a correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows
the non-existence of linear association among two continuous variables.

i. Linear Regression
Linear regression is a type of model where the relationship between an independent variable and a
dependent variable is assumed to be linear. The estimate of variable “y” is obtained from an equation, y’-
y_bar = byx(x-x_bar) …… (1) and estimate of variable “x” is obtained through the equation x’-x_bar =
bxy(y-y_bar) …. (2). The graphical representation of linear equations on (1) & (2) is known as
Regression lines. These lines are obtained through the Method of Least Squares. 

ii. Polynomial Regression


It is a type of Regression analysis that models the relationship of values of the Dependent variable “x”
and independent variables “y’’ as non-linear. It is a special case of Multiple Linear Regression even
though it fits a non-linear model to data. It is because data may be correlated but the relationship between
two variables might not look linear. 

iii. Logistic Regression


Logistic Regression is a method that was used first in the field of Biology in the 20th century. It is used to
estimate the probability of certain events that are mutually exclusive, for example, happy/sad,
normal/abnormal, or pass/fail. The value of probability strictly ranges between 0 and 1.

iv. Quantile Regression


Quantile Regression is an econometric technique that is used when the necessary conditions to use Linear
Regression are not duly met. It is an extension of Linear Regression analysis i.e., we can use it when
outliers are present in data as its estimates strong against outliers as compared to linear regression.

Factor Analysis (CFA, EFA & PCA)


Exploratory Factor Analysis (EFA) is often referred to as Factor Analysis (FA) or as common Factor
Analysis (no, not abbreviated as CFA), and should be differentiated from its close ally, Principle
Components Analysis (PCA). While all of these “explore” (hence “exploratory”) the interrelationships
among several variables to explain them in terms of their common underlying dimensions (factors), there
is a subtle (but very important) difference between EFA and PCA.

Interpretive Structural Modeling


Interpretive Structural Modeling is used for identifying and summarizing relationship among specific
variables, which define a problem or an issue. It is an interactive learning process. It is to identify and
rank the variables. ISM is to establish the interrelationship among the variables & discuss the managerial
implication of the research.

Grey Relational Analysis


Grey Relational Analysis is a method of calculating grey relational degree and determining the
contribution measure of the main behavior of the system or the influence degree between the system
factors. The measure of correlation between two factors or between two systems is called grey correlation
degree. The change trend, size and speed of grey correlation degree reflect the relative change of factors
in the process of system development. In the process of development, when the relative changes of two
factors or systems have basically the same trend of change, the two factors have a greater degree of grey
correlation; otherwise, they have a smaller degree of grey correlation.

RIDIT Analysis
In statistics, ridit scoring is a statistical method used to analyze ordered qualitative measurements. The
tools of ridit analysis were developed and first applied by Bross, who coined the term "ridit" by analogy
with other statistical transformations such as probit and logit. A ridit describes how the distribution of the
dependent variable in row i of a contingency table compares relative to an identified distribution (e.g., the
marginal distribution of the dependent variable).

Data Envelopment Analysis


DEA is a linear programming methodology that empirically quantifies the relative efficiency of multiple
similar entities or DMUs (Cooper et al., 2007). The DMU is the homogeneous entity responsible for the
conversion of inputs into outputs. As shown in Figure, to carry out a DEA study, a matrix composed of
the inputs, outputs, and complementary elements of the sample of DMUs is required. Once the DEA
model has been formulated according to a set of features such as metrics and orientation, the matrix is
implemented in the model to be solved, thus obtaining as main results relative efficiency scores and
operational benchmarks for each DMU.

An artificial neural network is an attempt to simulate the network of neurons that make up a human brain
so that the computer will be able to learn things and make decisions in a humanlike manner. ANNs are
created by programming regular computers to behave as though they are interconnected brain cells.
Artificial neural networks are created to digitally mimic the human brain. They are currently used for
complex analyses in various fields, ranging from medicine to engineering, and these networks can be used
to design the next generation of computers

Conjoint Analysis
Conjoint analysis is a research technique used to quantify how people value the individual features of a
product or service. A conjoint survey question shows respondents a set of concepts, asking them to
choose or rank the most appealing ones.

Conjoint analysis is a survey-based statistical technique used in market research that helps determine how
people value different attributes (feature, function, benefits) that make up an individual product or
service.

The objective of conjoint analysis is to determine what combination of a limited number of attributes is
most influential on respondent choice or decision making. A controlled set of potential products or
services is shown to survey respondents and by analyzing how they make choices among these products,
the implicit valuation of the individual elements making up the product or service can be determined.
These implicit valuations (utilities or part-worth’s) can be used to create market models that estimate
market share, revenue and even profitability of new designs.

Canonical Correlation
Canonical correlation analysis is used to identify and measure the associations among two sets of
variables. Canonical correlation is appropriate in the same situations where multiple regression would be,
but where are there are multiple intercorrelated outcome variables. Canonical correlation analysis
determines a set of canonical variates, orthogonal linear combinations of the variables within each set that
best explain the variability both within and between sets.

Canonical correlation analysis is a method for exploring the relationships between two multivariate sets
of variables (vectors), all measured on the same individual.

Consider, as an example, variables related to exercise and health. On one hand, you have variables
associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run a
certain distance, the amount of weight lifted on bench press, the number of push-ups per minute, etc. On
the other hand, you have variables that attempt to measure overall health, such as blood pressure,
cholesterol levels, glucose levels, body mass index, etc. Two types of variables are measured and the
relationships between the exercise variables and the health variables are of interest.

Co-Integration
Co-integration tests identify scenarios where two or more non-stationary time series are integrated
together in a way that they cannot deviate from equilibrium in the long term. The tests are used to identify
the degree of sensitivity of two variables to the same average price over a specified period of time. In
other words, Co-integration is a technique used to find a possible correlation between time series
processes in the long term. The most popular co-integration tests include Engle-Granger, the Johansen
Test, and the Phillips-Ouliaris test.

Multi Criteria Decision Making


Multi-criteria decision making (MCDM) also referred to as multiple criteria decision analysis (MCDA), is
a research area that involves the analysis of various available choices in a situation or research area which
spans daily life, social sciences, engineering, medicine, and many other areas. MCDM is one of the most
popular decision-making tools utilized in various fields.

MCDM analyses the criteria to determine whether each criterion is a favorable or unfavorable choice for
a particular application. It also attempts to compare this criterion, based on the selected criteria, against
every other available option in an attempt to assist the decision maker in selecting an option with the
minimal compromise and maximum advantages. The criteria used in the analyses of these criteria can be
either qualitative or quantitative criteria.

Data Mining
Data mining is the process of analyzing a large batch of information to discern trends and patterns. Data
mining can be used by corporations for everything from learning about what customers are interested in
or want to buy to fraud detection and spam filtering.

Data mining programs break down patterns and connections in data based on what information users
request or provide. Social media companies use data mining techniques to commodify their users in order
to generate profit. This use of data mining has come under criticism lately as users are often unaware of
the data mining happening with their personal information, especially when it is used to influence
preferences.

Cluster Analysis
Cluster analysis is a multivariate data mining technique whose goal is to groups objects (eg., products,
respondents, or other entities) based on a set of user selected characteristics or attributes. It is the basic
and most important step of data mining and a common technique for statistical data analysis, and it is
used in many fields such as data compression, machine learning, pattern recognition, information retrieval
etc.
Multi-Dimensional Scaling
Multidimensional scaling is a visual representation of distances or dissimilarities between sets of objects.
“Objects” can be colors, faces, map coordinates, political persuasion, or any kind of real or conceptual
stimuli (Kruskal and Wish, 1978). Objects that are more similar (or have shorter distances) are closer
together on the graph than objects that are less similar (or have longer distances). As well as interpreting
dissimilarities as distances on a graph, MDS can also serve as a dimension reduction technique for high-
dimensional data (Buja et. al, 2007).

The term scaling comes from psychometrics, where abstract concepts (“objects”) are assigned numbers
according to a rule (Trochim, 2006). For example, you may want to quantify a person’s attitude to global
warming. You could assign a “1” to “doesn’t believe in global warming”, a 10 to “firmly believes in
global warming” and a scale of 2 to 9 for attitudes in between. You can also think of “scaling” as the fact
that you’re essentially scaling down the data (i.e. making it simpler by creating lower-dimensional data).
Data that is scaled down in dimension keeps similar properties. For example, two data points that are
close together in high-dimensional space will also be close together in low-dimensional space (Martinez,
2005). The “multidimensional” part is due to the fact that you aren’t limited to two dimensional graphs or
data. Three-dimensional, four-dimensional and higher plots are possible.

Correspondence Analysis
Correspondence analysis reveals the relative relationships between and within two groups of variables,
based on data given in a contingency table. For brand perceptions, these two groups are brands and the
attributes that apply to these brands. For example, let’s say a company wants to learn which attributes
consumers associate with different brands of beverage products. Correspondence analysis helps measure
similarities between brands and the strength of brands in terms of their relationships with different
attributes. Understanding the relative relationships allows brand owners to pinpoint the effects of previous
actions on different brand related attributes, and decide on next steps to take.

Correspondence analysis is valuable in brand perceptions for a couple of reasons. When attempting to
look at relative relationships between brands and attributes, brand size can have a misleading effect;
correspondence analysis removes this effect. Correspondence analysis also gives an intuitive quick view
of brand attribute relationships (based on proximity and distance from origin) that isn’t provided by many
other graphs.

Time Series Analysis


Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of
time. In time series analysis, analysts record data points at consistent intervals over a set period of time
rather than just recording the data points intermittently or randomly. However, this type of analysis is not
merely the act of collecting data over time.

What sets time series data apart from other data is that the analysis can show how variables change over
time. In other words, time is a crucial variable because it shows how the data adjusts over the course of
the data points as well as the final results. It provides an additional source of information and a set order
of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and reliability.
An extensive data set ensures you have a representative sample size and that analysis can cut through
noisy data. It also ensures that any trends or patterns discovered are not outliers and can account for
seasonal variance. Additionally, time series data can be used for forecasting—predicting future data based
on historical data.

Econometric Analysis
Econometrics is the use of statistical methods to develop theories or test existing hypotheses in economics
or finance. Econometrics relies on techniques such as regression models and null hypothesis testing.
Econometrics can also be used to try to forecast future economic or financial trends.

As with other statistical tools, econometricians should be careful not to infer a causal relationship from
statistical correlation. Some economists have criticized the field of econometrics for prioritizing statistical
models over economic reasoning.

Data Stationarity
A common assumption in many time series techniques is that the data are stationary.

A stationary process has the property that the mean, variance and autocorrelation structure do not change
over time. Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat
looking series, without trend, constant variance over time, a constant autocorrelation structure over time
and no periodic fluctuations (seasonality).

For practical purposes, stationarity can usually be determined from a run sequence plot.

Granger Causality
If you’ve explored the vector autoregressive literature, it is likely that you have come across the term
Granger causality. Granger causality is an econometric test used to verify the usefulness of one variable to
forecast another.

A variable is said to:

 Granger-cause another variable if it is helpful for forecasting the other variable.


 Fail to Granger-cause if it is not helpful for forecasting the other variable.

At this point, you may be asking yourself what does it mean for a variable to be “helpful” in forecasting?
In simple terms, a variable is “helpful” for forecasting, if when added to the forecast model, it reduces the
forecasting error.

In the context of the vector autoregressive models, a variable fails to Granger-cause another variable if its:

 Lags are not statistically significant in the equation for another variable.
 Past values aren’t significant in predicting the future values of another.
Vector Error Correction Model / Vector Auto Regression Model
(VEC/VAR);
The vector autoregressive (VAR) model is a general framework used to describe the dynamic
interrelationship among stationary variables. So, the first step in time-series analysis should be to
determine whether the levels of the data are stationary. If not, take the first differences of the series and
try again. Usually, if the levels (or log-levels) of your time series are not stationary, the first differences
will be.

If the time series are not stationary then the VAR framework needs to be modified to allow consistent
estimation of the relationships among the series. The vector error correction (VEC) model is just a special
case of the VAR for variables that are stationary in their differences (i.e., I(1)). The VEC can also take
into account any co-integrating relationships among the variables.

Delphi Technique
The Delphi method is a process used to arrive at a group opinion or decision by surveying a panel of
experts. Experts respond to several rounds of questionnaires, and the responses are aggregated and shared
with the group after each round.

The experts can adjust their answers each round, based on how they interpret the “group response”
provided to them. The ultimate result is meant to be a true consensus of what the group thinks.

Game Theory
Game theory is a theoretical framework to conceive social situations among competing players. The
intention of game theory is to produce optimal decision-making of independent and competing actors in a
strategic setting. Using game theory, real-world scenarios for such situations as pricing competition and
product releases (and many more) can be laid out and their outcomes predicted.

Scenarios include the prisoner's dilemma and the dictator game among many others. Different types of
game theory include cooperative/non-cooperative, zero-sum/non-zero-sum, and simultaneous/sequential.

Formal Logic
Formal logic is the abstract study of propositions, statements, or assertively used sentences and of
deductive arguments. The discipline abstracts from the content of these elements the structures or logical
forms that they embody. The logician customarily uses a symbolic notation to express such structures
clearly and unambiguously and to enable manipulations and tests of validity to be more easily applied.
Although the following discussion freely employs the technical notation of modern symbolic logic, its
symbols are introduced gradually and with accompanying explanations so that the serious and attentive
general reader should be able to follow the development of ideas.
Discrete Mathematics
Discrete mathematics is the study of mathematical structures that are countable or otherwise distinct and
separable. Examples of structures that are discrete are combinations, graphs, and logical statements.
Discrete structures can be finite or infinite. Discrete mathematics is in contrast to continuous
mathematics, which deals with structures which can range in value over the real numbers, or have some
non-separable quality.

Thematic Content Analysis


Thematic analysis is a method of analyzing qualitative data. It is usually applied to a set of texts, such as
an interview or transcripts. The researcher closely examines the data to identify common themes – topics,
ideas and patterns of meaning that come up repeatedly.

There are various approaches to conducting thematic analysis, but the most common form follows a six-
step process: familiarization, coding, generating themes, reviewing themes, defining and naming themes,
and writing up.

Cybernetic Modeling
Cybernetics is a control theory as it is applied to complex systems. Cybernetics is associated with models
in which a monitor compares what is happening to a system at various sampling times with some standard
of what should be happening, and a controller adjusts the system’s behavior accordingly.

Simulations
A simulation is the imitation of the operation of a real-world process or system. The behavior of a system
is studied by generating an artificial history of the system through the use of random numbers. These
numbers are used in the context of a simulation model, which is the mathematical, logical and symbolic
representation of the relationships between the objects of interest of the system. After the model has been
validated, the effects of changes in the environment on the system, or the effects of changes in the system
on system performance can be predicted using the simulation model.

Linear Programming
Linear programming (LP) or Linear Optimization may be defined as the problem of maximizing or
minimizing a linear function that is subjected to linear constraints. The constraints may be equalities or
inequalities. The optimization problems involve the calculation of profit and loss. Linear programming
problems are an important class of optimization problems that helps to find the feasible region and
optimize the solution in order to have the highest or lowest value of the function.

In other words, linear programming is considered as an optimization method to maximize or minimize the
objective function of the given mathematical model with the set of some requirements which are
represented in the linear relationship. The main aim of the linear programming problem is to find the
optimal solution.
Linear programming is the method of considering different inequalities relevant to a situation and
calculating the best value that is required to be obtained in those conditions. Some of the assumptions
taken while working with linear programming are:

 The number of constraints should be expressed in the quantitative terms


 The relationship between the constraints and the objective function should be linear
 The linear function (i.e., objective function) is to be optimized
Software’s of Data Analysis

1. MPLUS
Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data.
Mplus offers researchers a wide choice of models, estimators, and algorithms in a program that has an
easy-to-use interface and graphical displays of data and analysis results.

Mplus is a highly flexible, powerful statistical analysis software program that can fit an extensive variety
of statistical models using one of many estimators available. Perhaps its greatest strengths are in its
capabilities to model latent variables, both continuous and categorical, which underlie its flexibility.
Among the many models Mplus can fit are:

 Regression models (linear, logistic, poisson, Cox proportional hazards, etc.)


 Factor analysis, exploratory and confirmatory
 Structural equation models
 Latent growth models
 Mixture models (latent class, latent profile, etc.)
 Longitudinal analysis (latent transition analysis, growth mixture models, etc.)
 Multilevel models
 Bayesian analysis

2. EViews
EViews, or Econometric Views, is the statistical package used for time series oriented econometric
analysis. With an aim to provide the best assistance to scholars, policy makers, government agencies and
academicians for econometrics analysis, we associate with economic experts, who are also trained
analysts and provide focused assistance for the application of EViews.

Since statistical analysis is used for economic decision making, our statisticians act as perfect guides for
doctoral candidates to help manage and analyse the data. Model simulation can be done and forecasts can
be generated by using the EViews quickly and effectively.

Due to its all-inclusive nature, it can be used for the following tasks:

 Estimation
 Forecasting
 Simulation
 Graphics
 Statistical analysis
 Data management
3. Primavera
Primavera is advanced software that is trusted by project managers and companies from different
industries globally. It provides sophisticated solutions to plan, manage and execute projects of any size
and scale. It increases project efficiency significantly by identifying bottlenecks and scheduler overruns.

Oracle Primavera® is used for major projects in industries such as engineering and construction,
aerospace and defense, utilities, oil and gas, chemicals, industrial manufacturing, automotive, financial
services, communications, travel and transportation, healthcare, and government.

4. Lingo
LINGO is a comprehensive tool designed to make building and solving Linear, Nonlinear (convex &
nonconvex/Global), Quadratic, Quadratically Constrained, Second Order Cone, Semi Definite, Stochastic,
and Integer optimization models faster, easier and more efficient. LINGO provides a completely
integrated package that includes a powerful language for expressing optimization models, a full featured
environment for building and editing problems, and a set of fast built-in solvers. The recently released
LINGO 20 includes a number of significant enhancements and new features.

5. Lindo
linear, nonlinear, integer, stochastic and global programming solvers have been used by thousands of
companies worldwide to maximize profit and minimize cost on decisions involving production planning,
transportation, finance, portfolio allocation, capital budgeting, blending, scheduling, inventory, resource
allocation and more.

6. Mendeley
Mendeley Reference Manager is a free web and desktop reference management application. It helps you
simplify your reference management workflow so you can focus on achieving your goals. With Mendeley
Reference Manager you can: Store, organize and search all your references from just one library.

7. Visio
With Visio on your PC or mobile device, you can: Organize complex ideas visually. Get started with
hundreds of templates, including flowcharts, timelines, floor plans, and more. Add and connect shapes,
text, and pictures to show relationships in your data.

8. MATLAB
MATLAB® is a programming platform designed specifically for engineers and scientists to analyze and
design systems and products that transform our world. The heart of MATLAB is the MATLAB language,
a matrix-based language allowing the most natural expression of computational mathematics.
MATLAB is a high-performance language for technical computing. It integrates computation,
visualization, and programming in an easy-to-use environment where problems and solutions are
expressed in familiar mathematical notation. Typical uses include:

 Math and computation


 Algorithm development
 Modeling, simulation, and prototyping
 Data analysis, exploration, and visualization
 Scientific and engineering graphics
 Application development, including Graphical User Interface building

9. AMOS
AMOS is statistical software and it stands for analysis of a moment structures. AMOS is an added SPSS
module, and is specially used for Structural Equation Modeling, path analysis, and confirmatory factor
analysis. It is also known as analysis of covariance or causal modeling software.

10. Smart PLS


The Partial Least Squares regression (PLS) is a method which reduces the variables, used to predict, to a
smaller set of predictors. These predictors are then used to perfom a regression. Which is better Amos or
SmartPLS?

These are two different softwares in nature. If you want to develop a new theory (exploratory research),
then Smart PLS is preferred and if you want to test a theory (confirmatory research), then AMOS will be
better choice. Look up your objectives of research that will assist you in deciding.

11. LISREL
LISREL is statistical software that is used for structural regression modeling. Structural equation models
are the system of linear equations. LISREL is the simultaneous estimation of the structural model and
measurement model. Structural model assumes that all variables are measured without error.

LISREL can be used to fit:

 measurement models,
 structural equation models based on continuous or ordinal data,
 multilevel models for continuous and categorical data using a number of link functions,
 generalized linear models based on complex survey data.
 Additional statistical analyses than can be performed include, to name a few:
 exploratory factor analysis (EFA),
 multivariate analysis of variance (MANOVA),
 logistic and probit regression,
 cenored regression,
 survival analysis.

You might also like