Ba-Unit 1 To Unit3 Notes
Ba-Unit 1 To Unit3 Notes
BUSINESS ANLAYTICS
(Unit 1- Unit 3)
CCW331 BUSINESS ANALYTICS L T P C
Syllabus 2 0 2 3
Data Warehouses and Data Mart - Knowledge Management –Types of Decisions - Decision Making Process
Introduction to Business Forecasting and Predictive analytics - Logic and Data Driven Models – Data
Mining and Predictive Analysis Modeling – Machine Learning for Predictive analytics.
Human Resources – Planning and Recruitment – Training and Development - Supply chain network -
Planning Demand, Inventory and Supply – Logistics – Analytics applications in HR & Supply Chain -
Applying HR Analytics to make a prediction of the demand for hourly employees for a year.
Marketing Strategy, Marketing Mix, Customer Behavior –selling Process – Sales Planning –
Analytics applications in Marketing and Sales - predictive analytics for customers' behavior in marketing
and sales.
TOTAL(L):30 PERIODS
LIST OF EXPERIMENTS:
Use MS-Excel and Power-BI to perform the following experiments using a Business data set, andmake
presentations.
Students may be encouraged to bring their own real-time socially relevant data set.
I Cycle – MS Excel
2. (i) Get the input from user and perform numerical operations (MAX, MIN, AVG, SUM,
SQRT, ROUND)
3. Perform statistical operations - Mean, Median, Mode and Standard deviation, Variance,
Skewness, Kurtosis
30 PERIODS
UNIT 1
INTRODUCTION TO BUSINESS ANALYTICS
Prescriptive Analytics
In the case of prescriptive analytics, we make use of simulation, data modelling, and
optimization of algorithms to find answers to questions such as “what needs to be done”. This
is used to provide solutions and identify the potential results of those solutions. This field of
business analytics has recently surfaced and is on heavy rise since it gives multiple solutions,
with their possible effectiveness, to the problems faced by businesses. Let’s say Plan A fails
or there aren’t enough resources to execute it, then there is still Plan B, Plan C, etc., in hand.
Example –
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.
Business Understanding
Focuses on understanding the project objectives and requirements from a business
perspective. The analyst formulates this knowledge as a data mining problem and develops
preliminary plan Data Understanding
Starting with initial data collection, the analyst proceeds with activities to get familiar with
the data, identify data quality problems & discover first insights into the data. In this phase,
the analyst might also detect interesting subsets to form hypotheses for hidden information
Data Preparation
The data preparation phase covers all activities to construct the final dataset from the initial
raw data
Modelling
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some
techniques like neural nets have specific requirements regarding the form of the data. There
can be a loop back here to data prep
Evaluation
The analyst builds & chooses models that appear to have high quality based on loss functions
that were selected. The analyst them tests them to ensure that they can generalise the models
against unseen data. Subsequently, the analyst also validates that the models sufficiently
cover all key business issues. The end result is the selection of the champion model(s)
Deployment
Generally, this will mean deploying a code representation of the model into an operating
system. This also includes mechanisms to score or categorize new unseen data as it arises.
The mechanism should use the new information in the solution of the original business
problem. Importantly, the code representation must also include all the data prep steps
leading up to modelling. This ensures that the model will treat new raw data in the same
manner as during model development
Project plan
Now you specify every step that you, the data miner, intend to take until the project is
completed and the results are presented and reviewed.
Deliverables for this task include two reports:
● Project plan: Outline your step-by-step action plan for the project. Expand the outline with a
schedule for completion of each step, required resources, inputs (such as data or a meeting
with a subject matter expert), and outputs (such as cleaned data, a model, or a report) for each
step, and dependencies (steps that can’t begin until this step is completed). Explicitly state
that certain steps must be repeated (for example, modeling and evaluation usually call for
several back-and-forth repetitions).
● Initial assessment of tools and techniques: Identify the required capabilities for meeting
your data-mining goals and assess the tools and resources that you have. If something is
missing, you have to address that concern very early in the process.
DATA COLLECTION
Data is a collection of facts, figures, objects, symbols, and events gathered from different
sources. Organizations collect data to make better decisions. Without data, it would be
difficult for organizations to make appropriate decisions, and so data is collected at various
points in time from different audiences.
For instance, before launching a new product, an organization needs to collect data on
product demand, customer preferences, competitors, etc. In case data is not collected
beforehand, the organization’s newly launched product may lead to failure for many reasons,
such as less demand and inability to meet customer needs.
Although data is a valuable asset for every organization, it does not serve any purpose until
analyzed or processed to get the desired results.
Collecting the information from the numerical fact after observation is known as raw data.
There are two types of data. Below we have provided the types of data: Primary Data and
Secondary Data.
The two types of data are as follows.
1. Primary Data
When an investigator collects data himself with a definite plan or design in his/her way, then
the data is known as primary data. Generally, the results derived from the primary data are
accurate as the researcher gathers the information. But, one of the disadvantages of primary
data collection is the expenses associated with it. Primary data research is very time-
consuming and expensive.
2. Secondary Data
Data that the investigator does not initially collect but instead obtains from published or
unpublished sources are secondary data. Secondary data is collected by an individual or an
institution for some purpose and are used by someone else in another context. It is worth
noting that although secondary data is cheaper to obtain, it raises concerns about accuracy.
As the data is second-hand, one cannot fully rely on the information to be authentic.
Data Collection: Methods
Data collection is defined as collecting and analysing data to validate and research using
some techniques. It is done to diagnose a problem and learn its outcome and future trends.
When there is a need to solve a question, data collection methods help assume the future
result.
We must collect reliable data from the correct sources to make the calculations and analysis
easier. There are two types of data collection methods. This is dependent on the kind of data
that is being collected. They are:
1. Primary Data Collection Methods
2. Secondary Data Collection Methods
Types of Data Collection
Students require primary or secondary data while doing their research. Both primary and
secondary data have their own advantages and disadvantages. Both the methods come into
the picture in different scenarios. One can use secondary data to save time and primary data
to get accurate results.
Primary Data Collection Method
Primary or raw data is obtained directly from the first-hand source through experiments,
surveys, or observations. The primary data collection method is further classified into two
types, and they are given below:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Quantitative Data Collection Methods
The term ‘Quantity’ tells us a specific number. Quantitative data collection methods express
the data in numbers using traditional or online data collection methods. Once this data is
collected, the results can be calculated using Statistical methods and Mathematical tools.
Some of the quantitative data collection methods include
Time Series Analysis
The term time series refers to a sequential order of values of a variable, known as a trend, at
equal time intervals. Using patterns, an organization can predict the demand for its products
and services for the projected time.
Smoothing Techniques
In cases where the time series lacks significant trends, smoothing techniques can be used.
They eliminate a random variation from the historical demand. It helps in identifying patterns
and demand levels to estimate future demand. The most common methods used in smoothing
demand forecasting techniques are the simple moving average method and the weighted
moving average method.
Barometric Method
Also known as the leading indicators approach, researchers use this method to speculate
future trends based on current developments. When the past events are considered to predict
future events, they act as leading indicators.
Qualitative Data Collection Methods
The qualitative method does not involve any mathematical calculations. This method is
closely connected with elements that are not quantifiable. The qualitative data collection
method includes several ways to collect this type of data, and they are given below:
Interview Method
As the name suggests, data collection is done through the verbal conversation of interviewing
the people in person or on a telephone or by using any computer-aided model. This is one of
the most often used methods by researchers. A brief description of each of these methods is
shown below:
Personal or Face-to-Face Interview: In this type of interview, questions are asked
personally directly to the respondent. For this, a researcher can do online surveys to take note
of the answers.
Telephonic Interview: This method is done by asking questions on a telephonic call. Data is
collected from the people directly by collecting their views or opinions.
Computer-Assisted Interview: The computer-assisted type of interview is the same as a
personal interview, except that the interviewer and the person being interviewed will be doing
it on a desktop or laptop. Also, the data collected is directly updated in a database to make the
process quicker and easier. In addition, it eliminates a lot of paperwork to be done in updating
the collection of data.
Questionnaire Method of Collecting Data
The questionnaire method is nothing but conducting surveys with a set of quantitative
research questions. These survey questions are done by using online survey questions
creation software. It also ensures that the people’s trust in the surveys is legitimised. Some
types of questionnaire methods are given below:
Web-Based Questionnaire: The interviewer can send a survey link to the selected
respondents. Then the respondents click on the link, which takes them to the survey
questionnaire. This method is very cost-efficient and quick, which people can do at their own
convenient time. Moreover, the survey has the flexibility of being done on any device. So, it
is reliable and flexible.
Mail-Based Questionnaire: Questionnaires are sent to the selected audience via email. At
times, some incentives are also given to complete this survey which is the main attraction.
The advantage of this method is that the respondent’s name remains confidential to the
researchers, and there is the flexibility of time to complete this survey.
Observation Method
As the word ‘observation’ suggests, data is collected directly by observing this method. This
can be obtained by counting the number of people or the number of events in a particular
time frame. Generally, it’s effective in small-scale scenarios. The primary skill needed here is
observing and arriving at the numbers correctly. Structured observation is the type of
observation method in which a researcher detects certain specific behaviours.
Document Review Method
The document review method is a data aggregation method used to collect data from existing
documents with data about the past. There are two types of documents from which we can
collect data. They are given below:
Public Records: The data collected in an organisation like annual reports and sales
information of the past months are used to do future analysis.
Personal Records: As the name suggests, the documents about an individual such as type of
job, designation, and interests are taken into account.
Secondary Data Collection Method
The data collected by another person other than the researcher is secondary data. Secondary
data is readily available and does not require any particular collection methods. It is available
in the form of historical archives, government data, organisational records etc. This data can
be obtained directly from the company or the organization where the research is being
organised or from outside sources.
The internal sources of secondary data gathering include company documents, financial
statements, annual reports, team member information, and reports got from customers or
dealers. Now, the external data sources include information from books, journals, magazines,
the census taken by the government, and the information available on the internet about
research. The leading edge of this data aggregation method is that it is easy to collect since
they are readily accessible.
The secondary data collection methods, too, can involve both quantitative and qualitative
techniques. Secondary data is easily available and hence, less time-consuming and expensive
as compared to the primary data. However, with the secondary data collection methods, the
authenticity of the data gathered cannot be verified.
Collection of Data in Statistics
There are various ways to represent data after gathering. But, the most popular method is to
tabulate the data using tally marks and then represent them in a frequency distribution table.
The frequency distribution table is constructed by using the tally marks. Tally marks are a
form of a numerical system used for counting. The vertical lines are used for the counting.
The cross line is placed over the four lines giving the total at 55.
Example:
Consider a jar containing the different colours of pieces of bread as shown below:
DATA PREPARATION
Data preparation is the process of gathering, combining, structuring and organizing data so it
can be used in business intelligence (BI), analytics and data visualization applications. The
components of data preparation include data preprocessing, profiling, cleansing, validation
and transformation; it often also involves pulling together data from different internal
systems and external sources.
Data preparation work is done by information technology (IT), BI and data management
teams as they integrate data sets to load into a data warehouse, NoSQL database or data lake
repository, and then when new analytics applications are developed with those data sets. In
addition, data scientists, data engineers, other data analysts and business users increasingly
use self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It's also known as data
wrangling, although some practitioners use that term in a narrower sense to refer to cleansing,
structuring and transforming data; that usage distinguishes data wrangling from the data pre-
processing stage.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for
processing and analysis is accurate and consistent so the results of BI and analytics
applications will be valid. Data is commonly created with missing values, inaccuracies or
other errors, and separate data sets often have different formats that need to be reconciled
when they're combined. Correcting data errors, validating data quality and consolidating data
sets are big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications
deliver meaningful information and actionable insights for business decision-making. The
data often is enriched and optimized to make it more informative and useful -- for example,
by blending internal and external data sets, creating new data fields, eliminating outlier values
and addressing imbalanced data sets that could skew analytics results.
In addition, BI and data management teams use the data preparation process to curate data
sets for business users to analyse. Doing so helps streamline and guide self-service BI
applications for business analysts, executives and workers.
What are the benefits of data preparation?
Data scientists often complain that they spend most of their time gathering, cleansing and
structuring data instead of analysing it. A big benefit of an effective data preparation process
is that they and other end users can focus more on data mining and data analysis -- the parts
of their job that generate business value. For example, data preparation can be done more
quickly, and prepared data can automatically be fed to users for recurring analytics
applications.
Done properly, data preparation also helps an organization do the following:
● ensure the data used in analytics applications produces reliable results;
● identify and fix data issues that otherwise might not be detected;
● enable more informed decision-making by business executives and operational workers;
● reduce data management and analytics costs;
● avoid duplication of effort in preparing data for use in multiple applications; and
● get a higher ROI from BI and analytics initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semi structured and unstructured data, often in raw form until it's
needed for specific analytics uses. Those uses include predictive analytics, machine learning
(ML) and other forms of advanced analytics that typically involve large amounts of data to
prepare. For example, in an article on preparing data for machine learning, Felix Wick,
corporate vice president of data science at supply chain software vendor Blue Yonder, is
quoted as saying that data preparation "is at the heart of ML."
Steps in the data preparation process
Data preparation is done in a series of steps. There's some variation in the data preparation
steps listed by different data professionals and software vendors, but the process typically
involves the following tasks:
1. Data discovery and profiling. The next step is to explore the collected data to better
understand what it contains and what needs to be done to prepare it for the intended uses. To
help with that, data profiling identifies patterns, relationships and other attributes in the data,
as well as inconsistencies, anomalies, missing values and other issues so they can be
addressed.
What is data profiling?
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Additionally, data profiling involves a review of source data to understand the data's
structure, content and interrelationships.
This review process delivers two high-level values to the organization: It provides a
high-level view of the quality of its data sets; and two, it helps the organization identify
potential data projects.
Given those benefits, data profiling is an important component of data preparation programs.
Its assistance helping organizations to identify quality data makes it an important precursor to
data processing and data analytics activities.
Moreover, an organization can use data profiling and the insights it produces to continuously
improve the quality of its data and measure the results of that effort.
Data profiling may also be known as data archaeology, data assessment, data discovery or
data quality analysis.
Organizations use data profiling at the beginning of a project to determine if enough data has
been gathered, if any data can be reused or if the project is worth pursuing. The process of
data profiling itself can be based on specific business rules that will uncover how the data set
aligns with business standards and goals.
Types of data profiling
There are three types of data profiling.
● Structure discovery. This focuses on the formatting of the data, making sure everything is
uniform and consistent. It uses basic statistical analysis to return information about the
validity of the data.
● Content discovery. This process assesses the quality of individual pieces of data. For
example, ambiguous, incomplete and null values are identified.
● Relationship discovery. This detects connections, similarities, differences and associations
among data sources.
What are the steps in the data profiling process?
Data profiling helps organizations identify and fix data quality problems before the data is
analyzed, so data professionals aren't dealing with inconsistencies, null values or incoherent
schema designs as they process data to make decisions.
Data profiling statistically examines and analyzes data at its source and when loaded. It also
analyzes the metadata to check for accuracy and completeness.
data scienceIt typically involves either writing queries or using data
profiling tools. A high-level breakdown of the process is as follows:
1. The first step of data profiling is gathering one or multiple data sources and the associated
metadata for analysis.
2. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships
and find anomalies.
3. Once the data is cleaned, data profiling tools will return various statistics to describe the data
set. This could include the mean, minimum/maximum value, frequency, recurring patterns,
dependencies or data quality risks.
For example, by examining the frequency distribution of different values for each column in a
table, a data analyst could gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies; inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
Benefits of data profiling
Data profiling returns a high-level overview of data that can result in the following benefits:
● leads to higher-quality, more credible data;
● helps with more accurate predictive analytics and decision-making;
● makes better sense of the relationships between different data sets and sources;
● keeps company information centralized and organized;
● eliminates errors, such as missing values or outliers, that add costs to data-driven projects;
● highlights areas within a system that experience the most data quality issues, such as data
corruption or user input errors; and
● produces insights surrounding risks, opportunities and trends.
Data profiling challenges
Although the objectives of data profiling are straightforward, the actual work involved is
quite complex, with multiple tasks occurring from the ingestion of data through its
warehousing.
That complexity is one of the challenges organizations encounter when trying to implement
and run a successful data profiling program.
The sheer volume of data being collected by a typical organization is another challenge, as is
the range of sources -- from cloud-based systems to endpoint devices deployed as part of an
internet-of-things ecosystem -- that produce data.
The speed at which data enters an organization creates further challenges to having a
successful data profiling program.
These data prep challenges are even more significant in organizations that have not adopted
modern data profiling tools and still rely on manual processes for large parts of this work.
On a similar note, organizations that don't have adequate resources -- including trained data
professionals, tools and the funding for them -- will have a harder time overcoming these
challenges.
However, those same elements make data profiling more critical than ever to ensure that the
organization has the quality data it needs to fuel intelligent systems, customer
personalization, productivity-boosting automation projects and more.
Examples of data profiling
Data profiling can be implemented in a variety of use cases where data quality is important.
For example, projects that involve data warehousing or business intelligence may require
gathering data from multiple disparate systems or databases for one report or analysis.
Applying data profiling to these projects can help identify potential issues and corrections that
need to be made in extract, transform and load (ETL) jobs and other data integration
processes before moving forward.
Additionally, data profiling is crucial in data conversion or data migration initiatives that
involve moving data from one system to another. Data profiling can help identify data quality
issues that may get lost in translation or adaptions that must be made to the new system prior
to migration.
The following four methods, or techniques, are used in data profiling:
● column profiling, which assesses tables and quantifies entries in each column;
● cross-column profiling, which features both key analysis and dependency analysis;
● cross-table profiling, which uses key analysis to identify stray data as well as semantic and
syntactic discrepancies; and
● data rule validation, which assesses data sets against established rules and standards to
validate that they're being followed.
Data profiling tools
Data profiling tools replace much, if not all, of the manual effort of this function by
discovering and investigating issues that affect data quality, such as duplication, inaccuracies,
inconsistencies and lack of completeness.
These technologies work by analyzing data sources and linking sources to their metadata to
allow for further investigation into errors.
Furthermore, they offer data professionals quantitative information and statistics around data
quality, typically in tabular and graph formats.
Data management applications, for example, can manage the profiling process through tools
that eliminate errors and apply consistency to data extracted from multiple sources without
the need for hand coding.
Such tools are essential for many, if not most, organizations today as the volume of data they
use for their business activities significantly outpaces even a large team's ability to perform
this function through mostly manual efforts.
Data profile tools also generally include data wrangling, data gap and metadata discovery
capabilities as well as the ability to detect and merge duplicates, check for data similarities
and customize data assessments.
Commercial vendors that provide data profiling capabilities include Datameer, Informatica,
Oracle and SAS. Open source solutions include Aggregate Profiler, Apache Griffin, Quadient
DataCleaner and Talend.
2. Data cleansing. Next, the identified data errors and issues are corrected to create complete
and accurate data sets. For example, as part of cleansing data sets, faulty data is removed or
fixed, missing values are filled in and inconsistent entries are harmonized.
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing
incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves
identifying data errors and then changing, updating or removing data to correct them. Data
cleansing improves data quality and helps provide more accurate, consistent and reliable
information for decision-making in an organization.
Data cleansing is a key part of the overall data management process and one of the core
components of data preparation work that readies data sets for use in business intelligence
(BI) and data science applications. It's typically done by data quality analysts and engineers
or other data management professionals. But data scientists, BI analysts and business users
may also clean data or take part in the data cleansing process for their own applications.
Data cleansing vs. data cleaning vs. data scrubbing
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most
part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed
as an element of data cleansing that specifically involves removing duplicate, bad, unneeded
or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context,
it's an automated function that checks disk drives and storage systems to make sure the data
they contain can be read and to identify any bad sectors or blocks.
HYPOTHESIS GENERATION
Data scientists work with data sets small and large, and are tellers of stories. These stories
have entities, properties and relationships, all described by data. Their apparatus and methods
open up data scientists to opportunities to identify, consolidate and validate hypotheses with
data, and use these hypotheses as starting points for our data narratives. Hypothesis
generation is a key challenge for data scientists. Hypothesis generation and by extension
hypothesis refinement constitute the very purpose of data analysis and data science.
Hypothesis generation for a data scientist can take numerous forms, such as:
1. They may be interested in the properties of a certain stream of data or a certain
measurement. These properties and their default or exceptional values may form a
certain hypothesis.
2. They may be keen on understanding how a certain measure has evolved over time. In
trying to understand this evolution of a system’s metric, or a person’s behaviour, they
could rely on a mathematical model as a hypothesis.
3. They could consider the impact of some properties on the states of systems,
interactions and people. In trying to understand such relationships between different
measures and properties, they could construct machine learning models of different
kinds.
Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system
behaviour and represent such behaviour in a manner that’s tangible and tractable based on
simple, explicable rules. This makes story-telling easier for data scientists when they become
new-age raconteurs, straddling data visualisations, dashboards with data summaries and
machine learning models.
5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually
come across passengers requesting drivers to increase the speed as they are getting
late and there could be other factors to hypothesize which we can look at.
● Age of passengers: Senior citizens as passengers may contribute to higher
trip duration as drivers tend to go slow in trips involving senior citizens
● Medical conditions or pregnancy: Passengers with medical conditions
contribute to a longer trip duration
● Emergency: Passengers with an emergency could contribute to a shorter
trip duration
● Passenger count: Higher passenger count leads to shorter duration trips due
to congestion in seating
6. Date-Time Features
The day and time of the week are important as New York is a busy city and could
be highly congested during office hours or weekdays. Let us now generate a few
hypotheses on the date and time-based features.
Pickup Day:
● Weekends could contribute to more outstation trips and could have a higher
trip duration
● Weekdays tend to have higher trip duration due to high traffic
● If the pickup day falls on a holiday then the trip duration may be shorter
● If the pickup day falls on a festive week then the trip duration could be
lower due to lesser traffic
Time:
● Early morning trips have a lesser trip duration due to lesser traffic
● Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the
road are factors that can’t be ignored. Let’s form some hypotheses based on these
factors.
● Condition of the road: The duration of the trip is more if the condition of
the road is bad
● Road type: Trips in concrete roads tend to have a lower trip duration
● Strike on the road: Strikes carried out on roads in the direction of the trip
causes the trip duration to increase
8. Weather Based Features
Weather can change at any time and could possibly impact the commute if the
weather turns bad. Hence, this is an important feature to consider in our
hypothesis.
● Weather at the start of the trip: Rainy weather condition contributes to a
higher trip duration
After writing down our hypothesis and looking at the dataset you will notice
that you would have covered the writing of hypothesis on most of the features
present in the data set. There could also be a possibility that you might have to
work with fewer features and the features on which you have generated hypotheses
are not currently being captured/stored by the business and are not available.
Always go ahead and capture data from external sources if you think that the
data is relevant for your prediction. Ex: Getting weather information
It is also important to note that since hypothesis generation is an estimated
guess, the hypothesis generated could come out to be true or false once exploratory
data analysis and hypothesis testing is performed on the data.
MODELING:
After all the cleaning, formatting and feature selection, we will now feed
the data to the chosen model. But how does one select a model to use?
How to choose a model?
IT DEPENDS. It all depends on what the goal of your task or project is and this should
already be identified in the Business Understanding phase
Steps in choosing a model
1. Determine size of training data — if you have a small dataset, fewer number of
observations, high number of features, you can choose high bias/low variance
algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and
has a high number of observations compared to number of features, you can choose a
low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output — if your goal is inference, choose
restrictive models as it is more interpretable (Linear Regression, Least Squares). If
your goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time — always remember that higher accuracy as well as large
datasets means higher training time. Examples of easy to run and to implement
algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples of
algorithms that need more time to train are: SVM, Neural Networks, and Random
Forests.
4. Linearity —try checking first the linearity of your data by fitting a linear line or by
trying to run a logistic regression, you can also check their residual errors. Higher
errors mean that the data is not linear and needs complex algorithms to fit. If data is
Linear, you can choose: Linear Regression, Logistic Regression, Support Vector
Machines. If Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They
are often are called the “Linear ML Algorithms”.
Parametric ML Algorithms
● Logistic Regression
● Linear Discriminant Analysis
● Perceptron
● Naïve Bayes
● Simple Neural Networks
Benefits of Parametric ML Algorithms
● Simpler — easy to understand methods and easy to interpret results
● Speed — very fast to learn from the data provided
● Less data — it does not require as much training data
Limitations of Parametric ML Algorithms
● Limited Complexity —suited only to simpler problems
● Poor Fit — the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the
form of the mapping functions. It is good to use when you have a lot of data and no prior
knowledge and you don’t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
● K-Nearest Neighbors (KNN)
● Decision Trees like CART
● Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
● Flexibility— it is capable of fitting a large number of functional forms
● Power — do not assume about the underlying function
● Performance — able to give a higher performance model for predictions
Limitations of Non-Parametric ML Algorithms
● Needs more data — requires a large training dataset
● Slower processing — they often have more parameters which means that training time
is much longer
● Overfitting — higher risk of overfitting the training data and results are harder to
explain why specific predictions were made
In the process flow above, Data Modeling is broken down into four tasks
together with its projected outcome or output in detail.
Simply put, the Data Modeling phase’s goal is to:
2. Designing tests
The test in this task is the test that you’ll use to determine how well your model works. It
may be as simple as splitting your data into a group of cases for model training and another
group for model testing.
Training data is used to fit mathematical forms to the data model, and test data is used during
the model-training process to avoid overfitting: making a model that’s perfect for one dataset,
but no other. You may also use holdout data, data that is not used during the model-training
process, for an additional test.
The deliverable for this task is your test design. It need not be elaborate, but you should at
least take care that your training and test data are similar and that you avoid introducing any
bias into the data.
3. Building model(s)
Modeling is what many people imagine to be the whole job of the data miner, but it’s just one
task of dozens! Nonetheless, modeling to address specific business goals is the heart of the
data-mining profession.
Deliverables for this task include three items:
● Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure of
the final model. Document these settings in a report.
● Model descriptions: Describe your models. State the type of model (such as linear
regression or neural network) and the variables used. Explain how the model is
interpreted. Document any difficulties encountered in the modeling process.
● Models: This deliverable is the models themselves. Some model types can be easily
defined with a simple equation; others are far too complex and must be transmitted in
a more sophisticated format.
4. Assessing model(s)
Now you will review the models that you’ve created, from a technical standpoint and also
from a business standpoint (often with input from business experts on your project team).
Deliverables for this task include two reports:
● Model assessment: Summarizes the information developed in your model review. If
you have created several models, you may rank them based on your assessment of
their value for a specific application.
● Revised parameter settings: You may choose to fine-tune settings that were used to
build the model and conduct another round of modeling and try to improve your
results.
VALIDATION:
Why data validation?
Data validation happens immediately after data preparation/wrangling and before
modeling. it is because during data preparation there is a high possibility of things going
wrong especially in complex scenarios.
Data validation ensures that modeling happens on the right data. faulty data as input
to the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a
proper understanding of the data and business. I
t is usually clients who technically good enough to check the data. Once we go
through data preparation and just before data modeling, we usually make data visualization
and give my newly prepared data to the client.
The clients with the help of SQL queries or any other tools try to validate if output
contains no error.
Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in
parallel meaning you do not have to wait for the green light for data validation to do the
modeling. But once you get feedback from the domain expert that there are faults in the data,
we need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data
preparation is not correct.
2. Common bugs in programming/data preparation pipeline that led to a faulty output.
EVALUATION:
The evaluation phase includes three tasks. These are
● Evaluating results
● Reviewing the process
● Determining the next steps
Data analysts or help people make sense of the numerical data that has been
aggregated, transformed, and displayed. There are two main methods for data interpretation:
quantitative and qualitative.
Qualitative Data Interpretation Method
This is a method for breaking down or analyzing so-called qualitative data, also known as
categorical data. It is important to note that no bar graphs or line charts are used in this
method. Instead, they rely on text. Because qualitative data is collected through
person-to-person techniques, it isn't easy to present using a numerical approach.
Surveys are used to because they allow you to assign numerical values to answers,
making them easier to analyze. If we rely solely on the text, it would be a time-consuming
and error-prone process. This is why it must be transformed.
This data interpretation is applied when we are dealing with quantitative or numerical data.
Since we are dealing with numbers, the values can be displayed in a bar chart or pie chart.
There are two main types: Discrete and Continuous. Moreover, numbers are easier to analyze
since they involve statistical modeling techniques like mean and standard deviation.
is an average value of a particular data set obtained or calculated by dividing the sum
of the values within that data set by the number of values within that same set.
is a technique is used to ascertain how responses align with or deviate
from the average value or mean. It relies on the meaning to describe the consistency of the
replies within a particular data set. You can use this when calculating the average pay for a
certain profession and then displaying the upper and lower values in the data set.
As stated, some tools can do this automatically, especially when it comes to quantitative data.
Whatagraph is one such tool as it can aggregate data from multiple sources using different
system integrations. It will also automatically organize and analyze that which will later be
displayed in pie charts, line charts, or bar charts, however you wish.
Benefits Of Data
Interpretation
Multiple data interpretation benefits explain its significance within the corporate world,
medical industry, and financial industry:
Informed decision-making. The managing board must examine the data to take action and
implement new methods. This emphasizes the significance of well-analyzed data as well as a
well-structured data collection process.
Anticipating needs and identifying trends . Data analysis provides users with relevant
insights that they can use to forecast trends. It would be based on customer concerns and
expectations.
For example, a large number of people are concerned about privacy and the leakage of
personal information. Products that provide greater protection and anonymity are more likely
to become popular.
Clear foresight. Companies that analyze and aggregate data better understand their own
performance and how consumers perceive them. This provides them with a better
understanding of their shortcomings, allowing them to work on solutions that will
significantly improve their performance.
Some queries are updated in the database such as “were the decision and action impactful?”
“What was the return or investment?”,” how was the analysis group compared with the
regulating class?”. The performance-based database is continuously updated once the new
insight or knowledge is extracted.
UNIT II
BUSINESS INTELLIGENCE
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
DATA MART
A Data Mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse. A Data Mart is a condensed version of Data
Warehouse and is designed for use by a specific department, unit or set of users in an
organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a single department
in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse. Data
marts are small in size and are more flexible compared to a Datawarehouse.
Why do we need Data Mart?
● Data Mart helps to enhance user’s response time due to reduction in volume of data
● It provides easy access to frequently requested data.
● Data mart are simpler to implement when compared to corporate Datawarehouse. At the
same time, the cost of implementing Data Mart is certainly lower compared with
implementing a full data warehouse.
● Compared to Data Warehouse, a DataMart is agile. In case of change in model, DataMart
can be built quicker due to a smaller size.
● A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse
is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more
open to change compared to Datawarehouse.
● Data is partitioned and allows very granular access control privileges.
● Data can be segmented and stored on different hardware/software platforms.
Types of Data Mart
There are three main types of data mart:
1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
Dependent Data Mart
A dependent data mart allows sourcing organization’s data from a single Data Warehouse. It is
one of the data marts examples which offers the benefit of centralization. If you need to develop
one or more physical data marts, then you need to configure them as dependent data marts.
Dependent Data Mart in data warehouse can be built in two different ways. Either where a user
can access both the data mart and data warehouse, depending on need, or where access is limited
only to the data mart. The second approach is not optimal as it produces sometimes referred to
as a data junkyard. In the data junkyard, all data begins with a common source, but they are
scrapped, and mostly junked.
Dependent Data Mart
Independent Data Mart
An independent data mart is created without the use of central Data warehouse. This kind of
Data Mart is an ideal option for smaller groups within an organization.
An independent data mart has neither a relationship with the enterprise data warehouse nor with
any other data mart. In Independent data mart, the data is input separately, and its analyses are
also performed autonomously.
Implementation of independent data marts is antithetical to the motivation for building a data
warehouse. First of all, you need a consistent, centralized store of enterprise data which can be
analysed by multiple users with different interests who want widely varying information.
Independent Data Mart
Hybrid Data Mart:
A hybrid data mart combines input from sources apart from Data warehouse. This could be
helpful when you want ad-hoc integration, like after a new group or product is added to the
organization.
It is the best data mart example suited for multiple database environments and fast
implementation turnaround for any organization. It also requires least data cleansing effort.
Hybrid Data mart also supports large storage structures, and it is best suited for flexible for
smaller data-centric applications.
Hybrid Data Mart
Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed steps to
implement a Data Mart:
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements. Finally, we
create the logical and physical Data Mart design.
The design step involves the following tasks:
● Gathering the business & technical requirements and Identifying data sources.
● Selecting the appropriate subset of data.
● Designing the logical and physical structure of the data mart.
Data could be partitioned based on following criteria:
● Date
● Business or Functional Unit
● Geography
● Any combination of above
Data could be partitioned at the application or DBMS level. Though it is recommended to
partition at the Application level as it allows different data models each year with the change in
business environment.
What Products and Technologies Do You Need?
A simple pen and paper would suffice. Though tools that help you create UML or ER
diagram would also append meta data into your logical and physical designs.
Constructing
This is the second phase of implementation. It involves creating the physical database and the
logical structures.
This step involves the following tasks:
● Implementing the physical database designed in the earlier phase. For instance, database
schema objects like table, indexes, views, etc. are created.
What Products and Technologies Do You Need?
You need a relational database management system to construct a data mart. RDBMS have
several features that are required for the success of a Data Mart.
● Storage management: An RDBMS stores and manages the data to create, add, and delete
data.
● Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.
● Data protection: The RDBMS system also offers a way to recover from system failures
such as power failures. It also allows restoring data from these backups incase of the disk
fails.
● Multiuser support: The data management system offers concurrent access, the ability for
multiple users to access and modify data without interfering or overwriting changes made
by another user.
● Security: The RDMS system also provides a way to regulate access by users to objects
and certain types of operations.
Populating:
In the third phase, data in populated in the data mart.
The populating step involves the following tasks:
● Source data to target data Mapping
● Extraction of source data
● Cleaning and transformation operations on the data
● Loading data into the data mart
● Creating and storing metadata
TYPES OF DECISIONS
Decision-making is one of the core functions of management. And it is actually a very scientific
function with a well-defined decision-making process. There are various types of decisions the
managers have to take in the day-to-day functioning of the firm. Let us take a look at some of
the types of decisions.
1. Programmed and Non-Programmed Decisions
2. Major and Minor Decisions
3. Routine and Strategic Decisions
4. Organizational and Personal Decisions
5. Individual and Group Decisions
6. Policy and Operation Decisions and
7. Long-Term Departmental and Non-Economic Decisions.
Type # 1. Programmed and Non-Programmed Decisions:
(a) Programmed decisions are those made in accordance with some habit, rule or procedure.
Every organisation has written or unwritten policies that simplify decision making in recurring
situations by limiting or excluding alternatives.
For example, we would not usually have to worry about what to pay to a newly hired employee;
organizations generally have an established salary scale for all positions. Routine procedures
exist for dealing with routine problems.
Routine problems are not necessarily simple ones; programmed decisions are used for dealing
with complex as well as with uncomplicated issues. To some extent, of course, programmed
decisions limit our freedom, because the organization rather than the individual decides what to
do.
However, the policies, rules or procedures by which we make decisions free us of the time
needed to work out new solutions to old problems, thus allowing us to devote attention to other,
more important activities in the organization.
(b) Non-programmed decisions are those that deal with unusual or exceptional problems. If a
problem has not come up often enough to be covered by a policy or is so important that it
deserves special treatment, it must be handled by a non-programmed decision.
Such problems as:
(1) How to allocate an organisation’s resources
(2) What to do about a failing product line,
(3) How community relations should be improved will usually require non- programmed
decisions.
As one moves up in the organizational hierarchy, the ability to make non- programmed decisions
becomes more important because progressively more of the decisions made are
non-programmed.
Type # 2. Major and Minor Decisions:
A decision related to the purchase of a CNC machine costing several lakhs is a major decision
and purchase of a few reams of typing paper is a minor (matter or) decision.
Type # 3. Routine and Strategic Decisions:
Routine decisions are of repetitive nature, do not require much analysis and evaluation, are in
the context of day-to-day operations of the enterprise and can be made quickly at middle
management level. An example is, sending samples of a food product to the Government
investigation centre.
Strategic decisions relate to policy matter, are taken at higher levels of management after careful
analysis and evaluation of various alternatives, involve large expenditure of funds and a slight
mistake in decision making is injurious to the enterprise. Examples of strategic decisions are-
capital expenditure decisions, decisions related to pricing, expansion and change in product line
etc.
Type # 4. Organizational and Personal Decisions:
A manager makes organizational decisions in the capacity of a company officer. Such decisions
reflect the basic policy of the company. They can be delegated to others. Personal decisions
relate the manager as an individual and not as a member of an organization. Such decisions
cannot be delegated.
Type # 5. Individual and Group Decisions:
Individual decisions are taken by a single individual in context of routine decisions where
guidelines are already provided. Group decisions are taken by a committee constituted for this
specific purpose. Such decisions are very important for the organisation.
Type # 6. Policy and Operative Decisions:
Policy decisions are very important, they are taken by top management, they have a long-term
impact and mostly relate to basic policies. Operative decisions relate to day-to-day operations of
the enterprise and are taken at lower or middle management level. Whether to give bonus to
employees is a policy decision but calculating bonus for each employee is an operative decision.
Type # 7. Long-Term Departmental and Non-Economic Decisions:
In case of long-term decisions, the time period covered is long and the risk involved is more.
Departmental decisions relate to a particular department only and are taken by departmental
head. Non-economic decisions relate to factors such as technical values, moral behaviour etc.
1. Data-Driven DSS take the massive amounts of data available through the
company’s TPS and MIS systems and cull from it useful information
which executives can use to make more informed decisions. They don’t
have to have a theory or model but can “free-flow” the data. The first
generic type of Decision Support System is a Data-Driven DSS. These
systems include file drawer and management reporting systems, data
warehousing and analysis systems, Executive Information Systems (EIS)
and Spatial Decision Support Systems. Business Intelligence Systems are
also examples of Data- Driven DSS. Data-Driven DSS emphasize access
to and manipulation of large databases of structured data and especially a
time-series of internal company data and sometimes external data. Simple
file systems accessed by query and retrieval tools provide the most
elementary level of functionality. Data warehouse systems that allow the
manipulation of data by computerized tools tailored to a specific task and
setting or by more general tools and operators provide additional
functionality. Data-Driven DSS with Online Analytical Processing
(OLAP) provide the highest level of functionality and decision support
that is linked to analysis of large collections of historical data.
2. Model-Driven DSS A second category, Model-Driven DSS, includes
systems that use accounting and financial models, representational
models, and optimization models. Model-Driven DSS emphasize access
to and manipulation of a model. Simple statistical and analytical tools
provide the most elementary level of functionality. Some OLAP
systems that allow complex analysis of data may be classified as hybrid
DSS systems providing modeling, data retrieval and data summarization
functionality. Model-Driven DSS use data and parameters provided by
decision-makers to aid them in analyzing a situation, but they are not
usually data intensive. Very large databases are usually not needed for
Model-Driven DSS. Model-Driven DSS were isolated from the main
Information Systems of the organization and were primarily used for the
typical “what-if” analysis. That is, “What if we increase production of our
products and decrease the shipment time?” These systems rely heavily on
models to help executives understand the impact of their decisions on the
organization, its suppliers, and its customers.
3. Knowledge-Driven DSS The terminology for this third generic type of
DSS is still evolving. Currently, the best term seems to be
Knowledge-Driven DSS. Adding the modifier “driven” to the word
knowledge maintains a parallelism in the framework and focuses on the
dominant knowledge base component. Knowledge-Driven DSS can
suggest or recommend actions to managers. These DSS are personal
computer systems with specialized problem-solving expertise. The
“expertise” consists of knowledge about a particular domain,
understanding of problems within that domain, and “skill” at solving
some of these problems. A related concept is Data Mining. It refers to a
class of analytical applications that search for hidden patterns in a
database. Data mining is the process of sifting through large amounts of
data to produce data content relationships.
4. Document-Driven DSS A new type of DSS, a Document-Driven DSS or
Knowledge Management System, is evolving to help managers retrieve
and manage unstructured documents and Web pages. A Document-Driven
DSS integrates a variety of storage and processing technologies to
provide complete document retrieval and analysis. The Web provides
access to large document databases including databases of hypertext
documents, images, sounds and video. Examples of documents that
would be accessed by a Document-Based DSS are policies and
procedures, product specifications, catalogs, and corporate historical
documents, including minutes of meetings, corporate records, and
important correspondence. A search engine is a powerful decision aiding
tool associated with a Document-Driven DSS.
5. Communications-Driven and Group DSS Group Decision Support
Systems (GDSS) came first, but now a broader category of
Communications-Driven DSS or groupware can be identified. This fifth
generic type of Decision Support System includes communication,
collaboration and decision support technologies that do not fit within
those DSS types identified. Therefore, we need to identify these systems
as a specific category of DSS. A Group DSS is a hybrid Decision Support
System that emphasizes both the use of communications and decision
models. A Group Decision Support System is an interactive
computer-based system intended to facilitate the solution of problems by
decision-makers working together as a group. Groupware supports
electronic communication, scheduling, document sharing, and other
group productivity and decision support enhancing activities We have a
number of technologies and capabilities in this category in the framework
— Group DSS, two-way interactive video, White Boards, Bulletin
Boards, and Email.
Components of Decision Support Systems (DSS)
Decision Support Systems can create advantages for organizations and can have
positive benefits, however building and using DSS can create negative
outcomes in some situations.
In conclusion, before firms will invest in the Decision Support Systems, they
must compare the advantages and disadvantages of the decision support system
to get valuable investment.
BUSINESS INTELLIGENCE
BI is an umbrella term for data analysis techniques, applications and practices used to
support the decision-making processes in business. The term was proposed by Howard
Dresner in 1989 and became widespread in the late 1990s. Business intelligence assists
business owners in making important decisions based on their business data.
Rather than directly telling business owners what to do, business intelligence allows
them to analyze the data they have to understand trends and get insights, thus
scaffolding the decision-making process. BI includes a wide variety of techniques and
tools for data analytics, including tools for ad- hoc analytics and reporting, OLAP tools,
real-time business intelligence, SaaS BI, etc. Another important area of BI is data
visualization software, dashboards, and scorecards.
∙ It represents data in a multidimensional form, which makes it convenient for analysts and other
business users to analyze numeric values from different perspectives.
∙ OLAP is good for storing, extracting and analyzing large amounts of data. Business
intelligence specialists are able to analyze data accumulated over a long period of time, which
enables more precise results and better forecasting. The architecture of OLAP systems allows
fast access to the data as they typically pre-aggregate data.
∙ OLAP provides wide opportunities for data slicing and dicing, drill down/up/through, which
helps analysts narrow down the data used for BI analysis and reporting.
∙ OLAP systems usually have an intuitive and easy-to-use interface, which allows nontechnical
users to analyze data and generate reports without involving IT department. Also, OLAP
dimension uses familiar business terms so that employees don't have to receive any additional
training.
Business Forecasting
3.1 Introduction
The growing competition, rapidity of change in circumstances and the trend
towards automation demand that decisions in business are based on a careful
analysis of data concerning the future course of events and not purely on
guesses and hunches. The future is unknown to us and yet every day we are
forced to make decisions involving the future and therefore, there is uncertainty.
Great risk is associated with business affairs. All businessmen are forced to
make forecasts regarding business activities.
Success in business depends upon successful forecasts of business events. In
recent times, considerable research has been conducted in this field. Attempts
are being made to make forecasting as scientific as possible.
Business forecasting is not a new development. Every businessman must
forecast; even if the entire product is sold before production. Forecasting has
always been necessary. What is new in the attempt to put forecasting on a
scientific basis is to forecast by reference to past history and statistics rather
than by pure intuition and guess-work.
One of the most important tasks before businessmen and economists these days
is to make estimates for the future. For example, a businessman is interested in
finding out his likely sales next year or as long term planning in next five or ten
years so that he adjusts his production accordingly and avoid the possibility of
either inadequate production to meet the demand or unsold stocks.
Similarly, an economist is interested in estimating the likely population in the
coming years so that proper planning can be carried out with regard to jobs for
the people, food supply, etc. First step in making estimates for the future
consists of gathering information from the past. In this connection we usually
deal with statistical data which is collected, observed or recorded at successive
intervals of time. Such data is generally referred to as time series. Thus, when
we observe numerical data at different points of time the set of observations is
known as time series.
Objectives:
After studying this unit, you should be able to:
● describe the meaning of business forecasting
● distinguish between prediction, projection and forecast
● describe the forecasting methods available
● apply the forecasting theories in taking effective business decisions
3.2 Business Forecasting
Business forecasting refers to the analysis of past and present economic
conditions with the object of drawing inferences about probable future business
conditions. The process of making definite estimates of future course of events
is referred to as forecasting and the figure or statements obtained from the
process is known as ‘forecast’; future course of events is rarely known. In order
to be assured of the coming course of events, an organised system of forecasting
helps. The following are two aspects of scientific business forecasting:
1. Analysis of past economic conditions
For this purpose, the components of time series are to be studied. The secular
trend shows how the series has been moving in the past and what its future
course is likely to be over a long period of time. The cyclic fluctuations would
reveal whether the business activity is subjected to a boom or depression. The
seasonal fluctuations would indicate the seasonal changes in the business
activity.
2. Analysis of present economic conditions
The object of analysing present economic conditions is to study those factors
which affect the sequential changes expected on the basis of the past conditions.
Such factors are new inventions, changes in fashion, changes in economic and
political spheres, economic and monetary policies of the government, war, etc.
These factors may affect and alter the duration of trade cycle. Therefore, it is
essential to keep in mind the present economic conditions since they have an
important bearing on the probable future tendency.
3.2.1 Objectives of forecasting in business
Forecasting is a part of human nature. Businessmen also need to look to the
future. Success in business depends on correct predictions. In fact when a man
enters business, he automatically takes with it the responsibility for attempting
to forecast the future.
To a very large extent, success or failure would depend upon the ability to
successfully forecast the future course of events. Without some element of
continuity between past, present and future, there would be little possibility of
successful prediction. But history is not likely to repeat itself and we would
hardly expect economic conditions next year or over the next 10 years to follow
a clear cut prediction. Yet, past patterns prevail sufficiently to justify using the
past as a basis for predicting the future.
A businessman cannot afford to base his decisions on guesses. Forecasting
helps a businessman in reducing the areas of uncertainty that surround
management decision
making with respect to costs, sales, production, profits, capital investment,
pricing, expansion of production, extension of credit, development of markets,
increase of inventories and curtailment of loans. These decisions are to be based
on present indications of future conditions.
However, we know that it is impossible to forecast the future precisely. There is
a possibility of occurrence of some range of error in the forecast. Statistical
forecasts are the methods in which we can use the mathematical theory of
probability to measure the risks of errors in predictions.
3.2.1.1 Prediction, Projection and Forecasting
A great amount of confusion seems to have grown up in the use of words
‘forecast’, ‘prediction’ and ‘projection’.
Forecasts are made by estimating future values of the external factors by means
of prediction, projection or forecast and from these values calculating the
estimate of the dependent variable.
3.2.2 Characteristics of Business Forecasting
● Based on past and present conditions
Business forecasting is based on past and present economic condition of the
business. To forecast the future, various data, information and facts concerning
to economic condition of business for past and present are analysed.
● Based on mathematical and statistical methods
The process of forecasting includes the use of statistical and mathematical
methods. By using these methods, the actual trend which may take place in
future can be forecasted.
● Period
The forecasting can be made for long term, short term, medium term or any
specific period.
● Estimation of future
Business forecasting is to forecast the future regarding probable economic conditions.
● Scope
Forecasting can be physical as well as financial.
3.2.3 Steps in forecasting
Forecasting of business fluctuations consists of the following steps:
1. Understanding why changes in the past have occurred
One of the basic principles of statistical forecasting is that the forecaster should
use past performance data. The current rate and changes in the rate constitute
the basis of forecasting. Once they are known, various mathematical techniques
can develop projections from them. If an attempt is made to forecast business
fluctuations without understanding why past changes have taken place, the
forecast will be purely mechanical.
Business fluctuations are based solely upon the application of mathematical
formulae and are subject to serious error.
2. Determining which phases of business activity must be measured
After understanding the reasons of occurrence of business fluctuations, it is
necessary to measure certain phases of business activity in order to predict
what changes will probably follow the present level of activity.
Quantitative forecasting
The quantitative forecasting method relies on historical data to predict future needs
and trends. The data can be from your own company, market activity, or both. It
focuses on cold, hard numbers that can show clear courses of change and action. This
method is beneficial for companies that have an extensive amount of data at their
disposal.
Qualitative forecasting
The qualitative forecasting method relies on the input of those who influence your
company’s success. This includes your target customer base and even your leadership
team. This method is beneficial for companies that don’t have enough complex data to
conduct a quantitative forecast.
There are two approaches to qualitative forecasting:
1. Market research: The process of collecting data points through direct correspondence
with the market community. This includes conducting surveys, polls, and focus groups
to gather real-time feedback and opinions from the target market. Market research
looks at competitors to see how they adjust to market fluctuations and adapt to
changing supply and demand. Companies commonly utilize market research to
forecast expected sales for new product launches.
2. Delphi method:This method collects forecasting data from company professionals.
The company’s foreseeable needs are presented to a panel of experts, who then work
together to forecast the expectations and business decisions that can be made with the
derived insights. This method is used to create long-term business predictions and can
also be applied to sales forecasts.
3.2 Utility of Business Forecasting
Business forecasting acquires an important place in every field of the economy.
Business forecasting helps the businessmen and industrialists to form the
policies and plans related with their activities. On the basis of the forecasting,
businessmen can forecast the demand of the product, price of the product,
condition of the market and so on. The business decisions can also be reviewed
on the basis of business forecasting.
3.3.1 Advantages of business forecasting
● Helpful in increasing profit and reducing losses
Every business is carried out with the purpose of earning maximum profits. So,
by forecasting the future price of the product and its demand, the businessman
can predetermine the production cost, production and the level of stock to be
determined. Thus, business forecasting is regarded as the key of success of
business.
● Helpful in taking management decisions
Business forecasting provides the basis for management decisions, because in
present times the management has to take the decision in the atmosphere of
uncertainties. Also, business forecasting explains the future conditions and
enables the management to select the best alternative.
● Useful to administration
On the basis of forecasting, the government can control the circulation of
money. It can also modify the economic, fiscal and monetary policies to avoid
adverse effects of trade cycles. So, with the help of forecasting, the government
can control the expected fluctuations in future.
● Basis for capital market
Business forecasting helps in estimating the requirement of capital, position of
stock exchange and the nature of investors.
● Useful in controlling the business cycles
The trade cycles cause various depressions in business such as sudden change in
price level, increase in the risk of business, increase in unemployment, etc. By
adopting a systematic business forecasting, businessmen and government can
handle and control the depression of trade cycles.
● Helpful in achieving the goals
Business forecasting helps to achieve the objective of business goals through
proper planning of business improvement activities.
● Facilitates control
By business forecasting, the tendency of black marketing, speculation,
uneconomic activities and corruption can be controlled.
● Utility to society
With the help of business forecasting the entire society is also benefited because
the adverse effects of fluctuations in the conditions of business are kept under
control.
3.3.2 Limitations of business forecasting
Business forecasting cannot be accurate due to various limitations which are
mentioned below.
● Forecasting cannot be accurate, because it is largely based on future events
and there is no guarantee that they will happen.
● Business forecasting is generally made by using statistical and
mathematical methods. However, these methods cannot claim to make an
uncertain future a definite one.
● The underlying assumptions of business forecasting cannot be satisfied
simultaneously. In such a case, the results of forecasting will be misleading.
● The forecasting cannot guarantee the elimination of errors and mistakes.
The managerial decision will be wrong if the forecasting is done in a wrong
way.
● Factors responsible for economic changes are often difficult to discover and
measure. Hence, business forecasting becomes an unnecessary exercise.
● Business forecasting does not evaluate risks.
● The forecasting is made on the basis of past information and data and relies
on the assumption that economic events are repeated under the same
conditions. But there may be circumstances where these conditions are not
repeated.
● Forecasting is not a continuous process. In order to be effective, it requires
continuous attention.
Predictive Analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine
learning to identify trends in data and predict future behaviors.
With increasing pressure to show a return on investment (ROI) for implementing
learning analytics, it is no longer enough for a business to simply show how learners
performed or how they interacted with learning content. It is now desirable to go
beyond descriptive analytics and gain insight into whether training initiatives are
working and how they can be improved.
Predictive Analytics can take both past and current data and offer predictions of what
could happen in the future. This identification of possible risks or opportunities enables
businesses to take actionable intervention in order to improve future learning
initiatives.
The software for predictive analytics has moved beyond the realm of statisticians and is
becoming more affordable and accessible for different markets and industries,
including the field of learning & development.
For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as
specialized software.
For the learner, predictive forecasting could be as simple as a dashboard located on the
main screen after logging in to access a course. Analyzing data from past and current
progress, visual indicators in the dashboard could be provided to signal whether the
employee was on track with training requirements.
At the business level, an LMS system with predictive analytic capability can help
improve decision-making by offering in-depth insight to strategic questions and
concerns. This could
range from anything to course enrolment, to course completion rates, to employee
performance.
Predictive analytic models
Because predictive analytics goes beyond sorting and describing data, it relies heavily
on complex models designed to make inferences about the data it encounters. These
models utilize algorithms and machine learning to analyze past and present data in
order to provide future trends.
Each model differs depending on the specific needs of those employing predictive
analytics. Some common basic models that are utilized at a broad level include:
● Decision trees use branching to show possibilities stemming from each
outcome or choice.
● Regression techniques assist with understanding relationships between
variables.
Predictive Modeling
Predictive modeling means developing models that can be used to forecast or predict
future events. In business analytics, models can be developed based on logic or data.
Logic-Driven Models
A logic-driven model is one based on experience, knowledge, and logical relationships
of variables and constants connected to the desired business performance outcome
situation.
The question here is how to put variables and constants together to create a model that
can predict the future. Doing this requires business experience. Model building requires
an understanding of business systems and the relationships of variables and constants
that seek to generate a desirable business performance outcome. To help conceptualize
the relationships inherent in a business system, diagramming methods can be helpful.
For example, the cause-and-effect diagram is a visual aid diagram that permits a user to
hypothesize relationships between potential causes of an outcome (see Figure). This
diagram lists potential causes in terms of human, technology, policy, and process
resources in an effort to establish some basic relationships that impact business
performance. The diagram is used by tracing contributing and relational factors from
the desired business performance goal back to possible causes, thus allowing the user to
better picture sources of potential causes that could affect the performance. This
diagram is sometimes referred to as a fishbone diagram because of its appearance.
Fig Cause-and-effect diagram*
or Profit = (Unit Price × Quantity Sold) − [(Fixed Cost) + (Variable Cost ×Quantity
Suppose a grocery store has collected a big data file on what customers
put into their baskets at the market (the collection of grocery items a
customer purchases at one time). The grocery store would like to know if
there are any associated items in a typical market basket. (For example, if a
customer purchases product A, she will most often associate it or purchase
it with product
B.) If the customer generally purchases product A and B together, the store
might only need to advertise product A to gain both product A’s and B’s
sales. The value of knowing this association of products can improve the
performance of the store by reducing the need to spend money on
advertising both products. The benefit is real if the association holds true.
The SAS K-Mean cluster software can be found in Proc Cluster. Any
integer value can designate the K number of clusters desired. In this
problem set, K=2. The SAS printout of this classification process is shown
in Table 6.3.
The Initial Cluster Centers table listed the initial high (20167) and a low
(12369) value from the data set as the clustering process begins. As it turns
out, the software divided the customers into 9 high sales customers and 11
low sales customers.
Consider how large big data sets can be. Then realize this kind of
classification capability can be a useful tool for identifying and predicting
sales based on the mean values.
The case study firm had collected a random sample of monthly sales
information presented in Figure 6.4 listed in thousands of dollars. What the
firm wants to know is, given a fixed budget of $350,000 for promoting this
service product, when it is offered again, how best should the company
allocate budget dollars in hopes of maximizing the future estimated
month’s product sales? Before the firm makes any allocation of budget,
there is a need to understand how to estimate future product sales. This
requires understanding the behavior of product sales relative to sales
promotion efforts using radio, paper, TV, and point-of-sale (POS) ads.
Figure 6.4 Data for marketing/planning case study
The R-Square Adjusted statistic does not have the same interpretation
as R- Square (a precise, proportional measure of variation in the
relationship). It is instead a comparative measure of suitability of
alternative independent variables. It is ideal for selection between
independent variables in a multiple regression model. The R-Square
adjusted seeks to take into account the phenomenon of the R-Square
automatically increasing when additional independent variables are added
to the model. This phenomenon is like a painter putting paint on a canvas,
where more paint additively increases the value of the painting. Yet by
continually adding paint, there comes a point at which some paint covers
other paint, diminishing the value of the original. Similarly, statistically
adding more variables should increase the ability of the model to capture
what it seeks to model. On the other hand, putting in too many variables,
some of which may be poor predictors, might bring down the total
predictive ability of the model. The R-Square adjusted statistic provides
some information to aid in revealing this behavior.
The value of the R-Square adjusted statistic can be negative, but it will
always be less than or equal to that of the R-Square in which it is related.
Unlike R-Square, the R-Square adjusted increases when a new independent
variable is included only if the new variable improves the R-Square more
than would be expected in the absence of any independent value being
added. If a set of independent variables is introduced into a regression
model one at a time in forward step-wise regression using the highest
correlations ordered first, the R- Square adjusted statistic will end up being
equal to or less than the R-Square value of the original model. By
systematic experimentation with the R-Square adjusted recomputed for
each added variable or combination, the value of the R-Square adjusted
will reach a maximum and then decrease. The multiple regression model
with the largest R-Square adjusted statistic will be the most
accurate combination of having the best fit without excessive or
unnecessary independent variables. Again, just putting all the variables
into a model may add unneeded variability, which can decrease its
accuracy. Thinning out the variables is important.
Table 6.9 SAS Best Variable Combination Regression Model and Statistics:
Marketing/Planning Case Study
where:
commercials
Because all the data used in the model is expressed as dollars, the
interpretation of the model is made easier than using more complex data.
The interpretation of the multiple regression model suggests that for every
dollar allocated to radio commercials (represented by X1), the firm will
receive
$275.69 in product sales (represented by Yp in the model). Likewise, for every
dollar allocated to TV commercials (represented by X2), the firm will receive
$48.34 in product sales.
In summary, for this case study, the predictive analytics analysis has
revealed a more detailed, quantifiable relationship between the generation
of product sales and the sources of promotion that best predict sales. The
best way
to allocate the $350,000 budget to maximize product sales might involve
placing the entire budget into radio commercials because they give the best
return per dollar of budget. Unfortunately, there are constraints and
limitations regarding what can be allocated to the different types of
promotional methods. Optimizing the allocation of a resource and
maximizing business performance necessitate the use of special business
analytic methods designed to accomplish this task. This requires the
additional step of prescriptive analytics analysis in the BA process, which
will be presented in the last section of Chapter 7.
Summary
This chapter dealt with the predictive analytics step in the BA process.
Specifically, it discussed logic-driven models based on experience and
aided by methodologies like the cause-and-effect and the influence
diagrams. This chapter also defined data-driven models useful in the
predictive step of the BA analysis. A further discussion of data mining was
presented. Data mining methodology such as neural networks, discriminant
analysis, logistic regression, and hierarchical clustering was described. An
illustration of K- mean clustering using SAS was presented. Finally, this
chapter discussed the second installment of a case study illustrating the
predictive analytics step of the BA process. The remaining installment of
the case study will be presented in Chapter 7.
Discussion Questions
Problems
where:
commercials
3. Assume for this problem the following table would have held true
for the resulting marketing/planning case study problem. Which
combination of variables is estimated here to be the best predictor set?
Explain why.
4. Assume for this problem that the following table would have held
true for the resulting marketing/planning case study problem. Which of the
variables is estimated here to be the best predictor? Explain why.
• PRIVACY POLICY