0% found this document useful (0 votes)
122 views75 pages

Demgn801 Business Analytics 1 75

Uploaded by

Agus Gumilar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views75 pages

Demgn801 Business Analytics 1 75

Uploaded by

Agus Gumilar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Business Analytics

DEMGN801

Edited by:
Dr. Suresh Kashyap
Business Analytics
Edited By
Dr. Suresh Kashyap
Title: BUSINESS_ANALYTICS

Author’s Name: Dr. Mohd Imran Khan

Published By : Lovely Professional University

Publisher Address: Lovely Professional University, Jalandhar Delhi GT road, Phagwara - 144411

Printer Detail: Lovely Professional University

Edition Detail: (I)

ISBN: 978-93-94068-47-6

Copyrights@ Lovely Professional University


Content

Unit 1: Business Analytics and Summarizing Business Data 1


Dr. Mohd Imran Khan, Lovely Professional University
Unit 2: Summarizing Business Data 19
Dr. Mohd Imran Khan, Lovely Professional University
Unit 3: Business Data Visualization 39
Dr. Mohd Imran Khan, Lovely Professional University
Unit 4: Business Forecasting using Time Series 64
Dr. Mohd Imran Khan, Lovely Professional University
Unit 5: Business Prediction Using Generalised Linear Models 85
Dr. Mohd Imran Khan, Lovely Professional University
Unit 6: Machine Learning for Businesses 100
Dr. Mohd Imran Khan, Lovely Professional University
Unit 7: Text Analytics for Business 121
Dr. Mohd Imran Khan, Lovely Professional University
Unit 8: BusinessIntelligence 142
Dr. Mohd Imran Khan, Lovely Professional University
Unit 9: Data Visualization 156
Dr. Mohd Imran Khan, Lovely Professional University
Unit 10: Data Environment and Preparation 170
Dr. Mohd Imran Khan, Lovely Professional University
Unit 11: Data Blending 184
Dr. Mohd Imran Khan, Lovely Professional University
Unit 12: Design Fundamentals and Visual Analytics 195
Dr. Mohd Imran Khan, Lovely Professional University
Unit 13: Decision Analytics and Calculations 204
Dr. Mohd Imran Khan, Lovely Professional University
Unit 14: Mapping 215
Dr. Mohd Imran Khan, Lovely Professional University
Notes

Unit 01: Business Analytics and Summarizing Business Data


Dr. Mohd Imran Khan, Lovely Professional University

Unit 01: Business Analytics and Summarizing Business Data


CONTENTS
Objectives
Introduction
1.1 Overview of Business Analytics
1.2 Scope of Business Analytics
1.3 Use cases of Business Analytics
1.4 What Is R?
1.5 The R Environment
1.6 What is R Used For?
1.7 The Popularity of R by Industry
1.8 How to Install R
1.9 R packages
1.10 Vector in R
1.11 Data types in R
1.12 Data Structures in R
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Readings

Objectives
 overview of business analytics:
 scope of business analytics,
 application of business analytics
 Rstudio environment for business analytics,
 basics of R: packages
 vectors in R programming,
 datatypes and data structures in R programming

Introduction
Business analytics is a crucial aspect of modern-day organizations that leverages data and
advanced analytical techniques to make data-driven decisions. The goal of business analytics is to
turn data into insights that can help organizations identify trends, measure performance, and
optimize processes.
One of the most significant benefits of business analytics is that it allows organizations to make
informed decisions based on real data instead of gut instincts or assumptions. This leads to better
decision-making and a more strategic approach to business operations. Additionally, business

LOVELY PROFESSIONAL UNIVERSITY 1


Notes
Business Analytics

analytics enables organizations to predict future trends and allocate resources more effectively,
thereby increasing efficiency and competitiveness.
Another advantage of business analytics is that it can help organizations understand their
customers better. By analyzing customer data, organizations can gain insights into customer
behavior, preferences, and buying patterns, which can help them tailor their products and services
to meet customer needs more effectively.
However, it is important to note that business analytics is not just about collecting and analyzing
data. It requires a deep understanding of statistical and mathematical models, as well as the ability
to effectively communicate insights to key stakeholders. Furthermore, organizations must ensure
that their data is of high quality and that their analytics systems are secure, to ensure that the
insights generated are accurate and trustworthy.

1.1 Overview of Business Analytics


Business analytics is a broad field that encompasses the use of data, statistical algorithms, and
technologies to extract insights and support decision making in organizations. It involves the
collection, analysis, and interpretation of data to help organizations identify trends, measure
performance, and optimize processes.
The goal of business analytics is to turn data into actionable insights that can inform strategy and
drive improvements. This is achieved through a combination of descriptive, diagnostic, predictive,
and prescriptive analytics, which provide different levels of insight and support different types of
decision making.
Descriptive analytics provides a historical perspective on business performance and focuses on
summarizing and describing past data. Diagnostic analytics focuses on identifying root causes of
performance issues. Predictive analytics uses historical data and statistical models to make
predictions about future performance. Prescriptive analytics provides recommendations for
decision-makers to optimize future outcomes.
Business analytics tools and technologies include data warehousing, data mining, machine
learning, and visualization tools, among others. The use of these tools and techniques enables
organizations to collect, process, and analyze large amounts of data, providing insights that would
be difficult to extract manually.
Overall, business analytics is a crucial tool for organizations looking to make data-driven decisions,
optimize performance, and stay ahead in a highly competitive business environment.
In conclusion, business analytics is a critical tool for modern organizations that enables them to
make informed decisions, improve operations, and stay competitive in a rapidly changing business
environment. While it requires a combination of technical expertise and communication skills, the
benefits it brings to organizations make it a valuable investment.

2 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

1.2 Scope of Business Analytics


The scope of business analytics covers a wide range of activities and areas within an organization,
including:
Data Collection and Management: The process of gathering, storing, and organizing data from
various sources in a structured manner.
Data Analysis: The process of using statistical and mathematical techniques to identify patterns
and relationships in data, and to gain insights into business problems.
Predictive Modeling: The use of statistical algorithms and machine learning techniques to make
predictions about future events or trends based on historical data.
Data Visualization: The process of creating visual representations of data to help understand and
communicate insights and information more effectively.
Decision-Making Support: Using analytics to provide insights and recommendations to decision-
makers to help them make more informed choices.
Customer Behavior Analysis: The process of analyzing customer data to gain insights into their
behavior and preferences, and to inform business strategy.
Market Research: The process of gathering and analyzing data about the market, customers, and
competitors to inform business strategy.
Inventory Management: Using analytics to optimize the management of inventory levels and costs,
and to improve supply chain efficiency.
Financial Forecasting: The process of using data and analytical models to make predictions about
future financial performance and outcomes.
Operations Optimization: Using analytics to optimize business processes and operations, and to
improve efficiency, productivity, and customer satisfaction.
Customer Behavior Analysis: Understanding customer preferences, needs, and purchase patterns
to inform business decisions and improve customer experience.
Sales and Marketing Analysis: Evaluating the effectiveness of sales and marketing strategies, and
determining opportunities for improvement.
Supply Chain Optimization: Optimizing supply chain operations, such as inventory management,
logistics, and transportation.
Financial Analysis and Reporting: Analyzing financial data to support budgeting, forecasting, and
decision-making.
Human Resource Management and Analysis: Examining HR data to improve workforce planning,
talent management, and employee satisfaction.
Operations and Process Improvement: Identifying and improving inefficiencies in business
processes to increase efficiency and productivity.
Business Analytics Success Stories
Here are some well-known business data analytics success stories:
Capital One: Capital One uses data analytics to detect fraud and manage risk. The company's
algorithms analyze customer data to identify unusual or suspicious behavior and alert the relevant
departments.
Barclays: The bank uses data analytics to detect fraud, manage risk and improve customer
experience.
Procter & Gamble: The consumer goods company uses data analytics to optimize pricing, improve
supply chain and inform marketing strategies.
Sports teams: Teams in the NFL, NBA and MLB use data analytics to optimize player performance,
inform game strategy and improve fan engagement.

LOVELY PROFESSIONAL UNIVERSITY 3


Notes
Business Analytics

These examples show how businesses can use data analytics to drive efficiency, improve customer
experiences, and make informed decisions.

1.3 Use cases of Business Analytics


Netflix
Netflix uses business analytics in several ways:
Content analysis: They analyze data to determine which content to produce and license, including
genre, budget, and target audience.
Customer behavior: They track viewing habits, search and browsing behavior, and preferences to
make recommendations and personalize the user experience.
Pricing and subscription: Netflix uses analytics to determine optimal pricing and subscription
plans, monitor customer churn, and understand the impact of changes.
Marketing: They analyze the effectiveness of marketing campaigns and adjust them accordingly.
International expansion: They use data to determine which markets to expand into, what content
to offer, and how to localize the user experience.
Overall, Netflix leverages analytics to drive informed decision-making and optimize their
operations, user experience, and revenue.
Amazon
Amazon uses business analytics in several ways:
Sales and revenue: They analyze sales data to understand trends, customer behavior, and revenue
growth.
Inventory and supply chain: Amazon uses analytics to optimize inventory levels, manage the
supply chain, and ensure timely delivery of products.
Customer behavior: They track customer behavior, including browsing, search, and purchase
history, to make recommendations and personalize the user experience.
Pricing: Amazon uses data and analytics to determine optimal pricing for products and to track
competitor pricing.
Marketing: They analyze the effectiveness of marketing campaigns, advertising, and promotions to
make informed decisions about where to allocate budget.
Fraud detection: Amazon uses analytics to detect fraudulent activity and protect the security of
customer data and transactions.
Overall, Amazon leverages analytics to drive informed decision-making and optimize their
operations, customer experience, and revenue.

Walmart
Walmart uses business analytics in several ways, including:
Supply Chain Optimization: Walmart uses data analytics to optimize its supply chain and
improve the efficiency of its operations.
Customer Insights: Walmart collects and analyzes data on customer shopping habits and
preferences to inform its marketing strategies and product offerings.
Inventory Management: Walmart uses data analytics to track inventory levels and sales patterns to
ensure that the right products are in stock at the right time.
Employee Management: Walmart uses data analytics to monitor employee productivity, schedule
management and reduce labor costs.
Pricing Strategies: Walmart uses data analytics to inform its pricing strategies, ensuring that it
remains competitive while maximizing profits.
Overall, Walmart leverages business analytics to gain insights and make data-driven decisions that
improve its operations and drive growth.

4 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

Uber
Uber uses business analytics in several ways:
Demand forecasting: To predict demand for rides and optimize pricing and driver incentives.
Customer segmentation: To better understand and target different customer segments.
Driver performance evaluation: To measure driver performance and identify areas for
improvement.
Route optimization: To determine the best routes for drivers and passengers, reducing travel time
and costs.
Fraud detection: To identify and prevent fraudulent activities, such as fake rides and fake drivers.
Marketing and promotions: To measure the effectiveness of marketing campaigns and
promotional offers.
Market expansion: To analyze new markets and determine the viability of expanding into new
cities and regions.
Google
Google uses business analytics in various ways:
Data-driven decision making: Google collects and analyzes massive amounts of data to inform its
decisions and strategies.
Customer behavior analysis: Google analyzes user data to understand customer behavior and
preferences, which helps with product development and marketing strategies.
Financial analysis: Google uses business analytics to track and forecast its financial performance.
Ad campaign optimization: Google uses analytics to measure the effectiveness of its advertising
campaigns and adjust them accordingly.
Market research: Google analyzes market trends and competitor activity to inform its business
strategies.
What is R: Overview, its Applications and what is R used for?

Since there are so many programming languages available today, it’s sometimes hard to decide
which one to choose. As a result, programmers often face the dilemma of too many good choices.
It’s enough to stop people in their tracks, paralyzed with indecision!

To combat this potential source of mental gridlock, we present an analysis of the R programming
language.

1.4 What Is R?
What better place to find a good definition of the language than the R Foundation’s website?
According to R-Project.org, R is “… a language and environment for statistical computing and
graphics.” It’s an open-source programming language often used as a data analysis and statistical
software tool.

R is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much code written for S
runs unaltered under R.

LOVELY PROFESSIONAL UNIVERSITY 5


Notes
Business Analytics

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
The S language is often the vehicle of choice for research in statistical methodology, and R provides
an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been taken
over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General
Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment consists of an integrated suite of software facilities designed for data
manipulation, calculation, and graphical display. The environment features:

 A high-performance data storage and handling facility

 A suite of operators for array calculations, mainly matrices

 A vast, easily understandable, integrated assortment of intermediate tools dedicated to data


analysis

 Graphical facilities for data analysis and display that work either for on-screen or hardcopy

 The well-developed, simple and effective programming language, featuring user-defined


recursive functions, loops, conditionals, and input and output facilities.

The syntax of R consists of three items:


 Variables, which store data

 Comments, which are used to improve code readability

 Keywords, reserved words that have a special meaning for the compiler

R was developed in 1993 by Ross Ihaka and Robert Gentleman and includes linear regression,
machine learning algorithms, statistical inference, time series, and more.

R is a universal programming language compatible with the Windows, Macintosh, UNIX, and
Linux platforms. It is often referred to as a different implementation of the S language and
environment and is considered highly extensible.

1.5 The R Environment


R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes

 an effective data handling and storage facility,


 a suite of operators for calculations on arrays, in particular matrices,
 a large, coherent, integrated collection of intermediate tools for data analysis,

6 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

 graphical facilities for data analysis and display either on-screen or on hardcopy, and
 a well-developed, simple and effective programming language which includes conditionals,
loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the case
with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect of S,
which makes it easy for users to follow the algorithmic choices made. For computationally-
intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can
write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it as an environment within


which statistical techniques are implemented. R can be extended (easily) via packages. There are
about eight packages supplied with the R distribution and many more are available through the
CRAN family of Internet sites covering a very wide range of modern statistics.

R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.

Does R Have Any Drawbacks?


What language doesn’t? When answering the question “What is R?” we should also look at some of
R’s not so great aspects:

It’s a complicated language. R has a steep learning curve. It’s a language best suited for people who
have previous programming experience.

It’s not as secure. R doesn’t have basic security measures. Consequently, it’s not a good choice for
making web-safe applications. Also, R can’t be embedded in web browsers.

It’s slow. R is slower than other programming languages like Python or MATLAB.

It takes up a lot of memory. Memory management isn’t one of R’s strong points. R’s data must be
stored in physical memory. However, the increasing use of cloud-based memory may eventually
make this drawback moot.

It doesn’t have consistent documentation/package quality. Docs and packages can be patchy and
inconsistent, or incomplete. That’s the price you pay for a language that doesn’t have official,
dedicated support and instead is maintained and added to by the community.

Why use R
R is a state-of-the-art programming languague for statistical computing, data analysis, and machine
learning. It has been around for almost three decades with over 12,000 packages available for
download on CRAN. This means that there is an R package that supports whatever type of analysis
you want to perform. Here are a few reasons why you should learn and use R:

LOVELY PROFESSIONAL UNIVERSITY 7


Notes
Business Analytics

Free and open-source: The R programming language is open-source and is issued under the
General Public License (GNU). This means that you can use all the functionalities of R for free
without any restrictions or licensing requirements. Since R is open-source, everyone is welcome to
contribute to the project, and since it’s freely available, bugs are easily detected and fixed by the
open-source community.

Popularity: The R programming language was ranked 7th in the 2021 IEEE Spectrum ranking of
top programming languages and 12th in the TIOBE Index ranking of January 2022. It’s the second
most popular programming language for data science just behind Python, according to edX, and it
is the most popular programming language for statistical analysis. R’s popularity also means that
there is extensive community support on platforms like Stack overflow. R also has a detailed online
documentation that R users can consult for help.

High-quality visualization: The R programming language is famous for high-quality


visualizations. R’s ggplot2 is a detailed implementation of the grammar of graphics — a system to
concisely describe the components of a graph. With R’s high-quality graphics, you can easily
implement intuitive and interactive graphs.

A language for data analytics and data science: The R programming language isn’t a general-
purpose programming language. It’s a specialized programming language for statistical
computing. Therefore, most of R’s functions carry out vectorized operations, meaning you don’t
need to loop through each element. This makes running R code very fast. Distributed computing
can be executed in R, whereby tasks are split among multiple processing computers to reduce
execution time. R is integrated with Hadoop and Apache Spark, and it can be used to process large
amount of data. R can connect to all kinds of databases, and it has packages to carry out machine
learning and deep learning operations.

Opportunity to pursue an exciting career in academe and industry: The R programming language
is trusted and extensively used in the academic community for research. R is increasingly being
used by government agencies, social media, telecommunications, financial, e-commerce,
manufacturing, and pharmaceutical companies. Top companies that uses R include Amazon,
Google, ANZ Bank, Twitter, LinkedIn, Thomas Cook, Facebook, Accenture, Wipro, the New York
Times, and many more. A good mastery of the R programming language opens all kinds of
opportunities in academe and industry.

1.6 What is R Used For?


R is a programming language and software environment for statistical computing, data analysis,
and graphics. It is widely used by statisticians, data scientists, and researchers in academia,
government, and industry for tasks such as statistical modeling, data visualization, and data
mining. R is a programming language and software environment for statistical computing and
graphics. It is widely used by statisticians, data scientists, and researchers for developing statistical
software and data analysis. R is also used for machine learning, data visualization, and data

8 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

manipulation. With its vast libraries and packages, R is popular in industries such as finance,
healthcare, and e-commerce, as well as academia and research institutions.

Although R is a popular language used by many programmers, it is especially effective when used
for

 Data analysis
 Statistical inference
 Machine learning algorithms

R offers a wide variety of statistics-related libraries and provides a favorable environment for
statistical computing and design. In addition, the R programming language gets used by many
quantitative analysts as a programming tool since it's useful for data importing and cleaning.

As of August 2021, R is one of the top five programming languages of the year, so it’s a favorite
among data analysts and research programmers. It’s also used as a fundamental tool for finance,
which relies heavily on statistical data.

1.7 The Popularity of R by Industry


Thanks to its versatility, many different industries use the R programming language. Here is a list
of industries/disciplines that use the R programming language:

 Fintech Companies (financial services)


 Academic Research
 Government (FDA, National Weather Service)
 Retail
 Social Media
 Data Journalism
 Manufacturing
 Healthcare

This graph, provided by Stackoverflow, gives you a better idea of R programming language usage
in recent history. Given its strength in statistics, it's hardly surprising that R enjoys heavy use in the
world of academia, as illustrated on the chart.

If you’re looking for specifics, here are ten significant companies or organizations that use R,
presented in no particular order.

 Airbnb
 Microsoft
 Uber
 Facebook
 Ford
 Google
 Twitter

LOVELY PROFESSIONAL UNIVERSITY 9


Notes
Business Analytics

 IBM
 American Express
 HP

1.8 How to Install R


To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows,
Mac or Linux.

When you have downloaded and installed R, you can run R on your computer.

The screenshot below shows how it may look like when you run R on a Windows PC:

Installing R on Windows OS

To install R on Windows OS:

Go to the CRAN website. (https://cran.r-project.org/)

Click on "Download R for Windows".

Click on "install R for the first time" link to download the R executable (.exe) file.

Run the R executable file to start installation, and allow the app to make changes to your device.

Select the installation language.

Follow the installation instructions.

Click on "Finish" to exit the installation setup.

R has now been sucessfully installed on your Windows OS. Open the R GUI to start writing R
codes.

Additional R interfaces

10 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

Other than the R GUI, the other ways to interface with R include RStudio Integrated Development
Environment (RStudio IDE) and Jupyter Notebook. To run R on RStudio, you first need to install R
on your computer, while to run R on Jupyter Notebook, you need to install an R kernel. RStudio
and Jupyter Notebook provide an interactive and friendly graphical interface to R that greatly
improves users’ experience.

Installing RStudio Desktop

To install RStudio Desktop on your computer, do the following:

Go to the RStudio website. (https://posit.co/download/rstudio-desktop/)

Click on "DOWNLOAD" in the top-right corner.

Click on "DOWNLOAD" under the "RStudio Open Source License".

Download RStudio Desktop recommended for your computer.

Run the RStudio Executable file (.exe) for Windows OS or the Apple Image Disk file (.dmg) for
macOS X.

Follow the installation instructions to complete RStudio Desktop installation.

RStudio is now successfully installed on your computer. The RStudio Desktop IDE interface is
shown in the figure below:

1.9 R packages
R packages are collections of functions, data, and compiled code that can be used to extend the
capabilities of R. There are thousands of R packages available, covering a wide range of topics,
including statistics, machine learning, data visualization, and more. Installing and using R

LOVELY PROFESSIONAL UNIVERSITY 11


Notes
Business Analytics

packages is an essential part of working with R, and many packages are designed to be easy to
install and use, with clear documentation and examples.

Tidyverse: The tidyverse is a collection of R packages designed for data science. It includes
packages for data manipulation (dplyr), data visualization (ggplot2), and data import/export
(readr, tidyr), among others. The packages in the tidyverse are designed to work together
seamlessly, and they share a common design philosophy, which emphasizes simplicity,
consistency, and understanding. The tidyverse is particularly popular among R users due to its ease
of use, intuitive syntax, and wide range of capabilities, making it a great choice for data analysis
tasks of all types and complexity levels.

Ggplot2: ggplot2 is a data visualization library for the R programming language. It provides a
high-level interface for creating statistical graphics. ggplot2 uses a grammar of graphics to build
complex plots from basic components, allowing users to quickly create sophisticated visualizations
of their data. The library is highly customizable and flexible, allowing users to specify a wide range
of visual elements such as colors, shapes, sizes, and labels.

Dplyr: dplyr is a data manipulation library for R. It provides a set of functions that allow users to
perform common data manipulation tasks such as filtering, summarizing, transforming, and
aggregating data. dplyr is designed to be fast, efficient, and easy to use, and it operates on data
frames and tibbles, making it a popular choice for data wrangling and exploration. The library is
particularly well-suited for working with large datasets, as it provides optimized implementations
for many common data manipulation operations. The syntax of dplyr functions is highly readable
and intuitive, and the library is widely used by data scientists and analysts for data preparation and
exploration.

Tidyr:tidyr is a library for the R programming language that provides tools for "tidying" data. In
the context of data science and analysis, tidying data means restructuring it into a format that is
more suitable for analysis, visualization, and modeling. tidyr provides a suite of functions for
transforming data from a wide variety of formats into a more structured, "tidy" format. This makes
it easier to work with the data and perform common data manipulation tasks such as aggregating,
filtering, and summarizing. The library is designed to work seamlessly with other R libraries such
as dplyr, making it a popular choice for data preparation and wrangling tasks.

Shiny: Shiny is a web application framework for R. It allows R developers to create interactive,
web-based data applications without needing to learn HTML, CSS, or JavaScript. Shiny provides a
simple, high-level syntax for building user interfaces and tying them to R code for data analysis,
visualization, and modeling. Applications built with Shiny can be run locally or hosted on a web
server, making it easy to share results with collaborators and stakeholders. The framework is highly
customizable and can be extended using HTML, CSS, and JavaScript, allowing developers to create
complex, interactive applications with rich user interfaces. Shiny is widely used in data science and
analytics for creating dashboards, data visualization tools, and other interactive applications.

12 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

1.10 Vector in R
In R, a vector is a basic data structure that represents an ordered collection of values of the same
type (numeric, character, logical, etc.). Vectors are the simplest type of data structure in R and are
used as the building blocks for more complex data structures such as arrays, data frames, and lists.
A vector can be created using the c() function and can be indexed, sliced, and manipulated using
various R functions and operators. In R, vectors are used for representing variables, input data, and
intermediate results of computations. They play a crucial role in many data analysis and modeling
tasks and are an essential part of the R programming language.

1.11 Data types in R


In R, the following data types are commonly used:

Numeric: represents numbers and is used for mathematical calculations.

Integer: a whole number, without a fractional part.

Complex: represents complex numbers.

Character: used to represent text.

Logical: used to represent True/False values.

Factor: used to represent categorical variables.

Date: used to represent dates.

Raw: used to represent raw binary data.

R also has several other specialized data types such as list, matrix, data frame, etc.

LOVELY PROFESSIONAL UNIVERSITY 13


Notes
Business Analytics

1.12 Data Structures in R


In R, data structures include:

Vectors: One-dimensional arrays of homogeneous data (e.g., numbers or characters)

Matrices: Two-dimensional arrays of homogeneous data

Arrays: Multi-dimensional arrays of homogeneous data

Data frames: Two-dimensional arrays of heterogeneous data, with rows and columns labeled

Lists: Heterogeneous collections of objects

Factors: Categorical variables with a limited number of levels

Tables: Tabular data structure for summarizing categorical data.

Each of these data structures can be created and manipulated in various ways in R, and many
functions are available for operating on them.

Summary
Business analytics is the practice of examining data and using statistical analysis and other methods
to gain insights into the performance and efficiency of a business. It involves the use of data,
statistical algorithms, and technology to uncover hidden patterns and knowledge from large data
sets, and is used to inform decision making and guide the development of strategies and plans.
The goal of business analytics is to improve decision-making, streamline processes, and gain a
competitive advantage through the use of data and predictive modeling. It can be applied in
various areas of a business, such as sales and marketing, supply chain management, finance, and
operations.
Business analytics typically involves several key steps: data collection, data cleaning and
preparation, data analysis, and communication of results. Data scientists and other professionals
use statistical and mathematical methods, such as regression analysis and predictive modeling, to
analyze the data and extract insights. The results of these analyses are then used to inform
decisions, support business strategy development, and identify opportunities for improvement.
In recent years, the rapid growth of digital data and advancements in technology have made it
easier for organizations to collect and analyze large amounts of data, leading to the widespread
adoption of business analytics across a wide range of industries.

Keywords
Business analytics, Descriptive analytics, Predictive analytics, Prescriptive analytics, R
Programming

SelfAssessment
1. Which of the following fields below typically make use of Data Mining techniques?
A. Advertising
B. Government Intelligence
C. Airline Industry
D. All of the above

14 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

2. The R language is a dialect of which of the following programming languages?


A. SAS
B. MATLAB
C. C
D. S

3. Which is the R command for obtaining 1000 random numbers through normal distribution
with mean 0 and variance 1?
A. norm(1000, 0, 1)
B. rnorm(0, 1, 1000)
C. rnorm(1000, 0, 1)
D. qnorm(0, 1, 1000)

4. For the population y<-c(1,2,3,4,5), write the R command to find the mean?
A. mean{y}
B. means(y)
C. mean(y)
D. mean[y]

5. It is an encompassing and multidimensional field that uses mathematics, statistics,


predictive modeling and machine learning techniques to find meaningful patterns and
knowledge in recorded data.
A. Big Data
B. Analytics
C. Normal Data
D. Analytics Process

6. It is a term applied to a dataset that exceeds the processing capacity of conventional


database systems, or it doesn’t fit the structural requirements of traditional database
architecture.
A. Big Data
B. Business Analytics
C. Analytics
D. Normal Data

7. The first step in the process is _____________. Data relevant to the applicant is collected. The
quality, quantity, validity, and nature of data directly impact the analytical outcome. A
thorough understanding of the data on hand is extremely critical.
A. Results
B. Put Into Use
C. Data Collection

LOVELY PROFESSIONAL UNIVERSITY 15


Notes
Business Analytics

D. Model Building

8. Usually raw data is not in a format that can be directly used to perform data analysis. In
very simple terms, most platforms require data to be in a matrix form with the variables
being in different columns and rows representing various observations. Data may be
available in structured, semi-structured, and unstructured form.
A. Data Collection
B. Data Preparation
C. Data Analysis
D. Model Building

9. Once data is converted into a structured format, the next stage is to perform ___________. At
this stage underlying trends in the data are identified. This step can include fitting a linear
or nonlinear regression model, performing principal component analysis or cluster analysis,
identifying if data is normally distributed or not.
A. Data Collection
B. Data Preparation
C. Data Analysis
D. Model Building

10. We need to analyzed the data we collected, Analyze Data Model to assess and query the
data collected in the process.
A. Data
B. Analyze
C. Generate Reports
D. Smarter Decisions

11. Consists of acquiring the data, implementing advanced data processes, distributing the data
effectively and managing oversight data.
A. Artificial Intelligence
B. Growing Importance of The CDO & CAO
C. Data Discovery
D. Data Quality Management (DQM)

12. __________ is the science aiming to make machines execute what is usually done by complex
human intelligence.
A. Data Discovery
B. Artificial Intelligence
C. Collaborative Business Intelligence
D. Consumer Experience

16 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 01: Business Analytics and Summarizing Business Data

13. Predictive analytics is widely used by both conventional retail stores as well as e-commerce
firms for analyzing their historical data and building models for customer engagement,
supply chain optimization, price optimization, and space optimization and assortment
planning.
A. Retail Industry
B. Telecom Industry
C. Health Industry
D. Finance Industry

14. Quantitative data refers to:


A. numerical data that could usefully be quantified to help you answer your research
question(s) and to meet your objectives.
B. graphs and tables.
C. any data you present in your report.
D. statistical analysis.

15. Qualitative analysis software cannot:


A. make report writing easier.
B. find concealed data.
C. be done without training.
D. re-analyse data easily.

Answers for SelfAssessment


l. D 2. D 3. B 4. C 5. B

6. A 7. C 8. B 9. A 10. B

11. D 12. B 13. A 14. A 15. D

Review Questions
1. What is business analytics and how does it differ from traditional business intelligence?
2. What are the key steps involved in the business analytics process?
3. How can data visualization be used to support business decision-making?
4. What is data mining and how is it used in business analytics?
5. What is predictive analytics and how does it differ from descriptive analytics?
6. What are some common techniques used in predictive modeling, such as regression
analysis, decision trees, and neural networks?
7. How can business analytics be used to support customer relationship management
(CRM)?
8. What are some common applications of business analytics in areas such as supply chain
management, marketing, and finance?

LOVELY PROFESSIONAL UNIVERSITY 17


Notes
Business Analytics

9. What is big data and how does it impact business analytics?


10. What role does machine learning play in business analytics and what are some common
algorithms used in this area?

Further Readings
https://business.wfu.edu/masters-in-business-analytics/articles/what-is-
analytics/#:~:text=Business%20analytics%20is%20the%20process,to%20create%20insights%
20from%20data.
Business Analytics, 2ed: The Science of Data-Driven Decision Making by U. Dinesh Kumar,

18 LOVELY PROFESSIONAL UNIVERSITY


Notes
Dr. Mohd Imran Khan, Lovely Professional University Unit 02: Summarizing Business Data

Unit 02: Summarizing Business Data


CONTENTS
Objectives
Introduction
2.1 Functions in R Programming
2.2 One Variable and Two Variables Statistics
2.3 Basics Functions in R
2.4 User-defined Functions in R Programming Language
2.5 Single Input Single Output
2.6 Multiple Input Multiple Output
2.7 Inline Functions in R Programming Language
2.8 Functions to Summarize Variables- Select, Filter, Mutate & Arrange
2.9 Summarize function in R
2.10 Group by function in R
2.11 Concept of Pipes Operator in R
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further reading

Objectives
 discuss one variable and two variables statistics,
 overview of functions to summarize variables.
 implement select, filter, mutate, variables.
 use of arrange, summarize, and group byfunctions.
 demonstrate concept of pipes operator

Introduction
R is a programming language and software environment for statistical computing and graphics. It
was developed in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand. R provides a wide range of statistical and graphical techniques and is highly extensible,
allowing users to write their own functions and packages.
One of the main strengths of R is its ability to handle and visualize complex data. It has a large and
active community of developers, who have contributed over 15,000 packages to the Comprehensive
R Archive Network (CRAN). These packages cover a wide range of topics, including machine
learning, time series analysis, Bayesian statistics, social network analysis, and many others.
In addition to its statistical and graphical capabilities, R also provides a flexible and interactive
programming environment. R code can be run from the command line, from scripts, or from within

LOVELY PROFESSIONAL UNIVERSITY 19


Notes
Business Analytics

a graphical user interface (GUI) such as RStudio. R supports various data structures such as vectors,
matrices, data frames, and lists, and it has a rich set of functions for data manipulation and
transformation.
R is widely used in academia, industry, and government for data analysis, statistical modelling, and
data visualization. It is also a popular choice for reproducible research, as the code and data used in
an analysis can be easily shared and documented.
In summary, R is a powerful and versatile language for data analysis and statistical computing,
with a large community of users and developers and a wide range of tools and techniques.

2.1 Functions in R Programming


Functions are useful when you want to perform a certain task multiple times. A function accepts
input arguments and produces the output by executing valid R commands that are inside the
function. In R Programming Language when you are creating a function the function name and the
file in which you are creating the function need not be the same and you can have one or more
function definitions in a single R file.
In R programming, functions are blocks of code that perform specific tasks and return a value.
Functions are used to encapsulate reusable code, making it easier to write and maintain code. R has
many built-in functions and also allows you to create your own custom functions. To define a
function in R, you use the function keyword, followed by the function's arguments and the code to
be executed in curly braces {}. The return value of a function can be specified using the return
keyword. To call a function, simply type its name followed by the arguments in parentheses ().

Types of function in R Language


Built-in Function: Built function R is sq(), mean(), max(), these function are directly call in the
program by users.
User-defined Function: R language allow us to write our own function.
Examples of built-in function in R
R has a large number of built-in functions, covering a wide range of tasks, including:

 Mathematics: sqrt, abs, cos, sin, log, exp, etc.

 Data manipulation: head, tail, sort, unique, cbind, rbind, etc.

 Data analysis: mean, median, summary, t.test, cor, lm, etc.

 Plotting: plot, hist, boxplot, scatterplot, density, etc.

 String manipulation: toupper, tolower, substr, gsub, paste, etc.

 File Input/Output: read.csv, write.csv, read.table, write.table, etc.


Use cases of basic inbuild functions of R programming

Functions to do Descriptive Analytics in R programming


Descriptive statistics in R programming involves summarizing and analyzing a dataset to gain a
better understanding of its properties and patterns. This can be done through various measures
such as central tendency (mean, median, mode), dispersion (standard deviation, variance, range),
and distribution (histograms, boxplots, density plots).
Here are some of the commonly used functions in R for descriptive statistics:
mean(): calculates the mean of a numeric vector.
median(): calculates the median of a numeric vector.
mode(): calculates the mode of a numeric vector.
sd(): calculates the standard deviation of a numeric vector.

20 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

var(): calculates the variance of a numeric vector.


range(): calculates the range of a numeric vector (difference between max and min).
quantile(): calculates specified quantiles of a numeric vector.
hist(): creates a histogram of a numeric vector.
boxplot(): creates a boxplot of a numeric vector.
density(): creates a density plot of a numeric vector.
table(): calculates the frequency distribution of a categorical variable.
min(): calculates the minimum value of a numeric vector.
max(): calculates the maximum value of a numeric vector.
sum(): calculates the sum of a numeric vector.
prod(): calculates the product of a numeric vector.
cumsum(): calculates the cumulative sum of a numeric vector.
cumprod(): calculates the cumulative product of a numeric vector.
cor(): calculates the correlation between two numeric vectors.
cov(): calculates the covariance between two numeric vectors.
apply(): applies a function to each column (or row) of a data frame.
These are just some of the functions available in R for descriptive statistics. By using these
functions, you can obtain a better understanding of your dataset and draw meaningful conclusions
from your data.

2.2 One Variable and Two Variables Statistics


Upcoming section shows examples of R functions for one variable and two variable statistics:

LOVELY PROFESSIONAL UNIVERSITY 21


Notes
Business Analytics

22 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

LOVELY PROFESSIONAL UNIVERSITY 23


Notes
Business Analytics

2.3 Basics Functions in R


# Find sum of numbers 4 to 6.
print(sum(4:6))
# Find max of numbers 4 and 6.

24 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

print(max(4:6))
# Find min of numbers 4 and 6.
print(min(4:6))
#Calculate the square root of a number
sqrt(16)
#Calculate the natural logarithm of a number
log(10)
#Calculate the exponential function
exp(2)
#Calculate the sine of an angle (in radians)
sin(pi/4)
#Calculate the sum of two numbers
x <- 2
y <- 3
x+y
#Calculate the difference of two numbers
x-y
#Calculate the product of two numbers
x*y
#Calculate the quotient of two numbers
x/y
#Calculate the power of a number
x^y
#Calculate the cosine of an angle (in radians)
cos(pi/3)
#Calculate the tangent of an angle (in radians)
tan(pi/4)
#Calculate the inverse sine of a value
asin(1)
#Calculate the inverse cosine of a value
acos(0.5)
#Calculate the inverse tangent of a value
atan(1)
#Calculate the mean and standard deviation of a vector x:
x <- c(1, 2, 3, 4, 5)
mean(x)
sd(x)
#Calculate the median and quartiles of a vector x:
x <- c(1, 2, 3, 4, 5)
median(x)

LOVELY PROFESSIONAL UNIVERSITY 25


Notes
Business Analytics

quantile(x, c(0.25, 0.75))


#Calculate the minimum and maximum values of a vector x:
x <- c(1, 2, 3, 4, 5)
min(x)
max(x)
#Calculate the sum and product of a vector x:
x <- c(1, 2, 3, 4, 5)
sum(x)
prod(x)
#Calculate the cumulative sum and cumulative product of a vector x:
x <- c(1, 2, 3, 4, 5)
cumsum(x)
cumprod(x)
#Calculate the correlation between two vectors x and y:
x <- rnorm(100)
y <- rnorm(100)
cor(x, y)
#Calculate the covariance between two vectors x and y:
x <- rnorm(100)
y <- rnorm(100)
cov(x, y)
#Calculate the mean and standard deviation of multiple columns of a data frame:
df <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))
apply(df, 2, mean)
apply(df, 2, sd)

2.4 User-defined Functions in R Programming Language


R provides built-in functions like print(), cat(), etc. but we can also create our own functions. These
functions are called user-defined functions.

26 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

2.5 Single Input Single Output


Now create a function in R that will take a single input and gives us a single output. Following is an
example to create a function that calculates the area of a circle which takes in the arguments the
radius. So, to create a function, name the function as “areaOfCircle” and the arguments that are
needed to be passed are the “radius” of the circle.

LOVELY PROFESSIONAL UNIVERSITY 27


Notes
Business Analytics

2.6 Multiple Input Multiple Output


Now create a function in R Language that will take multiple inputs and gives us multiple outputs
using a list.
The functions in R Language takes multiple input objects but returned only one object as output,
this is, however, not a limitation because you can create lists of all the outputs which you want to
create and once the list is created you can access them into the elements of the list and get the
answers which you want.
Let us consider this example to create a function “Rectangle” which takes “length” and “width” of
the rectangle and returns area and perimeter of that rectangle. Since R Language can return only
one object. Hence, create one object which is a list that contains “area” and “perimeter” and return
the list.

2.7 Inline Functions in R Programming Language


Sometimes creating an R script file, loading it, executing it is a lot of work when you want to just
create a very small function. So, what we can do in this kind of situation is an inline function. To
create an inline function you have to use the function command with the argument x and then the
expression of the function.

28 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

# A simple R function to check whether x is even or odd


evenOdd = function(x){
if(x %% 2 == 0)
return("even")
else
return("odd")
}
print(evenOdd(4))
print(evenOdd(3))
# A simple R function to calculate area of a circle
areaOfCircle = function(radius){
area = pi*radius^2
return(area)
}
print(areaOfCircle(2))
# A simple R function to calculate area and perimeter of a rectangle
Rectangle = function(length, width){
area = length * width
perimeter = 2 * (length + width)
# create an object called result which is
# a list of area and perimeter
result = list("Area" = area, "Perimeter" = perimeter)
return(result)
}
resultList = Rectangle(2, 3)

LOVELY PROFESSIONAL UNIVERSITY 29


Notes
Business Analytics

print(resultList["Area"])
print(resultList["Perimeter"])
# A simple R program to demonstrate the inline function
f = function(x) x^2*4+x/3
print(f(4))
print(f(-2))
print(0)

2.8 Functions to Summarize Variables- Select, Filter, Mutate &


Arrange
What is the select() function in R?
The select() function is used to pick specific variables or features of a DataFrame or tibble. It selects
columns based on provided conditions like contains, matches, starts with, ends with, and so on.
Syntax
select(.data,….)
Example
iris <- as_tibble(iris) # so it prints a little nicer
select(iris, starts_with("Petal"))
select(iris, ends_with("Width"))
# Move Species variable to the front
select(iris, Species, everything())
df <- as.data.frame(matrix(runif(100), nrow = 10))
df <- tbl_df(df[c(3, 4, 7, 1, 9, 8, 5, 2, 6, 10)])
select(df, V4:V6)
select(df, num_range("V", 4:6))
# Drop variables with -
select(iris, -starts_with("Petal"))
# The .data pronoun is available:
select(mtcars, .data$cyl)
select(mtcars, .data$mpg : .data$disp)
What is the filter() function in R?
The filter() function is used to produce a subset of the data frame, retaining all rows that satisfy the
specified conditions. The filter() method in R programming language can be applied to both
grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical
operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the
column values. The subset data frame has to be retained in a separate variable.
Example : R program to filter rows using filter() function
library(dplyr)
# sample data
df=data.frame(x=c(12,31,4,66,78),
y=c(22.1,44.5,6.1,43.1,99),
z=c(TRUE,TRUE,FALSE,TRUE,TRUE))

30 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

# condition
filter(df, x<50 & z==TRUE)
Output:
x y z
1 12 22.1 TRUE
2 31 44.5 TRUE
# create a vector of numbers
x <- c(1, 2, 3, 4, 5, 6)
# filter elements that are greater than 3
result <- filter(x, x > 3)
# print the filtered result
print(result)
# Output:
# [1] 4 5 6
In this example, the filter() function is applied to the vector x with the condition x > 3, which
returns a new vector containing only the elements of x that are greater than 3.
# Creating a vector of numbers
numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Using filter function to extract only even numbers
even_numbers <- filter(numbers, function(x) x %% 2 == 0)
# Printing the filtered numbers
even_numbers
#Output
[1] 2 4 6 8 10
filter(starwars, species == "Human")
filter(starwars, mass > 1000)
# Multiple criteria
filter(starwars, hair_color == "none" & eye_color == "black")
filter(starwars, hair_color == "none" | eye_color == "black")
# Multiple arguments are equivalent to and
filter(starwars, hair_color == "none", eye_color == "black")
# Load library dplyr
library(dplyr)
# Load iris dataset
data(iris)
# Select only Sepal.Length and Species columns
iris_select <- iris %>% select(Sepal.Length, Species)
# View the first 6 rows
head(iris_select)
# Load library dplyr
library(dplyr)

LOVELY PROFESSIONAL UNIVERSITY 31


Notes
Business Analytics

# Load iris dataset


data(iris)
# Create a new column "Sepal.Ratio" based on Sepal.Length and Sepal.Width
iris_mutate <- iris %>% mutate(Sepal.Ratio = Sepal.Length / Sepal.Width)
# View the first 6 rows
head(iris_mutate)
# Load library dplyr
library(dplyr)
# Load iris dataset
data(iris)
# Arrange rows by Sepal.Length in ascending order
iris_arrange <- iris %>% arrange(Sepal.Length)
# View the first 6 rows
head(iris_arrange)

2.9 Summarize function in R


The summarize() function is used in the R program to summarize the data frame into just one value
or vector. This summarization is done through grouping observations by using categorical values at
first, using the groupby() function.
The dplyr package is used to get the summary of the dataset. The summarize() function offers the
summary that is based on the action done on grouped or ungrouped data.
Summarize grouped data
The operations that can be performed on grouped data are average, factor, count, mean, etc.
# Load library
library(dplyr)
data <- PlantGrowth
# summarize
summarize(data, mean(weight,na.rm=TRUE))
In the example above, we use the summarize() function to obtain the mean weight of all the plant
species in the PlantGrowth dataset.
Summarize ungrouped data
We can also summarize ungrouped data. This can be done by using three functions.
summarize_all()
summarize_at()
summazrize_if()
Examples
# Load dplyr library
library(dplyr)
# Main code
data <- mtcars
# Loading starting 6 observations
sample <- head(data)

32 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

# Caculating mean value.


sample %>% summarize_all(mean)
In the code snippet above, we load the mtcars dataset in the data variable. In the variable sample,
we are loading the top six observations to process. The sample %>% summarize_all(mean) will
show the mean of the six observations in the result.
summarize_at()
It performs the action on the specific column and generates the summary based on that action.
# Load dplyr library
library(dplyr)
# Main code
data <- mtcars
# Loading starting 6 observations
sample <- head(data)
# Caculating mean value.
sample %>% summarize_all(mean)
In the code snippet above, we load the mtcars dataset in the data variable. In the variable sample,
we are loading the top six observations to process. The sample %>% summarize_all(mean) will
show the mean of the six observations in the result.

summarize_if()

In this function, we specify a condition and the summary will be generated if the condition is
satisfied.

# Laod dplyr library

library(dplyr)

# Main code

data<-mtcars

z<- head(data)

z %>% group_by(hp) %>%

summarize_if(is.numeric, mean)

In the code snippet above, we use the predicate function is.numeric and mean as an action.

2.10 Group by function in R


Group_by() function belongs to the dplyr package in the R programming language, which groups
the data frames. Group_by() function alone will not give any output. It should be followed by
summarise() function with an appropriate action to perform. It works similar to GROUP BY in SQL
and pivot table in excel.
Example
library(dplyr)
df = read.csv("Sample_Superstore.csv")
df_grp_region = df %>% group_by(Region) %>%
summarise(total_sales = sum(Sales),

LOVELY PROFESSIONAL UNIVERSITY 33


Notes
Business Analytics

total_profits = sum(Profit),
.groups = 'drop')
View(df_grp_region)

2.11 Concept of Pipes Operator in R


The pipe operator in R is the %>% symbol and it is used for chaining together multiple operations
in a readable and concise way. The pipe operator takes the output from the left-hand side of the
operator and "pipes" it as the first argument to the function on the right-hand side. This allows you
to build complex sequences of operations, each relying on the output from the previous step,
without the need for intermediate variables.
# Example 1
library(dplyr)
mtcars %>%
filter(cyl == 4) %>% summarize(mean_mpg = mean(mpg))
In this example, the mtcars data set is filtered to only keep observations with 4 cylinders, and then
the mean miles per gallon (mpg) is calculated for the remaining observations. The output from each
step is passed to the next step using the pipe operator, making the code more readable and concise.
Note that the pipe operator is not built into base R, but is included in the dplyr package, which is a
popular data manipulation library.
# Example 2
data(mtcars)
mtcars %>%
select(mpg, hp) %>%
head()

# Example 3
mtcars %>%
group_by(cyl) %>%
summarize(mean_mpg = mean(mpg),
n = n())
# Example 4
mtcars %>%
mutate(cyl_factor = factor(cyl),
hp_group = cut(hp, breaks = c(0, 50, 100, 150, 200),
labels = c("low", "medium", "high", "very high"))) %>%
group_by(cyl_factor, hp_group) %>%
summarize(mean_mpg = mean(mpg),
n = n())
In the second example, the mtcars data set is first filtered to keep only the mpg and hp columns,
and then only the first six rows are displayed.
In the third example, the mtcars data set is grouped by the number of cylinders (cyl) and the mean
miles per gallon (mpg) and number of observations (n) are calculated for each group.
In the fourth example, two new variables are created and added to the mtcars data set. The number
of cylinders (cyl) is converted to a factor and a new variable (cyl_factor) is created to represent this

34 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

factor. Another new variable (hp_group) is created by dividing the horsepower (hp) into groups
using the cut function. The data set is then grouped by the two new variables, and the mean miles
per gallon (mpg) and number of observations (n) are calculated for each group.

Summary
There are many ways to summarize business data in R, depending on the type of data you are
working with and the goals of your analysis. Here are a few common methods for summarizing
business data. Descriptive statistics: You can use base R functions such as mean, median, sum, min,
max, and quantile to calculate common summary statistics for your data. For example, you can
calculate the mean, median, and standard deviation of a variable of interest.
Grouping and aggregating: You can use the group_by and summarize functions from the dplyr
package to group your data by one or more variables and calculate summary statistics for each
group. For example, you can group sales data by product and calculate the total sales for each
product.
Cross-tabulation: You can use the table function to create cross-tabulations (also known as
contingency tables) of your data. For example, you can create a cross-tabulation of sales data by
product and region.
Visualization: You can use various plotting functions, such as barplot, histogram, and boxplot, to
create visual representations of your data. Visualization can help you quickly identify patterns and
relationships in your data.

Keywords
dplyr, R packages, group by, pipe operator, summarize.

Self Assessment
1. Descriptive analysis tell about________?
A. Past
B. Present
C. Future
D. Previous

2. How many types of R objects are present in R data type?


A. 4
B. 5
C. 6
D. 7

3. How many types of data types are present in R?


A. 4
B. 5
C. 6
D. 7

4. In R every operation has a ______call?


A. System
B. Function

LOVELY PROFESSIONAL UNIVERSITY 35


Notes
Business Analytics

C. None of the above


D. Both of the above

5. The ____________ in R is a vector.


A. Basic data structure
B. Basic datatypes
C. Both
D. None

6. _________and_________ are types of matrices functions?


A. Apply and sapply
B. Apply and lapply
C. Both
D. None

7. How many control statements are present in R?


A. 6
B. 7
C. 8
D. 9

8. Which of the following finds the maximum value in the vector x, exclude missing values
A. rm(x)
B. all(x)
C. max(x, na.rm=TRUE)
D. x%in%y

9. R functionality is divided into a number of ________


A. Packages
B. Functions
C. Domains
D. Library

10. Which of the following return a subset of the columns of a data frame?
A. select
B. retrieve
C. get
D. set

11. Point out the correct statement?


A. The data frame is a key data structure in statistics and in R

36 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 02: Summarizing Business Data

B. R has an internal implementation of data frames that is likely the one you will use most
often
C. There are packages on CRAN that implement data frames via things like relational
databases that allow you to operate on very very large data frames
D. All of the mentioned

12. _________ generate summary statistics of different variables in the data frame, possibly
within strata.
A. rename
B. summarize
C. set
D. subset

13. ________ add new variables/columns or transform existing variables.


A. mutate
B. add
C. apped
D. arrange

14. The _______ operator is used to connect multiple verb actions together into a pipeline.
A. pipe
B. piper
C. start
D. end

15. The dplyr package can be installed from CRAN using __________
A. installall.packages(“dplyr”)
B. install.packages(“dplyr”)
C. installed.packages(“dplyr”)
D. installed.packages(“dpl”)

Answers forSelf Assessment


l. B 2. C 3. C 4. B 5. A

6. A 7. C 8. B 9. A 10. A

11. D 12. B 13. A 14. A 15. B

Review Questions
1) Use IRIS data set and use group by, summarize function.
2) Discuss the pipe operator in R.

LOVELY PROFESSIONAL UNIVERSITY 37


Notes
Business Analytics

3) Discuss functions of dplyr package.


4) List all inbuilt functios of R.
5) Develop function which return odd and even number.

Further reading
An Introduction to R" by W. N. Venables, D. M. Smith, and the R Development Core
Team
https://www.r-bloggers.com

38 LOVELY PROFESSIONAL UNIVERSITY


Notes

Dr. Mohd Imran Khan, Lovely Professional University Unit 03: Business Data Visualization

Unit 03: Business Data Visualization


CONTENTS
Objectives
Introduction
3.1 Use Cases of Business Data Visualization
3.2 Basic Graphs and their Purposes
3.3 R Packages for Data Visualization
3.4 Ggplot2
3.5 Bar Graph using ggplot2
3.6 Line Plot using ggplot2 in R
Summary
Keywords
Self Assessment
Answers for self Assessment
Review Questions
Further Reading

Objectives
 To analyse data visualization in business context.
 To discover the purpose of basic graphs.
 To understand the grammar of graphics.
 To visualize basics graphs using ggplot2.
 To visualize some advanced graphs.

Introduction
Business data visualization is the representation of business data and information using charts,
graphs, maps, and other visual elements. The goal of data visualization in a business context is to
make complex data easy to understand, reveal patterns and trends, and support decision-making
processes.Business data visualization is the process of transforming complex data into graphical
representations, such as charts, graphs, maps, and infographics, to communicate data in a way that
is easy to understand and interpret. The main goal of business data visualization is to provide a
visual representation of data that supports decision-making processes and enhances
communication.
Data visualization offers several benefits to businesses, including:
Improved communication: By using visual representations, data visualization makes it easier for
individuals to understand and interpret data, which leads to better communication and
collaboration among team members.
Increased Insights: Data visualization allows companies to identify patterns and trends in data that
would be difficult to detect through raw data analysis. This leads to new insights and a better
understanding of the data.

LOVELY PROFESSIONAL UNIVERSITY 39


Notes
Business Analytics

Better Decision-Making: Data visualization provides a visual representation of data that supports
decision-making processes. By presenting data in a way that is easy to understand and interpret,
decision-makers can make informed decisions based on accurate data analysis.
Enhanced Presentations: Data visualization adds a visual component to presentations, making
them more engaging and effective for communicating data.

3.1 Use Cases of Business Data Visualization


Data visualization has a wide range of use cases in businesses, including:
Sales and Marketing: Data visualization can be used to analyze sales data, customer
demographics, and marketing campaign performance. This allows companies to make informed
decisions about product development, marketing strategies, and customer engagement.
Financial Analysis: Data visualization can be used to present financial data, such as budget reports,
income statements, and balance sheets, in a way that is easy to understand and interpret.
Supply Chain Management: Data visualization can be used to track the flow of goods and
materials, monitor inventory levels, and analyze supply chain performance.
Operations Management: Data visualization can be used to monitor key performance indicators,
such as production output and efficiency, in real-time. This allows companies to make informed
decisions about operations and production processes.
Business data visualization is a powerful tool for companies to understand and make sense of large
amounts of data. By transforming complex data into graphical representations, data visualization
improves communication, enhances decision-making processes, and provides new insights into
data. With its wide range of use cases, data visualization is an essential tool for businesses in a data-
driven world.

3.2 Basic Graphs and their Purposes


There are several basic graphs that are commonly used in data visualization and each has a specific
purpose:
Bar Graph: A bar graph is used to compare the sizes of different categories of data. The data is
represented as bars, with the height of each bar representing the value of the data. Bar graphs are
best used when comparing data sets with a small number of categories.
Line Graph: A line graph is used to show how a value changes over time. Data points are plotted
on a graph and connected with lines to show the trend over time. Line graphs are best used for data
sets with continuous data, such as stock prices or temperature over time.
Pie Chart: A pie chart is used to show the proportion of different categories in a data set. The data
is represented as slices of a pie, with each slice representing the proportion of a category in the data
set. Pie charts are best used for data sets with a small number of categories and for showing the
proportion of each category in the data set.
Scatter Plot: A scatter plot is used to show the relationship between two variables. Data points are
plotted on a graph to show the relationship between the two variables. Scatter plots are best used
for data sets with continuous data and for showing the relationship between two variables.
Histogram: A histogram is used to show the distribution of data. The data is divided into bins, with
the height of each bin representing the number of data points in that bin. Histograms are best used
for data sets with continuous data and for showing the distribution of data.
Stacked Bar Graph: A stacked bar graph is used to show the proportion of different categories in a
data set, while also showing the total of all the categories. The data is represented as bars, with each
bar representing the total of all the categories, and each category represented as a portion of the bar.
Stacked bar graphs are best used for data sets with a small number of categories and for showing
the proportion of each category in the data set, as well as the total of all the categories.
By selecting the appropriate graph type, you can effectively communicate your data and help
others understand your findings.

40 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

3.3 R Packages for Data Visualization


There are several R packages available for data visualization, some of the most popular ones are:
ggplot2: One of the most widely used packages for data visualization in R, it provides a high-level
interface for producing attractive and informative visualizations with minimal code.
plotly: An interactive visualization library that allows you to produce charts, maps, and other types
of graphics that can be easily embedded in web pages or R markdown documents.
lattice: A package that provides a high-level interface for producing trellis graphics, which are
multi-panel visualizations that display the relationship between multiple variables.
Shiny: A package that makes it easy to build interactive web applications in R, including
visualizations.
leaflet: A package that provides an interface for creating interactive maps, making it easy to display
spatial data in a meaningful way.
dygraphs: A package that provides an interface for producing time-series plots, which are
commonly used to visualize trends in data over time.
rgl: A package that provides a high-level interface for producing interactive 3D graphics, allowing
you to visualize complex data in a way that is not possible with 2D graphics.
rbokeh: A visualization library for R that provides a high-level interface to the Bokeh library for
Python.
googleVis: An R interface to the Google Charts API, which allows you to create interactive web
visualizations from R with minimal effort.
ggvis: A package for creating interactive visualizations, with syntax similar to ggplot2.
rayshader: A package for creating 3D visualizations and animations of ggplot2 graphics.
flexdashboard: A package for creating dashboards, with support for multiple pages and a variety
of interactive visualizations.
These packages cover a wide range of visualization needs and provide many customization
options, making it easy to create high-quality visualizations for your data.

3.4 Ggplot2
ggplot2 is a plotting library for the R programming language, used for creating sophisticated
graphics. It was created by Hadley Wickham and is based on the principles of the grammar of
graphics, which provides a flexible structure for building complex visualizations from simple
components.
One of the key features of ggplot2 is that it allows users to build plots layer by layer, by adding
components such as data, aesthetics (mapping variables to visual properties), geoms
(representations of data, such as points, lines, or bars), and statistics (such as regression lines or
smoothing splines). This approach makes it easier to understand and control the appearance of the
final plot.
ggplot2 also has a large and active user community, which has contributed a variety of additional
packages and extensions that enhance its functionality. As a result, ggplot2 is widely used in
academia and industry, and has become one of the most popular plotting libraries for R.
The library is highly extensible, with a large number of plugins and extensions available that allow
you to create custom visualizations or fine-tune existing ones. Additionally, ggplot2 is designed to
play well with other R packages, such as dplyr and tidyr, which makes it easy to manipulate and
transform your data before creating a visualization.
Some of the key features of ggplot2 include:

 A wide variety of plot types, including scatterplots, bar plots, line plots, histograms, density
plots, box plots, and more.

LOVELY PROFESSIONAL UNIVERSITY 41


Notes
Business Analytics

 Customization of every aspect of a plot, from the axis labels and titles to the colors and
themes.
 Built-in support for facets, which allow you to create multiple subplots that share the same
scales and aesthetics.
 The ability to combine multiple layers into a single plot, and to add smooth fits, regression
lines, and other statistical summaries to a plot.
ggplot2 has a number of advantages over other data visualization tools, including:
Consistency: ggplot2 provides a consistent syntax for creating visualizations, making it easier to
learn and use.
Customization: ggplot2 is highly customizable, allowing you to create visualizations that meet
your specific needs.
Extendibility: ggplot2 is designed to be extended and modified, making it easy to create new
visualizations or modify existing ones.
Large Community: ggplot2 has a large and active community of users who provide support,
resources, and tutorials.
ggplot2 is widely used in the R community and is considered to be one of the best data
visualization libraries for R. It provides a powerful and flexible platform for creating professional-
looking visualizations and has a large and active user community that provides support and
develops new extensions and packages.
The syntax of ggplot2 can be broken down into three main components:
The data: You start by specifying the data you want to visualize. This can be a data frame or a
tibble in R.
The aesthetics: Next, you define the visual mappings, or aesthetics, between the variables in your
data and the visual elements of the plot, such as the x and y positions, color, size, etc.
The geometry: Finally, you specify the type of plot you want to create, such as a scatter plot, bar
plot, histogram, etc., using a geom (short for geometry).
Here's a simple example that demonstrates the basic syntax of ggplot2:
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point()
In this example, the data is mtcars, and the aesthetics are defined as x = wt and y = mpg. The
geom_point() function is used to specify that we want a scatter plot.
The syntax of ggplot2 can be quite dense, but it's also highly expressive and allows for fine-grained
control over the appearance and behavior of your visualizations. With practice, you'll find that you
can create complex and beautiful plots with just a few lines of code.
Few more examples
Barplot
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = factor(cyl))) +

42 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

geom_bar(fill = "blue") +
xlab("Number of Cylinders") +
ylab("Count") +
ggtitle("Count of Cars by Number of Cylinders")
Line plot
library(ggplot2)
# Load the data
data(economics)
# Create the plot
ggplot(data = economics, aes(x = date, y = uempmed)) +
geom_line(color = "red") +
xlab("Year") +
ylab("Unemployment Rate") +
ggtitle("Unemployment Rate Over Time")
Histogram
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = mpg)) +
geom_histogram(fill = "blue", binwidth = 2) +
xlab("Miles Per Gallon") +
ylab("Frequency") +
ggtitle("Histogram of Miles Per Gallon")
Boxplot
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "blue") +
xlab("Number of Cylinders") +
ylab("Miles Per Gallon") +
ggtitle("Box Plot of Miles Per Gallon by Number of Cylinders")
These are just a few examples to get you started. You can create many more complex and
interesting visualizations using ggplot2 by combining different geoms, adjusting the aesthetics, and
adding additional elements such as faceting, themes, and annotations.

3.5 Bar Graph using ggplot2


This is the most basic barplot you can build using the ggplot2 package. It follows those steps:
always start by calling the ggplot() function.

LOVELY PROFESSIONAL UNIVERSITY 43


Notes
Business Analytics

then specify the data object. It has to be a data frame. And it needs one numeric and one categorical
variable.
then come thes aesthetics, set in the aes() function: set the categoric variable for the X axis, use the
numeric for the Y axis
finally call geom_bar(). You have to specify stat="identity" for this kind of dataset.
Most basic bar plot
# Load ggplot2
library(ggplot2)
# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)
# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity")

Control bar color


Here are a few different methods to control bar colors. Note that using a legend in this case is not
necessary since names are already displayed on the X axis. You can remove it with
theme(legend.position="none").
# Libraries
library(ggplot2)
# 1: uniform color. Color is for the border, fill is for the inside
ggplot(mtcars, aes(x=as.factor(cyl) )) +
geom_bar(color="blue", fill=rgb(0.1,0.4,0.5,0.7) )
# 2: Using Hue
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +

44 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

scale_fill_hue(c = 40) +
theme(legend.position="none")
# 3: Using RColorBrewer
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_brewer(palette = "Set1") +
theme(legend.position="none")

# 4: Using greyscale:
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_grey(start = 0.25, end = 0.75) +
theme(legend.position="none")
# 5: Set manualy
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_manual(values = c("red", "green", "blue") ) +
theme(legend.position="none")

LOVELY PROFESSIONAL UNIVERSITY 45


Notes
Business Analytics

Horizontal barplot with coord_flip()


It often makes sense to turn your barplot horizontal. Indeed, it makes the group labels much easier
to read.
Fortunately, the coord_flip() function makes it a breeze.
# Load ggplot2
library(ggplot2)
# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)
# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity") +
coord_flip()

46 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Control bar width with width


The width argument of the geom_bar() function allows to control the bar width. It ranges between 0
and 1, 1 being full width.
See how this can be used to make bar charts with variable width.
# Load ggplot2
library(ggplot2)
# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)
# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity", width=0.2)

LOVELY PROFESSIONAL UNIVERSITY 47


Notes
Business Analytics

Stacked Bar Graph


If your data contains several groups of categories, you can display the data in a bar graph in one of
two ways. You can decide to show the bars in groups (grouped bars) or you can choose to have
them stacked (stacked bars).
#creating data
survey <- data.frame(group=rep(c("Men", "Women"),each=6),
fruit=rep(c("Apple", "Kiwi", "Grapes", "Banana", "Pears", "Orange"),2),
people=c(22, 10, 15, 23, 12, 18, 18, 5, 15, 27, 8, 17))
ggplot(survey, aes(x=fruit, y=people, fill=group)) +
geom_bar(stat="identity")

48 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

3.6 Line Plot using ggplot2 in R


A line chart or line graph displays the evolution of one or several numeric variables. Data points
are usually connected by straight line segments. You read an extensive definition here.
The input data frame requires at least 2 columns:
An ordered numeric variable for the X axis
Another numeric variable for the Y axis
Once the data is read by ggplot2 and those 2 variables are specified in the x and y arguments of the
aes(), just call the geom_line() function.
Most basic line plot
# Libraries
library(ggplot2)
# create data
xValue <- 1:10
yValue <- cumsum(rnorm(10))
data <- data.frame(xValue,yValue)
# Plot
ggplot(data, aes(x=xValue, y=yValue)) +
geom_line()
Formatting Line
Line Type
For this, the command linetype is used. ggplot2 provides various line types. For example : dotted,
two dash, dashed, etc. This attribute is passed with a required value.
library(ggplot2)
# Create data for chart
val <-data.frame(course=c('DSA','C++','R','Python'),
num=c(77,55,80,60))
# Format the line type

LOVELY PROFESSIONAL UNIVERSITY 49


Notes
Business Analytics

ggplot(data=val, aes(x=course, y=num, group=1)) +


geom_line(linetype = "dotted")+
geom_point()

Line Color
The command color is used and the desired color is written in double quotes [” “] inside geom_line(
).
library(ggplot2)
# Create data for chart
val <-data.frame(course=c('DSA','C++','R','Python'),
num=c(77,55,80,60))
# Format the line color
ggplot(data=val, aes(x=course, y=num, group=1)) +
geom_line(color="green")+
geom_point()

50 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Line Size
The line size can be changed using the command size and providing the value of the size inside
geom_line( ).
library(ggplot2)
# Create data for chart
val <-data.frame(course=c('DSA','C++','R','Python'),
num=c(77,55,80,60))
# Format the line size
ggplot(data=val, aes(x=course, y=num, group=1)) +
geom_line(color="green",size=1.5)+
geom_point()

Histogram in R using ggplot2


Basically, Histograms are used to show distributions of a given variable while bar charts are used to
compare variables. Histograms plot quantitative data with ranges of the data grouped into the
intervals while bar charts plot categorical data.
geom_histogram() function is an in-built function of ggplot2 module.
Basic histogram with geom_histogram
# library
library(ggplot2)
# dataset:
data=data.frame(value=rnorm(100))
# basic histogram
p <- ggplot(data, aes(x=value)) +
geom_histogram()

LOVELY PROFESSIONAL UNIVERSITY 51


Notes
Business Analytics

Control bin size with binwidth


# Libraries
library(tidyverse)
library(hrbrthemes)
# Load dataset from github
Data<-
read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_
OneNum.csv", header=TRUE)
# plot
p <- data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
geom_histogram( binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin size = 3") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
)

52 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Box plots in R using ggplot2


Box plots are commonly used to show the distribution of data in a standard way by presenting five
summary values. The list below summarizes the minimum, Q1 (First Quartile), median, Q3 (Third
Quartile), and maximum values. Summarizing these values can provide us with information about
our outliers and their values.
In ggplot2, geom_boxplot() is used to create a boxplot.
Most basic box plot using ggplot2.
library(ggplot2)
# Create the dataset or load the dataset
# for the chart
Dataset <- c(17, 32, 8, 53, 1,45,56,678,23,34)
Dataset
# loading data set and storing it in ds variable
ds <- read.csv(
"c://crop//archive//Crop_recommendation.csv", header = TRUE)
# create a boxplot by using geom_boxplot() function
# of ggplot2 package
crop=ggplot(data=ds, mapping=aes(x=label, y=temperature))+geom_boxplot()
crop

LOVELY PROFESSIONAL UNIVERSITY 53


Notes
Business Analytics

Adding mean value to the boxplot


Mean value can also be added to a boxplot, for that we have to specify the function we are using,
within stat_summary(). This function is used to add new summary values and add these summary
values to the plot. By using this function you don’t need to calculate the mean values before
plotting.
library(ggplot2)
# loading data set and storing it in ds variable
ds <- read.csv("c://crop//archive//Crop_recommendation.csv", header = TRUE)
# add mean to ggplot2 boxplot
ggplot(ds, aes(x = label, y = temperature, fill = label)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 2, color = "white")

Scatter Plot using ggplot2 in R


To plot scatterplot we will use we will be using geom_point() function.
Most basic scatterplot
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()

54 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Scatter plot with groups


Here we will use distinguish the values by a group of data (i.e. factor level data). aes() function
controls the color of the group and it should be factor variable.
# Scatter plot with groups
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = factor(Sepal.Width)))

Graphics for Correlations


Correlation plots, also known as correlograms for more than two variables, help us to visualize the
correlation between continuous variables.

LOVELY PROFESSIONAL UNIVERSITY 55


Notes
Business Analytics

Correlogram is a graph of correlation matrix. Useful to highlight the most correlated variables in a
data table. In this plot, correlation coefficients are colored according to the value. Correlation matrix
can be also reordered according to the degree of association between variables.
Use of ggcorrplot() function to draw a correlogram
library(ggcorrplot)
# Load the data
data(mtcars)
# Calculate the correlation matrix
cor_mat <- cor(mtcars)
# Create the plot
ggcorrplot(cor_mat, method = "circle", hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3)
In this example, the cor function is used to calculate the pairwise correlations between the variables
in the mtcars dataset. The ggcorrplot function is then used to create the correlogram, using the color
method to represent the correlation coefficients with colors (positive correlations in blue and
negative correlations in red).

Graphs for deviation and ranking


Point plot
A point plot represents an estimate of central tendency for a numeric variable by the position of the
dot and provides some indication of the uncertainty around that estimate using error bars.
Point plots can be more useful than bar plots for focusing comparisons between different levels of
one or more categorical variables. They are particularly adept at showing interactions: how the
relationship between levels of one categorical variable changes across levels of a second categorical
variable.
# creating a data frame df
df<-data.frame(Mean=c(0.24,0.25,0.37,0.643,0.54),
sd=c(0.00362,0.281,0.3068,0.2432,0.322),
Quality=as.factor(c("good","bad","good","very good","very good")),
Category=c("A","B","C","D","E"),
Insert= c(0.0, 0.1, 0.3, 0.5, 1.0))

56 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

# plot the point plot


p<-ggplot(df, aes(x=Category, y=Mean, fill=Quality)) +
geom_point()+
geom_errorbar(aes(ymin=Mean-sd, ymax=Mean+sd), width=.2,
position=position_dodge(0.05))

Violin Plot
A violin plot is a type of plot that combines aspects of both box plots and kernel density plots, and
is used to visualize the distribution of a numerical variable and its ranking within that distribution.
# First, install and load the ggplot2 library
library(ggplot2)
# Generate some sample data
set.seed(123)
x <- rnorm(100)
group <- rep(c("Group 1", "Group 2"), 50)
# Prepare the data into a format that can be plotted
df <- data.frame(x = x, group = group)
# Create the violin plot using ggplot2
ggplot(df, aes(x = group, y = x, fill = group)) +
geom_violin() +
labs(x = "Group", y = "X")
In this example, the ggplot() function is used to specify the plot, with group and x as the aesthetic
mappings. The geom_violin() layer is then added to the plot to create the violin plot. The labs()
function is used to add labels to the x- and y-axes.

LOVELY PROFESSIONAL UNIVERSITY 57


Notes
Business Analytics

In this example, the x variable is drawn from a normal distribution and assigned to two different
groups, "Group 1" and "Group 2". The violin plot shows the distribution of x for each group. The fill
color of the violin plot is specified by the group variable.

Graphs for distribution and composition


Density plot
A density plot is a type of plot that is used to visualize the distribution of a numerical variable. It
shows the estimated probability density function (PDF) of the data, which provides information
about the shape of the distribution and the distribution of the data.
To create a density plot using ggplot2, you will first need to prepare the data into a format that can
be plotted, and then use the ggplot() function to specify the plot, followed by the geom_density()
layer to add the density plot.
Here's an example of how to create a density plot using ggplot2:
# First, install and load the ggplot2 library
library(ggplot2)
# Generate some sample data
set.seed(123)
x <- rnorm(100)
# Prepare the data into a format that can be plotted
df <- data.frame(x = x)
# Create the density plot using ggplot2
ggplot(df, aes(x = x)) +
geom_density() +
labs(x = "X", y = "Density")

58 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Lollipop plot
A lollipop plot is a type of plot that is used to visualize the relationship between two variables,
where one variable is categorical and the other is numerical. In a lollipop plot, the categorical
variable is shown on the x-axis, and the numerical variable is represented by a line (the "stick") that
extends from the x-axis to the corresponding y-value. The end of the stick is marked by a circle (the
"lollipop").
# First, install and load the ggplot2 library
library(ggplot2)
# Generate some sample data
set.seed(123)
x <- c("Group 1", "Group 2", "Group 3")
y <- c(1, 2, 3)
# Prepare the data into a format that can be plotted
df <- data.frame(x = x, y = y)
# Create the lollipop plot using ggplot2
ggplot(df, aes(x = x, y = y)) +
geom_segment(aes(xend = x, yend = 0), color = "gray50") +
geom_point(size = 5) +
labs(x = "Group", y = "Value")

LOVELY PROFESSIONAL UNIVERSITY 59


Notes
Business Analytics

2
Value

Group 1 Group 2 Group 3


Group

In this example, the ggplot() function is used to specify the plot, with x and y as the aesthetic
mappings. The geom_segment() layer is then added to the plot to create the stick, with xend and
yend as the endpoint mappings. The geom_point() layer is added to the plot to create the lollipops,
and the labs() function is used to add labels to the x- and y-axes.
You can customize the appearance of the lollipop plot by adding additional layers or arguments to
the ggplot() function. For example, you can change the color of the sticks and lollipops, add labels
to the lollipops, and more.

Summary
Business data visualization refers to the representation of data in graphical format to help
organizations make informed decisions. By visualizing data, it becomes easier to identify patterns,
trends, and relationships that may not be immediately apparent from raw data. The main goal of
business data visualization is to communicate complex information in an easy-to-understand
manner and to support data-driven decision making.

There are various types of data visualizations including bar graphs, line charts, scatter plots, pie
charts, heat maps, and more. The choice of visualization depends on the type and nature of the data
being analyzed.

Benefits of business data visualization include improved communication and understanding of


data, identifying relationships and trends, making informed decisions, and improved data analysis
efficiency.

It's important to note that while visualizing data can greatly enhance understanding and decision
making, it is important to also consider the limitations and potential biases that may arise in the
visual representation of data. Proper data visualization techniques should be used and the results
should be validated and interpreted carefully.

Keywords
Data visualization, Ggplot, R packages, lollipop chart

Self Assessment
1. Point out the correct statement?

60 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

A. autoplotgraph is used to complete ggplot appropriate to a particular data type


B. auto_element wraps up a projection of summary functions
C. ggplot.data create a new ggplot plot from a data frame
D. aes_sdensity display a smooth density estimate

2. ______ display a smooth density estimate.


A. geom_density2
B. geom_density
C. aes_sdensity
D. geom_contour

3. Which of the following draws nothing?


A. geom_blank
B. geom
C. geom_bin2d
D. geom_contour

4. Point out the correct statement?


A. is.theme reports whether x is a real object
B. is.object reports whether x is a aesthetic object
C. qplot is used for quick plot
D. ggplot describe the type of plot you will produce

5. _________ describe the type of plot you will produce.


A. geoms
B. ggplot
C. fplot
D. gplot

6. __________ is interval represented by a vertical line, with a point in the middle.


A. geom_range
B. geom_pointrange
C. printplot
D. geom_contour

7. Which of the following create a set of identity mappings?


A. ggplot
B. aes_all
C. aes
D. ggorder

8. _________ is new package that makes it easy to “tidy” your data.


A. tidy
B. tidyr
C. tidyneat
D. tidynr

LOVELY PROFESSIONAL UNIVERSITY 61


Notes
Business Analytics

9. Point out the correct statement?


A. Each row is an observation in tidy data
B. Each column is a variable in tidy data
C. Arranging your data in tidy way makes it easier to work
D. All of the mentioned

10. Which of the following takes two columns and spreads them into multiple columns?
A. ggmissplot
B. printplot
C. print.ggplot
D. ggplot

11. How many functions exist for wrangling the data with dplyr package?
A. one
B. seven
C. three
D. five

12. What is the role of exploratory graphs in data analysis?


A. They are made for formal presentations
B. They are typically made very quickly
C. Axes, legends, and other details are clean and exactly detailed
D. They are used in place of formal modeling

13. What is ggplot2 an implementation of?


A. the Grammar of Graphics developed by Leland Wilkinson
B. 3D visualization system
C. the S language originally developed by Bell Labs
D. the base plotting system in R

14. For barchart and _________ non-trivial methods exist for tables and arrays, documented at
barchart.table.
A. scatterplot
B. dotplot
C. xyplot
D. scatterplot and xyplot

15. What is a geom in the ggplot2 system?


A. a plotting object like point, line, or other shape
B. a method for making conditioning plots
C. a method for mapping data to attributes like color and size
D. a statistical transformation

62 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 03: Business Data Visualization

Answers for self Assessment


l. C 2. B 3. C 4. C 5. A

6. B 7. D 8. C 9. D 10. C

11. B 12. B 13. A 14. B 15. A

Review Questions
1) What is ggplot2 and what is its purpose?
2) How does ggplot2 differ from other data visualization tools in R?
3) What is the structure of a ggplot2 plot?
4) What is a "ggplot" object and how is it constructed in ggplot2?
5) How can you add layers to a ggplot object?
6) What are the different types of geoms available in ggplot2 and what do they represent?
7) How can you customize the appearance of a ggplot plot, such as color, size, and shape of
the data points?
8) How can you add descriptive statistics, such as mean or median, to a ggplot plot?
9) How can you use facets to create multiple plots in a single ggplot plot?
10) What is the difference between scales and themes in ggplot2, and how can you use them to
change the look of your plot?

Further Reading
"R Graphics Cookbook" by Winston Chang
"Data Visualization with ggplot2" by Hadley Wickham
"ggplot2: Elegant Graphics for Data Analysis" by Hadley Wickham
"An Introduction to ggplot2" by Ed Zehl
"Data Visualization with ggplot2: A Practical Guide" by Kim Seefeld
"R Graphics for Data Analysis" by Murrell
"Data Visualization with ggplot2 and the Tidyverse" by Thomas Lin Pedersen
"The ggplot2 Package: A tutorial on its structure and use" by J. Verzani
"Data Visualization with ggplot2: A step-by-step guide" by Thomas Briet.

LOVELY PROFESSIONAL UNIVERSITY 63


Notes

Unit 04: Business Forecasting using Time Series


Dr. Mohd Imran Khan, Lovely Professional University

Unit 04:Business Forecasting using Time Series


CONTENTS
Objectives
Introduction
4.1 What is Business Forecasting?
4.2 Time Series Analysis
4.3 When Time Series Forecasting should be used
4.4 Time Series Forecasting Considerations
4.5 Examples of Time Series Forecasting
4.6 Why Organizations use Time Series Data Analysis
4.7 Exploration of Time Series Data using R
4.8 Forecasting Using ARIMA Methodology
4.9 Forecasting Using GARCH Methodology
4.10 Forecasting Using VAR Methodology
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Reading

Objectives
After studying this unit, you should be able to

 to make informed decisions based on accurate predictions of future events


 to help businesses prepare for the future by providing them with the information they need
to make informed decisions
 to help businesses make better decisions by providing them with accurate and reliable
predictions of future events
 to identify potential risks and opportunities, enabling them to make proactive decisions to
mitigate risks and capitalize on opportunities

Introduction
Business forecasting is a crucial element for any business to sustain its growth and profitability in
the long run. Time series analysis is a popular technique used for business forecasting, which
involves analyzing the past performance of a business to predict its future performance. Time series
analysis involves analyzing data over a certain period of time to identify trends, patterns, and
relationships that can be used to make accurate predictions about future outcomes.
In business forecasting using time series analysis, various methods can be employed such as
moving average, exponential smoothing, regression analysis, and trend analysis. These methods

64 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

help in identifying the trends and patterns in the data and forecasting the future values of the
variables.
One of the major advantages of time series analysis is that it can help businesses in identifying the
factors that affect their performance and understanding the impact of external factors such as
changes in the economy, consumer behavior, and market trends.
Time series analysis can be used in various business functions such as sales forecasting, inventory
management, financial forecasting, and demand forecasting. It helps businesses to make informed
decisions about their future investments, resource allocation, and overall strategy.
In conclusion, time series analysis is an essential tool for business forecasting, and its applications
are wide-ranging. Accurate forecasting can provide a significant competitive advantage for
businesses and is essential for their long-term success

4.1 What is Business Forecasting?


Business forecasting refers to the tools and techniques used to predict developments in business,
such as sales, expenditures, and profits. The purpose of business forecasting is to develop better
strategies based on these informed predictions. Past data is collected and analyzed via quantitative
or qualitative models so that patterns can be identified and can direct demand planning, financial
operations, future production, and marketing operations.
Business forecasting is the process of estimating future business performance, including revenue,
expenses, and other metrics. It is an essential part of planning and decision-making for
organizations of all sizes, as it helps companies understand their future financial position and make
informed decisions about investments, resource allocation, and other important business initiatives.
Forecasting can be done using a variety of methods, including qualitative methods such as expert
opinions and market research, and quantitative methods such as time-series analysis and regression
analysis. The choice of method will depend on the specific business, its data and available
resources, as well as the purpose and time frame of the forecast.
Business forecasting is not an exact science, and there is always a degree of uncertainty involved.
However, by using appropriate methods and considering a range of scenarios, companies can make
informed decisions about their future, and be better prepared for the challenges and opportunities
ahead.

The business forecasting process entails:


Identify the problem, data point, or question that will be the basis of the systematic investigation.
Identify relevant, theoretical variables and determine the ideal manner for collecting datasets.

LOVELY PROFESSIONAL UNIVERSITY 65


Notes

Unit 04: Business Forecasting using Time Series

Make estimates about future business operations based on information collected through
investigation.
Choose the model that best fits the dataset, variables, and estimates. The chosen model conducts
data analysis and a forecast is made.
Note the deviations between actual performance and the forecast. Use this information to refine the
process of predicting and improve the accuracy of future forecasts.

4.2 Time Series Analysis


This technique involves analyzing data from past periods to make predictions about future trends.
It considers variables such as seasonality, trend, and autocorrelation to make predictions.
Time series analysis is a statistical method used to analyze and forecast future trends based on
historical data. It involves analyzing data collected over a period of time, such as monthly sales
figures or daily stock prices, to identify patterns and trends in the data. The data is then used to
make predictions about future trends.
Time series analysis can be divided into two main categories:
Descriptive Time Series Analysis and Predictive Time Series Analysis.
Descriptive Time Series Analysis involves exploring the data to identify patterns and trends, while
Predictive Time Series Analysis involves making predictions about future trends based on the
identified patterns and trends.
Time series analysis is commonly used in many areas, including finance, economics, marketing,
and operations research. It can be applied to a wide range of data, including sales figures, stock
prices, interest rates, and weather data.
To perform time series analysis, various techniques are used, such as trend analysis, seasonality
analysis, and autoregression. The selection of the appropriate technique depends on the nature of
the data and the purpose of the analysis.
Regression Analysis: This technique uses historical data to establish a relationship between two or
more variables. The relationship is then used to make predictions about future trends.
Regression analysis is a statistical method used to examine the relationship between two or more
variables. It involves analyzing historical data to establish a mathematical relationship between a
dependent variable (the variable being predicted) and one or more independent variables
(predictors). The relationship is then used to make predictions about future values of the dependent
variable.
Regression analysis is commonly used in many fields, including economics, finance, marketing, and
engineering. It can be applied to a wide range of data, including sales figures, stock prices, interest
rates, and weather data.
There are several types of regression analysis, including simple linear regression, multiple linear
regression, and nonlinear regression. Simple linear regression involves a single independent
variable and is used to predict the dependent variable based on a straight-line relationship.
Multiple linear regression involves multiple independent variables and is used to predict the
dependent variable based on more complex relationships. Nonlinear regression involves a
nonlinear relationship between the dependent and independent variables and is used when the
relationship cannot be described by a straight line.
The accuracy of the predictions made using regression analysis depends on the quality of the data
and the validity of the mathematical relationship established between the variables. To ensure the
validity of the relationship, it is important to perform a thorough analysis of the data and to
carefully select the appropriate regression model.
Moving Averages: This technique involves calculating the average of past data over a specific
period of time, such as a month or quarter, to make predictions about future trends.
Exponential Smoothing: This technique involves adjusting past data to account for trends and other
factors, such as seasonality, to make predictions about future trends.

66 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

ARIMA (AutoRegressive Integrated Moving Average): This technique uses time series data to
analyze patterns and relationships, including trends and seasonality, to make predictions about
future trends.
Neural Networks: This technique uses artificial intelligence algorithms to analyze large amounts of
data and identify patterns, which can then be used to make predictions about future trends.
Decision Trees: This technique uses historical data to build a tree-like structure that can be used to
make predictions about future trends based on different scenarios.
Monte Carlo Simulation: This technique involves running multiple simulations based on random
sampling of historical data to make predictions about future trends.

Business Forecasting Techniques


Business forecasting and planning can be conducted by either quantitative modeling methods or
qualitative modeling methods:
Quantitative Techniques in Business Forecasting
Quantitative forecasting is a long term business forecasting method concerned only with
measurable data such as statistics and historical data. Past performance is used to identify trends or
rates of change. These types of business forecasting are especially useful for long range forecasting
in business. Quantitative models include:
Trend Analysis Method: Also known as “Time Series Analysis,” this forecast method uses past data
to predict future events, excluding outliers and holding more recent data in higher regard. This
method is most effective when there is a large quantity of historical data showing clear and stable
trends. This is the most common and cost-effective method.
Econometric Modeling: This mathematical model makes use of several multiple-regres­sion
equations to test the consistency of datasets over time and the significance of the relationship
between datasets, and to predict significant economic shifts and the potential effect of those shifts
on the company.
Indicator Approach: This approach follows the relationship between certain indicators and uses the
leading indicator data in order to estimate the performance of the lagging indicators. Lagging
indicators are a type of KPI that measure business performance subsequently and provide insight
into the impact of business strategies on the results achieved.
Qualitative Techniques in Business Forecasting
Qualitative forecasting relies on industry experts or “market mavens” to make short-term
predictions. These techniques are especially useful in forecasting markets for which there is
insufficient historical data to make statistically relevant conclusions. Qualitative models include:
Market Research: Polls and surveys are conducted with a large number of prospective consumers
regarding a specific product or service in order to predict the margin by which consumption will
either decrease or increase.
Delphi Model: A panel of experts are polled on their opinions regarding specific topics. Their
predictions are compiled anonymously, and a forecast is made.

What is the Importance of Forecasting in Business?


The use of forecasts in business management is indispensable for nearly every decision in every
industry. The use of business forecasting provides information that helps business managers
identify and understand weaknesses in their planning, adapt to changing circumstances, and
achieve effective control of business operations.
Some business forecasting examples include: determining the feasibility of facing existing
competition, measuring the possibility of creating demand for a product, estimating the costs of
recurring monthly bills, predicting future sales volumes based on past sales information, efficient
allocation of resources, forecasting earnings and budgeting, and scrutinizing the appropriateness of
management decisions.

LOVELY PROFESSIONAL UNIVERSITY 67


Notes

Unit 04: Business Forecasting using Time Series

Business forecasting software can help business managers and forecasters not only generate
forecast reports easily, but also better understand predictions and how to make strategic decisions
based off of these predictions. A quality business forecast system should provide clear, real-time
visualization of business performance, which facilitates fast analysis and streamlined business
planning.
The application of forecasting in business is an art and a science, the combination of business
intelligence and data science, and the challenges of business forecasting often stem from poor
judgments and inexperience. Assumptions combined with unexpected events can be dangerous
and result in completely inaccurate predictions. Despite the limitations of business forecasting,
gaining any amount of insight into probable future trends will put an organization at a significant
advantage.

Time Series Forecasting: Definition, Applications, and Examples


Time series forecasting occurs when you make scientific predictions based on historical time
stamped data. It involves building models through historical analysis and using them to make
observations and drive future strategic decision-making. An important distinction in forecasting is
that at the time of the work, the future outcome is completely unavailable and can only be
estimated through careful analysis and evidence-based priors.

What is time series forecasting?


A Tableau workbook demonstrating a time series forecasting visualization.

Time series forecasting is the process of analyzing time series data using statistics and modeling to
make predictions and inform strategic decision-making. It’s not always an exact prediction, and
likelihood of forecasts can vary wildly—especially when dealing with the commonly fluctuating
variables in time series data as well as factors outside our control. However, forecasting insight
about which outcomes are more likely—or less likely—to occur than other potential outcomes.
Often, the more comprehensive the data we have, the more accurate the forecasts can be. While
forecasting and “prediction” generally mean the same thing, there is a notable distinction. In some
industries, forecasting might refer to data at a specific future point in time, while prediction refers
to future data in general. Series forecasting is often used in conjunction with time series analysis.
Time series analysis involves developing models to gain an understanding of the data to
understand the underlying causes. Analysis can provide the “why” behind the outcomes you are
seeing. Forecasting then takes the next step of what to do with that knowledge and the predictable
extrapolations of what might happen in the future.

Applications of time series forecasting

68 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Forecasting has a range of applications in various industries. It has tons of practical applications
including: weather forecasting, climate forecasting, economic forecasting, healthcare forecasting
engineering forecasting, finance forecasting, retail forecasting, business forecasting, environmental
studies forecasting, social studies forecasting, and more. Basically anyone who has consistent
historical data can analyze that data with time series analysis methods and then model, forecasting,
and predict. For some industries, the entire point of time series analysis is to facilitate forecasting.
Some technologies, such as augmented analytics, can even automatically select forecasting from
among other statistical algorithms if it offers the most certainty.

4.3 When Time Series Forecasting should be used


Naturally, there are limitations when dealing with the unpredictable and the unknown. Time series
forecasting isn’t infallible and isn’t appropriate or useful for all situations. Because there really is no
explicit set of rules for when you should or should not use forecasting, it is up to analysts and data
teams to know the limitations of analysis and what their models can support. Not every model will
fit every data set or answer every question. Data teams should use time series forecasting when
they understand the business question and have the appropriate data and forecasting capabilities to
answer that question. Good forecasting works with clean, time stamped data and can identify the
genuine trends and patterns in historical data. Analysts can tell the difference between random
fluctuations or outliers, and can separate genuine insights from seasonal variations. Time series
analysis shows how data changes over time, and good forecasting can identify the direction in
which the data is changing.

4.4 Time Series Forecasting Considerations


The first thing to consider is the amount of data at hand—the more points of observation you have,
the better your understanding. This is a constant across all types of analysis, and time series
analysis forecasting is no exception. However, forecasting relies heavily on the amount of data,
possibly even more so than other analyses. It builds directly off of past and current data. The less
data you have to extrapolate, the less accurate your forecasting will be.
Time horizons
The time frame of your forecast also matters. This is known as a time horizon—a fixed point in time
where a process (like the forecast) ends. It’s much easier to forecast a shorter time horizon with
fewer variables than it is a longer time horizon. The further out you go, the more unpredictable the
variables will be. Alternatively, having less data can sometimes still work with forecasting if you
adjust your time horizons. If you’re lacking long-term recorded data but you have an extensive
amount of short-term data, you can create short-term forecasts.
Dynamic and static states
The state of your forecasting and data makes a difference as to when you want to use it. Will the
forecast be dynamic or static? If the forecast is static, it is set in stone once it is made, so make sure
your data is adequate for a forecast. However, dynamic forecasts can be constantly updated with
new information as it comes in. This means you can have less data at the time the forecast is made,
and then get more accurate predictions as data is added.
Data quality
As always with analysis, the best analysis is only useful if the data is of a useable quality. Data that
is dirty, poorly processed, overly processed, or isn’t properly collected can significantly skew
results and create wildly inaccurate forecasts. The typical guidelines for data quality apply here:

 make sure data is complete,


 is not duplicated or redundant,
 was collected in a timely and consistent manner,
 is in a standard and valid format,
 is accurate for what it is measuring,
 and is uniform across sets.
When dealing with time series analysis, it is even more important that the data was collected at
consistent intervals over the period of time being tracked. This helps account for trends in the data,

LOVELY PROFESSIONAL UNIVERSITY 69


Notes

Unit 04: Business Forecasting using Time Series

cyclic behavior, and seasonality. It also can help identify if an outlier is truly an outlier or if it is part
of a larger cycle. Gaps in the data can hide cycles or seasonal variation, skewing the forecast as a
result.

4.5 Examples of Time Series Forecasting


Here are several examples from a range of industries to make the notions of time series analysis
and forecasting more concrete:
Forecasting the closing price of a stock each day.
Forecasting product sales in units sold each day for a store.
Forecasting unemployment for a state each quarter.
Forecasting the average price of gasoline each day.
Things that are random will never be forecast accurately, no matter how much data we collect or
how consistently. For example: we can observe data every week for every lottery winner, but we
can never forecast who will win next. Ultimately, it is up to your data and your time series data
analysis as to when you should use forecasting, because forecasting varies widely due to various
factors. Use your judgment and know your data. Keep this list of considerations in mind to always
have an idea of how successful forecasting will be.
For as long as we have been recording data, time has been a crucial factor. In time series analysis,
time is a significant variable of the data. Times series analysis helps us study our world and learn
how we progress within it.
Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a set
period of time rather than just recording the data points intermittently or randomly. However, this
type of analysis is not merely the act of collecting data over time.
What sets time series data apart from other data is that the analysis can show how variables change
over time. In other words, time is a crucial variable because it shows how the data adjusts over the
course of the data points as well as the final results. It provides an additional source of information
and a set order of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for forecasting—
predicting future data based on historical data.

4.6 Why Organizations use Time Series Data Analysis


Time series analysis helps organizations understand the underlying causes of trends or systemic
patterns over time. Using data visualizations, business users can see seasonal trends and dig deeper
into why these trends occur. With modern analytics platforms, these visualizations can go far
beyond line graphs.
When organizations analyze data over consistent intervals, they can also use time series forecasting
to predict the likelihood of future events. Time series forecasting is part of predictive analytics. It
can show likely changes in the data, like seasonality or cyclic behavior, which provides a better
understanding of data variables and helps forecast better.
For example, Des Moines Public Schools analyzed five years of student achievement data to
identify at-risk students and track progress over time. Today’s technology allows us to collect
massive amounts of data every day and it’s easier than ever to gather enough consistent data for
comprehensive analysis.
Time series analysis examples
Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time series
analysis because currency and sales are always changing. Stock market analysis is an excellent

70 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

example of time series analysis in action, especially with automated trading algorithms. Likewise,
time series analysis is ideal for forecasting weather changes, helping meteorologists predict
everything from tomorrow’s weather report to future years of climate change. Examples of time
series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Time Series Analysis Types
Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too many
things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and forecasts
incorrect.
Models of time series analysis include:
Classification: Identifies and assigns categories to the data.
Curve fitting: Plots the data along a curve to study the relationships of variables within the data.
Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal variation.
Explanative analysis: Attempts to understand the data and the relationships within it, as well as
cause and effect.
Exploratory analysis: Highlights the main characteristics of the time series data, usually in a visual
format.
Forecasting: Predicts future data. This type is based on historical trends. It uses the historical data
as a model for future data, predicting scenarios that could happen along future plot points.
Intervention analysis: Studies how an event can change the data.
Segmentation: Splits the data into segments to show the underlying properties of the source
information.
Data classification
Further, time series data can be classified into two main categories:
Stock time series data means measuring attributes at a certain point in time, like a static snapshot of
the information as it was.
Flow time series data means measuring the activity of the attributes over a certain period, which is
generally part of the total whole and makes up a portion of the results.
Data variations
In time series data, variations can occur sporadically throughout the data:
Functional analysis can pick out the patterns and relationships within the data to identify notable
events.

LOVELY PROFESSIONAL UNIVERSITY 71

You might also like