Demgn801 Business Analytics 1 75
Demgn801 Business Analytics 1 75
DEMGN801
Edited by:
Dr. Suresh Kashyap
Business Analytics
Edited By
Dr. Suresh Kashyap
Title: BUSINESS_ANALYTICS
Publisher Address: Lovely Professional University, Jalandhar Delhi GT road, Phagwara - 144411
ISBN: 978-93-94068-47-6
Objectives
overview of business analytics:
scope of business analytics,
application of business analytics
Rstudio environment for business analytics,
basics of R: packages
vectors in R programming,
datatypes and data structures in R programming
Introduction
Business analytics is a crucial aspect of modern-day organizations that leverages data and
advanced analytical techniques to make data-driven decisions. The goal of business analytics is to
turn data into insights that can help organizations identify trends, measure performance, and
optimize processes.
One of the most significant benefits of business analytics is that it allows organizations to make
informed decisions based on real data instead of gut instincts or assumptions. This leads to better
decision-making and a more strategic approach to business operations. Additionally, business
analytics enables organizations to predict future trends and allocate resources more effectively,
thereby increasing efficiency and competitiveness.
Another advantage of business analytics is that it can help organizations understand their
customers better. By analyzing customer data, organizations can gain insights into customer
behavior, preferences, and buying patterns, which can help them tailor their products and services
to meet customer needs more effectively.
However, it is important to note that business analytics is not just about collecting and analyzing
data. It requires a deep understanding of statistical and mathematical models, as well as the ability
to effectively communicate insights to key stakeholders. Furthermore, organizations must ensure
that their data is of high quality and that their analytics systems are secure, to ensure that the
insights generated are accurate and trustworthy.
These examples show how businesses can use data analytics to drive efficiency, improve customer
experiences, and make informed decisions.
Walmart
Walmart uses business analytics in several ways, including:
Supply Chain Optimization: Walmart uses data analytics to optimize its supply chain and
improve the efficiency of its operations.
Customer Insights: Walmart collects and analyzes data on customer shopping habits and
preferences to inform its marketing strategies and product offerings.
Inventory Management: Walmart uses data analytics to track inventory levels and sales patterns to
ensure that the right products are in stock at the right time.
Employee Management: Walmart uses data analytics to monitor employee productivity, schedule
management and reduce labor costs.
Pricing Strategies: Walmart uses data analytics to inform its pricing strategies, ensuring that it
remains competitive while maximizing profits.
Overall, Walmart leverages business analytics to gain insights and make data-driven decisions that
improve its operations and drive growth.
Uber
Uber uses business analytics in several ways:
Demand forecasting: To predict demand for rides and optimize pricing and driver incentives.
Customer segmentation: To better understand and target different customer segments.
Driver performance evaluation: To measure driver performance and identify areas for
improvement.
Route optimization: To determine the best routes for drivers and passengers, reducing travel time
and costs.
Fraud detection: To identify and prevent fraudulent activities, such as fake rides and fake drivers.
Marketing and promotions: To measure the effectiveness of marketing campaigns and
promotional offers.
Market expansion: To analyze new markets and determine the viability of expanding into new
cities and regions.
Google
Google uses business analytics in various ways:
Data-driven decision making: Google collects and analyzes massive amounts of data to inform its
decisions and strategies.
Customer behavior analysis: Google analyzes user data to understand customer behavior and
preferences, which helps with product development and marketing strategies.
Financial analysis: Google uses business analytics to track and forecast its financial performance.
Ad campaign optimization: Google uses analytics to measure the effectiveness of its advertising
campaigns and adjust them accordingly.
Market research: Google analyzes market trends and competitor activity to inform its business
strategies.
What is R: Overview, its Applications and what is R used for?
Since there are so many programming languages available today, it’s sometimes hard to decide
which one to choose. As a result, programmers often face the dilemma of too many good choices.
It’s enough to stop people in their tracks, paralyzed with indecision!
To combat this potential source of mental gridlock, we present an analysis of the R programming
language.
1.4 What Is R?
What better place to find a good definition of the language than the R Foundation’s website?
According to R-Project.org, R is “… a language and environment for statistical computing and
graphics.” It’s an open-source programming language often used as a data analysis and statistical
software tool.
R is a language and environment for statistical computing and graphics. It is a GNU project which
is similar to the S language and environment which was developed at Bell Laboratories (formerly
AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a
different implementation of S. There are some important differences, but much code written for S
runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
The S language is often the vehicle of choice for research in statistical methodology, and R provides
an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been taken
over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General
Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows and MacOS.
The R environment consists of an integrated suite of software facilities designed for data
manipulation, calculation, and graphical display. The environment features:
Graphical facilities for data analysis and display that work either for on-screen or hardcopy
Keywords, reserved words that have a special meaning for the compiler
R was developed in 1993 by Ross Ihaka and Robert Gentleman and includes linear regression,
machine learning algorithms, statistical inference, time series, and more.
R is a universal programming language compatible with the Windows, Macintosh, UNIX, and
Linux platforms. It is often referred to as a different implementation of the S language and
environment and is considered highly extensible.
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple and effective programming language which includes conditionals,
loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the case
with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect of S,
which makes it easy for users to follow the algorithmic choices made. For computationally-
intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can
write C code to manipulate R objects directly.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
It’s a complicated language. R has a steep learning curve. It’s a language best suited for people who
have previous programming experience.
It’s not as secure. R doesn’t have basic security measures. Consequently, it’s not a good choice for
making web-safe applications. Also, R can’t be embedded in web browsers.
It’s slow. R is slower than other programming languages like Python or MATLAB.
It takes up a lot of memory. Memory management isn’t one of R’s strong points. R’s data must be
stored in physical memory. However, the increasing use of cloud-based memory may eventually
make this drawback moot.
It doesn’t have consistent documentation/package quality. Docs and packages can be patchy and
inconsistent, or incomplete. That’s the price you pay for a language that doesn’t have official,
dedicated support and instead is maintained and added to by the community.
Why use R
R is a state-of-the-art programming languague for statistical computing, data analysis, and machine
learning. It has been around for almost three decades with over 12,000 packages available for
download on CRAN. This means that there is an R package that supports whatever type of analysis
you want to perform. Here are a few reasons why you should learn and use R:
Free and open-source: The R programming language is open-source and is issued under the
General Public License (GNU). This means that you can use all the functionalities of R for free
without any restrictions or licensing requirements. Since R is open-source, everyone is welcome to
contribute to the project, and since it’s freely available, bugs are easily detected and fixed by the
open-source community.
Popularity: The R programming language was ranked 7th in the 2021 IEEE Spectrum ranking of
top programming languages and 12th in the TIOBE Index ranking of January 2022. It’s the second
most popular programming language for data science just behind Python, according to edX, and it
is the most popular programming language for statistical analysis. R’s popularity also means that
there is extensive community support on platforms like Stack overflow. R also has a detailed online
documentation that R users can consult for help.
A language for data analytics and data science: The R programming language isn’t a general-
purpose programming language. It’s a specialized programming language for statistical
computing. Therefore, most of R’s functions carry out vectorized operations, meaning you don’t
need to loop through each element. This makes running R code very fast. Distributed computing
can be executed in R, whereby tasks are split among multiple processing computers to reduce
execution time. R is integrated with Hadoop and Apache Spark, and it can be used to process large
amount of data. R can connect to all kinds of databases, and it has packages to carry out machine
learning and deep learning operations.
Opportunity to pursue an exciting career in academe and industry: The R programming language
is trusted and extensively used in the academic community for research. R is increasingly being
used by government agencies, social media, telecommunications, financial, e-commerce,
manufacturing, and pharmaceutical companies. Top companies that uses R include Amazon,
Google, ANZ Bank, Twitter, LinkedIn, Thomas Cook, Facebook, Accenture, Wipro, the New York
Times, and many more. A good mastery of the R programming language opens all kinds of
opportunities in academe and industry.
manipulation. With its vast libraries and packages, R is popular in industries such as finance,
healthcare, and e-commerce, as well as academia and research institutions.
Although R is a popular language used by many programmers, it is especially effective when used
for
Data analysis
Statistical inference
Machine learning algorithms
R offers a wide variety of statistics-related libraries and provides a favorable environment for
statistical computing and design. In addition, the R programming language gets used by many
quantitative analysts as a programming tool since it's useful for data importing and cleaning.
As of August 2021, R is one of the top five programming languages of the year, so it’s a favorite
among data analysts and research programmers. It’s also used as a fundamental tool for finance,
which relies heavily on statistical data.
This graph, provided by Stackoverflow, gives you a better idea of R programming language usage
in recent history. Given its strength in statistics, it's hardly surprising that R enjoys heavy use in the
world of academia, as illustrated on the chart.
If you’re looking for specifics, here are ten significant companies or organizations that use R,
presented in no particular order.
Airbnb
Microsoft
Uber
Facebook
Ford
Google
Twitter
IBM
American Express
HP
When you have downloaded and installed R, you can run R on your computer.
The screenshot below shows how it may look like when you run R on a Windows PC:
Installing R on Windows OS
Click on "install R for the first time" link to download the R executable (.exe) file.
Run the R executable file to start installation, and allow the app to make changes to your device.
R has now been sucessfully installed on your Windows OS. Open the R GUI to start writing R
codes.
Additional R interfaces
Other than the R GUI, the other ways to interface with R include RStudio Integrated Development
Environment (RStudio IDE) and Jupyter Notebook. To run R on RStudio, you first need to install R
on your computer, while to run R on Jupyter Notebook, you need to install an R kernel. RStudio
and Jupyter Notebook provide an interactive and friendly graphical interface to R that greatly
improves users’ experience.
Run the RStudio Executable file (.exe) for Windows OS or the Apple Image Disk file (.dmg) for
macOS X.
RStudio is now successfully installed on your computer. The RStudio Desktop IDE interface is
shown in the figure below:
1.9 R packages
R packages are collections of functions, data, and compiled code that can be used to extend the
capabilities of R. There are thousands of R packages available, covering a wide range of topics,
including statistics, machine learning, data visualization, and more. Installing and using R
packages is an essential part of working with R, and many packages are designed to be easy to
install and use, with clear documentation and examples.
Tidyverse: The tidyverse is a collection of R packages designed for data science. It includes
packages for data manipulation (dplyr), data visualization (ggplot2), and data import/export
(readr, tidyr), among others. The packages in the tidyverse are designed to work together
seamlessly, and they share a common design philosophy, which emphasizes simplicity,
consistency, and understanding. The tidyverse is particularly popular among R users due to its ease
of use, intuitive syntax, and wide range of capabilities, making it a great choice for data analysis
tasks of all types and complexity levels.
Ggplot2: ggplot2 is a data visualization library for the R programming language. It provides a
high-level interface for creating statistical graphics. ggplot2 uses a grammar of graphics to build
complex plots from basic components, allowing users to quickly create sophisticated visualizations
of their data. The library is highly customizable and flexible, allowing users to specify a wide range
of visual elements such as colors, shapes, sizes, and labels.
Dplyr: dplyr is a data manipulation library for R. It provides a set of functions that allow users to
perform common data manipulation tasks such as filtering, summarizing, transforming, and
aggregating data. dplyr is designed to be fast, efficient, and easy to use, and it operates on data
frames and tibbles, making it a popular choice for data wrangling and exploration. The library is
particularly well-suited for working with large datasets, as it provides optimized implementations
for many common data manipulation operations. The syntax of dplyr functions is highly readable
and intuitive, and the library is widely used by data scientists and analysts for data preparation and
exploration.
Tidyr:tidyr is a library for the R programming language that provides tools for "tidying" data. In
the context of data science and analysis, tidying data means restructuring it into a format that is
more suitable for analysis, visualization, and modeling. tidyr provides a suite of functions for
transforming data from a wide variety of formats into a more structured, "tidy" format. This makes
it easier to work with the data and perform common data manipulation tasks such as aggregating,
filtering, and summarizing. The library is designed to work seamlessly with other R libraries such
as dplyr, making it a popular choice for data preparation and wrangling tasks.
Shiny: Shiny is a web application framework for R. It allows R developers to create interactive,
web-based data applications without needing to learn HTML, CSS, or JavaScript. Shiny provides a
simple, high-level syntax for building user interfaces and tying them to R code for data analysis,
visualization, and modeling. Applications built with Shiny can be run locally or hosted on a web
server, making it easy to share results with collaborators and stakeholders. The framework is highly
customizable and can be extended using HTML, CSS, and JavaScript, allowing developers to create
complex, interactive applications with rich user interfaces. Shiny is widely used in data science and
analytics for creating dashboards, data visualization tools, and other interactive applications.
1.10 Vector in R
In R, a vector is a basic data structure that represents an ordered collection of values of the same
type (numeric, character, logical, etc.). Vectors are the simplest type of data structure in R and are
used as the building blocks for more complex data structures such as arrays, data frames, and lists.
A vector can be created using the c() function and can be indexed, sliced, and manipulated using
various R functions and operators. In R, vectors are used for representing variables, input data, and
intermediate results of computations. They play a crucial role in many data analysis and modeling
tasks and are an essential part of the R programming language.
R also has several other specialized data types such as list, matrix, data frame, etc.
Data frames: Two-dimensional arrays of heterogeneous data, with rows and columns labeled
Each of these data structures can be created and manipulated in various ways in R, and many
functions are available for operating on them.
Summary
Business analytics is the practice of examining data and using statistical analysis and other methods
to gain insights into the performance and efficiency of a business. It involves the use of data,
statistical algorithms, and technology to uncover hidden patterns and knowledge from large data
sets, and is used to inform decision making and guide the development of strategies and plans.
The goal of business analytics is to improve decision-making, streamline processes, and gain a
competitive advantage through the use of data and predictive modeling. It can be applied in
various areas of a business, such as sales and marketing, supply chain management, finance, and
operations.
Business analytics typically involves several key steps: data collection, data cleaning and
preparation, data analysis, and communication of results. Data scientists and other professionals
use statistical and mathematical methods, such as regression analysis and predictive modeling, to
analyze the data and extract insights. The results of these analyses are then used to inform
decisions, support business strategy development, and identify opportunities for improvement.
In recent years, the rapid growth of digital data and advancements in technology have made it
easier for organizations to collect and analyze large amounts of data, leading to the widespread
adoption of business analytics across a wide range of industries.
Keywords
Business analytics, Descriptive analytics, Predictive analytics, Prescriptive analytics, R
Programming
SelfAssessment
1. Which of the following fields below typically make use of Data Mining techniques?
A. Advertising
B. Government Intelligence
C. Airline Industry
D. All of the above
3. Which is the R command for obtaining 1000 random numbers through normal distribution
with mean 0 and variance 1?
A. norm(1000, 0, 1)
B. rnorm(0, 1, 1000)
C. rnorm(1000, 0, 1)
D. qnorm(0, 1, 1000)
4. For the population y<-c(1,2,3,4,5), write the R command to find the mean?
A. mean{y}
B. means(y)
C. mean(y)
D. mean[y]
7. The first step in the process is _____________. Data relevant to the applicant is collected. The
quality, quantity, validity, and nature of data directly impact the analytical outcome. A
thorough understanding of the data on hand is extremely critical.
A. Results
B. Put Into Use
C. Data Collection
D. Model Building
8. Usually raw data is not in a format that can be directly used to perform data analysis. In
very simple terms, most platforms require data to be in a matrix form with the variables
being in different columns and rows representing various observations. Data may be
available in structured, semi-structured, and unstructured form.
A. Data Collection
B. Data Preparation
C. Data Analysis
D. Model Building
9. Once data is converted into a structured format, the next stage is to perform ___________. At
this stage underlying trends in the data are identified. This step can include fitting a linear
or nonlinear regression model, performing principal component analysis or cluster analysis,
identifying if data is normally distributed or not.
A. Data Collection
B. Data Preparation
C. Data Analysis
D. Model Building
10. We need to analyzed the data we collected, Analyze Data Model to assess and query the
data collected in the process.
A. Data
B. Analyze
C. Generate Reports
D. Smarter Decisions
11. Consists of acquiring the data, implementing advanced data processes, distributing the data
effectively and managing oversight data.
A. Artificial Intelligence
B. Growing Importance of The CDO & CAO
C. Data Discovery
D. Data Quality Management (DQM)
12. __________ is the science aiming to make machines execute what is usually done by complex
human intelligence.
A. Data Discovery
B. Artificial Intelligence
C. Collaborative Business Intelligence
D. Consumer Experience
13. Predictive analytics is widely used by both conventional retail stores as well as e-commerce
firms for analyzing their historical data and building models for customer engagement,
supply chain optimization, price optimization, and space optimization and assortment
planning.
A. Retail Industry
B. Telecom Industry
C. Health Industry
D. Finance Industry
6. A 7. C 8. B 9. A 10. B
Review Questions
1. What is business analytics and how does it differ from traditional business intelligence?
2. What are the key steps involved in the business analytics process?
3. How can data visualization be used to support business decision-making?
4. What is data mining and how is it used in business analytics?
5. What is predictive analytics and how does it differ from descriptive analytics?
6. What are some common techniques used in predictive modeling, such as regression
analysis, decision trees, and neural networks?
7. How can business analytics be used to support customer relationship management
(CRM)?
8. What are some common applications of business analytics in areas such as supply chain
management, marketing, and finance?
Further Readings
https://business.wfu.edu/masters-in-business-analytics/articles/what-is-
analytics/#:~:text=Business%20analytics%20is%20the%20process,to%20create%20insights%
20from%20data.
Business Analytics, 2ed: The Science of Data-Driven Decision Making by U. Dinesh Kumar,
Objectives
discuss one variable and two variables statistics,
overview of functions to summarize variables.
implement select, filter, mutate, variables.
use of arrange, summarize, and group byfunctions.
demonstrate concept of pipes operator
Introduction
R is a programming language and software environment for statistical computing and graphics. It
was developed in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand. R provides a wide range of statistical and graphical techniques and is highly extensible,
allowing users to write their own functions and packages.
One of the main strengths of R is its ability to handle and visualize complex data. It has a large and
active community of developers, who have contributed over 15,000 packages to the Comprehensive
R Archive Network (CRAN). These packages cover a wide range of topics, including machine
learning, time series analysis, Bayesian statistics, social network analysis, and many others.
In addition to its statistical and graphical capabilities, R also provides a flexible and interactive
programming environment. R code can be run from the command line, from scripts, or from within
a graphical user interface (GUI) such as RStudio. R supports various data structures such as vectors,
matrices, data frames, and lists, and it has a rich set of functions for data manipulation and
transformation.
R is widely used in academia, industry, and government for data analysis, statistical modelling, and
data visualization. It is also a popular choice for reproducible research, as the code and data used in
an analysis can be easily shared and documented.
In summary, R is a powerful and versatile language for data analysis and statistical computing,
with a large community of users and developers and a wide range of tools and techniques.
print(max(4:6))
# Find min of numbers 4 and 6.
print(min(4:6))
#Calculate the square root of a number
sqrt(16)
#Calculate the natural logarithm of a number
log(10)
#Calculate the exponential function
exp(2)
#Calculate the sine of an angle (in radians)
sin(pi/4)
#Calculate the sum of two numbers
x <- 2
y <- 3
x+y
#Calculate the difference of two numbers
x-y
#Calculate the product of two numbers
x*y
#Calculate the quotient of two numbers
x/y
#Calculate the power of a number
x^y
#Calculate the cosine of an angle (in radians)
cos(pi/3)
#Calculate the tangent of an angle (in radians)
tan(pi/4)
#Calculate the inverse sine of a value
asin(1)
#Calculate the inverse cosine of a value
acos(0.5)
#Calculate the inverse tangent of a value
atan(1)
#Calculate the mean and standard deviation of a vector x:
x <- c(1, 2, 3, 4, 5)
mean(x)
sd(x)
#Calculate the median and quartiles of a vector x:
x <- c(1, 2, 3, 4, 5)
median(x)
print(resultList["Area"])
print(resultList["Perimeter"])
# A simple R program to demonstrate the inline function
f = function(x) x^2*4+x/3
print(f(4))
print(f(-2))
print(0)
# condition
filter(df, x<50 & z==TRUE)
Output:
x y z
1 12 22.1 TRUE
2 31 44.5 TRUE
# create a vector of numbers
x <- c(1, 2, 3, 4, 5, 6)
# filter elements that are greater than 3
result <- filter(x, x > 3)
# print the filtered result
print(result)
# Output:
# [1] 4 5 6
In this example, the filter() function is applied to the vector x with the condition x > 3, which
returns a new vector containing only the elements of x that are greater than 3.
# Creating a vector of numbers
numbers <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Using filter function to extract only even numbers
even_numbers <- filter(numbers, function(x) x %% 2 == 0)
# Printing the filtered numbers
even_numbers
#Output
[1] 2 4 6 8 10
filter(starwars, species == "Human")
filter(starwars, mass > 1000)
# Multiple criteria
filter(starwars, hair_color == "none" & eye_color == "black")
filter(starwars, hair_color == "none" | eye_color == "black")
# Multiple arguments are equivalent to and
filter(starwars, hair_color == "none", eye_color == "black")
# Load library dplyr
library(dplyr)
# Load iris dataset
data(iris)
# Select only Sepal.Length and Species columns
iris_select <- iris %>% select(Sepal.Length, Species)
# View the first 6 rows
head(iris_select)
# Load library dplyr
library(dplyr)
summarize_if()
In this function, we specify a condition and the summary will be generated if the condition is
satisfied.
library(dplyr)
# Main code
data<-mtcars
z<- head(data)
summarize_if(is.numeric, mean)
In the code snippet above, we use the predicate function is.numeric and mean as an action.
total_profits = sum(Profit),
.groups = 'drop')
View(df_grp_region)
# Example 3
mtcars %>%
group_by(cyl) %>%
summarize(mean_mpg = mean(mpg),
n = n())
# Example 4
mtcars %>%
mutate(cyl_factor = factor(cyl),
hp_group = cut(hp, breaks = c(0, 50, 100, 150, 200),
labels = c("low", "medium", "high", "very high"))) %>%
group_by(cyl_factor, hp_group) %>%
summarize(mean_mpg = mean(mpg),
n = n())
In the second example, the mtcars data set is first filtered to keep only the mpg and hp columns,
and then only the first six rows are displayed.
In the third example, the mtcars data set is grouped by the number of cylinders (cyl) and the mean
miles per gallon (mpg) and number of observations (n) are calculated for each group.
In the fourth example, two new variables are created and added to the mtcars data set. The number
of cylinders (cyl) is converted to a factor and a new variable (cyl_factor) is created to represent this
factor. Another new variable (hp_group) is created by dividing the horsepower (hp) into groups
using the cut function. The data set is then grouped by the two new variables, and the mean miles
per gallon (mpg) and number of observations (n) are calculated for each group.
Summary
There are many ways to summarize business data in R, depending on the type of data you are
working with and the goals of your analysis. Here are a few common methods for summarizing
business data. Descriptive statistics: You can use base R functions such as mean, median, sum, min,
max, and quantile to calculate common summary statistics for your data. For example, you can
calculate the mean, median, and standard deviation of a variable of interest.
Grouping and aggregating: You can use the group_by and summarize functions from the dplyr
package to group your data by one or more variables and calculate summary statistics for each
group. For example, you can group sales data by product and calculate the total sales for each
product.
Cross-tabulation: You can use the table function to create cross-tabulations (also known as
contingency tables) of your data. For example, you can create a cross-tabulation of sales data by
product and region.
Visualization: You can use various plotting functions, such as barplot, histogram, and boxplot, to
create visual representations of your data. Visualization can help you quickly identify patterns and
relationships in your data.
Keywords
dplyr, R packages, group by, pipe operator, summarize.
Self Assessment
1. Descriptive analysis tell about________?
A. Past
B. Present
C. Future
D. Previous
8. Which of the following finds the maximum value in the vector x, exclude missing values
A. rm(x)
B. all(x)
C. max(x, na.rm=TRUE)
D. x%in%y
10. Which of the following return a subset of the columns of a data frame?
A. select
B. retrieve
C. get
D. set
B. R has an internal implementation of data frames that is likely the one you will use most
often
C. There are packages on CRAN that implement data frames via things like relational
databases that allow you to operate on very very large data frames
D. All of the mentioned
12. _________ generate summary statistics of different variables in the data frame, possibly
within strata.
A. rename
B. summarize
C. set
D. subset
14. The _______ operator is used to connect multiple verb actions together into a pipeline.
A. pipe
B. piper
C. start
D. end
15. The dplyr package can be installed from CRAN using __________
A. installall.packages(“dplyr”)
B. install.packages(“dplyr”)
C. installed.packages(“dplyr”)
D. installed.packages(“dpl”)
6. A 7. C 8. B 9. A 10. A
Review Questions
1) Use IRIS data set and use group by, summarize function.
2) Discuss the pipe operator in R.
Further reading
An Introduction to R" by W. N. Venables, D. M. Smith, and the R Development Core
Team
https://www.r-bloggers.com
Dr. Mohd Imran Khan, Lovely Professional University Unit 03: Business Data Visualization
Objectives
To analyse data visualization in business context.
To discover the purpose of basic graphs.
To understand the grammar of graphics.
To visualize basics graphs using ggplot2.
To visualize some advanced graphs.
Introduction
Business data visualization is the representation of business data and information using charts,
graphs, maps, and other visual elements. The goal of data visualization in a business context is to
make complex data easy to understand, reveal patterns and trends, and support decision-making
processes.Business data visualization is the process of transforming complex data into graphical
representations, such as charts, graphs, maps, and infographics, to communicate data in a way that
is easy to understand and interpret. The main goal of business data visualization is to provide a
visual representation of data that supports decision-making processes and enhances
communication.
Data visualization offers several benefits to businesses, including:
Improved communication: By using visual representations, data visualization makes it easier for
individuals to understand and interpret data, which leads to better communication and
collaboration among team members.
Increased Insights: Data visualization allows companies to identify patterns and trends in data that
would be difficult to detect through raw data analysis. This leads to new insights and a better
understanding of the data.
Better Decision-Making: Data visualization provides a visual representation of data that supports
decision-making processes. By presenting data in a way that is easy to understand and interpret,
decision-makers can make informed decisions based on accurate data analysis.
Enhanced Presentations: Data visualization adds a visual component to presentations, making
them more engaging and effective for communicating data.
3.4 Ggplot2
ggplot2 is a plotting library for the R programming language, used for creating sophisticated
graphics. It was created by Hadley Wickham and is based on the principles of the grammar of
graphics, which provides a flexible structure for building complex visualizations from simple
components.
One of the key features of ggplot2 is that it allows users to build plots layer by layer, by adding
components such as data, aesthetics (mapping variables to visual properties), geoms
(representations of data, such as points, lines, or bars), and statistics (such as regression lines or
smoothing splines). This approach makes it easier to understand and control the appearance of the
final plot.
ggplot2 also has a large and active user community, which has contributed a variety of additional
packages and extensions that enhance its functionality. As a result, ggplot2 is widely used in
academia and industry, and has become one of the most popular plotting libraries for R.
The library is highly extensible, with a large number of plugins and extensions available that allow
you to create custom visualizations or fine-tune existing ones. Additionally, ggplot2 is designed to
play well with other R packages, such as dplyr and tidyr, which makes it easy to manipulate and
transform your data before creating a visualization.
Some of the key features of ggplot2 include:
A wide variety of plot types, including scatterplots, bar plots, line plots, histograms, density
plots, box plots, and more.
Customization of every aspect of a plot, from the axis labels and titles to the colors and
themes.
Built-in support for facets, which allow you to create multiple subplots that share the same
scales and aesthetics.
The ability to combine multiple layers into a single plot, and to add smooth fits, regression
lines, and other statistical summaries to a plot.
ggplot2 has a number of advantages over other data visualization tools, including:
Consistency: ggplot2 provides a consistent syntax for creating visualizations, making it easier to
learn and use.
Customization: ggplot2 is highly customizable, allowing you to create visualizations that meet
your specific needs.
Extendibility: ggplot2 is designed to be extended and modified, making it easy to create new
visualizations or modify existing ones.
Large Community: ggplot2 has a large and active community of users who provide support,
resources, and tutorials.
ggplot2 is widely used in the R community and is considered to be one of the best data
visualization libraries for R. It provides a powerful and flexible platform for creating professional-
looking visualizations and has a large and active user community that provides support and
develops new extensions and packages.
The syntax of ggplot2 can be broken down into three main components:
The data: You start by specifying the data you want to visualize. This can be a data frame or a
tibble in R.
The aesthetics: Next, you define the visual mappings, or aesthetics, between the variables in your
data and the visual elements of the plot, such as the x and y positions, color, size, etc.
The geometry: Finally, you specify the type of plot you want to create, such as a scatter plot, bar
plot, histogram, etc., using a geom (short for geometry).
Here's a simple example that demonstrates the basic syntax of ggplot2:
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point()
In this example, the data is mtcars, and the aesthetics are defined as x = wt and y = mpg. The
geom_point() function is used to specify that we want a scatter plot.
The syntax of ggplot2 can be quite dense, but it's also highly expressive and allows for fine-grained
control over the appearance and behavior of your visualizations. With practice, you'll find that you
can create complex and beautiful plots with just a few lines of code.
Few more examples
Barplot
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "blue") +
xlab("Number of Cylinders") +
ylab("Count") +
ggtitle("Count of Cars by Number of Cylinders")
Line plot
library(ggplot2)
# Load the data
data(economics)
# Create the plot
ggplot(data = economics, aes(x = date, y = uempmed)) +
geom_line(color = "red") +
xlab("Year") +
ylab("Unemployment Rate") +
ggtitle("Unemployment Rate Over Time")
Histogram
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = mpg)) +
geom_histogram(fill = "blue", binwidth = 2) +
xlab("Miles Per Gallon") +
ylab("Frequency") +
ggtitle("Histogram of Miles Per Gallon")
Boxplot
library(ggplot2)
# Load the data
data(mtcars)
# Create the plot
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "blue") +
xlab("Number of Cylinders") +
ylab("Miles Per Gallon") +
ggtitle("Box Plot of Miles Per Gallon by Number of Cylinders")
These are just a few examples to get you started. You can create many more complex and
interesting visualizations using ggplot2 by combining different geoms, adjusting the aesthetics, and
adding additional elements such as faceting, themes, and annotations.
then specify the data object. It has to be a data frame. And it needs one numeric and one categorical
variable.
then come thes aesthetics, set in the aes() function: set the categoric variable for the X axis, use the
numeric for the Y axis
finally call geom_bar(). You have to specify stat="identity" for this kind of dataset.
Most basic bar plot
# Load ggplot2
library(ggplot2)
# Create data
data <- data.frame(
name=c("A","B","C","D","E") ,
value=c(3,12,5,18,45)
)
# Barplot
ggplot(data, aes(x=name, y=value)) +
geom_bar(stat = "identity")
scale_fill_hue(c = 40) +
theme(legend.position="none")
# 3: Using RColorBrewer
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_brewer(palette = "Set1") +
theme(legend.position="none")
# 4: Using greyscale:
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_grey(start = 0.25, end = 0.75) +
theme(legend.position="none")
# 5: Set manualy
ggplot(mtcars, aes(x=as.factor(cyl), fill=as.factor(cyl) )) +
geom_bar( ) +
scale_fill_manual(values = c("red", "green", "blue") ) +
theme(legend.position="none")
Line Color
The command color is used and the desired color is written in double quotes [” “] inside geom_line(
).
library(ggplot2)
# Create data for chart
val <-data.frame(course=c('DSA','C++','R','Python'),
num=c(77,55,80,60))
# Format the line color
ggplot(data=val, aes(x=course, y=num, group=1)) +
geom_line(color="green")+
geom_point()
Line Size
The line size can be changed using the command size and providing the value of the size inside
geom_line( ).
library(ggplot2)
# Create data for chart
val <-data.frame(course=c('DSA','C++','R','Python'),
num=c(77,55,80,60))
# Format the line size
ggplot(data=val, aes(x=course, y=num, group=1)) +
geom_line(color="green",size=1.5)+
geom_point()
Correlogram is a graph of correlation matrix. Useful to highlight the most correlated variables in a
data table. In this plot, correlation coefficients are colored according to the value. Correlation matrix
can be also reordered according to the degree of association between variables.
Use of ggcorrplot() function to draw a correlogram
library(ggcorrplot)
# Load the data
data(mtcars)
# Calculate the correlation matrix
cor_mat <- cor(mtcars)
# Create the plot
ggcorrplot(cor_mat, method = "circle", hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3)
In this example, the cor function is used to calculate the pairwise correlations between the variables
in the mtcars dataset. The ggcorrplot function is then used to create the correlogram, using the color
method to represent the correlation coefficients with colors (positive correlations in blue and
negative correlations in red).
Violin Plot
A violin plot is a type of plot that combines aspects of both box plots and kernel density plots, and
is used to visualize the distribution of a numerical variable and its ranking within that distribution.
# First, install and load the ggplot2 library
library(ggplot2)
# Generate some sample data
set.seed(123)
x <- rnorm(100)
group <- rep(c("Group 1", "Group 2"), 50)
# Prepare the data into a format that can be plotted
df <- data.frame(x = x, group = group)
# Create the violin plot using ggplot2
ggplot(df, aes(x = group, y = x, fill = group)) +
geom_violin() +
labs(x = "Group", y = "X")
In this example, the ggplot() function is used to specify the plot, with group and x as the aesthetic
mappings. The geom_violin() layer is then added to the plot to create the violin plot. The labs()
function is used to add labels to the x- and y-axes.
In this example, the x variable is drawn from a normal distribution and assigned to two different
groups, "Group 1" and "Group 2". The violin plot shows the distribution of x for each group. The fill
color of the violin plot is specified by the group variable.
Lollipop plot
A lollipop plot is a type of plot that is used to visualize the relationship between two variables,
where one variable is categorical and the other is numerical. In a lollipop plot, the categorical
variable is shown on the x-axis, and the numerical variable is represented by a line (the "stick") that
extends from the x-axis to the corresponding y-value. The end of the stick is marked by a circle (the
"lollipop").
# First, install and load the ggplot2 library
library(ggplot2)
# Generate some sample data
set.seed(123)
x <- c("Group 1", "Group 2", "Group 3")
y <- c(1, 2, 3)
# Prepare the data into a format that can be plotted
df <- data.frame(x = x, y = y)
# Create the lollipop plot using ggplot2
ggplot(df, aes(x = x, y = y)) +
geom_segment(aes(xend = x, yend = 0), color = "gray50") +
geom_point(size = 5) +
labs(x = "Group", y = "Value")
2
Value
In this example, the ggplot() function is used to specify the plot, with x and y as the aesthetic
mappings. The geom_segment() layer is then added to the plot to create the stick, with xend and
yend as the endpoint mappings. The geom_point() layer is added to the plot to create the lollipops,
and the labs() function is used to add labels to the x- and y-axes.
You can customize the appearance of the lollipop plot by adding additional layers or arguments to
the ggplot() function. For example, you can change the color of the sticks and lollipops, add labels
to the lollipops, and more.
Summary
Business data visualization refers to the representation of data in graphical format to help
organizations make informed decisions. By visualizing data, it becomes easier to identify patterns,
trends, and relationships that may not be immediately apparent from raw data. The main goal of
business data visualization is to communicate complex information in an easy-to-understand
manner and to support data-driven decision making.
There are various types of data visualizations including bar graphs, line charts, scatter plots, pie
charts, heat maps, and more. The choice of visualization depends on the type and nature of the data
being analyzed.
It's important to note that while visualizing data can greatly enhance understanding and decision
making, it is important to also consider the limitations and potential biases that may arise in the
visual representation of data. Proper data visualization techniques should be used and the results
should be validated and interpreted carefully.
Keywords
Data visualization, Ggplot, R packages, lollipop chart
Self Assessment
1. Point out the correct statement?
10. Which of the following takes two columns and spreads them into multiple columns?
A. ggmissplot
B. printplot
C. print.ggplot
D. ggplot
11. How many functions exist for wrangling the data with dplyr package?
A. one
B. seven
C. three
D. five
14. For barchart and _________ non-trivial methods exist for tables and arrays, documented at
barchart.table.
A. scatterplot
B. dotplot
C. xyplot
D. scatterplot and xyplot
6. B 7. D 8. C 9. D 10. C
Review Questions
1) What is ggplot2 and what is its purpose?
2) How does ggplot2 differ from other data visualization tools in R?
3) What is the structure of a ggplot2 plot?
4) What is a "ggplot" object and how is it constructed in ggplot2?
5) How can you add layers to a ggplot object?
6) What are the different types of geoms available in ggplot2 and what do they represent?
7) How can you customize the appearance of a ggplot plot, such as color, size, and shape of
the data points?
8) How can you add descriptive statistics, such as mean or median, to a ggplot plot?
9) How can you use facets to create multiple plots in a single ggplot plot?
10) What is the difference between scales and themes in ggplot2, and how can you use them to
change the look of your plot?
Further Reading
"R Graphics Cookbook" by Winston Chang
"Data Visualization with ggplot2" by Hadley Wickham
"ggplot2: Elegant Graphics for Data Analysis" by Hadley Wickham
"An Introduction to ggplot2" by Ed Zehl
"Data Visualization with ggplot2: A Practical Guide" by Kim Seefeld
"R Graphics for Data Analysis" by Murrell
"Data Visualization with ggplot2 and the Tidyverse" by Thomas Lin Pedersen
"The ggplot2 Package: A tutorial on its structure and use" by J. Verzani
"Data Visualization with ggplot2: A step-by-step guide" by Thomas Briet.
Objectives
After studying this unit, you should be able to
Introduction
Business forecasting is a crucial element for any business to sustain its growth and profitability in
the long run. Time series analysis is a popular technique used for business forecasting, which
involves analyzing the past performance of a business to predict its future performance. Time series
analysis involves analyzing data over a certain period of time to identify trends, patterns, and
relationships that can be used to make accurate predictions about future outcomes.
In business forecasting using time series analysis, various methods can be employed such as
moving average, exponential smoothing, regression analysis, and trend analysis. These methods
Business Analytics
help in identifying the trends and patterns in the data and forecasting the future values of the
variables.
One of the major advantages of time series analysis is that it can help businesses in identifying the
factors that affect their performance and understanding the impact of external factors such as
changes in the economy, consumer behavior, and market trends.
Time series analysis can be used in various business functions such as sales forecasting, inventory
management, financial forecasting, and demand forecasting. It helps businesses to make informed
decisions about their future investments, resource allocation, and overall strategy.
In conclusion, time series analysis is an essential tool for business forecasting, and its applications
are wide-ranging. Accurate forecasting can provide a significant competitive advantage for
businesses and is essential for their long-term success
Make estimates about future business operations based on information collected through
investigation.
Choose the model that best fits the dataset, variables, and estimates. The chosen model conducts
data analysis and a forecast is made.
Note the deviations between actual performance and the forecast. Use this information to refine the
process of predicting and improve the accuracy of future forecasts.
Business Analytics
ARIMA (AutoRegressive Integrated Moving Average): This technique uses time series data to
analyze patterns and relationships, including trends and seasonality, to make predictions about
future trends.
Neural Networks: This technique uses artificial intelligence algorithms to analyze large amounts of
data and identify patterns, which can then be used to make predictions about future trends.
Decision Trees: This technique uses historical data to build a tree-like structure that can be used to
make predictions about future trends based on different scenarios.
Monte Carlo Simulation: This technique involves running multiple simulations based on random
sampling of historical data to make predictions about future trends.
Business forecasting software can help business managers and forecasters not only generate
forecast reports easily, but also better understand predictions and how to make strategic decisions
based off of these predictions. A quality business forecast system should provide clear, real-time
visualization of business performance, which facilitates fast analysis and streamlined business
planning.
The application of forecasting in business is an art and a science, the combination of business
intelligence and data science, and the challenges of business forecasting often stem from poor
judgments and inexperience. Assumptions combined with unexpected events can be dangerous
and result in completely inaccurate predictions. Despite the limitations of business forecasting,
gaining any amount of insight into probable future trends will put an organization at a significant
advantage.
Time series forecasting is the process of analyzing time series data using statistics and modeling to
make predictions and inform strategic decision-making. It’s not always an exact prediction, and
likelihood of forecasts can vary wildly—especially when dealing with the commonly fluctuating
variables in time series data as well as factors outside our control. However, forecasting insight
about which outcomes are more likely—or less likely—to occur than other potential outcomes.
Often, the more comprehensive the data we have, the more accurate the forecasts can be. While
forecasting and “prediction” generally mean the same thing, there is a notable distinction. In some
industries, forecasting might refer to data at a specific future point in time, while prediction refers
to future data in general. Series forecasting is often used in conjunction with time series analysis.
Time series analysis involves developing models to gain an understanding of the data to
understand the underlying causes. Analysis can provide the “why” behind the outcomes you are
seeing. Forecasting then takes the next step of what to do with that knowledge and the predictable
extrapolations of what might happen in the future.
Business Analytics
Forecasting has a range of applications in various industries. It has tons of practical applications
including: weather forecasting, climate forecasting, economic forecasting, healthcare forecasting
engineering forecasting, finance forecasting, retail forecasting, business forecasting, environmental
studies forecasting, social studies forecasting, and more. Basically anyone who has consistent
historical data can analyze that data with time series analysis methods and then model, forecasting,
and predict. For some industries, the entire point of time series analysis is to facilitate forecasting.
Some technologies, such as augmented analytics, can even automatically select forecasting from
among other statistical algorithms if it offers the most certainty.
cyclic behavior, and seasonality. It also can help identify if an outlier is truly an outlier or if it is part
of a larger cycle. Gaps in the data can hide cycles or seasonal variation, skewing the forecast as a
result.
Business Analytics
example of time series analysis in action, especially with automated trading algorithms. Likewise,
time series analysis is ideal for forecasting weather changes, helping meteorologists predict
everything from tomorrow’s weather report to future years of climate change. Examples of time
series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Time Series Analysis Types
Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too many
things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and forecasts
incorrect.
Models of time series analysis include:
Classification: Identifies and assigns categories to the data.
Curve fitting: Plots the data along a curve to study the relationships of variables within the data.
Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal variation.
Explanative analysis: Attempts to understand the data and the relationships within it, as well as
cause and effect.
Exploratory analysis: Highlights the main characteristics of the time series data, usually in a visual
format.
Forecasting: Predicts future data. This type is based on historical trends. It uses the historical data
as a model for future data, predicting scenarios that could happen along future plot points.
Intervention analysis: Studies how an event can change the data.
Segmentation: Splits the data into segments to show the underlying properties of the source
information.
Data classification
Further, time series data can be classified into two main categories:
Stock time series data means measuring attributes at a certain point in time, like a static snapshot of
the information as it was.
Flow time series data means measuring the activity of the attributes over a certain period, which is
generally part of the total whole and makes up a portion of the results.
Data variations
In time series data, variations can occur sporadically throughout the data:
Functional analysis can pick out the patterns and relationships within the data to identify notable
events.