0% found this document useful (0 votes)
2 views

FBAS Notes

The document discusses the fundamentals of business analytics, focusing on decision-making processes and the impact of technological advancements on data analysis. It categorizes analytical methods into descriptive, predictive, and prescriptive analytics, highlighting their applications in various fields such as finance, marketing, and healthcare. Additionally, it addresses the challenges and opportunities presented by big data, including the four V's: volume, velocity, variety, and veracity.

Uploaded by

faith kirb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

FBAS Notes

The document discusses the fundamentals of business analytics, focusing on decision-making processes and the impact of technological advancements on data analysis. It categorizes analytical methods into descriptive, predictive, and prescriptive analytics, highlighting their applications in various fields such as finance, marketing, and healthcare. Additionally, it addresses the challenges and opportunities presented by big data, including the four V's: volume, velocity, variety, and veracity.

Uploaded by

faith kirb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Fundamentals of Business Analytics

INTRODUCTION
Chapter 1 DECISION MAKING
Defined as the following process:
Three developments spurred recent explosive growth 1.​ Identify and define the problem.
in the use of analytical methods in business 2.​ Determine the criteria that will be used to
applications: evaluate alternative solutions.
3.​ Determine the set of alternative solutions.
First development: 4.​ Evaluate the alternatives.
-​ Technological advances—scanner technology, 5.​ Choose an alternative.
data collection through e-commerce, Internet
social networks, and data generated from Common approaches to making decisions include:
personal electronic devices—produce 1.​ Tradition
incredible amounts of data for businesses. 2.​ Intuition
-​ Businesses want to use these data to improve 3.​ Rules of thumb
the efficiency and profitability of their 4.​ Using the relevant data available
operations, better understand their customers,
price their products more effectively, and gain
a competitive advantage. Managers’ responsibility:
To make (1) strategic, (2) tactical,
Second development: or (3) operational decisions.
-​ Ongoing research has resulted in numerous
methodological developments, including: 1) Strategic decisions:
●​ Advances in computational -​ Involve higher-level issues concerned with the
approaches to effectively handle and overall direction of the organization.
explore massive amounts of data. -​ Define the organization’s overall goals and
●​ Faster algorithms for optimization and aspirations for the future.
simulation.
●​ More effective approaches for 2) Tactical decisions:
visualizing data. -​ Concern how the organization should achieve
the goals and objectives set by its strategy.
Third development: -​ Are usually the responsibility of mid level
-​ The methodological developments were paired management.
with an explosion in computing power and
storage capability. 3) Operational decisions:
-​ Better computing hardware, parallel -​ Affect how the firm is run from day to day.
computing, and cloud computing have enabled -​ Are the domain of operations managers, who
businesses to solve big problems faster and are the closest to the customer.
more accurately than ever before.
BUSINESS ANALYTICS
-​ Scientific process of transforming data into II. PREDICTIVE Analytics
insight for making better decisions. -​ Consists of techniques that use models
-​ Used for data-driven or fact-based decision constructed from past data to predict the future
making, which is often seen as more objective or ascertain the impact of one variable on
than other alternatives for decision making. another.
-​ Survey data and past purchase behavior may
Tools of business analytics can aid decision making be used to help predict the market share of a
by: new product.
-​ Creating insights from data.
-​ Improving our ability to more accurately Techniques used in Predictive Analytics:
forecast for planning. -​ Linear regression
-​ Helping us quantify risk. -​ Time series analysis
-​ Yielding better alternatives through analysis -​ Data mining: used to find patterns or
and optimization. relationships among elements of the data in a
large database
Categorization of Analytical METHODS and -​ Simulation: involves the use of probability
MODELS and statistics to construct a computer model to
study the impact of uncertainty on a decision.
I. DESCRIPTIVE Analytics:
Encompasses the set of techniques that describes
what has happened in the past III. PRESCRIPTIVE Analytics:
Examples: -​ Indicates a best course of action to take
-​ Data queries -​ Provide a forecast or prediction, but do not
-​ Reports provide a decision.
-​ Descriptive statistics
-​ Data visualization (including data dashboards) -​ Prescriptive Model: A forecast or prediction
-​ Data-mining techniques combined with a rule.
-​ Basic what-if spreadsheet models -​ Rule-based Model: Prescriptive models that
rely on a rule or set of rules.
Data query: A request for information with certain -​ Simulation optimization: Combines the use
characteristics from a database of probability and statistics to model
uncertainty with optimization techniques to find
Data dashboards: Collections of tables, charts, maps, good decisions in highly complex and highly
and summary statistics that are updated as new data uncertain settings.
become available.
Uses of dashboards: Decision analysis:
-​ To help management monitor specific aspects -​ Used to develop an optimal strategy
of the company’s performance related to their when a decision maker is faced with
decision-making responsibilities. several decision alternatives and an
-​ For corporate-level managers, daily data uncertain set of future events.
dashboards might summarize sales by region, -​ Employs Utility theory: assigns
current inventory levels, and other values to outcomes based on the
company-wide metrics. decision maker’s attitude toward risk,
-​ Front-line managers may view dashboards loss, and other factors.
that contain metrics related to staffing levels,
local inventory levels, and short-term sales
forecasts.
Optimization models: Models that give the best
Data mining: The use of analytical techniques for decision subject to constraints of the situation.
better understanding patterns and relationships that -​ Portfolio models
exist in large data sets. -​ Supply network design models
-​ Price-markdown models
Examples:
-​ Cluster analysis
-​ Sentiment analysis
Examples:
a) Portfolio models
-​ Finance field
-​ Use historical investment return data to
determine the mix of investments that yield the
highest expected return while controlling or
limiting exposure to risk.

b) Supply network design models


-​ Operations field
-​ Provide the cost-minimizing plant and
distribution center locations subject to meeting
the customer service requirements.

c) Price-markdown models
-​ Retailing field
-​ Use historical data to yield
revenue-maximizing discount levels and the
timing of discount offers when goods have not
sold as planned.
BIG DATA 1) Volume
Any set of data that is too large or too complex to be -​ Because data is collected electronically, we
handled by standard data-processing techniques and are able to collect more of it.
typical desktop software. -​ To be useful, this data must be stored, and this
storage has led to vast quantities of data.
IBM describes the phenomenon of big data through
the four V’s: 2) Velocity
1.​ Volume -​ Real-time capture and analysis of data present
2.​ Velocity unique challenges both in how data is stored
3.​ Variety and the speed with which that data can be
4.​ Veracity analyzed for decision making.

-​ Represents opportunities. 3) Variety
-​ Presents challenges in terms of data storage -​ More complicated types of data are now
and processing, security, and available available and are proving to be of great value
analytical talent. to businesses.
●​ Text data: collected by monitoring what is
being said about a company’s products or
Terabytes to exabytes of existing data
to process services on social media platforms.
Volume
●​ Audio data: collected from service calls.
Data at ●​ Video data: collected by in-store video
Rest cameras and used to analyze shopping
behavior.
-​ Analyzing information generated by these
nontraditional sources is more complicated in
Streaming data, milliseconds to part because of the processing required to
seconds to respond transform the data into a numerical form that
Velocity
can be analyzed.
Data in
Motion 4) Veracity
-​ How much uncertainty is in the data.
-​ Inconsistencies in units of measure and the
Structured, unstructured, text, lack of reliability of responses in terms of bias
Variety multimedia also increase the complexity of the data.

Data in
Many
Forms

Uncertainty due to data inconsistency


& incompleteness, ambiguities,
latency, deception, model
Veracity
approximations
Data in
Doubt
The four Vs have led to new technologies:
-​ Hadoop: An open-source programming
environment that supports big data processing
through distributed storage and processing on
clusters of computers.
-​ MapReduce: A programming model used
within Hadoop that performs two major steps:
the map step and the reduce step.
-​ Cloud computing: aka the “the cloud,” refers
to the use of data and software on servers
housed external to an organization via the
internet.
-​ Artificial Intelligence: (AI) is the use of big
data and computers to make decisions that in
the past would have required human
intelligence.
-​ Data security: the protection of stored data
from destructive forces or unauthorized users,
is of critical importance to companies.

The complexities of the 4 V’s have increased the


demand for analysts, but a shortage of qualified
analysts has made hiring more challenging.

More companies are searching for data scientists,


who know how to process and analyze massive
amounts of data.

The Internet of Things (IoT)


the technology that allows data, collected from sensors
in all types of machines, to be sent over the Internet to
repositories where it can be stored and analyzed.

The Spectrum of Business Analytics

Most complex: Prescriptive


Business Analytics in Practice Supply-Chain Analytics:
Predictive & Prescriptive analytics are -​ The core service of companies such as UPS
sometimes referred to as advanced analytics. and FedEx is the efficient delivery of goods,
and analytics has long been used to achieve
Financial Analytics: efficiency.
Use of predictive models to: -​ The optimal sorting of goods, vehicle and staff
-​ Forecast financial performance. scheduling, and vehicle routing are all key to
-​ Assess the risk of investment portfolios and profitability for logistics companies such as
projects. UPS and FedEx
-​ Construct financial instruments such as -​ Companies can benefit from better inventory
derivatives. and processing control and more efficient
-​ Construct optimal portfolios of investments. supply chains.
-​ Allocate assets.
-​ Create optimal capital budgeting plans Analytics for Government and Nonprofits: Analytics
Simulation is also often used to assess risk in the for government to:
financial sector. -​ Drive out inefficiencies.
-​ Increase the effectiveness and accountability
Human Resource (HR) Analytics: of programs.
-​ New area of application for analytics. Analytics for nonprofit agencies to ensure their
-​ The HR function is charged with ensuring that effectiveness and accountability to their donors and
the organization: clients.
●​ Has the mix of skill sets necessary to
meet its needs. Sports Analytics
●​ Is hiring the highest-quality talent and Professional sports teams use to:
providing an environment that retains -​ Assess players for the amateur drafts.
it. -​ Decide how much to offer players in contract
●​ Achieves its organizational diversity negotiations
goals. -​ Professional motorcycle racing teams use
sophisticated optimization for gearbox design
Marketing Analytics: to gain competitive advantage.
-​ Marketing is one of the fastest-growing areas -​ Teams use to assist with on-field decisions
for the application of analytics. such as which pitchers to use in various
-​ A better understanding of consumer behavior games of a MLB playoff series.
through the use of scanner data and data -​ The use of analytics for off-the-field business
generated from social media has led to an decisions is increasing rapidly.
increased interest in marketing analytics. -​ Using prescriptive analytics, franchises across
-​ A better understanding of consumer behavior several major sports dynamically adjust ticket
through marketing analytics leads to: prices throughout the season to reflect the
●​ Better use of advertising budgets. relative attractiveness and potential demand
●​ More effective pricing strategies. for each game.
●​ Improved forecasting of demand.
●​ Improved product-line management. Web Analytics:
●​ Increased customer satisfaction and -​ The analysis of online activity, which includes,
loyalty. but is not limited to, visits to web sites and
social media sites such as Facebook and
Healthcare Analytics: LinkedIn.
Descriptive, predictive, and prescriptive analytics are -​ Leading companies apply descriptive and
used to improve: advanced analytics to data collected in online
-​ Patient, staff, and facility scheduling experiments to determine the best way to:
-​ Patient flow ●​ Configure web sites
-​ Purchasing ●​ Position ads
-​ Inventory control ●​ Utilize social networks for the
-​ Use of prescriptive analytics for diagnosis and promotion of products and services.
treatment may prove to be the most important
application of analytics in healthcare.
Fundamentals of Business Analytics Cross-Sectional and Time Series Data
DESCRIPTIVE STATISTICS Cross-Sectional Data
Chapter 2 Data collected from several entities at the same, or
approximately the same, point in time.
Overview of Using Data: Definitions and Goals -​ ex) Share price of fast food companies in
Data March 2020
The facts and figures collected, analyzed, and
summarized for presentation and interpretation Time Series Data
Data collected over several time periods
Variable -​ ex) Sales from 2006-2017
A characteristic or a quantity of interest that can take -​ Graphs of time series data are frequently
on different values found in business and economic publications
-​ Graphs help analysts understand what
Observation happened in the past, identify trends over
A set of values corresponding to a set of variables time, and project future levels for the time
series
Variation
The difference in a variable measured over Sources of Data
observations Experimental Study
-​ A variable of interest is first identified
Random variable/uncertain variable -​ Then one or more other variables are
A quantity whose values are not known with certainty identified and controlled or manipulated so
that data can be obtained about how they
Sample Data influence the variable of interest.
Company Symbol Share Price ($)
Nonexperimental / Observational Study
Apple AAPL 160.47 -​ Makes no attempt to control the variables of
interest
American AXP 91.69 -​ Survey: Most common type of observational
Express study

-​ Variables: Row 1
-​ Observations: Row 2 & 3 *Modifying Data in Excel
There is variation within the observations
Creating Distributions from Data
Types of Data
Frequency Distributions for Categorical Data
Population and Sample Data Frequency Distribution
Population: All elements of interest -​ A summary of data that shows the number
Sample: Subset of the population (frequency) of observations in each of several
●​ Random Sampling: nonoverlapping classes
-​ A sampling method to gather a -​ Typically referred to as bins, when dealing
representative sample of the with distributions (used to be called class)
population data; objective; procedural
-​ We assume that data is random Raw Data - ungrouped data
Grouped Data - raw data after grouping
Quantitative and Categorical Data
-​ Data can only be one of the two
Quantitative Data
Data on which numeric and arithmetic operations, such
as addition, subtraction, multiplication, and division,
can be performed.

Categorical Data
Data on which arithmetic operations cannot be
performed
Sample Frequency Distribution Lower limit - Upper limit = Class (bin) Width
(33 - 12) / 5 (number of rows)
Soft Drink Frequency
= 4.2
Coca-cola 19
i) Year-End Audit Times (Days)
Diet Coke 8
12 14 19 18

Dr. Pepper 5 15 15 18 17

Total 32 20 27 22 23

-​ Classes/bins: left column (?) 22 21 33 28


-​ Class interval: (e.g. 10-14, 15-19, 20-24)
24 18 16 13
Frequency distribution summarizes information about ii)
the popularity of the soft drinks
Audit Times Freq. Relative Percent
-​ Coca-cola is the leader (days) Freq. Freq.
-​ Diet Coke is second
10-14 4 0.2 20
-​ Dr. Pepper is third
15-19 8 0.24 24
Relative Frequency Distribution
A tabular summary of data showing the relative 20-24 5 0.25 25

frequency for each bin. 25-29 2 0.1 10

Percent Frequency Distribution 30-34 1 0.05 5


Summarizes the percent frequency of the data for
each bin. Histogram
-​ Used to provide estimated of the relative -​ A common graphical presentation of
likelihoods of different values of a random quantitative data
variable -​ Provide information about the shape, or form,
of a distribution
*Ungrouped data is used for these 2 frequency -​ Constructed by placing the variable of interest on the
distributions horizontal axis and the selected frequency measure
(absolute frequency, relative frequency, or percent
frequency) on the vertical axis
I.​ Relative Frequency and Percent Frequency
-​ The frequency measure of each class is shown by drawing
Distributions of Soft Drink Purchases
a rectangle whose base is the class limits on the
Soft Drink Relative Percent horizontal axis and whose height is the corresponding
frequency measure
Frequency Frequency (%)

Coca-cola 0.59 59 Skewness


-​ Lack of symmetry
Diet Coke 0.25 25 -​ Important characteristic of the shape of a
distribution
Dr. Pepper 0.16 16

Total 1.00 100


Frequency ÷ Total = Relative Frequency

Frequency Distributions for Quantitative Data


-​ 3 steps necessary to define the classes for a
frequency distribution with quantitative data:
1.​ Determine the number of nonoverlapping bins
2.​ Determine the width of each bin
3.​ Determine the bin limits

Approximate Bin Width


(Largest data value - Smallest DV) ÷ No. of bins Bell-shaped - ideal (symmetrical)
Cumulative Distributions
Cumulative Frequency Distribution Geometric Mean
A variation of the frequency distribution that provides ●​ A measure of location that is calculated by
another tabular summary of quantitative data finding the nth root of the product of n values
-​ Uses the number of classes, class widths, and -​ Used in analyzing growth rates in financial
class limits developed for the frequency data.
distribution
-​ Shows the number of data items with values
less than or equal to the upper class limit of I.​ Sample Table
each class
Year Return (%) Growth Factor

iv) Cumulative Percent Frequency Distributions for 1 -22.1 0.779


Audit Time Data
… … …
Audit Time Cumulativ Cumulativ Cumulativ
(days) e e Relative e Percent 10 2.1 1.021
Freq. F. F. Solution:
●​ Product of the growth factors
≤ 14 4 0.20 20
$100 [(0.779)...(1.021)]
≤ 19 12 0.60 60 ​ = $100 (1.335) = $133.45
●​ Geometric mean of the growth factors
≤ 24 17 0.85 85 x̄g = 10√1.335 = 1.029
●​ Conclude that annual returns grew at an
≤ 29 19 0.95 95
average annual rate of
≤ 34 20 1.00 100 (1.029-1)100% or 2.9%

Measures of Variability
Range
Measures of Location Subtracting the smallest value from the largest value in
Mean (Arithmetic Mean) a data set
Average value for a variable -​ Drawback: Range is based on only 2 of the
-​ The mean is denoted by x̄ observations; thus is highly influenced by
-​ n = sample size extreme values.
-​ x1 = value of variable for x for the 1st
observation Variance
-​ x2 = value of variable x for the 2nd observation A measure of variability that utilizes all the data
-​ xi = value of variable x for the ith observation -​ It is based on the deviation about the mean,
which is the difference between the value of
each observation (xi) and the mean
-​ The deviations about the mean are squared
Basically: Add all then divide by no. of observations. while computing the variance

Median
Value in the middle when the data are arranged in
ascending order
-​ Odd = Middle value
-​ Even = Average of 2 middle values For population variance, replace s² with σ²

Mode Standard Deviation


Value that occurs most frequently in a data set The positive square root of the variance
Consider the class size data: -​ Measured in the same units as the original
​ 32, 42, 46, 46, 54 data
Observe: 46 is the only value that occurs more
than once; so Mode is 46
-​ Multimodal Data: At least 2 modes
-​ Bimodal Data: Exactly 2 modes For population, replace as σ = √σ²
Coefficient of Variation
A descriptive statistic that indicates how large the
standard deviation is relative to the mean
-​ Expressed as a percentage

Example:
Class size data: 46, 54, 42, 46, 32
x̄ = 44
s=8
Coefficient of Variation = (8/44 x 100)%
​ = 18.2%
MODIFYING DATA IN EXCEL Steps to determine the nth percentile
1.​ Arrange data in ascending order
To sort the automobiles by March 2010 sales:
2.​ Compute
Step 1: Select cells A1:F21
Step 2: Click the Data tab in the Ribbon. -​ Lp = p/100 (n+1)
Step 3: Click Sort in the Sort & Filter group. ○​ p = value to find
Step 4: Select the check box for My data has headers. ○​ Lp = pth percentile
Step 5: In the first Sort by dropdown menu, select Sales (March
3.​ Interpret
2010).
Step 6: In the Order dropdown menu, select Largest to Smallest. -​ If there is a decimal A (11.05), then the
Step 7: Click OK. result is that it is B (.05, 5%) between
the value in position (11) and (12)
4.​ The value in the (11th) position is X, the value
To identify the automobile models in Table 2.2 for which sales had in the 12th position is Y
decreased from March 2010 to March 2011:
5.​ Solve
Step 1: Starting with the original data shown in Figure 2.3, select pth percentile = X + B(Y-X)
cells F1:F21.
Step 2: Click on the Home tab in the Ribbon.
Step 3: Click Conditional Formatting in the Styles group.
Quartiles
Step 4: Select Highlight Cells Rules, and click Less Than from the
dropdown menu. When the data is divided into four equal parts:
Step 5: Enter 0% in the Format cells that are LESS THAN: box. -​ Each part contains approximately 25% of the
Step 6: Click OK observations.
-​ Division points are referred to as quartiles.

Conditional Formatting of Data in Excel: ●​ Q1=first quartile, or 25th percentile.


-​ Makes it easy to identify data that satisfy ●​ Q2=second quartile, or 50th percentile (also
certain conditions in a data set. the median).
-​ Uses rules to highlight interesting data ●​ Q3=third quartile or 75th percentile.

Interquartile Range – or IQR; The difference between


Quick Analysis button the third and first quartiles
-​ Outside the bottom-right corner of a group of
selected cells.
-​ It provides shortcuts for Conditional z-Scores
Formatting, adding Data Bars, and other -​ Measures the relative location of a value in the
operations. data set.
-​ Helps to determine how far a particular value
is from the mean relative to the data set’s
Analyzing Distributions standard deviation.
Percentile -​ Often called the standardized value.
-​ Value of a variable at which a specified
(approximate) percentage of observations are
below that value.
-​ The pth percentile tells us the point in the data
where:
○​ Approximately p percent of the
observations have values less than
the pth percentile.
○​ Approximately (100 - p) percent of the
observations have values greater than
the pth percentile.
if value > mean, then z-score > 0 (POS)
if value < mean, then z-score < 0 (NEG)
Empirical Rule Scatter Diagrams and Associated Covariance Values
-​ Can be used to determine the percentage of
data values that are within a specified number if Sxy
of standard deviations of the mean. ●​ is Positive – x and y are positively linearly
-​ Used when the distribution of data exhibits a related; 2nd and 3rd quadrant
symmetric bell-shaped distribution ●​ is Approximately 0 – not linearly related, all
-​ For data having a bell-shaped distribution quadrants
●​ Approximately 68% of the data values ●​ is Negative – negatively linearly related; 1st
will be within 1 standard deviation. and 4th quadrant
●​ Approximately 95% of the data values
will be within 2 standard deviations. Correlation Coefficient
●​ Almost all the data values will be -​ measures the relationship between two

👉
within 3 standard deviations. variables.

👉
68% 1 -​ Not affected by the units of measurement for x

👉
95% 2 and y.
all 3

Outliers
-​ Extreme values in a data set.
-​ Can be identified using standardized values
(z-scores).
-​ Any data value with a z-score less than –3 or
greater than +3 is an outlier.
-​ Such data values can then be reviewed to
determine their accuracy and whether they
belong in the data set.

Interpretation of Correlation Coefficient:


Box plot -1 </= r </= +1
-​ a graphical summary of the distribution of
data.
-​ Developed from the quartiles for a data set.

Measures of Association Between 2 Variables

Covariance
-​ A descriptive measure of the linear association
between 2 variables

If Population Covariance:
DATA VISUALIZATION -​ In large tables, vertical lines or light shading
-​ Reduces the cognitive load, or the amount of can be useful to help the reader differentiate
effort necessary to accurately and efficiently the columns and rows.
process the information being communicated -​ To highlight the differences among locations,
by a data visualization. the shading could be done for every other row
instead of every other column.
Preattentive attributes – features that can be used in
a data visualization to reduce the cognitive load Alignment
required by the user to interpret the visualization. -​ Numerical values should be right-aligned
Include attributes such as -​ Text values should be left-aligned.
●​ color, size, shape, and length among others.
Crosstabulation
For creating effective tables and charts for data -​ A tabular summary of data for two variables.
visualization is the idea of the data-ink ratio. -​ The left and top margin labels define the
classes for the two variables.
-​ A crosstabulation in Microsoft Excel is known
Data-ink as a PivotTable.
-​ Ink used in a table or chart that is necessary to
convey the meaning to the audience
CHARTS (graphs)
Non-data-ink -​ Visual methods for displaying data.
-​ Ink used in a table or chart that serves no -​ Examples: scatter charts, line charts,
useful purpose in conveying the data to the sparklines, bar charts, and column charts.
audience
Scatter chart is a graphical presentation of the
relationship between two quantitative variables.
Data-ink ratio
-​ measures the proportion of ink used in a table Line charts
or chart that is necessary to convey the -​ Similar to scatter charts, but a line connects
meaning to the audience (known as “data-ink”) the points in the chart.
to the total ink used for the table or chart. -​ Very useful for time series data collected over
-​ Low Data-ink ratio: a period of time (minutes, hours, days, years,
○​ Labels for axes etc.).
○​ Removing unnecessary gridlines -​ Time series plot A line chart for time series
○​ Increasing data-ink ratio by addling data.
labels to axes
○​ Removing unnecessary lines and Bar charts and column charts
labels -​ Provide a graphical summary of categorical
data using the preattentive attribute of length
to convey relative amounts. Very helpful in
TABLES should be used when making comparisons between categorical
1.​ Specific numerical values. variables.
2.​ Comparisons between different values and not
just relative comparisons. Bar charts – horizontal bars
3.​ Values have different units or very different Column charts – vertical bars
magnitudes.
Pie charts
Table Design Principles -​ used to compare categorical data.
-​ keep in mind the data-ink ratio and avoid the -​ Rely on the preattentive attribute of size to
use of unnecessary ink in tables. convey information to the user.
-​ Avoid using vertical lines in a table unless
they are necessary for clarity.
-​ Horizontal lines are generally necessary only
for separating column titles from data values
or when indicating that a calculation has taken
place.
For Multiple Variables: Parallel-coordinates Plot
Bubble chart -​ A helpful chart for examining data with more
-​ A graphical means of visualizing three than two variables, which includes a different
variables in a two-dimensional graph vertical axis for each variable.
-​ A preferred alternative to a 3-D graph. -​ Each observation in the data set is
Scatter-chart matrix represented by drawing a line on the
-​ Allows the reader to easily see the parallel-coordinates plot connecting each
relationships among multiple variables. vertical axis.
-​ Each scatter chart in the matrix is created in -​ The height of the line on each vertical axis
the same manner as for creating a single represents the value taken by that observation
scatter chart. for the variable corresponding to the vertical
-​ Each column and row in the scatter-chart axis.
matrix corresponds to one categorical variable

Specialized Data Visualizations: Data dashboard


-​ Heat maps, treemaps, waterfall charts, stock -​ a data-visualization tool that illustrates multiple
charts, and parallel-coordinates plots. metrics and automatically updates these
metrics as new data become available.
Heat Map
-​ Two-dimensional graphical representation of -​ Provide timely summary information on KPIs
data that uses different shades of color to that are important to the user, and it should do
indicate magnitude. so in a manner that informs rather than
overwhelms its user.
Treemap -​ Should present all KPIs as a single screen that
-​ A chart that uses the size, color, and a user can quickly scan to understand the
arrangement of rectangles to display the business’s current state of operations.
magnitudes of a quantitative variable for -​ Rather than requiring the user to scroll
different categories, each of which are further vertically and horizontally to see the entire
decomposed into subcategories. dashboard, it is better to create multiple
-​ The size of each rectangle represents the dashboards so that each dashboard can be
magnitude of the quantitative variable within a viewed on a single screen.
category/subcategory.
-​ The color of the rectangle represents the Key performance indicators
category and all subcategories of a category -​ aka key performance metrics (KPMs)
are arranged together.
-​ Categorical data that is further decomposed Key Performance Undicators (KPI):
into subcategories is called hierarchical data. -​ In a business, the equivalent values are often
indicative of the business’s current operating
Hierarchical data characteristics,
-​ represented with a tree-like structure, -​ such as its financial position, the inventory on
where the branches of the tree lead to hand, customer service metrics, and the like.
categories and subcategories. -​ The KPIs displayed in the data dashboard
should convey meaning to its user and be
Waterfall chart related to the decisions the user makes.
-​ Visual display that shows the cumulative effect
of positive and negative changes on a variable
of interest. Trendlines
-​ The changes in a variable of interest are -​ lines added to charts to represent the general
reported for a series (time periods) direction or pattern of data over time, helping
-​ The magnitude of each change is represented to identify trends and forecasts.
by a column anchored at the cumulative height
of the changes in the preceding categories. Sparklines
-​ small, simple charts embedded within a single
Stock chart cell in a spreadsheet to provide a visual
-​ a graphical display of stock prices over time. summary of data trends.
DATA MINING Classification
-​ The use of analytical techniques for better -​ The data classification process involves
understanding patterns and relationships learning and classification.
that exist in large data sets. -​ In Learning the training data are analyzed by
-​ Data mining is a logical process that is used to classification algorithm.
search through large amount of data in order -​ In classification test data are used to estimate
to find useful data. The goal of this technique the accuracy of the classification rule…
is to find patterns that were previously
unknown. Once these… Association rule mining
-​ A process used to turn raw data into useful -​ a technique/method used to uncover hidden
information relationships between variables in large
-​ Also called as knowledge discovery process, datasets
knowledge mining from data, knowledge -​ (usage examples: in market basket analysis,
extraction or data /pattern analysis. customer segmentation, fraud detection.
Source: google.com)
●​ Extract data for business -​ Need to be able to generate rules with
●​ Collect specific business factors to evaluate confidence…
data
●​ Find useful information from data Clustering
●​ Analyze the data -​ Grouping together subjects/cases with similar
characteristics/attributes
-​ Can be said as identification of similar classes
Major Steps on Data Mining of objects. By using…
Three steps involved are
1.​ Exploration Regression – measuring the effect of one variable
2.​ Pattern identification over another (linear or multiple) variable.
3.​ Deployment Types of regression methods: …

Exploration The increase in the use of data-mining techniques in


-​ Data is cleaned and transformed into another business has been caused largely by three events
form, and important variables and then nature a.​ The explosion in the amount of data being
of data based on the problem are determined produced and electronically tracked
b.​ The ability to electronically warehouse these
Pattern Identification – Identify and choose the data
patterns which make the best prediction c.​ The affordability of computer power to analyze
the data
Deployment – Patterns are deployed for desired
outcome Data cleaning – The data wrangling step in which
errors in the raw data are corrected.
Data classification – A predictive data mining task
Data Mining Processes requiring the prediction of an observation’s outcome
1.​ Data Extraction class or category.
2.​ Data Pre-Processing
3.​ Data Mining
4.​ Extracting Patterns
5.​ Visualizing Data

Data Mining Methods


1.​ Classification
2.​ Association Rule Mining
3.​ Outlier Detection
4.​ Clustering
5.​ Regression
DATA CLEANSING / DATA CLEANING 1.​ Missing completely at random (MCAR): The
Data sets commonly include observations with missing tendency for an observation to be missing the
values for one or more variables. value for some variable is entirely random;
whether data are missing does not depend on
Poor data (due to human error): either the value of the missing data or the
typo error, inconsistent formatting value of any other variable in the data.
(uppercase/lowercase/mixed), blank fields, 2.​ Missing at random (MAR): The tendency for
same piece, rows that are duplicates, an observation to be missing a value for some
misspelled words, missing data, spelling variable is related to the value of some other
variations, mixed up letters, inconsistent variable(s) in the data.
punctuation, inconsistent currencies, 3.​ Missing not at random (MNAR): The
inconsistent field lengths (determines how tendency for the value of a variable to be
many letters can fit into the row or column), missing is related to the value that is missing
inconsistent names, etc. Replacement:
4.​ Imputation: The systematic replacement of
Data Preparation missing values with values that seem
-​ Makes heavy use of the descriptive statistics reasonable.
and data visualization methods 5.​ Prediction: Build a model to predict a value
-​ The data in a data set are often said to be
"dirty" and "raw" before they have been
preprocessed Identification of Erroneous Outliers and other
-​ We need to put them into a form that is best Erroneous Values:
suited for a data-mining algorithm Examining the variables in the data set by use of
summary statistics, frequency distributions, bar charts
and histograms, z-scores, scatter plots, correlation
Dirty Data Clean Data
coefficients, and other tools can uncover data-quality
Incomplete, incorrect, or Complete, correct, and issues and outliers.
irrelevant to the problem relevant to the problem
Many software ignore missing values when calculating
Clean Data various summary statistics.
-​ Essential to data integrity and reliable
solutions and decisions. If missing values in a data set are indicated with a
unique value (such as 9999999), these values may
Tools and Techniques: be used by software when calculating various
1.​ Back up the data summary statistics.
2.​ Remove unwanted data manually or use Excel
duplicates, irrelevant data, extra spaces and Both cases can result in misleading values for
blanks, fix… summary statistics.

Many analysts prefer to deal with missing data issues


Treatment of Missing Data PRIOR to using summary statistics to attempt to
-​ Discard observations with any missing values identify erroneous outliers and other erroneous values
-​ Discard any variable with missing values in the data.
-​ FIll in missing entries with estimated values
-​ Apply a data-mining algorithm that can handle
missing values (e.g. classification and
regression trees)

Legitimately Missing Data – In some cases missing


data naturally occur; No remedial action taken

Illegitimately Missing Data – In other cases missing


data occur for different reasons
Variable Representation:
In many data-mining applications, it may be prohibitive
to analyze the data because of the number of variables
recorded.

Dimension reduction – the process of removing


variables from the analysis without losing crucial
information.

A critical part of data mining is determining how to


represent the measurements of the variables and
which variables to consider.

Often data sets contain variables that, considered


separately, are not particularly insightful but that, when
appropriately combined, result in a new variable that
reveals an important relationship.
OTHER NOTES
Data Cleaning is part of Data Wrangling The Data-Driven Landscape
Big Data Buzz: Terms like big data, data analytics, and data science
are ubiquitous. However, these terms can sometimes be misused or
Data Wrangling confused with one another.
The process of transforming and structuring data from Creepy Precision: Companies like Amazon, Facebook, and Google
one raw form into a desired format with the intent of utilize data mining techniques to predict consumer behavior with
startling accuracy, which can feel invasive to users.
improving data quality and making it more consumable
ad useful for analytics of machine learning. It often
includes transforming, cleansing, and enriching data
from multiple sources. The data being analyzed is
more accurate and meaningful, leading to better
solutions, decisions, and outcomes (www.alteryx.com).
similar words: data cleaning, data remediation, data
munging.

Data Mining
The process of discovering patterns and knowledge in
large amounts of data; it focuses on analyzing and
finding insights

Steps in Data Wrangling

Accessing the Data


1.​ Discovery - the analyst becomes familiar with,
and understands, the raw data with an eye
toward how the data should be organized to
facilitate its use and analysis
○​ Accessing and Formatting from
unstructured to structured data,
arrayed as a rectangle
2.​ Structuring - so that raw data can be more
readily analyzed in the intended referred to as
flat file manner. This may include the
formatting of the data fields, how the data are
to be arranged, spliting one field with severaÏ
im portant pieces of information into several
fields, and merging several fields into a single
more meaningful field.
3.​ Cleaning - This step includes identifying
missing data, erroneous data, duplicate
records, and outliers and determining the best
actions for addressing these issues. Data
cleaning makes heavy use of descriptive
statistics and data-visualization methods to
identify missing data, erroneous data, and
outliers.

Data Mining
-​ The process of sifting through large volumes
of data to uncover patterns, trends, and
insights that are not immediately obvious.
-​ Context: It has gained immense popularity due to its
applications in various industries, especially following
high-profile cases like the Cambridge Analytica scandal,
which highlighted the power and potential misuse of data
mining techniques.
Core Techniques in Data Mining D.​ Anomaly Detection
Purpose – To identify unusual or unexpected patterns
A.​ Classification that may indicate problems or opportunities.
Purpose – Categorize data into predefined classes
(e.g., “probably pregnant” vs. “probably not pregnant”). Process:
-​ Anomalies can be detected through statistical
Process: methods that look for deviations from the norm
-​ Data attributes for each instance are quantified (e.g., unusual spending patterns).
(e.g., purchasing patterns).
-​ Labels are assigned to the data based on Example:
known outcomes (e.g., baby registries). The IRS and credit card companies use
-​ Algorithms analyze these labeled examples to anomaly detection to identify potential fraud.
identify patterns.

Example: E.​ Association Rule


Target’s ability to send pregnancy-related Purpose – To uncover relationships between different
coupons demonstrates effective classification. variables in datasets.
Their methods were so accurate that they
could identify a customer’s pregnancy before Process:
she even informed her family. -​ Analyzes transactions to find items that are
frequently purchased together.

B.​ Regression Example:


Purpose – Used to predict numerical outcomes rather The discovery that beer and diapers are often
than categorical ones (e.g. estimating baby's due date) bought together highlights how retailers can
optimize store layouts and inventory based on
Process: consumer behavior.
-​ Involves creating a mathematical model to
describe the relationship between variables.
-​ Similar to classification, weights are assigned The Power of Data Mining
to features that influence the prediction.
Standard Tools:
Example: The five techniques—classification, regression,
Google Flu Trends uses regression to provide clustering, anomaly detection, and association
estimates of flu prevalence based on search learning—form the backbone of data mining and can
data. be applied to various sectors.

Practical Applications: From personalized marketing


C.​ Clustering strategies to predictive healthcare analytics, data
Purpose: To group similar data points together based mining enables businesses to make informed
on shared characteristics without pre-existing labels. decisions.

Process: Challenges and Considerations


-​ Products or data points are analyzed to Data Quality – The success of data mining heavily
identify natural groupings (e.g., eBay relies on the quality and preparation of data. Poorly
categorizing millions of products). curated data can lead to misleading insights.
-​ Techniques like hierarchical clustering allow Ethical Implications – The use of personal data
for the creation of a taxonomy, organizing data raises ethical concerns regarding privacy and consent.
into a tree structure. Companies must navigate the “creep factor”
associated with data collection and usage.
Example: The Human Element – Understanding the social
Howard Moskowitz’s pasta sauce research implications of data mining is crucial. The algorithms
revealed distinct consumer preferences, are only as good as the data fed into them, and human
leading to targeted marketing strategies. oversight is essential for responsible application.
Conclusion

You might also like