Bda Unit 5
Bda Unit 5
Syllabus:
UNIT V:
Predictive Analytics and Visualizations: Predictive Analytics,
Simple linear regression, Multiple linear regression, Interpretation
of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application
1. Predictive analytics
Predictive analytics is the use of statistical algorithms, machine learning techniques, and data mining to
analyze historical data and make predictions about future events or behaviors. It involves using
statistical models to identify patterns in historical data, which can then be used to make predictions
about future outcomes.
Predictive analytics is used in a wide range of industries, including finance, healthcare, marketing, and
retail. Some common applications of predictive analytics include:
Fraud detection: Predictive analytics can be used to identify fraudulent transactions by analyzing
patterns in historical data.
Customer retention: Predictive analytics can be used to identify customers who are at risk of leaving a
company and develop strategies to retain them.
Inventory management: Predictive analytics can be used to forecast demand for products and optimize
inventory levels to minimize stockouts and overstocking.
Marketing: Predictive analytics can be used to target customers with personalized offers based on their
past behavior and predicted future behavior.
Risk management: Predictive analytics can be used to assess the likelihood of future events, such as
credit defaults or insurance claims, and manage risk accordingly.
Predictive analytics involves several steps, including data collection, data cleaning, data preparation,
model development, model validation, and deployment. The goal is to develop accurate and reliable
models that can be used to make predictions about future outcomes.
Predictive analysis is a process of using statistical and machine learning algorithms to analyze data and
make predictions about future events or outcomes. It involves collecting historical data, identifying
patterns and relationships, and using this information to make informed predictions about future trends
or events.
Data Collection: The first step in predictive analysis is to collect and organize relevant data. This may
involve gathering data from multiple sources, such as databases, spreadsheets, and online sources.
Data Cleaning and Preparation: Once the data has been collected, it must be cleaned and prepared for
analysis. This may involve removing missing data, handling outliers, and transforming the data to ensure
that it is in a format that can be analyzed.
Data Exploration: In this stage, analysts use various techniques to explore the data and identify patterns
and relationships. This may involve data visualization, descriptive statistics, and hypothesis testing.
Model Building: Once the data has been explored and understood, analysts can begin building
predictive models. This involves selecting an appropriate algorithm, training the model on the historical
data, and tuning the model to optimize its performance.
Model Validation: After the model has been built, it must be validated to ensure that it is accurate and
reliable. This may involve testing the model on a separate set of data or using cross-validation techniques.
Deployment: Finally, the predictive model can be deployed to make predictions about future events or
outcomes. This may involve integrating the model into a larger system or using it to guide decision-
making in a business or organizational context.
Predictive analysis has a wide range of applications in various fields, including finance, healthcare,
marketing, and sports. For example, in finance, predictive analysis can be used to forecast stock prices
or identify potential risks in investment portfolios. In healthcare, predictive analysis can be used to
predict patient outcomes or identify potential health risks. In marketing, predictive analysis can be used
to identify potential customers or predict sales trends. In sports, predictive analysis can be used to
predict game outcomes or identify potential draft picks.
Regression analysis is a statistical method used in predictive analysis to identify and quantify the
relationship between a dependent variable (also known as the outcome or target variable) and one or
more independent variables (also known as predictor variables).
The role of regression in predictive analysis is to build a model that can predict the value of the
dependent variable based on the values of the independent variables. The regression model uses the
historical data to identify the relationship between the dependent variable and the independent variables
and then applies that relationship to make predictions about future outcomes.
There are several types of regression models that can be used in predictive analysis, including linear
regression, logistic regression, and polynomial regression, among others. Each of these models is used to
model different types of relationships between the dependent variable and the independent variables.
Linear regression is one of the most commonly used regression models in predictive analysis. It
assumes a linear relationship between the dependent variable and the independent variables and
estimates the coefficients of the linear equation to predict the value of the dependent variable.
Logistic regression, on the other hand, is used when the dependent variable is binary, meaning it can
only take on two values (e.g., true/false or 0/1). Logistic regression estimates the probability of the
dependent variable being in one category or another based on the values of the independent variables.
Polynomial regression is used when the relationship between the dependent variable and the
independent variables is nonlinear. It estimates a polynomial equation to predict the value of the
dependent variable.
The main goal of simple linear regression is to identify and quantify the relationship between the
dependent variable and the independent variable. This is done by estimating a linear equation that best
fits the data and using it to predict the value of the dependent variable for a given value of the
independent variable.
The formula for simple linear regression is:
y = b0 + b1*x + e
Where:
The slope and intercept are estimated using a method called least squares regression, which minimizes
the sum of the squared errors between the actual values of y and the predicted values of y.
For example, let's say we want to analyze the relationship between the number of hours studied (x) and
the test score (y) for a group of students. We collect data on 10 students and obtain the following results:
Hours Studied (x) Total score (y)
2 60
4 70
5 75
6 80
8 90
9 95
10 96
11 100
12 105
14 110
Using simple linear regression, we can estimate the equation of the line that best fits the data. The slope
(b1) and intercept (b0) of the line can be estimated as follows:
b0 = (∑y - b1∑x) / n
= (740 - 7.71*84) / 10
= -10.60
y = -10.60 + 7.71*x
Using this equation, we can predict the test score for a student who studies for 7 hours as follows:
Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when you want to know:
1. How strong the relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
2. The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change
significantly across the values of the independent variable.
Independence of observations: the observations in the dataset were collected using statistically valid
sampling methods, and there are no hidden relationships among variables.
In multiple linear regression, it is possible that some of the independent variables are actually correlated
with one another, so it is important to check these before developing the regression model. If two
independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the
regression model.
Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of
grouping factor.
Let's take an example of multiple linear regression to understand it better. Suppose we are interested in
predicting the salary of an employee based on their level of education, years of experience, and age. In
this case, salary is the dependent variable and education, years of experience, and age are the
independent variables.
To build a multiple linear regression model, we first need to collect data on the dependent and
independent variables from a sample of employees. Let's say we have collected data on 100 employees,
and the data looks like this:
Where:
For example, let's say we want to predict the salary of an employee who has 16 years of education, 8
years of experience, and is 40 years old. We can plug these values into the equation:
So, based on our model, we would predict that this employee's salary is $104,703.4.
4. Interpreting regression coefficients
Interpreting the coefficients in a model with two predictors: a continuous and a categorical variable.
The example here is a linear regression model. But this works the same way for interpreting coefficients
from any regression model without interactions.
A linear regression model with two predictor variables results in the following equation:
One example would be a model of the height of a shrub (Y) based on the amount of bacteria in the soil
(X1) and whether the plant is located in partial or full sun (X2).
Height is measured in cm. Bacteria is measured in thousand per ml of soil. And type of sun = 0 if the
plant is in partial sun and type of sun = 1 if the plant is in full sun.
Let’s say it turned out that the regression equation was estimated as follows:
Y = 42 + 2.3*X1 + 11*X2
We would expect an average height of 42 cm for shrubs in partial sun with no bacteria in the soil.
However, this is only a meaningful interpretation if it is reasonable that both X1 and X2 can be 0, and if
the data set actually included values for X1 and X2 that were near 0.
If neither of these conditions are true, then B0 really has no meaningful interpretation. It just anchors the
regression line in the right place. In our case, it is easy to see that X2 sometimes is 0, but if X1, our
bacteria level, never comes close to 0, then our intercept has no real interpretation.
This means that if X1 differed by one unit (and X2 did not differ) Y will differ by B1 units, on average.
In our example, shrubs with a 5000/ml bacteria count would, on average, be 2.3 cm taller than those
with a 4000/ml bacteria count. They likewise would be about 2.3 cm taller than those with 3000/ml
bacteria, as long as they were in the same type of sun.
(Don’t forget that since the measurement unit for bacteria count is 1000 per ml of soil, 1000 bacteria
represent one unit of X1).
B2 is then the average difference in Y between the category for which X2 = 0 (the reference group) and
the category for which X2 = 1 (the comparison group).
So compared to shrubs that were in partial sun, we would expect shrubs in full sun to be 11 cm taller, on
average, at the same level of soil bacteria.
Therefore, each coefficient does not measure the total effect on Y of its corresponding variable. It would
if it were the only predictor variable in the model. Or if the predictors were independent of each other.
Rather, each coefficient represents the additional effect of adding that variable to the model, if the
effects of all other variables in the model are already accounted for.
This means that adding or removing variables from the model will change the coefficients. This is not a
problem, as long as you understand why and interpret accordingly.
There are several techniques for visual data analysis that can help to gain insights and communicate
findings effectively. Here are some commonly used techniques:
Histograms: Histograms are used to visualize the distribution of a variable. The data is divided into bins
or intervals and the number of observations in each bin is plotted on the y-axis. This helps to identify
patterns such as skewness, kurtosis, and multimodality.
Scatter plots: Scatter plots are used to visualize the relationship between two variables. Each
observation is plotted as a point with the x-axis representing one variable and the y-axis representing the
other variable. This helps to identify patterns such as linearity, curvature, and outliers.
Box plots: Box plots are used to visualize the distribution of a variable and to compare the distributions
of two or more groups. A box is drawn with the top and bottom representing the upper and lower
quartiles, and the line inside the box representing the median. The whiskers represent the range of the
data and outliers are shown as points.
Bar charts: Bar charts are used to visualize the frequency or proportion of categorical variables. The
categories are plotted on the x-axis and the frequency or proportion is plotted on the y-axis.
Heat maps: Heat maps are used to visualize the relationship between two categorical variables. The
categories are plotted on both the x and y axes and the cells are colored according to the frequency or
proportion of each combination.
Line charts: Line charts are used to visualize the change in a variable over time. The time variable is
plotted on the x-axis and the variable of interest is plotted on the y-axis.
Bubble charts: Bubble charts are used to visualize the relationship between three variables. The x-axis
and y-axis represent two variables and the size of the bubble represents the third variable.
Geographic maps: Geographic maps are used to visualize the distribution of a variable geographically.
The variable of interest is plotted on a map and the intensity of the color or shading represents the
magnitude of the variable. This helps to identify patterns such as spatial clustering or outliers.
Histograms
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar
groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the
shape and spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of
continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal
distribution), outliers, skewness, etc. It is an accurate representation of the distribution of numerical
data, it relates only one variable. Includes bin or bucket- the range of values that divide the entire range
of values into a series of intervals and then count how many values fall into each interval.
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps, the
rectangles of histogram touch each other to indicate that the original value is continuous.
Histograms are based on area, not height of bars
In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores there
were within each bin. It is the product of height multiplied by the width of the bin that indicates the
frequency of occurrences within that bin. One of the reasons that the height of the bars is often
incorrectly assessed as indicating the frequency and not the area of the bar is because a lot of histograms
often have equally spaced bars (bins), and under these circumstances, the height of the bin does reflect
the frequency.
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses height and width: as a
data visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat map
shows you in a visual way that’s easy to assimilate and make decisions from. It is a graphical
representation of data where the individual values contained in a matrix are represented as colours.
Useful for two purposes: for visualizing correlation tables and for visualizing missing values in the data.
In both cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they are not a replacement
for more precise graphical displays, such as bar charts, because colour differences cannot be perceived
accurately.
Charts
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence of one variable on
another. To plot the relationship between the two variables, we can simply call the plot function.
Bar Charts
Bar charts are used for comparing the quantities of different categories or groups. Values of a category
are represented with the help of bars and they can be configured with vertical or horizontal bars, with the
length or height of each bar representing the value.
Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc
length of each slide is proportional to the quantity it represents. As a rule, they are used to compare the
parts of a whole and are most effective when there are limited components and when text and
percentages are included to describe the content. However, they can be difficult to interpret because the
human eye has a hard time estimating areas and comparing visual angles.
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot representing the
joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents
an observation. The marker position indicates the value for each observation. When you assign more
than two measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible
pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the
relationship, or correlations, between X and Y variables.
Bubble Charts
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional
dimension of data is represented in the size of the bubbles.
Timeline Charts
Timeline charts illustrate events, in chronological order — for example the progress of a project,
advertising campaign, acquisition process — in whatever unit of time the data was recorded — for
example week, month, year, quarter. It shows the chronological sequence of past or future events on a
timescale.
Tree Maps
A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles,
parent elements being tiled with their child elements. The sizes and colours of rectangles are
proportional to the values of the data points they represent. A leaf node rectangle has an area
proportional to the specified dimension of the data. Depending on the choice, the leaf node is coloured,
sized or both according to chosen attributes. They make efficient use of space, thus display thousands of
items on the screen simultaneously.
Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and
ties (relationships between the individuals). They are used in many applications, for example for
analysis of social networks or mapping product sales across geographic areas.
Interactive data visualization refers to the creation of visual displays of data that allow users to interact
with the data and manipulate it in real-time. This type of visualization provides a dynamic and engaging
way to explore and understand data.
Interactive data visualization is typically implemented using specialized software tools that enable the
creation of interactive dashboards, reports, and charts. These tools allow users to interact with data in
various ways, such as by filtering, sorting, zooming, panning, and selecting data points.
One of the key benefits of interactive data visualization is that it enables users to gain insights and
answer questions about the data quickly and efficiently. Users can explore the data at their own pace,
drilling down into specific areas of interest, and discovering patterns and trends that may not be
immediately apparent in static visualizations.
Interactive data visualization also supports collaboration and sharing of insights. Users can easily share
their visualizations with others and collaborate on data analysis, leading to more informed decision-
making and better outcomes.
Interactive data visualization supports exploratory thinking so that decision-makers can actively
investigate intriguing findings. Interactive visualization supports faster decision making, greater data
access and stronger user engagement along with desirable results in several other metrics. Some of the
key findings include:
• 70% of the interactive visualization adopters improve collaboration and knowledge sharing.
• 64% of the interactive visualization adopters improve user trust in underlying data.
• Interactive Visualization users engage data more frequently.
• Interactive Visualizes are more likely than static visualizers to be satisfied easily with the use of
analytical tools.
Examples of interactive data visualization tools include Tableau, Power BI, D3.js, Plotly, and Bokeh.
These tools provide a wide range of capabilities for creating interactive data visualizations, from simple
charts and graphs to complex dashboards and maps.
There are many data visualization systems and applications available, ranging from simple charting tools
to complex business intelligence platforms. Here are some of the most popular ones:
Tableau: Tableau is a powerful data visualization and business intelligence tool that allows users to
create interactive dashboards, reports, and charts. It supports a wide range of data sources and provides
intuitive drag-and-drop interfaces for creating visualizations.
Power BI: Power BI is a cloud-based business intelligence platform that enables users to visualize and
analyze data from a variety of sources. It offers a range of visualization tools, including charts, graphs,
and maps, as well as the ability to create custom dashboards and reports.
D3.js: D3.js is a JavaScript library that provides a framework for creating custom, interactive data
visualizations. It supports a wide range of data formats and provides a flexible API for building complex
visualizations.
Excel: Excel is a popular spreadsheet application that includes a range of charting and graphing tools for
visualizing data. It is widely used in business and finance for creating simple visualizations.
QlikView: QlikView is a business intelligence platform that allows users to create interactive
dashboards and reports. It supports a wide range of data sources and provides advanced visualization
capabilities, such as heat maps and scatter plots.
Google Data Studio: Google Data Studio is a cloud-based data visualization tool that allows users to
create interactive reports and dashboards. It supports a wide range of data sources and provides intuitive
drag-and-drop interfaces for creating visualizations.
Plotly: Plotly is a web-based data visualization tool that allows users to create interactive charts and
graphs. It supports a wide range of data formats and provides a flexible API for building custom
visualizations.
8. Data visualization application:
Data visualization applications are software tools that enable users to create visual displays of data in a
variety of formats, such as charts, graphs, maps, and dashboards. These applications are used across a
wide range of industries and domains, including business, science, engineering, healthcare, and social
sciences.
Here are some examples of how data visualization applications are used in different domains:
Business: In business, data visualization applications are used to track performance metrics, analyze
customer behavior, and identify trends and patterns in sales data. For example, a retailer might use a data
visualization application to create a dashboard that displays sales by product category, store location,
and time period.
Science: In science, data visualization applications are used to visualize experimental results, model
complex systems, and communicate research findings. For example, a biologist might use a data
visualization application to create a heat map of gene expression data, highlighting regions of the
genome that are active in response to specific stimuli.
Engineering: In engineering, data visualization applications are used to analyze and optimize complex
systems, such as manufacturing processes and supply chain networks. For example, a manufacturer
might use a data visualization application to create a flow chart of production processes, identifying
bottlenecks and areas for improvement.
Healthcare: In healthcare, data visualization applications are used to analyze patient data, track disease
outbreaks, and monitor healthcare delivery. For example, a public health agency might use a data
visualization application to create a map of COVID-19 cases by region, highlighting areas of high
transmission rates and targeting interventions accordingly.
Social Sciences: In social sciences, data visualization applications are used to analyze survey data, track
social trends, and visualize complex social networks. For example, a sociologist might use a data
visualization application to create a network diagram of interpersonal relationships in a community,
identifying key nodes and clusters of social activity.
Overall, data visualization applications play a critical role in helping users to understand and
communicate complex data in a clear and concise manner, enabling better decision-making and driving
innovation in a wide range of fields.