0% found this document useful (0 votes)

193 views14 pages

Bda Unit 5

The document discusses predictive analytics and its applications. Predictive analytics uses statistical algorithms and machine learning to analyze historical data and make predictions. It involves collecting data, identifying patterns, building models, validating models, and deploying models. Regression analysis is commonly used in predictive analytics to model relationships between variables and make predictions.

Uploaded by

bhargavvobilisetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

193 views14 pages

Bda Unit 5

Uploaded by

bhargavvobilisetti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

BIG DATA ANALYTICS

UNIT V Predictive Analytics and Visualizations

Syllabus:

UNIT V:
Predictive Analytics and Visualizations: Predictive Analytics,
Simple linear regression, Multiple linear regression, Interpretation
of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application
1. Predictive analytics
Predictive analytics is the use of statistical algorithms, machine learning techniques, and data mining to
analyze historical data and make predictions about future events or behaviors. It involves using
statistical models to identify patterns in historical data, which can then be used to make predictions
about future outcomes.

Predictive analytics is used in a wide range of industries, including finance, healthcare, marketing, and
retail. Some common applications of predictive analytics include:

Fraud detection: Predictive analytics can be used to identify fraudulent transactions by analyzing
patterns in historical data.

Customer retention: Predictive analytics can be used to identify customers who are at risk of leaving a
company and develop strategies to retain them.

Inventory management: Predictive analytics can be used to forecast demand for products and optimize
inventory levels to minimize stockouts and overstocking.

Marketing: Predictive analytics can be used to target customers with personalized offers based on their
past behavior and predicted future behavior.

Risk management: Predictive analytics can be used to assess the likelihood of future events, such as
credit defaults or insurance claims, and manage risk accordingly.

Predictive analytics involves several steps, including data collection, data cleaning, data preparation,
model development, model validation, and deployment. The goal is to develop accurate and reliable
models that can be used to make predictions about future outcomes.

Predictive analysis is a process of using statistical and machine learning algorithms to analyze data and
make predictions about future events or outcomes. It involves collecting historical data, identifying
patterns and relationships, and using this information to make informed predictions about future trends
or events.

Predictive analysis is a multi-step process that involves several key stages:

Data Collection: The first step in predictive analysis is to collect and organize relevant data. This may
involve gathering data from multiple sources, such as databases, spreadsheets, and online sources.

Data Cleaning and Preparation: Once the data has been collected, it must be cleaned and prepared for
analysis. This may involve removing missing data, handling outliers, and transforming the data to ensure
that it is in a format that can be analyzed.

Data Exploration: In this stage, analysts use various techniques to explore the data and identify patterns
and relationships. This may involve data visualization, descriptive statistics, and hypothesis testing.

Model Building: Once the data has been explored and understood, analysts can begin building
predictive models. This involves selecting an appropriate algorithm, training the model on the historical
data, and tuning the model to optimize its performance.
Model Validation: After the model has been built, it must be validated to ensure that it is accurate and
reliable. This may involve testing the model on a separate set of data or using cross-validation techniques.

Deployment: Finally, the predictive model can be deployed to make predictions about future events or
outcomes. This may involve integrating the model into a larger system or using it to guide decision-
making in a business or organizational context.

Predictive analysis has a wide range of applications in various fields, including finance, healthcare,
marketing, and sports. For example, in finance, predictive analysis can be used to forecast stock prices
or identify potential risks in investment portfolios. In healthcare, predictive analysis can be used to
predict patient outcomes or identify potential health risks. In marketing, predictive analysis can be used
to identify potential customers or predict sales trends. In sports, predictive analysis can be used to
predict game outcomes or identify potential draft picks.

ROLE OF REGRESSION IN PREDICTIVE ANALYSIS

Regression analysis is a statistical method used in predictive analysis to identify and quantify the
relationship between a dependent variable (also known as the outcome or target variable) and one or
more independent variables (also known as predictor variables).

The role of regression in predictive analysis is to build a model that can predict the value of the
dependent variable based on the values of the independent variables. The regression model uses the
historical data to identify the relationship between the dependent variable and the independent variables
and then applies that relationship to make predictions about future outcomes.

There are several types of regression models that can be used in predictive analysis, including linear
regression, logistic regression, and polynomial regression, among others. Each of these models is used to
model different types of relationships between the dependent variable and the independent variables.

Linear regression is one of the most commonly used regression models in predictive analysis. It
assumes a linear relationship between the dependent variable and the independent variables and
estimates the coefficients of the linear equation to predict the value of the dependent variable.

Logistic regression, on the other hand, is used when the dependent variable is binary, meaning it can
only take on two values (e.g., true/false or 0/1). Logistic regression estimates the probability of the
dependent variable being in one category or another based on the values of the independent variables.

Polynomial regression is used when the relationship between the dependent variable and the
independent variables is nonlinear. It estimates a polynomial equation to predict the value of the
dependent variable.

2. Simple linear regression

Simple linear regression is a statistical method used to analyze the relationship between two continuous
variables, where one variable is the dependent variable, and the other variable is the independent
variable. It is called "simple" because it involves only one independent variable.

The main goal of simple linear regression is to identify and quantify the relationship between the
dependent variable and the independent variable. This is done by estimating a linear equation that best
fits the data and using it to predict the value of the dependent variable for a given value of the
independent variable.
The formula for simple linear regression is:

y = b0 + b1*x + e
Where:

y is the dependent variable

x is the independent variable
b0 is the intercept, which is the predicted value of y when x is equal to zero
b1 is the slope, which represents the change in y for a one-unit increase in x
e is the error term, which represents the difference between the actual value of y and the predicted value
of y

The slope and intercept are estimated using a method called least squares regression, which minimizes
the sum of the squared errors between the actual values of y and the predicted values of y.

For example, let's say we want to analyze the relationship between the number of hours studied (x) and
the test score (y) for a group of students. We collect data on 10 students and obtain the following results:
Hours Studied (x) Total score (y)
2 60
4 70
5 75
6 80
8 90
9 95
10 96
11 100
12 105
14 110
Using simple linear regression, we can estimate the equation of the line that best fits the data. The slope
(b1) and intercept (b0) of the line can be estimated as follows:

b1 = (n∑xy - ∑x∑y) / (n∑x^2 - (∑x)^2)

where n = number of observations

b1 = (10964 - 84740) / (10*318 - 84^2)

= 7.71

b0 = (∑y - b1∑x) / n
= (740 - 7.71*84) / 10
= -10.60

So the equation of the line is:

y = -10.60 + 7.71*x

Using this equation, we can predict the test score for a student who studies for 7 hours as follows:

y = -10.60 + 7.71*7 = 42.47

Therefore, according to the model, a student who studies for 7 hours is predicted to score approximately
42.47 on the test.

3. Multiple Linear Regression (MLR):

Regression models are used to describe relationships between variables by fitting a line to the observed
data. Regression allows you to estimate how a dependent variable changes as the independent
variable(s) change.

Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. You can use multiple linear regression when you want to know:

1. How strong the relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
2. The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Assumptions of multiple linear regression

Multiple linear regression makes all of the same assumptions as simple linear regression:

Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change
significantly across the values of the independent variable.

Independence of observations: the observations in the dataset were collected using statistically valid
sampling methods, and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated
with one another, so it is important to check these before developing the regression model. If two
independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the
regression model.

Normality: The data follows a normal distribution.

Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of
grouping factor.

Multiple linear regression formula

To find the best-fit line for each independent variable, multiple linear regression calculates three things:
• The regression coefficients that lead to the smallest overall model error.
• The t statistic of the overall model.
• The associated p value (how likely it is that the t statistic would have occurred by chance if the
null hypothesis of no relationship between the independent and dependent variables was true).

Let's take an example of multiple linear regression to understand it better. Suppose we are interested in
predicting the salary of an employee based on their level of education, years of experience, and age. In
this case, salary is the dependent variable and education, years of experience, and age are the
independent variables.

To build a multiple linear regression model, we first need to collect data on the dependent and
independent variables from a sample of employees. Let's say we have collected data on 100 employees,
and the data looks like this:

Employee Education (X1) Experience (X2) Age (X3) Salary (Y)

1 16 5 32 70000
2 18 6 35 80000
3 14 4 28 60000
4 20 7 40 100000
... ... ... ... ...
100 17 5 33 75000
We can then use software like Excel or R to build a multiple linear regression model. The model will
take the form:

Y = b0 + b1X1 + b2X2 + b3*X3 + ε

Where:

Y is the dependent variable (salary)

X1, X2, and X3 are the independent variables (education, experience, and age)
b0 is the intercept (the value of Y when all independent variables are zero)
b1, b2, and b3 are the coefficients (the change in Y for a unit change in each independent variable)
ε is the error term (the difference between the predicted value and the actual value of Y)
To estimate the coefficients, we use a method called least squares regression, which minimizes the sum
of the squared errors between the predicted and actual values of Y. Once we have estimated the
coefficients, we can use the equation to predict the salary of a new employee based on their education,
experience, and age.

For example, let's say we want to predict the salary of an employee who has 16 years of education, 8
years of experience, and is 40 years old. We can plug these values into the equation:

Y = b0 + b1X1 + b2X2 + b3X3 + ε

Y = -11467.6 + 4416.2X1 + 1456.5X2 + 1206.5X3 + ε
Y = -11467.6 + 4416.216 + 1456.58 + 1206.5*40 + ε
Y = 104703.4 + ε

So, based on our model, we would predict that this employee's salary is $104,703.4.
4. Interpreting regression coefficients
Interpreting the coefficients in a model with two predictors: a continuous and a categorical variable.
The example here is a linear regression model. But this works the same way for interpreting coefficients
from any regression model without interactions.

A linear regression model with two predictor variables results in the following equation:

Yi = B0 + B1X1i + B2X2i + ei.

The variables in the model are:

Y, the response variable;

X1, the first predictor variable;
X2, the second predictor variable; and
e, the residual error, which is an unmeasured variable.

The parameters in the model are:

B0, the Y-intercept;

B1, the first regression coefficient; and
B2, the second regression coefficient.

One example would be a model of the height of a shrub (Y) based on the amount of bacteria in the soil
(X1) and whether the plant is located in partial or full sun (X2).

Height is measured in cm. Bacteria is measured in thousand per ml of soil. And type of sun = 0 if the
plant is in partial sun and type of sun = 1 if the plant is in full sun.

Let’s say it turned out that the regression equation was estimated as follows:

Y = 42 + 2.3*X1 + 11*X2

Interpreting the Intercept

B0, the Y-intercept, can be interpreted as the value you would predict for Y if both X1 = 0 and X2 = 0.

We would expect an average height of 42 cm for shrubs in partial sun with no bacteria in the soil.
However, this is only a meaningful interpretation if it is reasonable that both X1 and X2 can be 0, and if
the data set actually included values for X1 and X2 that were near 0.

If neither of these conditions are true, then B0 really has no meaningful interpretation. It just anchors the
regression line in the right place. In our case, it is easy to see that X2 sometimes is 0, but if X1, our
bacteria level, never comes close to 0, then our intercept has no real interpretation.

Interpreting Coefficients of Continuous Predictor Variables

Since X1 is a continuous variable, B1 represents the difference in the predicted value of Y for each one-
unit difference in X1, if X2 remains constant.

This means that if X1 differed by one unit (and X2 did not differ) Y will differ by B1 units, on average.
In our example, shrubs with a 5000/ml bacteria count would, on average, be 2.3 cm taller than those
with a 4000/ml bacteria count. They likewise would be about 2.3 cm taller than those with 3000/ml
bacteria, as long as they were in the same type of sun.

(Don’t forget that since the measurement unit for bacteria count is 1000 per ml of soil, 1000 bacteria
represent one unit of X1).

Interpreting Coefficients of Categorical Predictor Variables

Similarly, B2 is interpreted as the difference in the predicted value in Y for each one-unit difference in
X2 if X1 remains constant. However, since X2 is a categorical variable coded as 0 or 1, a one unit
difference represents switching from one category to the other.

B2 is then the average difference in Y between the category for which X2 = 0 (the reference group) and
the category for which X2 = 1 (the comparison group).

So compared to shrubs that were in partial sun, we would expect shrubs in full sun to be 11 cm taller, on
average, at the same level of soil bacteria.

Interpreting Coefficients when Predictor Variables are correlated

Don’t forget that each coefficient is influenced by the other variables in a regression model. Because
predictor variables are nearly always associated, two or more variables may explain some of the same
variation in Y.

Therefore, each coefficient does not measure the total effect on Y of its corresponding variable. It would
if it were the only predictor variable in the model. Or if the predictors were independent of each other.

Rather, each coefficient represents the additional effect of adding that variable to the model, if the
effects of all other variables in the model are already accounted for.

This means that adding or removing variables from the model will change the coefficients. This is not a
problem, as long as you understand why and interpret accordingly.

5. Visual data analysis techniques

Data visualization is a graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data. This blog on data visualization techniques will help you understand
detailed techniques and benefits.
In the world of Big Data, data visualization in Python tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
Data visualization is another form of visual art that grabs our interest and keeps our eyes on the
message. When we see a chart, we quickly see trends and outliers. If we can see something, we
internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of
data and couldn’t see a trend, you know how much more effective a visualization can be. The uses of
Data Visualization as follows.

• Powerful way to explore data with presentable results.

• Primary use is the pre-processing portion of the data mining process.
• Supports the data cleaning process by finding incorrect and missing values.
• For variable derivation and selection means to determine which variable to include and discarded
in the analysis.
• Also play a role in combining categories as part of the data reduction process.

There are several techniques for visual data analysis that can help to gain insights and communicate
findings effectively. Here are some commonly used techniques:

Histograms: Histograms are used to visualize the distribution of a variable. The data is divided into bins
or intervals and the number of observations in each bin is plotted on the y-axis. This helps to identify
patterns such as skewness, kurtosis, and multimodality.

Scatter plots: Scatter plots are used to visualize the relationship between two variables. Each
observation is plotted as a point with the x-axis representing one variable and the y-axis representing the
other variable. This helps to identify patterns such as linearity, curvature, and outliers.

Box plots: Box plots are used to visualize the distribution of a variable and to compare the distributions
of two or more groups. A box is drawn with the top and bottom representing the upper and lower
quartiles, and the line inside the box representing the median. The whiskers represent the range of the
data and outliers are shown as points.

Bar charts: Bar charts are used to visualize the frequency or proportion of categorical variables. The
categories are plotted on the x-axis and the frequency or proportion is plotted on the y-axis.

Heat maps: Heat maps are used to visualize the relationship between two categorical variables. The
categories are plotted on both the x and y axes and the cells are colored according to the frequency or
proportion of each combination.

Line charts: Line charts are used to visualize the change in a variable over time. The time variable is
plotted on the x-axis and the variable of interest is plotted on the y-axis.

Bubble charts: Bubble charts are used to visualize the relationship between three variables. The x-axis
and y-axis represent two variables and the size of the bubble represents the third variable.

Geographic maps: Geographic maps are used to visualize the distribution of a variable geographically.
The variable of interest is plotted on a map and the intensity of the color or shading represents the
magnitude of the variable. This helps to identify patterns such as spatial clustering or outliers.

Histograms
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar
groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the
shape and spread of continuous sample data.

It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of
continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal
distribution), outliers, skewness, etc. It is an accurate representation of the distribution of numerical
data, it relates only one variable. Includes bin or bucket- the range of values that divide the entire range
of values into a series of intervals and then count how many values fall into each interval.

Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps, the
rectangles of histogram touch each other to indicate that the original value is continuous.
Histograms are based on area, not height of bars
In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores there
were within each bin. It is the product of height multiplied by the width of the bin that indicates the
frequency of occurrences within that bin. One of the reasons that the height of the bars is often
incorrectly assessed as indicating the frequency and not the area of the bar is because a lot of histograms
often have equally spaced bars (bins), and under these circumstances, the height of the bin does reflect
the frequency.

Histogram Vs Bar Chart

The major difference is that a histogram is only used to plot the frequency of score occurrences in a
continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, can be
used for a lot of other types of variables including ordinal and nominal data sets.

Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses height and width: as a
data visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat map
shows you in a visual way that’s easy to assimilate and make decisions from. It is a graphical
representation of data where the individual values contained in a matrix are represented as colours.
Useful for two purposes: for visualizing correlation tables and for visualizing missing values in the data.
In both cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they are not a replacement
for more precise graphical displays, such as bar charts, because colour differences cannot be perceived
accurately.

Charts
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence of one variable on
another. To plot the relationship between the two variables, we can simply call the plot function.

Bar Charts
Bar charts are used for comparing the quantities of different categories or groups. Values of a category
are represented with the help of bars and they can be configured with vertical or horizontal bars, with the
length or height of each bar representing the value.

Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc
length of each slide is proportional to the quantity it represents. As a rule, they are used to compare the
parts of a whole and are most effective when there are limited components and when text and
percentages are included to describe the content. However, they can be difficult to interpret because the
human eye has a hard time estimating areas and comparing visual angles.

Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot representing the
joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents
an observation. The marker position indicates the value for each observation. When you assign more
than two measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible
pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the
relationship, or correlations, between X and Y variables.
Bubble Charts
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional
dimension of data is represented in the size of the bubbles.

Timeline Charts
Timeline charts illustrate events, in chronological order — for example the progress of a project,
advertising campaign, acquisition process — in whatever unit of time the data was recorded — for
example week, month, year, quarter. It shows the chronological sequence of past or future events on a
timescale.

Tree Maps
A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles,
parent elements being tiled with their child elements. The sizes and colours of rectangles are
proportional to the values of the data points they represent. A leaf node rectangle has an area
proportional to the specified dimension of the data. Depending on the choice, the leaf node is coloured,
sized or both according to chosen attributes. They make efficient use of space, thus display thousands of
items on the screen simultaneously.

Word Clouds and Network Diagrams for Unstructured Data

The variety of big data brings challenges because semi-structured, and unstructured data require new
visualization techniques. A word cloud visual represents the frequency of a word within a body of text
with its relative size in the cloud. This technique is used on unstructured data as a way to display high-
or low-frequency words.

Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and
ties (relationships between the individuals). They are used in many applications, for example for
analysis of social networks or mapping product sales across geographic areas.

6. Interactive Data Visualization

Interactive data visualization refers to the creation of visual displays of data that allow users to interact
with the data and manipulate it in real-time. This type of visualization provides a dynamic and engaging
way to explore and understand data.

Interactive data visualization is typically implemented using specialized software tools that enable the
creation of interactive dashboards, reports, and charts. These tools allow users to interact with data in
various ways, such as by filtering, sorting, zooming, panning, and selecting data points.

One of the key benefits of interactive data visualization is that it enables users to gain insights and
answer questions about the data quickly and efficiently. Users can explore the data at their own pace,
drilling down into specific areas of interest, and discovering patterns and trends that may not be
immediately apparent in static visualizations.

Interactive data visualization also supports collaboration and sharing of insights. Users can easily share
their visualizations with others and collaborate on data analysis, leading to more informed decision-
making and better outcomes.
Interactive data visualization supports exploratory thinking so that decision-makers can actively
investigate intriguing findings. Interactive visualization supports faster decision making, greater data
access and stronger user engagement along with desirable results in several other metrics. Some of the
key findings include:
• 70% of the interactive visualization adopters improve collaboration and knowledge sharing.
• 64% of the interactive visualization adopters improve user trust in underlying data.
• Interactive Visualization users engage data more frequently.
• Interactive Visualizes are more likely than static visualizers to be satisfied easily with the use of
analytical tools.

The benefits of Interactive Data Visualization are listed below:

Identifying Causes and Trends Quickly
Today’s 93% of human communication is visual, and it tells that human eyes are processing images
60,000 times more than the text-based data.
Relationships Between Tasks and Business Operations
By interacting with data to put the focus on specific metrics, decision-makers are able to compare
specific throughout definable timeframes.
Telling Story Through Data
By allowing users to interact with data present in a clear visual manner, a data-intensive story becomes
visible.

The best Data Visualization Techniques are described below:

Know Your Audience
Some of the most accomplished entrepreneurs and executives find it difficult to digest more than a bar
chart, pie chart, or neatly presented visual, also have no time to deep into data. Therefore, ensuring that
your content is inspiring to your audience is one the essential data visualization technique.
Set Your Goals
From storytelling right through to digital selling and beyond with the visualization of your data, your
efforts are as effective as the strategy behind them. To structure your visualization efforts, create logic
and drill down into insights that matter. It is important to set clear aims, objectives, and goals prior to
reports, graphs, charts, and visuals. Right Chart Types
• Number Charts
• Maps
• Pie Charts
• Gauge Charts
Take Advantage of Color Theory
Selecting the right color scheme for your professional assets will help enhance your efforts significantly.
The principle of color theory will have an impact on the success of your visualization model. You
should always try to keep the color scheme more consistent throughout data visualizations.
Handle Your Big Data
Discover which data is available to you and your organization and decide which is the most valuable.
Keep your data protected and data handling systems simple and updated to make the visualization
process straightforward and ensure that you use business dashboards which present your most insights in
easy access.
Utilize Word Clouds and Network Diagram
A network diagram is utilized to draw a graphical chart of a network. This style of layout is useful for
network engineers, designers, and data analysts while compiling network documentation. Word clouds
give a digestible means of presenting complex sets of unstructured information.
Use Ordering, Layout, and Hierarchy to Prioritize
Once you have categorized your data and broken it down to the branches of information that you seem
to be most valuable to your organization, you should dig deeper, creating a clearly labeled hierarchy of
your data and prioritizing it. Hierarchy, Ordering, and Layout will be in a state of constant evolution but
will make your visualization efforts speedier, simpler, and successful.
Apply Visualization Tools For the Digital Age
Interactive online dashboard or tool offers a digestible, comprehensive, and interactive mean of
collecting, arranging, and presenting data with ease. These data visualization concepts served to help
your efforts to new successful heights. To enhance activities, exploring business intelligence and online
data visualization tool will be useful.

Examples of interactive data visualization tools include Tableau, Power BI, D3.js, Plotly, and Bokeh.
These tools provide a wide range of capabilities for creating interactive data visualizations, from simple
charts and graphs to complex dashboards and maps.

7. Data visualization Systems:

There are many data visualization systems and applications available, ranging from simple charting tools
to complex business intelligence platforms. Here are some of the most popular ones:

Tableau: Tableau is a powerful data visualization and business intelligence tool that allows users to
create interactive dashboards, reports, and charts. It supports a wide range of data sources and provides
intuitive drag-and-drop interfaces for creating visualizations.

Power BI: Power BI is a cloud-based business intelligence platform that enables users to visualize and
analyze data from a variety of sources. It offers a range of visualization tools, including charts, graphs,
and maps, as well as the ability to create custom dashboards and reports.

D3.js: D3.js is a JavaScript library that provides a framework for creating custom, interactive data
visualizations. It supports a wide range of data formats and provides a flexible API for building complex
visualizations.

Excel: Excel is a popular spreadsheet application that includes a range of charting and graphing tools for
visualizing data. It is widely used in business and finance for creating simple visualizations.

QlikView: QlikView is a business intelligence platform that allows users to create interactive
dashboards and reports. It supports a wide range of data sources and provides advanced visualization
capabilities, such as heat maps and scatter plots.

Google Data Studio: Google Data Studio is a cloud-based data visualization tool that allows users to
create interactive reports and dashboards. It supports a wide range of data sources and provides intuitive
drag-and-drop interfaces for creating visualizations.

Plotly: Plotly is a web-based data visualization tool that allows users to create interactive charts and
graphs. It supports a wide range of data formats and provides a flexible API for building custom
visualizations.
8. Data visualization application:
Data visualization applications are software tools that enable users to create visual displays of data in a
variety of formats, such as charts, graphs, maps, and dashboards. These applications are used across a
wide range of industries and domains, including business, science, engineering, healthcare, and social
sciences.

Here are some examples of how data visualization applications are used in different domains:

Business: In business, data visualization applications are used to track performance metrics, analyze
customer behavior, and identify trends and patterns in sales data. For example, a retailer might use a data
visualization application to create a dashboard that displays sales by product category, store location,
and time period.

Science: In science, data visualization applications are used to visualize experimental results, model
complex systems, and communicate research findings. For example, a biologist might use a data
visualization application to create a heat map of gene expression data, highlighting regions of the
genome that are active in response to specific stimuli.

Engineering: In engineering, data visualization applications are used to analyze and optimize complex
systems, such as manufacturing processes and supply chain networks. For example, a manufacturer
might use a data visualization application to create a flow chart of production processes, identifying
bottlenecks and areas for improvement.

Healthcare: In healthcare, data visualization applications are used to analyze patient data, track disease
outbreaks, and monitor healthcare delivery. For example, a public health agency might use a data
visualization application to create a map of COVID-19 cases by region, highlighting areas of high
transmission rates and targeting interventions accordingly.

Social Sciences: In social sciences, data visualization applications are used to analyze survey data, track
social trends, and visualize complex social networks. For example, a sociologist might use a data
visualization application to create a network diagram of interpersonal relationships in a community,
identifying key nodes and clusters of social activity.

Overall, data visualization applications play a critical role in helping users to understand and
communicate complex data in a clear and concise manner, enabling better decision-making and driving
innovation in a wide range of fields.

Predictive Analytics Complete Notes
No ratings yet
Predictive Analytics Complete Notes
82 pages
Introduction To Data Science Lab Manual
100% (1)
Introduction To Data Science Lab Manual
76 pages
Unit 2 Data Analytics (1)
No ratings yet
Unit 2 Data Analytics (1)
33 pages
data mining introduction
No ratings yet
data mining introduction
52 pages
Viva questions for Soft Computing
No ratings yet
Viva questions for Soft Computing
5 pages
AIML 4th and 5th Module Notes
No ratings yet
AIML 4th and 5th Module Notes
77 pages
Da 2
No ratings yet
Da 2
31 pages
6 - Measures of Location
No ratings yet
6 - Measures of Location
40 pages
What Is A Distributed System - GeeksforGeeks
No ratings yet
What Is A Distributed System - GeeksforGeeks
8 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Unit-5 Bda
No ratings yet
Unit-5 Bda
21 pages
TY BSC Statistics
No ratings yet
TY BSC Statistics
34 pages
UNIT 2
No ratings yet
UNIT 2
6 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
business-analytics-local-author-book-1
No ratings yet
business-analytics-local-author-book-1
233 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Unit – III - PREDICTIVE ANALYTICS
No ratings yet
Unit – III - PREDICTIVE ANALYTICS
28 pages
CCS356 Object Oriented Software Engineering Lecture Notes 1
No ratings yet
CCS356 Object Oriented Software Engineering Lecture Notes 1
222 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
Bda Unit 5
No ratings yet
Bda Unit 5
30 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages
PG Syllabi Vol 02 720 782
No ratings yet
PG Syllabi Vol 02 720 782
63 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
EH Syllabus
No ratings yet
EH Syllabus
2 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
ENGR5420G QM Unit 6 Statistical Methods
No ratings yet
ENGR5420G QM Unit 6 Statistical Methods
68 pages
Final Report Ppic Kelvin DKK
No ratings yet
Final Report Ppic Kelvin DKK
552 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
Ad3491 Fdsa Unit 4 Notes Eduengg-2
No ratings yet
Ad3491 Fdsa Unit 4 Notes Eduengg-2
16 pages
Unit 5
No ratings yet
Unit 5
19 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Oose Unit 5
No ratings yet
Oose Unit 5
118 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
8 Advanced Interaction Modeling: Here Are Answers For An Electronic Gasoline Pump. Figure A8.1 Shows A Use Case Diagram
No ratings yet
8 Advanced Interaction Modeling: Here Are Answers For An Electronic Gasoline Pump. Figure A8.1 Shows A Use Case Diagram
9 pages
Unit 3
No ratings yet
Unit 3
24 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
MC4102 OOSE Question bank
No ratings yet
MC4102 OOSE Question bank
4 pages
TUGAS AKHIR ANALISIS DATA DAN VISUALISASI
No ratings yet
TUGAS AKHIR ANALISIS DATA DAN VISUALISASI
12 pages
Exercise 2 - PM 299
No ratings yet
Exercise 2 - PM 299
6 pages
2math 3
0% (1)
2math 3
66 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Lampiran 1 Tensimeter Digital Non Invasive Blood Pressure: Coda Kent Scientific Corporation
No ratings yet
Lampiran 1 Tensimeter Digital Non Invasive Blood Pressure: Coda Kent Scientific Corporation
11 pages
Efbs Test1 2023 Memo Sem2
No ratings yet
Efbs Test1 2023 Memo Sem2
9 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
GROUP 3 - Percentiles and Quartiles - Comprimido
No ratings yet
GROUP 3 - Percentiles and Quartiles - Comprimido
16 pages
Unit 2 Searching and Sorting
No ratings yet
Unit 2 Searching and Sorting
83 pages
BUSN 2429 Chapter 7 Sampling Distribution
No ratings yet
BUSN 2429 Chapter 7 Sampling Distribution
83 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Cs3391 Oops Unit 1 Notes Eduengg
No ratings yet
Cs3391 Oops Unit 1 Notes Eduengg
60 pages
Python Solutions For iPA 10-Feb-23
No ratings yet
Python Solutions For iPA 10-Feb-23
21 pages
Tests of Normality: Kolmogorov-Smirnov Shapiro-Wilk Statistic DF Sig. Statistic DF Sig. Standardized Residual For Daya
No ratings yet
Tests of Normality: Kolmogorov-Smirnov Shapiro-Wilk Statistic DF Sig. Statistic DF Sig. Standardized Residual For Daya
2 pages
Binomial Distribution
No ratings yet
Binomial Distribution
15 pages
Lesson 3 Presentation
No ratings yet
Lesson 3 Presentation
21 pages
Introduction To Analysis of Variance
No ratings yet
Introduction To Analysis of Variance
17 pages
SPM 3-I Couse File Format
No ratings yet
SPM 3-I Couse File Format
18 pages
Lab Program
100% (1)
Lab Program
15 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Ccs374 Web Application Security
No ratings yet
Ccs374 Web Application Security
20 pages
Ay33a Business-Statistics
No ratings yet
Ay33a Business-Statistics
4 pages
Data Collection Tool; Reliability and Validity
No ratings yet
Data Collection Tool; Reliability and Validity
4 pages
IOT Mod4@AzDOCUMENTS - in
No ratings yet
IOT Mod4@AzDOCUMENTS - in
17 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Answer 1: A. Ho: The Primary News Source and Educational Level Are Independent
No ratings yet
Answer 1: A. Ho: The Primary News Source and Educational Level Are Independent
15 pages
Testing The Difference Between Proportions
100% (2)
Testing The Difference Between Proportions
20 pages
BRM-chapter 4 Edited
No ratings yet
BRM-chapter 4 Edited
12 pages
Enterprise Information Architecture Component Model - Chapter 5
100% (1)
Enterprise Information Architecture Component Model - Chapter 5
27 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
DSE 25.2 Applications of Dispersion
No ratings yet
DSE 25.2 Applications of Dispersion
15 pages
Cs2358 Internet Programming Lab Anna University Syllabus
No ratings yet
Cs2358 Internet Programming Lab Anna University Syllabus
12 pages
Introduction To Client Server
No ratings yet
Introduction To Client Server
19 pages
Practice Question Lectures 41 To 45 More
No ratings yet
Practice Question Lectures 41 To 45 More
14 pages
Descriptive Statistics: X N X N X ... X X X
100% (2)
Descriptive Statistics: X N X N X ... X X X
8 pages
Jmis 26 4 167
No ratings yet
Jmis 26 4 167
9 pages
18CS72
No ratings yet
18CS72
2 pages
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
No ratings yet
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
10 pages
Chapter 2 Descriptive Statistics
No ratings yet
Chapter 2 Descriptive Statistics
12 pages
Designing A Learning System
No ratings yet
Designing A Learning System
12 pages
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
No ratings yet
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
5 pages
Swapnil Dabre Accolite Digital
No ratings yet
Swapnil Dabre Accolite Digital
4 pages
Open Electives - Offered For 2023 - 24 Even Semester
No ratings yet
Open Electives - Offered For 2023 - 24 Even Semester
1 page
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
Using Calculator Casio AU PLUS
No ratings yet
Using Calculator Casio AU PLUS
8 pages
Estimation in Statistics
100% (1)
Estimation in Statistics
4 pages
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet

Bda Unit 5

Uploaded by

Bda Unit 5

Uploaded by

BIG DATA ANALYTICS

UNIT V Predictive Analytics and Visualizations

Predictive analysis is a multi-step process that involves several key stages:

ROLE OF REGRESSION IN PREDICTIVE ANALYSIS

2. Simple linear regression

y is the dependent variable

b1 = (n∑xy - ∑x∑y) / (n∑x^2 - (∑x)^2)

b1 = (10964 - 84740) / (10*318 - 84^2)

So the equation of the line is:

y = -10.60 + 7.71*7 = 42.47

3. Multiple Linear Regression (MLR):

Assumptions of multiple linear regression

Normality: The data follows a normal distribution.

Multiple linear regression formula

Employee Education (X1) Experience (X2) Age (X3) Salary (Y)

Y = b0 + b1X1 + b2X2 + b3*X3 + ε

Y is the dependent variable (salary)

Y = b0 + b1X1 + b2X2 + b3X3 + ε

Yi = B0 + B1*X1i + B2*X2i + ei.

The variables in the model are:

Y, the response variable;

The parameters in the model are:

B0, the Y-intercept;

Interpreting the Intercept

Interpreting Coefficients of Continuous Predictor Variables

Interpreting Coefficients of Categorical Predictor Variables

Interpreting Coefficients when Predictor Variables are correlated

5. Visual data analysis techniques

• Powerful way to explore data with presentable results.

Histogram Vs Bar Chart

Word Clouds and Network Diagrams for Unstructured Data

6. Interactive Data Visualization

The benefits of Interactive Data Visualization are listed below:

The best Data Visualization Techniques are described below:

7. Data visualization Systems:

You might also like

Yi = B0 + B1X1i + B2X2i + ei.