Ad404 Data Science Notes Unit-2
Ad404 Data Science Notes Unit-2
)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Syllabus: AD404_DATA SCIENCE
Unit II: Unstructured Data Analytics- Importance of Unstructured Data, Unstructured Data
Analytics: Descriptive, diagnostic, predictive and prescriptive data Analytics based on Case
study. Data Visualization: box plots, histograms, scatterplots, features map visualization, t-SNE .
Overview of Advance Excel- Introduction, Data validation, Introduction to charts, pivot table,
Scenario manager, Protecting data, Excel minor, Introduction to macros.
Unstructured data, which lacks a predefined format, is crucial for businesses because it holds
valuable, often untapped, insights that can inform decisions, improve customer experiences, and
drive innovation, going beyond what structured data alone can provide.
Unstructured data analytics is the process of extracting valuable insights and knowledge from
data that doesn't conform to a structured format, such as text, images, audio, and video, using
specialized techniques and tools.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics uses
data to determine the probable outcome of an event or a likelihood of a situation occurring.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Predictive analytics holds a variety of statistical techniques from modeling, machine learning,
data mining , and game theory that analyze current and historical facts to make predictions about
a future event. Techniques that are used for predictive analytics are:
● Linear Regression
● Time Series Analysis and Forecasting
● Data Mining
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining historical
data to understand the cause of success or failure in the past. Almost all management reporting
such as sales, marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
● Data Queries
● Reports
● Descriptive Statistics
● Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take advantage
of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefits
from the predictions and showing the decision maker the implication of each decision option.
Prescriptive Analytics not only anticipates what will happen and when to happen but also why it
will happen. Further, Prescriptive Analytics can suggest decision options on how to take
advantage of a future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics
to leverage operational and usage data combined with data of external factors such as economic
data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for the
solution of any problem. We try to find any dependency and pattern in the historical data of the
particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming. Common techniques used for
Diagnostic Analytics are:
● Data discovery
● Data mining
● Correlations
The primary goal of data visualization is to make data more accessible and easier to interpret
allow users to identify patterns, trends, and outliers quickly. This is particularly important in big
data where the large volume of information can be confusing without effective visualization
techniques.
Why is Data Visualization Important:
Let’s take an example. Suppose you compile data of the company’s profits from 2013 to 2023
and create a line chart. It would be very easy to see the line going constantly up with a drop in
just 2018. So you can observe in a second that the company has had continuous profits in all the
years except a loss in 2018.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
It would not be that easy to get this information so fast from a data table. This is just one
demonstration of the usefulness of data visualization. Let’s see some more reasons why
visualization of data is so important.
1. Data Visualization Simplifies the Complex Data: Large and complex data sets can be
challenging to understand. Data visualization helps break down complex information into
simpler, visual formats making it easier for the audience to grasp.
2. Enhances Data Interpretation: Visualization highlights patterns, trends, and correlations in
data that might be missed in raw data form. This enhanced interpretation helps in making
informed decisions. Consider another Tableau visualization that demonstrates the relationship
between sales and profit. It might show that higher sales do not necessarily equate to higher
profits this trend that could be difficult to find from raw data alone. This perspective helps
businesses adjust strategies to focus on profitability rather than just sales volume.
3. Data Visualization Saves Time: It is definitely faster to gather some insights from the data
using data visualization rather than just studying a chart. In the screenshot below on Tableau it is
very easy to identify the states that have suffered a net loss rather than a profit. This is because
all the cells with a loss are coloured red using a heat map, so it is obvious states have suffered a
loss. Compare this to a normal table where you would need to check each cell to see if it has a
negative value to determine a loss. Visualizing Data can save a lot of time in this situation.
4. Improves Communication: Visual representations of data make it easier to share findings
with others especially those who may not have a technical background. This is important in
business where stakeholders need to understand data-driven insights quickly. Let see the below
TreeMap visualization on Tableau showing the number of sales in each region of the United
States with the largest rectangle representing California due to its high sales volume. This visual
context is much easier to grasp rather than a detailed table of numbers.
5. Data Visualization Tells a Data Story: Data visualization is also a medium to tell a data story
to the viewers. The visualization can be used to present the data facts in an easy-to-understand
form while telling a story and leading the viewers to an inevitable conclusion. This data story
should have a good beginning, a basic plot, and an ending that it is leading towards. For
example, if a data analyst has to craft a data visualization for company executives detailing the
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
profits of various products then the data story can start with the profits and losses of multiple
products and move on to recommendations on how to tackle the losses.
Types of Data Visualization Analysis
Data visualization is used to analyze visually the behavior of the different variables in a dataset,
such as a relationship between data points in a variable or the distribution. Depending on the
number of variables you want to study at once, you can distinguish three types of data
visualization analysis.
● Univariate analysis: Used to summarize the behavior of only one variable at a time.
● Bivariate analysis: Helps to study the relationship between two variables
● Multivariate analysis: Allows data practitioners to analyze more than two variables at
once
The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR is
calculated as –
IQR = Q3-Q1
Outlies are the data points below and above the lower and upper limit. The lower and upper
limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the minimum and maximum
values are calculated from the points which lie under the lower and upper limit.
How to create a box plots:
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches – 100, 120, 110, 150, 110,
140, 130, 170, 120, 220, 140, 110.
To draw a box plot for the given data first we need to arrange the data in ascending order and
then find the minimum, first quartile, median, third quartile and the maximum.
Ascending Order
100, 110, 110, 110, 120, 120, 130, 140, 140, 150, 170, 220
Median (Q2) = (120+130)/2 = 125; Since there were even values
To find the First Quartile we take the first six values and find their median.
Q1 = (110+110)/2 = 110
For the Third Quartile, we take the next six and find their median.
Q3 = (140+150)/2 = 145
Note: If the total number of values is odd then we exclude the Median while calculating Q1 and
Q3. Here since there were two central values we included them. Now, we need to calculate the
Inter Quartile Range.
IQR = Q3-Q1 = 145-110 = 35
We can now calculate the Upper and Lower Limits to find the minimum and maximum values
and also the outliers if any.
Lower Limit = Q1-1.5*IQR = 110-1.5*35 = 57.5
Upper Limit = Q3+1.5*IQR = 145+1.5*35 = 197.5
So, the minimum and maximum between the range [57.5,197.5] for our given data are –
Minimum = 100
Maximum = 170
The outliers which are outside this range are –
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Outliers = 220
Now we have all the information, so we can draw the box plot which is as below-
We can see from the diagram that the Median is not exactly at the center of the box and one
whisker is longer than the other. We also have one Outlier.
Use-Cases of Box Plot
● Box plots provide a visual summary of the data with which we can quickly identify
the average value of the data, how dispersed the data is, whether the data is skewed or
not (skewness).
● The Median gives you the average value of the data.
● Box Plots shows Skewness of the data-
a) If the Median is at the center of the Box and the whiskers are almost the same on both the
ends then the data is Normally Distributed.
b) If the Median lies closer to the First Quartile and if the whisker at the lower end is shorter
(as in the above example) then it has a Positive Skew (Right Skew).
c) If the Median lies closer to the Third Quartile and if the whisker at the upper end is
shorter than it has a Negative Skew (Left Skew).
● The dispersion or spread of data can be visualized by the minimum and maximum
values which are found at the end of the whiskers.
● The Box plot gives us the idea of about the Outliers which are the points which are
numerically distant from the rest of the data.
Histograms:
Histogram is a type of graphical representation used in statistics to show the distribution of
numerical data. It looks somewhat like a bar chart, but unlike bar graphs, which are used for
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
categorical data, histograms are designed for continuous data, grouping it into logical ranges
which are also known as "bins."
A histogram helps in visualizing the distribution of data across a continuous interval or period
which makes the data more understandable and also highlights the trends and patterns.
Parts of a Histogram
A histogram is a graph that represents the distribution of data. Here are the essential
components, presented in simple terms:
● Title: This is similar to the name of the histogram. It explains what the histogram is
about and what data it displays.
● X-axis: X-axis is a horizontal line at the bottom of the histogram. It displays the
categories or groups that the data is sorted into. For example, if you're measuring
people's heights, the X-axis may indicate several height ranges such as "5-6 feet" or
"6-7 feet".
● Y-axis: The Y-axis is a vertical line on the side of the histogram. It displays the
number of times something occurs in each category or group shown on the X-axis.
So, if you're measuring heights, the Y-axis may display how many individuals are in
each height range.
● Bars: Bars are the vertical rectangles you see on the chart. Each bar on the X-axis
represents a category or group, and its height indicates how many times something
occurs inside that category and the width indicates the range covered by each
category on the X-axis. So, higher bars indicate more occurrences, whereas shorter
bars indicate fewer occurrences.
Scatter plots:
Scatter plot is a mathematical technique that is used to represent data. Scatter plot also called a
Scatter Graph, or Scatter Chart uses dots to describe two different numeric variables. The
position of each dot on the horizontal and vertical axis indicates values for an individual data
point.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Scatter plot is one of the most important data visualization techniques and it is considered one of
the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship between two
variables, on a two-dimensional graph that is known as Cartesian Plane on mathematical
grounds.
It is generally used to plot the relationship between one independent variable and one dependent
variable, where an independent variable is plotted on the x-axis and a dependent variable is
plotted on the y-axis so that you can visualize the effect of the independent variable on the
dependent variable. These plots are known as Scatter Plot Graph or Scatter Diagram.
Applications of Scatter Plot
As already mentioned, a scatter plot is a very useful data visualization technique. A few
applications of Scatter Plots are listed below.
● Correlation Analysis: Scatter plot is useful in the investigation of the correlation
between two different variables. It can be used to find out whether two variables have
a positive correlation, negative correlation or no correlation.
● Outlier Detection: Outliers are data points, which are different from the rest of the
data set. A Scatter Plot is used to bring out these outliers on the surface.
● Cluster Identification: In some cases, scatter plots can help identify clusters or
groups within the data.
Data visualization using feature maps, especially in the context of Convolutional Neural
Networks (CNNs), helps understand what features a network learns and how it processes data
by visualizing the output of each filter at different layers. This allows for debugging,
optimization, and a deeper understanding of the network's inner workings.
Data validation is the process of checking the accuracy, consistency, and completeness of data.
It is a type of data cleansing that is performed before using, importing, or processing data.
Charts are visual representations of data that transform information into easily understandable
formats, such as graphs, diagrams, or maps, to help uncover patterns, trends, and relationships.
Here's a more detailed explanation:
What they are: Charts are a powerful tool for communicating data because they make complex
information more accessible and easier to grasp at a glance.
Why they are important:
Charts are used to:
● Analyze data: They help identify patterns, trends, and relationships that might not be
immediately obvious from raw numbers.
● Emphasize a point: Charts can be used to highlight specific data points or
comparisons, making an argument more compelling.
● Compare multiple sets of data: Charts allow for easy comparison of different
datasets, making it easier to see similarities and differences.
Common chart types:
● Bar charts: Use rectangular bars to compare different categories.
● Line charts: Show trends over time by connecting data points with lines.
● Pie charts: Represent parts of a whole as slices of a circle.
● Scatter plots: Display the relationship between two variables as points on a graph.
● Area charts: Use filled areas to show trends and magnitudes.
How to effectively use charts
● Choose the right chart type for your data and purpose.
● Ensure your charts are clear, concise, and easy to understand.
● Use appropriate labels and titles.
● Consider your audience and tailor your charts accordingly.
In Excel, a PivotTable is an interactive tool that allows you to quickly summarize and analyze
data by grouping and aggregating values, offering flexible ways to explore and present data.
1. Creating a PivotTable:
● Select Data: Choose a cell within the data range or table you want to analyze.
● Insert PivotTable: Go to the "Insert" tab and click "PivotTable".
● Choose Location: Decide whether to place the PivotTable on a new worksheet or an
existing one, and specify the location if creating it on an existing worksheet.
● Create the Table: Click "OK" to create the blank PivotTable and display the PivotTable
Fields list.
Fields List: This list on the right side of the screen shows all the fields (column headers)
from your data.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
Drag and Drop: Drag fields from the list into the "Rows", "Columns", "Values", and
"Filters" areas to structure your PivotTable.
● Rows: Fields that will be displayed as rows.
● Columns: Fields that will be displayed as columns.
● Values: Fields that will be summarized (e.g., sum, count, average).
● Filters: Fields that can be used to filter the data.
Summarize Values:
By default, Excel sums numerical values in the "Values" area, but you can change this to
other calculations (e.g., count, average, min, max) by right-clicking on the value field and
selecting "Value Field Settings".
3. Additional Features:
● PivotCharts: You can create charts directly from PivotTables by clicking on the "Insert
Chart" button in the PivotTable ribbon.
● Slicers: Add slicers to filter your PivotTable by specific values.
● Recommended PivotTables: Excel can suggest PivotTable layouts based on your data.
● Refresh: If your source data changes, you can refresh the PivotTable to see the updated
results.
● External Data Sources: You can also create PivotTables based on data from external
sources like SQL Server.
Example:
Let's say you have sales data with columns for "Region", "Product", and "Sales Amount". You
can create a PivotTable to see total sales by region and product:
● Select the data: (including headers).
● Insert a PivotTable .
● Drag "Region" to "Rows" .
● Drag "Product" to "Columns" .
● Drag "Sales Amount" to "Values" .
● Choose "Sum" as the calculation .
In the context of Microsoft Excel, the "Scenario Manager" is a built-in tool that allows users
to create, save, and compare different sets of input values (cells) within a worksheet, enabling
"what-if" analysis.
Excel Minor: XlMiner is a robust and user-friendly solution in the vast landscape of data
analysis tools. Developed to empower users with the ability to glean valuable insights from data,
XlMiner simplifies the complex world of analytics. This blog will look at XlMiner’s different
parts, including its features and functions, and how it can help make decisions based on data.
Understanding the Basics
A Microsoft Excel add-in called XlMiner is made to make data analysis accessible for all kinds
of users, from newbies to seasoned pros.
Its seamless integration with Excel provides users with a familiar environment, reducing the
learning curve and allowing for a smooth transition into data analytics.
Features and Capabilities
One of the critical strengths of XlMiner lies in its diverse set of features. From fundamental
statistical analysis to advanced machine learning, XlMiner covers various analytical techniques.
The passive voice emphasizes the simplicity and ease of accomplishing tasks.
Data Preparation Made Effortless
With XlMiner, data preparation becomes a breeze. Often scattered and unorganized, raw data can
be effortlessly transformed into a structured format suitable for analysis. Whether cleaning
missing values, handling outliers, or merging datasets, XlMiner simplifies these tasks, allowing
users to focus on deriving meaningful insights.
Models Built with Ease
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
In the realm of predictive analytics, building models can be a daunting task. XlMiner, however,
streamlines this process, providing a user-friendly interface for model development. Models can
be trained with a passive approach, letting the software handle the intricate details behind the
scenes. The simplicity of the process empowers users to harness the power of predictive
analytics without delving into the complexities of model building.
Exploring Advanced Analytics
For those venturing into advanced analytics, XlMiner proves to be a reliable companion. Its
support for machine learning algorithms opens up predictive modeling, classification, and
clustering possibilities. Complex analyses that once required specialized knowledge can now be
executed with a passive involvement from the user, thanks to XlMiner’s intuitive design.
Visualizing Insights
Data representation is essential to data analysis because it helps people understand and discuss
the results. XlMiner excels in this area, offering a range of visualization options. Users can
present their insights visually appealing and comprehensively, from basic charts to interactive
dashboards. The passive voice is utilized to emphasize the automated nature of these
visualization processes, minimizing the effort required from the user.
Time Series Analysis and Forecasting
In business and finance, predicting future trends is invaluable. XlMiner facilitates time series
analysis and forecasting, allowing users to uncover patterns and make informed predictions. The
passive approach ensures that users can perform these analyses without delving into the
complexities of time series modeling.
Optimization for Decision-Making
Making optimal decisions is a cornerstone of effective management. XlMiner supports
optimization techniques, enabling users to find the best solutions to complex problems. Whether
it’s resource allocation, production planning, or budget optimization, XlMiner guides users
through the process with a passive involvement, making decision-making more data-driven and
informed.
Introduction to macros.
In essence, a macro is a sequence of actions or instructions that can be recorded and then
executed repeatedly, automating tasks and saving time.
What it is: A macro is essentially a small program or script that records your actions (like
keystrokes and mouse clicks) within a software application, allowing you to perform a series of
operations with a single command.
Why use them: Macros are particularly useful for automating repetitive or complex tasks,
streamlining workflows, and increasing efficiency.
JAI NARAIN COLLEGE OF TECHNOLOGY, BHOPAL(M.P.)
Approved by AICTE New Delhi & Govt. of M.P.
Affiliated with Rajiv Gandhi Technical University (RGPV), Bhopal
__________________________________________________________________________________
DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE
How they work: You record a macro by performing the desired actions, and then the
software stores these actions as a macro that can be run later.
Examples:
● In Microsoft Excel: You can create a macro to automatically format cells, perform
calculations, or generate reports.
● In Microsoft Word: You can use macros to insert boilerplate text, change page
layouts, or perform complex formatting tasks.
Underlying Technology:
In many applications, macros are often implemented using a scripting language, such as
VBA (Visual Basic for Applications) in Microsoft Office applications.
Benefits:
● Time-saving: Automates repetitive tasks, saving significant time and effort.
● Error Reduction: Minimizes the risk of human error by automating tasks.
● Increased Efficiency: Streamlines workflows and improves overall productivity.
● Flexibility: Allows for customization and automation of a wide range of tasks.