0% found this document useful (0 votes)
7 views

DataScience&Analytics DataVisualiztn

Data visualization transforms data into graphical formats like charts and graphs to enhance understanding and decision-making. It simplifies complex data, reveals patterns, and engages viewers, with various methods such as bar charts, pie charts, and treemaps for different applications. Effective visualization relies on key variables and tools to present data clearly and meaningfully.

Uploaded by

swagatatalekar06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DataScience&Analytics DataVisualiztn

Data visualization transforms data into graphical formats like charts and graphs to enhance understanding and decision-making. It simplifies complex data, reveals patterns, and engages viewers, with various methods such as bar charts, pie charts, and treemaps for different applications. Effective visualization relies on key variables and tools to present data clearly and meaningfully.

Uploaded by

swagatatalekar06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Visualization

Data visualization is the process of turning data into pictures or graphics, like charts, graphs, and maps, to help
people understand and interpret information more easily. Instead of looking at rows and rows of numbers, data
visualization allows you to see patterns, trends, and outliers in a more intuitive way, helping you make decisions
based on that data.

Importance:

Simplifies Complex Data: Large sets of data can be hard to understand. By using graphs, charts, or maps, you
can see the big picture at a glance. For example, a line graph showing sales over time can instantly show whether
the sales are increasing or decreasing.
Reveals Patterns and Trends: Visuals make it easier to spot trends or patterns that might be hard to see in a
table of numbers. For instance, you could quickly spot a spike in sales during a particular month just by looking at
a bar chart.
Helps with Decision Making: When data is visualized, it’s easier to analyze, compare, and make decisions
based on it. A manager might use a dashboard showing key performance metrics to make quick business
decisions.
Engages Viewers: People tend to remember and understand visual information better than raw data. Visuals
capture attention and make complex concepts easier to grasp.
Conventional Data Visualization Methods

Bar Chart
A bar chart is a graphical representation of data where individual bars represent different categories, with the
height or length of each bar corresponding to the value of that category. Bar charts are commonly used to compare
quantities across different groups or categories.

is a type of data visualization used to represent categorical data. It displays rectangular bars, where the
length or height of each bar is proportional to the value of the corresponding category. Bar charts are used to
compare the frequency, count, or other measures (such as sums or averages) across different
categories.

Categories: The discrete groups or classifications represented along one axis


Values: The numerical data associated with each category, displayed on the other axis
Applications:

Bar charts are commonly used for:


Comparison: To compare data across different categories or groups.
Ranking: To rank categories based on their values.
Distribution: In some cases, to show the distribution of values across categories, though histograms are
typically used for continuous data.

Advantages:
Easy to interpret and visualize comparisons across categories.
Can effectively display both nominal and ordinal data.
Provides a clear, intuitive way to represent and compare quantities.

Limitations:
Not suitable for displaying relationships or trends in continuous data.
May become cluttered or difficult to read when there are too many categories or data points.
Example:
Suppose a company sells three product types (A, B, and C) in four different regions (North,
South, East, West). A stacked bar chart can be used to show the distribution of total sales
across these regions, with the sales data broken down by product type.

Each bar represents the total sales for a


product.

The segments within each bar represent


the sales from each product type (A, B,
and C) in each region.
Pie Charts
A pie chart is a type of circular statistical graphic that is used to represent the proportions of a whole. Each slice
(or segment) of the pie corresponds to a category, and the size of each slice is proportional to the quantity or
percentage that category represents out of the total.

Applications:

Circle: The entire circle represents the whole data set or 100%.

Slices: Each slice of the pie represents a category, with its size corresponding to the value or percentage that
category contributes to the total.

Percentages: Often, pie charts display the percentage or proportion each slice represents, either inside the slice
or in a legend.

Labels: Each slice can be labeled with either the category name, the numerical value, or the percentage it
represents.
Parallel Coordinates

Parallel coordinates are a technique used in data visualization to represent high-dimensional datasets.

Parallel plot or parallel coordinates plot allows to compare the feature of several individual observations (series)
on a set of numeric variables. Each vertical bar represents a variable and often has its own scale. (The units can
even be different). Values are then plotted as series of lines connected across each axis.

Used for multivariate numeric data

Allows the study of features for several quantitative variables. The variables can be completely different, different
ranges or even different units.

Key Characteristics:
1. Axes: Each axis corresponds to one feature of the dataset. These axes are typically placed parallel to each
other.
2. Lines: Each data point is represented as a polyline that connects its corresponding values along each axis. The
lines can represent individual data points or entire data subsets.
3. Interpretation: Patterns, correlations, and relationships between different dimensions can be seen through
the lines and their intersections.
IRIS dataset

Which compares flowers belonging to


3 species:

Setosa, Versicolor and Virginica

Each flower is measured by attributes:

Sepal Width, Sepal Length, Petal


Width and Petal length

Each variable is represented by its


own axis.
Each axis has its own scale (min /
max values)

Observations:
Flowers belonging to setosa species, have large Sepal Widths but low Sepal Lengths, Petal
Widths and Petal Lengths.

Flowers belonging to versicolor species have low Sepal Widths and medium Sepal Lengths,
Petal Widths and Lengths

Flowers belonging to virginica species have low to medium Sepal Width, medium to large
Sepal lengths and large petal widths and petal lengths.
Drawbacks:

Overcrowding: With many data points and dimensions, the chart can become cluttered,
making it hard to interpret.

Interpretation: Understanding relationships between dimensions can be challenging,


especially with complex datasets.
Treemap
A treemap is a data visualization technique that represents hierarchical (tree-structured) data as a set of nested
rectangles. Each branch of the hierarchy is represented by a rectangle, which is subdivided into smaller rectangles
that represent its sub-branches. The size and color of each rectangle can be used to encode different aspects of the
data, such as numerical values, categories, or other variables.

Composed of nested rectangles


The space in the visualization is split into rectangles that are sized and ordered by a quantitative variable.

1. Key Characteristics:
1. Hierarchy: Treemaps are particularly suited for visualizing hierarchical data, such as file systems,
organizational structures, or any data that can be structured into parent-child relationships.

2. Rectangular Layout: The data is represented by nested rectangles, with the size of each rectangle typically
corresponding to a quantitative variable (e.g., sales, revenue, population).

3. Color Encoding: Different colors can be used to represent categorical data, or to highlight differences in a
certain metric (e.g., performance, growth rate).

4. Compact: Treemaps are efficient in terms of space utilization, as they can fit a large amount of hierarchical
data into a small area.
Treemap which contains
rectangles and are sized and
colored by the sales in certain
cities.

Largest Rectangle - top left


corner

Smallest Rectangle - botton


right corner

Treemap contains data on


Observations: one level.
Casablanca and Cannes has the highest total sum of sales
HongKong and Bangalore have the least
Columns Country and
Continent are added to the
Treemap

Rectangles are nested.

Each Rectangle that


represents a country consists
of rectangles cities in that
country

It is still possible to see the


city with the highest sales,
but you can also see that Afric
a is the continent with the
highest sale. Asia is the
continent with the lowest
number of sales.
Seven Variables for Visualization
(Retinal Variables)
The "seven variables for visualization" typically refer to key factors that influence the way visualizations are designed
and interpreted. These variables are essential when creating effective visual representations of data, helping to
ensure clarity and insight. While there may be slight variations in how these are framed, one common set of seven
visualization variables includes:

1. Position: The placement of elements within a visual space, often used in graphs and charts (e.g., x and y axes on a
scatter plot). Positioning is a powerful way to show relationships between data points.
2. Size: The size of visual elements (such as bars, dots, or lines) can be used to represent magnitude or volume. Larger
elements often indicate higher values, while smaller elements indicate lower values.
3. Shape: Different shapes can be used to distinguish categories or represent different types of data. For example, circles,
squares, or triangles might represent different groups or variables.
4. Color: Color is often used to differentiate between categories, highlight trends, or represent values (e.g., using a
color gradient to show intensity or value). Color can help in visually grouping or distinguishing parts of the data.

5. Orientation: The angle or direction of elements can convey different types of information. For example, tilted
bars or lines can represent trends or directional relationships.

6. Texture: Texture refers to the surface detail of visual elements (like patterns or gradients). It can be used to
represent additional layers of data or simply to add aesthetic distinction.

7. Connection: The use of lines or arrows to connect elements in the visualization, which helps to show
relationships, flows, or networks between data points. This is particularly useful in network diagrams or flowcharts.

These variables, when thoughtfully combined, can help create more effective and interpretable visualizations,
enabling viewers to quickly grasp insights from complex data.
Mapping Variables to Encoding
Mapping variables to encoding involves assigning specific variables to certain types of encoding or transformations
in order to prepare data for processing, such as in machine learning or data analysis.

1. Categorical Variables to Numerical Encoding


When you have categorical data (like "red", "blue", "green"), you'll often need to map them to numerical values to
work with algorithms that require numerical input.
Common methods:
Label Encoding: Assign each unique category a numerical value.
Example: {"red": 0, "blue": 1, "green": 2}
One-Hot Encoding: Create a binary column for each category.
2. Numerical Variables to Binned Encoding
If you're working with continuous numerical data, you might want to map it into bins (or ranges).
Binning: Divide the numerical range into intervals and assign a label to each interval.

Example: If you have a age variable and decide to group ages into categories, you could bin it as:
0-18 -> "child",
19-35 -> "young adult",
36-65 -> "adult",
66+ -> "senior"

3. Mapping Text to Vector Representations


If you're working with text data, you might want to map words to numerical vectors.
Bag of Words: Assign each unique word in your text corpus a unique index, then represent each document as a
vector of word counts.
Example: "cat" → 1, "dog" → 2.
4. Feature Scaling
Interval/Ratio data (you are dealing with continuous numerical data that has meaningful relationships and scales
(height, weight, temperature, income))
Often requires scaling or normalization so that the model can process the data efficiently, especially when the data
values are on different scales (e.g., income in thousands and age in years).

Normalization (Min-Max Scaling):


This technique rescales the data to a [0,1] range. It’s especially useful when you want to preserve the relationships
between the values while standardizing them.
Example:
If the data for "income" ranges from 20k to 200k, using min-max scaling will rescale those values into a [0, 1]
range.
Types of Data and Effective Visualization Types
Graphics Selection Chart by Dr. Andrew Abela
It doesn’t present all possible charts, nor does it always recommend the best charts for every situation.

Helps reduce the complexity of chart selection, particularly for beginners.

User's informational needs have 4 primary visual demands:

Comparison: When the goal is to compare different sets of data, either over time or comparing different
items

Distribution: When there is a need to understand how data is distributed over a range, charts in this
category are divided based on the number of variables analyzed.

Relationship: Observe the correlation or relationship between two or more variables.

Composition: When there is a need to understand how different components add up to form a whole,
which can be either over time or static.

Guides users to specific charts based on data characteritics.


Big Data Visualization Tools:
Q1 . What is the need for such tools?

Q2. Short notes on the following Tools:

Google Chart
Tableau
Qlikview
Datawrapper
Oracle Visual Analyzer
Fusion Charts
HighCharts
Microsoft Power BI
Plotly
Sisense
Q3. What is the importance of Big Data Visualization?

1. Review of large amounts of data


2. Spot trends
3. Identify correlations and unexpected relationships in the data
4. Present the data to others

Q4. What are the key issues of Big Data Visualization?

1. Availability of visualization specialists


2. Visualization hardware resources
3. Data Quality

Q5. What are the benefits of Data Visualization Tools?


Q6. What is SAS Visual Analytics? Give an example.

Q7. Use any free online tool to create a Word Cloud from any
pdf document of your choice.
Correlation Matrix

A correlation matrix is a matrix that shows the correlation between variables.


It gives the correlation between all the possible pairs of values in a matrix
format.

It is a statistical technique used to evaluate the relationship between two


variables in a data set.

The matrix is a table in which every cell contains a correlation coefficient,


where 1 is considered a strong positive relationship between variables,
0 is no relationship and -1 is a strong negative relationship.

Commonly used to build regression models.


Calculate the Correlation between Age (X) and Glucose Level (Y)

Age (X) Glucose (Y)


43 99
21 65
25 79
42 75
57 87
59 81

Step 1 : Calculate the following XY, X2 and Y2


Step 2: Calculate the summation of X, Y, XY, X2 and Y2
Step 3: Calculate (Xi - sum X), (Yi - sum Y) and the product of the two
Step 4: Calculate (Xi - sum X)2 and (Yi - sum Y)2
Step 5: Apply Pearson's Correlation Coefficient

You might also like