3 Data Science Intro
3 Data Science Intro
• Data may be generated by humans (surveys, logs, etc.) or machines (weather data,
road vision, etc.), and could be in different formats (text, audio, video, augmented or
virtual reality, etc.)
1
Where Do We See Data Science?
• Finance
• Public Policy
• Politics
• Healthcare
• Urban Planning
• Education
• Libraries
• ….
• …
2
Where Do We See Data Science?
Finance
• Data scientists capture and analyze new sources of data
• Building predictive models and running real-time simulations of market events.
• Help the finance industry obtain the information necessary to make accurate predictions.
• Fraud detection and risk reduction.
• Minimize the chance of loan defaults via information such as customer profiling, past expenditures, and
other essential variables that can be used to analyze the probabilities of risk and default.
• Analyze a customer’s purchasing power to more effectively try to sell additional banking products.
• Identify the creditworthiness of potential customers
3
Where Do We See Data Science?
Public Policy
• Data science helps governments and agencies gain insights into citizen behaviors that affect the quality
of public life, including traffic, public transportation, social welfare, community wellbeing, etc. This
information, or data, can be used to develop plans that address the betterment of these areas.
Politics
• Data scientists have been quite successful in constructing accurate voter targeting models and increasing
voter participation.
4
Data Analytics Life Cycle
5
Data Analytics Life Cycle
• Phase 1—Discovery: In Phase 1, the team learns the business domain, including attempted similar
projects in the past from which they can learn. The team assesses the resources available to support the
project in terms of people, technology, time, and data. Important activities in this phase include framing
the business problem as an analytics challenge that can be addressed in subsequent phases and
formulating initial hypotheses to test and begin learning the data.
• Phase 2—Data preparation: Phase 2 the team works with data and perform analytics for the duration of
the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load
(ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also
needs to familiarize itself with the data thoroughly and take steps to condition the data.
6
Data Analytics Life Cycle
• Phase 3—Model planning: Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.
• Phase 4—Model building: In Phase 4, the team develops datasets for testing, training, and production
purposes. In addition, in this phase the team builds and executes models based on the work done in the
model planning phase. The team also considers whether its existing tools will suffice for running the
models, or if it will need a more robust environment for executing models and workflows (for example,
fast hardware and parallel processing, if applicable).
7
Data Analytics Life Cycle
• Phase 5—Communicate results: In Phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria developed in
Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to
summarize and convey findings to stakeholders.
• Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a production
environment.
8
What Do Data Scientists Do?
• Data collection
• Descriptive statistics
• Correlation
• Data visualization
• Model building
• Extrapolation and regression analysis
9
Data
• Data be viewed as the raw material from which information is obtained.
• Structured data refers to highly organized information that can be seamlessly included in a database and
readily searched via simple search operations; whereas unstructured data is essentially the opposite,
devoid of any underlying structure
• Structured data is data that has been predefined and formatted to a set structure before being placed in
data storage.
• Unstructured data is data stored in its native format and not processed until it is used. Example: social
media posts, presentations, chats, and satellite imagery
10
How to Collect Data?
11
Data Storage and Presentation
• Depending on its nature, data is stored in various formats.
• We will start with simple kinds – data in text form. If such data is structured, it is common to store and
present it in some kind of delimited way. That means various fields and values of the data are separated
using delimiters, such as commas or tabs
• The two most commonly used formats that store data as simple text – comma-separated values (CSV)
and tab-separated values (TSV).
12
CSV (Comma-Separated Values)
• the most common import and export format for spreadsheets and databases.
• the first row mentions the variable names. The remaining rows each individually represent one data
point.
• Example: A snippet from a dataset that represents the effectiveness of different treatment procedures
on separate individuals with clinical depression
Variable names
Data points
13
TSV (Tab-Separated Values)
• TSV files are used for raw data and can be imported into and exported from spreadsheet software. Tab-
separated values files are essentially text files, and the raw data can be viewed by text editors, though
such files are often used when moving raw data between spreadsheets.
• Example
• An advantage of TSV format is that the delimiter (tab) will not need to be avoided because it is unusual
to have the tab character within a field. In fact, if the tab character is present, it may have to be
removed. On the other hand, TSV is less common than other delimited formats such as CSV.
14
XML(eXtensibleMarkupLanguage)
• XML was designed to be both human and machine readable, and can thus be used to store and
transport data. In the real world, computer systems and databases contain data in incompatible formats.
As the XML data is stored in plain text format, it provides a software and hardware independent way of
storing data. This makes it much easier to create data that can be shared by different applications.
• Example
15
JSON (JavaScript Object Notation)
• JSON is a lightweight data-interchange format. It is not only easy for humans to read and write, but also
easy for machines to parse and generate.
• JSON is built on two structures:
• A collection of name–value pairs. In various languages, this is realized as an object, record, structure, dictionary, hash table,
keyed list, or associative array.
• An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
• Example
16
Data Pre-processing
• Data in the real world is often dirty – not ready to be used for the desired purpose.
17
Data Pre-processing
• factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the attribute values are lacking, certain attributes of interest are
lacking, or attributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data points in a dataset may
contain extreme values that can severely affect the dataset’s range. (e.g., Salary=“-10”)
• Inconsistent. Data contains discrepancies in codes or names. For example, if the “Name” column
for registration records of employees contains values other than alphabetical letters, or if records
do not start with a capital letter, discrepancies are present. (e.g., Age=“42” Birthday=“03/07/1997”;
Was rating “1,2,3”, now rating “A, B, C”)
18
Forms of data pre-processing
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
19
Data Cleaning
Following are different ways to clean data:
• Munging or Wrangling: Involves reformatting data into usable format.
Often, the data is not in a format that is easy to work with. For example, it may be stored or presented in a way that
is hard to process. Thus, we need to convert it to something more suitable for a computer to understand. To
accomplish this, there is no specific scientific method. The approaches to take are all about manipulating or
wrangling (or munging) the data to turn it into something that is more convenient or desirable. This can be done
manually, automatically, or, in many cases, semi-automatically.
• Handling missing data: Sometimes data may be in the right format, but some of the values are missing may be
due to problems with the process of collecting data, or an equipment malfunction. Or, comprehensiveness may
not have been considered important at the time of collection. Furthermore, some data may get lost due to system
or human error while storing or transferring the data.
Strategies to combat missing data: Global constant, Ignoring, Imputing or Inference approach
• Filtering noisy data: Identifying outliers and removing from database.
20
Data Cleaning - Missing Values
Strategies to handle missing data:
• Ignore the tuple: This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. By ignoring the tuple, we do not make
use of the remaining attributes in the tuple. Such data could have been useful to the task at hand.
• Fill in the missing value manually: In general, this approach is time consuming and may not be feasible
given a large data set with many missing values.
• Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant such as a label like “Unknown” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a
value in common. Hence, although this method is simple, it is not foolproof.
21
Data Cleaning - Missing Values (Cont.)
Strategies to handle missing data:
• Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing
value: For normal (symmetric) data distributions, the mean can be used, while skewed data distribution
should employ the median. We will talk about skewed distribution later in the lecture.
• Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree induction.
22
Data Cleaning – Noisy Data
Noise is a random error or variance in a measured variable. some basic statistical description techniques,
and methods of data visualization (e.g., boxplots and scatter plots) can be used to identify outliers, which
may represent noise.
Here are some of the common data smoothing techniques:
• Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it.
• first sort data and partition into (equal-frequency, or equal-width) bins
• then one can smooth by bin mean, smooth by bin median, smooth by bin boundaries, etc.
• Regression: Data smoothing can also be done by regression, a technique that conforms data values to a
function.
• Outlier analysis: Outliers may be detected by clustering, for example, where similar values are organized
into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered
outliers.
23
Binning methods for data smoothing
• Smoothing by bin means: each value in
a bin is replaced by the mean value of
the bin. Similarly, smoothing by bin
medians can be employed, in which
each bin value is replaced by the bin
median.
24
Importance of Data Cleaning
25
Data Integration
To be as efficient and effective for various data analyses as possible, data from various sources commonly needs
to be integrated. The following steps describe how to integrate multiple databases or files.
• Combine data from multiple sources into a coherent storage place (e.g., a single file or a database).
• Address redundant data in data integration. Redundant data is commonly generated in the process of
integrating multiple databases. For example:
• The same attribute may have different names in different databases.
• One attribute maybe a “derived” attribute in another table; for example, annual revenue.
• Correlation analysis may detect instances of redundant data.
26
Data Transformation
Some of the typical data manipulation and transformation techniques are:
• Reduction
• Data Cube Aggregation: use the smallest representation that is sufficient to address the given task.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year 2022. If you want to
get the annual sale per year, you just have to aggregate the sales per quarter for each year.
• Dimensionality Reduction: dimensionality reduction method works with respect to the nature of the data. Here, a dimension
or a column in your data spreadsheet is referred to as a “feature,” and the goal of the process is to identify which features to
remove or collapse to a combined feature. (later in this course)
27
Data Transformation
Some of the typical data manipulation and transformation techniques are:
• Conversion
• Feature Construction
New attributes constructed from the given ones
28
Data Transformation
Some of the typical data manipulation and transformation techniques are:
• Normalization: Scaled to fall within a small, specified range and aggregation. Some of the techniques
that are used for accomplishing normalization
• min-max
" #$%&!
𝑣! = (𝑛𝑒𝑤𝑀𝑎𝑥) − 𝑛𝑒𝑤𝑀𝑖𝑛) ) + 𝑛𝑒𝑤𝑀𝑖𝑛)
$'(! # $%&!
• Z-score
" #*!
𝑣! =
+!
• Scaling
"
𝑣! =
,-"
29
Data Transformation: Normalization
• Since the range of values of raw data varies widely, in some machine learning algorithms, objective
functions will not work properly without normalization.
• For example, the majority of classifiers calculate the distance between two points by the Euclidean
distance. If one of the features has a broad range of values, the distance will be governed by this
particular feature.
• Therefore, the range of all features should be normalized so that each feature contributes approximately
proportionately to the final distance.
• Another reason why feature scaling is applied is that gradient descent converges much faster with
feature scaling than without it.
30
Data Transformation: Min-Max Normalization
Min-Max Normalization:
, -./0!
𝑣+ = .12! - ./0!
(𝑛𝑒𝑤𝑀𝑎𝑥3 − 𝑛𝑒𝑤𝑀𝑖𝑛3 ) + 𝑛𝑒𝑤𝑀𝑖𝑛3
• The value range should be known and predefined both in training and in real world applications
• The closest the distribution is to a linear form the better it will work
31
Data Transformation: z-Score Normalization
• The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll
have the properties of a standard normal distribution with μ = 0 and σ = 1
𝑣 − 𝜇3
𝑣+ =
𝜎3
• Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only
important if we are comparing measurements that have different units, but it is also a general
requirement for many machine learning algorithms.
• Gradient descent as a prominent example (an optimization algorithm often used in logistic regression, SVMs, perceptrons,
neural networks etc
32
Z-score standardization or Min-Max scaling?
• There is no obvious answer to this question: it really depends on the application.
• In clustering analyses, standardization may be especially crucial in order to compare similarities between
features based on certain distance measures.
• Another prominent example is the Principal Component Analysis, where we usually prefer
standardization over Min-Max scaling, since we are interested in the components that maximize the
variance
• However, this doesn't mean that Min-Max scaling is not useful at all! A popular application is image
processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for
the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.
33
Data Transformation: Normalization by decimal scaling
• Normalizes by moving the decimal point of values of feature X. The number of decimal points moved
depends on the maximum absolute value of X. A modified value corresponding to v is obtained using
𝑣
𝑣+ =
104
• Usually the normalization factor is chosen such that max(|new_v|) < 1.
• This approach needs (similarly to min – max method) the maximum value to be predefined
• Example: suppose the range of attribute X is -500 to 45. The maximum absolute value of X is 500. To
normalize by decimal scaling we will divide each value by 1,000 (j = 3). In this case, -500 becomes -0.5
while 45 will become 0.045.
34
Data Analysis and Data Analytics
• These two terms – data analysis and data analytics – are often used interchangeably and could be
confusing.
• However, there are some subtle but important differences between analysis and analytics.
35
Categorization of Analysis and Analytics
36
Descriptive Analysis
• Descriptive analysis is about: “What is happening now based on incoming data.” It is a method for
quantitatively describing the main features of a collection of data.
• Example: categorize customers by their likely product preferences and purchasing patterns.
• Descriptive statistics come into play to facilitate analyzing and summarizing the data (e.g. mean, median,
or mode, …. etc.)
37
Summary Statistics/Quantitative Methods
• Central Tendency measures. They are computed to give a “center” around which the measurements in
the data are distributed.
• Variation or Variability measures. They describe “data spread” or how far away the measurements are
from the center.
• Relative Standing measures. They describe the relative position of specific measurements in the data.
38
Central Tendency Measures - Mean
• Mean is commonly known as average, though they are not exactly synonyms. Mean is most often used
to measure the central tendency of continuous data as well as a discrete dataset. If there are n number
of values in a dataset and the values are x1, x2, . . ., xn, then the mean is calculated as
• There is a significant drawback to using the mean as a central statistic: it is susceptible to the influence
of outliers. Also, mean is only meaningful if the data is normally distributed, or at least close to looking
like a normal distribution.
• The mean is useful for predicting future results when there are no extreme values in the data set
39
Central Tendency Measures - Median
• Median is the middle value of the data that has been sorted according to the values of the data.
• When the data has even number of values, median is calculated as the average of the two middle
values.
• Example
40
Which Location Measure Is Best?
• Mean is best for symmetric distributions without outliers.
41
Central Tendency Measures - Mode
• Mode is the most frequently occurring value in a dataset.
• The mode is useful when the most common item, characteristic or value of a data set is required.
42
Mean vs. Median vs. Mode
43
Measures of Shape: Skewness
Skewness
• Absence of symmetry
• Extreme values in one side of a distribution
44
Variation Measures - Variance
• It is a measure used to indicate how spread out the data points are.
• If the individual observations vary greatly from the group mean, then the variance is big; and vice versa.
• It is computed as:
• The advantage: the units of standard deviation is same as the units of the data, which is not the case for
the variance.
46
Standard Deviation: Interesting Theoretical Result
For many lists of observations, especially if their histogram is bell-shaped
• Roughly 68% of the observations lie within 1 standard deviation of the mean.
47
Diagnostic Analytics
• Diagnostic analytics are used for discovery, or to determine why something happened.
• It involves at least one cause (usually more than one) and one effect.
• There are many techniques available in diagnostic analytics, which should be capable of recognizing
patterns, detecting anomalies, surfacing ‘unusual’ events.
48
Diagnostic Analytics - Correlations
• Correlation is a statistical analysis that is used to measure and describe the strength and direction of the
relationship between two variables.
• Strength indicates how closely two variables are related to each other, and direction indicates how one
variable would change its value as the value of the other variable changes.
• An important statistic, the Pearson’s r correlation, is widely used to measure the degree of the
relationship between linear related variables. The following formula is used to calculate the Pearson’s r
correlation:
49
Diagnostic Analytics - Correlations
• Directions: All correlation coefficients between 0 and 1 represent positive correlations, while all
coefficients between 0 and -1 are negative correlations. Positive relation means if one
increase(decrease), then other will increase(decrease). Negative relation means if one
increase(decrease), then other will decrease(increase).
• Strength: The closer a correlation coefficient is to 1 or to -1, the stronger it is. Following picture suggests
a guideline to interpret the strength.
50
Predictive Analytics
• Predictive analytics has its roots in our ability to predict what might happen. These analytics are about
understanding the future using the data and the trends we have seen in the past, as well as emerging
new contexts and processes.
51
Prescriptive Analytics
• Prescriptive analytics is the area of business analytics dedicated to finding the best course of action for
a given situation.
• Specific techniques used in prescriptive analytics include optimization, simulation, game theory, and
decision-analysis methods.
52
Exploratory Analysis
• Often when working with data, we may not have a clear understanding of the problem or the situation.
And yet, we may be called on to provide some insights. In other words, we are asked to provide an
answer without knowing the question! This is where we go for an exploration.
• It typically involves data visualization techniques. The idea is that, plotting the data in different forms
could provide us with some clues regarding what we may (or want to) find in data. It consists of range of
techniques. The most common goal is looking for patterns in the data.
• Exploratory data analysis is an approach that postpones the usual assumptions about what kind of
model the data follows with the more direct approach of allowing the data itself to reveal its underlying
structure in the form of a model.
53
Mechanistic Analysis
• Mechanistic analysis involves understanding the exact changes in variables that lead to changes in other
variables for individual objects.
• For instance, studying how the increased amount of CO2 in the atmosphere is causing the overall
temperature to change.
54
Introduction to Data Visualization
What is data visualization
• Data visualization means encoding information about the data in a graphical way.
• It helps to highlight the most useful insights from a dataset, making it easier to spot trends, patterns,
outliers, and correlations.
• The ultimate goal of data visualization is to tell a story. We are trying to convey information about the
data as efficiently as possible.
• Using graphics allows the reader to quickly and accurately understand the message we are trying to
transmit.
56
Benefits of effective data visualization
Data visualization allows you to:
• Get an initial understanding of your data by making trends, patterns, and outliers easily visible to the
naked eye.
• Communicate insights and findings to non-data experts, making your data accessible and actionable.
• Tell a meaningful and impactful story, highlighting only the most relevant information for a given
context.
[Source: https://guides.library.jhu.edu/datavisualization] 58
Benefits of effective data visualization (Cont.)
Upon visual inspection, it becomes immediately clear that these datasets, while seemingly identical
according to common summary statistics, are each unique. This is the power of effective data visualization:
it allows us to bypass cognition by communicating directly with our perceptual system.
[Source: https://guides.library.jhu.edu/datavisualization] 59
What is data visualization used for?
Within the broader goal of conveying key insights, different visualizations can be used to tell different
stories. Data visualizations can be used to:
• Convey changes over time: For example, a line graph could be used to present how the value of Bitcoin
changed over a certain time period.
• Determine the frequency of events: You could use a histogram to visualize the frequency distribution of
a single event over a certain time period (e.g. number of internet users per year from 2007 to 2021).
• Highlight interesting relationships or correlations between variables: If you wanted to highlight the
relationship between two variables (e.g. marketing spend and revenue, or hours of weekly exercise vs.
cardiovascular fitness), you could use a scatter plot to see, at a glance, if one increases as the other
decreases (or vice versa).
• Examine a network: If you want to understand what’s going on within a certain network (for example,
your entire customer base), network visualizations can help you to identify (and depict) meaningful
connections and clusters within your network of interest.
• Analyze value and risk
[Source: Emily Stevens]
60
Types of data visualization
• It is important to distinguish between two main types: exploratory and explanatory data visualization.
• In a nutshell, exploratory data visualization helps you figure out what’s in your data, while explanatory
visualization helps you to communicate what you’ve found.
• Exploration takes place while you’re still analyzing the data, while explanation comes towards the end of
the process when you’re ready to share your findings.
• Think about what encodings are best for what kind of variable.
For example, color would work pretty badly for a continuous
variable, but would work very well for a discrete variable
• Other Visual Encodings: For example “spatial” encodings exploit the cortex’s spatial awareness to
encode information. This can be achieved through position in a scale, length, area, volume.
Discrete Variables
• Hierarchical visualizations organize groups within larger groups, and are often used to display clusters of
information. Examples include tree diagrams, ring charts, and sunburst diagrams.
• Network visualizations show the relationships and connections between multiple datasets. Examples
include matrix charts, word clouds, and node-link diagrams.
• Multidimensional or 3D visualizations are used to depict two or more variables. Examples include pie
charts, Venn diagrams, stacked bar graphs, and histograms.
• Geospatial visualizations convey various data points in relation to physical, real-world locations (for
example, voting patterns across a certain country). Examples include heat maps, cartograms, and
density maps.
[Source: Emily Stevens]
65
Common types of data visualization - Scatterplots
• Scatterplots (or scatter graphs) visualize the relationship between two variables. One variable is shown
on the x-axis, and the other on the y-axis, with each data point depicted as a single “dot” or item on the
graph. This creates a “scatter” effect, hence the name.
• Scatterplots are best used for large datasets when there’s no temporal element. For example, if you
wanted to visualize the relationship between a person’s height and weight, or between how many carats
a diamond measures and its monetary value, you could easily visualize this using a scatterplot. It’s
important to bear in mind that scatterplots simply describe the correlation between two variables; they
don’t infer any kind of cause-and-effect relationship.
• So, with a bar chart, you have your categorical data on the x-axis plotted against your discrete values on
the y-axis. The height of the bars is directly proportional to the values they represent, making it easy to
compare your data at a glance. [Source: Emily Stevens]
67
Common types of data visualization - Histograms
• Plot values of observation on one of the
axes (typically x-axis). The values can be
ordered (if applicable).
• Plot perpendicular bars to the above axis.
• The height of the bar shows how
many times each value occurred in the
dataset.
• When you have numeric values, it does
not make sense to count the occurrences
of each value. Thus, we introduce the
concept of buckets or bins.
• The idea is to plot the bars for each
bin/bucket, where the height of the bar
indicates the number of values that
belongs to the bin/bucket.
68
Common types of data visualization - Box Plots
To build a boxplot, following things are
required from the data.
• The min and max values in the data.
• The first and third quartile of the data.
• The median of the data.
69
Quartiles and IQR
• The quartiles of a population or a sample are the three values which divide the distribution or observed
data into even fourths.
• The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
• Q2 is the same as the median (50% are smaller, 50% are larger)
• Only 25% of the observations are greater than the third quartile
• The IQR is used when the researcher wishes to eliminate the influence of extreme values and consider
the variation for the more typical cases in a distribution.
70
Quartiles and IQR - Example
Find the quartiles of this data set: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36
• You first need to arrange the data points in increasing order.
idx 1 2 3 4 5 6 7 8 9 10 11
value 6 7 15 36 39 41 41 43 43 47 49
• Then you need to find the rank of the median (Q2) to split the data set in two.
rank = (n+1)/2 = (11+1)/2 = 6 à Q2 = 41
• Then you need to split the lower half of the data in two again to find the lower quartile. The lower
quartile will be the point of rank (5 + 1) ÷ 2 = 3. The result is Q1 = 15. The second half must also be split
in two to find the value of the upper quartile. The rank of the upper quartile will be 6 + 3 = 9. So Q3 = 43.
• Once you have the quartiles, you can easily measure the spread. The interquartile range will be Q3 - Q1,
which gives 28 (43-15).
71
Common types of data visualization - Pie charts
• Just like bar charts, pie charts are used to visualize categorical data. However, while bar charts represent
multiple categories of data, pie charts are used to visualize just one single variable broken down into
percentages or proportions.
• A pie chart is essentially a circle divided into different “slices,” with each slice representing the
percentage it contributes to the whole. Thus, the size of each pie slice is proportional to how much it
contributes to the whole “pie.”
Python Packages
• Seaborn
• Matplotlib
• Pandas
• Plotly
• …
75