Pant D. Statistics for Data Scientists and Analysts...using Python 2025
Pant D. Statistics for Data Scientists and Analysts...using Python 2025
Dipendra Pant
Suresh Kumar Mukhiya
www.bpbonline.com
First Edition 2025
ISBN: 978-93-65897-128
www.bpbonline.com
Dedicated to
https://rebrand.ly/68f7c9
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Statistics-for-
Data-Scientists-and-Analysts. In case there’s an update
to the code, it will be updated on the existing GitHub
repository.
We have code bundles from our rich catalogue of books and
videos available at https://github.com/bpbpublications.
Check them out!
Errata
We take immense pride in our work at BPB Publications and
follow best practices to ensure the accuracy of our content
to provide with an indulging reading experience to our
subscribers. Our readers are our mirrors, and we use their
inputs to reflect and improve upon human errors, if any, that
may have occurred during the publishing processes
involved. To let us maintain the quality and help us reach
out to any readers who might be having difficulties due to
any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly
appreciated by the BPB Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at
www.bpbonline.com and as a print book customer, you are entitled to a
discount on the eBook copy. Get in touch with us at :
[email protected] for more details.
At www.bpbonline.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters, and receive exclusive
discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet,
we would be grateful if you would provide us with the location address or
website name. Please contact us at [email protected] with a link to
the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave
a review on the site that you purchased it from? Potential readers can then
see and use your unbiased opinion to make purchase decisions. We at BPB
can understand what you think about our products, and our authors can see
your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.
Index
CHAPTER 1
Foundations of Data
Analysis and Python
Introduction
In today’s data-rich landscape, data is much more than a
collection of numbers or facts, it’s a powerful resource that
can influence decision-making, policy formation, product
development, and scientific discovery. To turn these raw
inputs into meaningful insights, we rely on statistics, the
discipline dedicated to collecting, organizing, summarizing,
and interpreting data. Statistics not only helps us
understand patterns and relationships but also guides us in
making evidence-based decisions with confidence. This
chapter examines fundamental concepts at the heart of
data analysis. We’ll explore what data is and why it matters,
distinguish between various types of data and their levels of
measurement, and consider how data can be categorized as
univariate, bivariate, or multivariate. We’ll also highlight
different data sources, clarify the roles of populations and
samples, and introduce crucial data preparation tasks
including cleaning, wrangling, and manipulation to ensure
data quality and integrity.
For example, consider you have records of customer
purchases at an online store everything from product
categories and prices to transaction dates and customer
demographics. Applying statistical principles and effective
data preparation techniques to this information can reveal
purchasing patterns, highlight which product lines drive the
most revenue, and suggest targeted promotions that
improve the shopping experience.
Structure
In this chapter, we will discuss the following topics:
Environment setup
Software installation
Basic overview of technology
Statistics, data, and its importance
Types of data
Levels of measurement
Univariate, bivariate, and multivariate data
Data sources, methods, population, and samples
Data preparation tasks
Wrangling and manipulation
Objectives
By the end of this chapter, readers will learn the basics of
statistics and data, such as, what they are, why they are
important, how they vary in type and application, and the
basic data collection and manipulation techniques.
Moreover, this chapter explains different level of
measurements, data analysis techniques, its source,
collection methods, their quality and cleaning. You will also
learn how to work with data using Python, a powerful and
popular programming language that offers many tools and
libraries for data analysis.
Environment setup
To set up the environment and to run the sample code for
statistics and data analysis in Python, the three options are
as follows:
Download and install Python from
https://www.python.org/downloads/. Other packages
need to be installed explicitly on top of Python. Then,
use any integrated development environment (IDE)
like visual studio code to execute Python code.
You can also use Anaconda, a Python distribution
designed for large-scale data processing, predictive
analytics, and scientific computing. The Anaconda
distribution is the easiest way to code in Python. It
works on Linux, Windows, and Mac OS X. It can be
downloaded from
https://www.anaconda.com/distribution/.
You can also use cloud services, which is the easiest of
all options but requires internet connectivity to use.
Cloud providers like Microsoft Azure Notebooks,
GitHub Code Spaces and Google Collaboratory are very
popular. Following are a few links:
Microsoft Azure Notebooks:
https://notebooks.azure.com/
GitHub Codespaces: Create a GitHub account
from https://github.com/join then, once logged in,
create a repository from https://github.com/new.
Once the repository is created, open the repository
in the codespace by using the following
instructions:
https://docs.github.com/en/codespaces/develop
ing-in-codespaces/creating-a-codespace-for-a-
repository.
Google Collaboratory: Create a Google account,
open https://colab.research.google.com/, and
create a new notebook.
Azure Notebook GitHub Codespace and Google
Collaboratory are cloud-based and easy-to-use platforms.
To run and set up an environment locally, install the
Anaconda distribution on your machine and follow the
software installation instructions.
Software installation
Now, let us look at the steps to install Anaconda to run the
sample code and tutorials on the local machine as follows:
1. Download the Anaconda Python distribution from the
following link: https://www.anaconda.com/download
2. Once the download is complete, run the setup to begin
the installation process.
3. Once the Anaconda application has been installed, click
Close and move to the next step to launch the
application.
Check Anaconda installation instructions in the
following:
https://docs.anaconda.com/free/anaconda/install/in
dex.html
Launch application
Now, let us lunch the installed Anaconda navigator and the
JupyterLab in it.
Following are the steps:
1. After installing the Anaconda navigator, open any
Anaconda navigator and then install and launch
JupyterLab.
2. This will start the Jupyter server listening on port 8888.
Usually, a pop-up window comes with a default browser,
but you can also start the JupyterLab application on any
web browser, Google Chrome preferred, and go to the
following URL:
http://localhost:8888/
3. A blank notebook is launched in a new window. You can
write Python code on it.
4. Select the cell and press run to execute the code.
The environment is now ready to write, run and execute
tutorials.
Python
To know more about Python and installation you can refer
to the following link:
https://www.python.org/about/gettingstarted/. Execute
Python-version in terminal or command prompt, and if you
see the Python version as output, you are good to go, else,
install Python. There are different ways to install Python
packages on Jupyter Notebook, depending on the package
manager you use and the environment you work in, as
follows:
If you use pip as your package manager, you can install
packages directly from a code cell in your notebook by
typing !pip install <package_name> and running the
cell. Then replace <package_name> with the name of
the package you want to install.
If you use conda as your package manager, you can
install packages from a JupyterLab cell by typing
!conda install <package_name> --yes and running
the cell. The --yes flag is to avoid prompts that asks for
confirmation.
If you want to install a specific version of Python for
your notebook, you can use the ipykernel module to
create a new kernel with that version. For example, if
you have Python 3.11 and pip installed on your
machine, you can type !pip3.11 install ipykernel and
!python3.11 -m ipykernel install –user in two
separate code cells and run them. Then, you can select
Python 3.11 as your kernel from the kernel menu.
Further tutorials will be based on the JupyterLab.
pandas
pandas is mainly used for data analysis and manipulation in
Python. More can be read at:
https://pandas.pydata.org/docs/
Following are the ways to install pandas:
In Jupyter Notebook, execute pip install pandas
In the conda environment, execute conda install
pandas --yes
NumPy
NumPy is a Python package for numerical computing,
multi-dimensional array, and math computation. More can
be read at https://pandas.pydata.org/docs/.
Following are the ways to install NumPy:
In Jupyter Notebook, execute pip install pandas
In the conda environment, execute conda install
pandas –yes
Sklearn
Sklearn is a Python package that provides tools for machine
learning, such as data preprocessing, model selection,
classification, regression, clustering, and dimensionality
reduction. Sklearn is mainly used for predictive data
analysis and building machine learning models. More can
be read at https://scikit-
learn.org/0.21/documentation.html.
Following are the ways to install Sklearn:
In Jupyter Notebook, execute pip install scikit-learn
In the conda environment, execute conda install
scikit-learn –yes
Matplotlib
Matplotlib is mainly used to create static, animated, and
interactive visualizations (plots, figures, and customized
visual style and layout) in Python. More can be read at
https://matplotlib.org/stable/index.html.
Following are the ways to install Matplotlib:
In Jupyter notebook, excute pip install matplotlib
In the conda environment, execute conda install
matplotlib --yes
Types of data
Data can be in different form and type but generally it can
be divided into two types, that is, qualitative and
quantitative.
Qualitative data
Qualitative data cannot be measured or counted in
numbers. Also known as categorical data, it is descriptive,
interpretation-based, subjective, and unstructured. It
describes the qualities or characteristics of something. It
helps to understand the reasoning behind it by asking why,
how, or what. It includes nominal and ordinal data. For
example, gender of person, race of a person, smartphone
brand, hair color type, marital status, and occupation of a
person.
Tutorial 1.1: To implement creating a data frame
consisting of only qualitative data.
To create a data frame with pandas, import pandas as pd,
then use the DataFrame() function and pass a data source,
such as a dictionary, list, or array, as an argument.
1. #Import the pandas library to create a pandas Datafram
e
2. import pandas as pd
3. # Sample qualitative data
4. qualitative_data = {
5. 'Name': ['John', 'Alice', 'Bob', 'Eve', 'Michael'],
6. 'City': ['New York', 'Los Angeles', 'Chicago', 'San Fran
cisco', 'Miami'],
7. 'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
8. 'Occupation': ['Engineer', 'Artist', 'Teacher', 'Doctor', '
Lawyer'],
9. 'Race': ['Black', 'White', 'Asian', 'Indian', 'Mongolian'],
10. 'Smartphone Brand': ['Apple', 'Samsung', 'Xiomi', 'Ap
ple', 'Google']
11. }
12. # Create the DataFrame
13. qualtitative_df = pd.DataFrame(qualitative_data)
14. # Prints the created DataFrame
15. print(qualtitative_df)
Output:
1. Name City Gender Occupation Race
Smartphone Brand
2. 0 John New York Male Engineer Black
Apple
3. 1 Alice Los Angeles Female Artist White
Samsung
4. 2 Bob Chicago Male Teacher Asian Xiomi
5. 3 Eve San FranciscoFemale Doctor Indian
Apple
6. 4 Michael Miami Male Lawyer
Mongolian Google
Row consisting of numbers 0, 1, 2, 3, and 4 is the index
column, not part of the qualitative data. To exclude it from
output, hide the index column using to_string() as follows:
1. print(qualtitative_df.to_string(index=False))
Output:
1. Name City Gender Occupation Race Smartph
one Brand
2. John New York Male Engineer Black A
pple
3. Alice Los Angeles Female Artist White Sam
sung
4. Bob Chicago Male Teacher Asian Xio
mi
5. Eve San Francisco Female Doctor Indian
Apple
6. Michael Miami Male Lawyer Mongolian
Google
While we often think of data in terms of numbers, many
other forms such as images, audio, videos, and text they
can also represent quantitative information when suitably
encoded (e.g., pixel intensity values in images, audio
waveforms, or textual features like word counts).
Tutorial 1.2: To implement accessing and creating a data
frame consisting of the image data.
In this tutorial, we’ll work with the open-source Olivetti
faces dataset, which consists of grayscale face images
collected at AT&T Laboratories Cambridge between April
1992 and April 1994. Each face is represented by
numerical pixel values, making them a form of quantitative
data. By organizing this data into a DataFrame, we can
easily manipulate, analyze, and visualize it for further
insights.
To create a data frame consisting of the Olivetti faces
dataset, you can use the following steps:
1. Fetch the Olivetti faces dataset from sklearn using the
sklearn.datasets.fetch_olivetti_faces function. This
will return an object that holds the data and some
metadata.
2. Use the pandas.DataFrame constructor to create a
data frame from the data and the feature names. You
can also add a column for the target labels using the
target and target_names attributes of the object.
3. Use the pandas method to display and analyze the data
frame. For example, you can use df.head(),
df.describe(), df.info().
1. import pandas as pd
2. #Import datasets from the sklearn library
3. from sklearn import datasets
4. # Fetch the Olivetti faces dataset
5. faces = datasets.fetch_olivetti_faces()
6. # Create a dataframe from the data and feature nam
es
7. df = pd.DataFrame(faces.data)
8. # Add a column for the target labels
9. df["target"] = faces.target
10. # Display the first 3 rows of the dataframe
11. print(f"{df.head(3)}")
12. # Print new line
13. print("\n")
14. # Display the first image in the dataset
15. import matplotlib.pyplot as plt
16. plt.imshow(df.iloc[0, :-1].values.reshape(64, 64), cm
ap="gray")
17. plt.title(f"Image of person {df.iloc[0, -1]}")
18. plt.show()
Quantitative data
Quantitative data is measurable and can be expressed
numerically. It is useful for statistical analysis and
mathematical calculations. For example, if you inquire
about the number of books people have read in a month,
their responses constitute quantitative data. They may
reveal that they have read, let us say, three books, zero
books, or ten books, providing information about their
reading habits. Quantitative data is easily comparable and
allows for calculations. It can provide answers to questions
such as How many? How much? How often? and How
fast?
Tutorial 1.3: To implement creating a data frame
consisting of only quantitative data is as follows:
1. #Import the pandas library to create a pandas Datafram
e
2. import pandas as pd
3. quantitative_df = pd.DataFrame({
4. "price": [300000, 250000, 400000, 350000, 450000],
5. "distance": [10, 15, 20, 25, 30],
6. "height": [170, 180, 190, 160, 175],
7. "weight": [70, 80, 90, 60, 75],
8. "salary": [5000, 6000, 7000, 8000, 9000],
9. "temperature": [25, 30, 35, 40, 45],
10. })
11. # Print the DataFrame without index
12. print(quantitative_df.to_string(index=False))
Output:
1. price distance height weight salary temperature
2. 300000 10 170 70 5000 25
3. 250000 15 180 80 6000 30
4. 400000 20 190 90 7000 35
5. 350000 25 160 60 8000 40
6. 450000 30 175 75 9000 45
Tutorial 1.4: To implement accessing and creating a data
frame by loading the tabular iris data.
Iris tabular dataset contains 150 samples of iris flowers
with four features, that is, sepal length, sepal width, petal
length, and petal width and three classes, that is, setosa,
versicolor, and virginica. The sepal length, sepal width,
petal length, petal width, and target (class) are columns of
the table1.
To create a data frame consisting of the iris dataset, you
can use the following steps:
1. First, you need to load the iris dataset from sklearn
using the sklearn.datasets.load_iris function. This will
return a bunch object that holds the data and some
metadata.
2. Next, you can use the pandas.DataFrame constructor
to create a data frame from the data and the feature
names. You can also add a column for the target labels
using the target and target_names attributes of the
bunch object.
1. Finally, you can use the panda method to display and
analyze the data frame. For example, you can use
df.head(), df.describe(), df.info() as follows:
1. import pandas as pd
2. # Import dataset from sklean
3. from sklearn import datasets
4. # Load the iris dataset
5. iris = datasets.load_iris()
6. # Create a dataframe from the data and feature nam
es
7. df = pd.DataFrame(iris.data, columns=iris.feature_n
ames)
8. # Add a column for the target labels
9. df["target"] = iris.target
10. # Display the first 5 rows of the dataframe
11. df.head()
Level of measurement
Level of measurement is a way of classifying data based on
how precise it is and what we can do with it. Generally,
they are four, that is, nominal, ordinal, interval and ratio.
Nominal is a category with no inherent order, such as
colors. Ordinal is a category with a meaningful order, such
as education levels. Interval is equal intervals but no true
zero, such as temperature in degrees Celsius, and ratio are
equal intervals with a true zero, such as age in years.
Nominal data
Nominal data is qualitative data that does not have a
natural ordering or ranking. For example, gender, religion,
ethnicity, color, brand ownership of electronic appliances,
and person's favorite meal.
Tutorial 1.5: To implement creating a data frame
consisting of qualitative nominal data, is as follows:
1. #Import the pandas library to create a pandas Datafram
e
2. import pandas as pd
3. nominal_data = {
4. "Gender": ["Male", "Female", "Male", "Female", "Male
"],
5. "Religion": ["Hindu", "Muslim", "Christian", "Buddhist
", "Jewish"],
6. "Ethnicity": ["Indian", "Pakistani", "American", "Chine
se", "Israeli"],
7. "Color": ["Red", "Green", "Blue", "Yellow", "White"],
8. "Electronic Appliances Ownership": ["Samsung", "LG
", "Apple", "Huawei", "Sony"],
9. "Person Favorite Meal": ["Biryani", "Kebab", "Pizza",
"Noodles", "Falafel"],
10. "Pet Preference": ["Dog", "Cat", "Parrot", "Fish", "Ha
mster"]
11. }
12. # Create the DataFrame
13. nominal_df = pd.DataFrame(nominal_data)
14. # Display the DataFrame
15. print(nominal_df)
Output:
1. Gender Religion Ethnicity Color Electronic Appliance
s Ownership \
2. 0 Male Hindu Indian Red Samsu
ng
3. 1 Female Muslim Pakistani Grezn
LG
4. 2 Male Christian American Blue App
le
5. 3 Female Buddhist Chinese Yellow Hu
awei
6. 4 Male Jewish Israeli Whie Sony
7.
8. Person Favorite Meal Pet Preference
9. 0 Biryani Dog
10. 1 Kebab Cat
11. 2 Pizza Parrot
12. 3 Noodles Fish
13. 4 Falafel Hamster
Ordinal data
Ordinal data is qualitative data that has a natural ordering
or ranking. For example, student ranking in class (1st, 2nd,
or 3rd), educational qualification (high school,
undergraduate, or graduate), satisfaction level (bad,
average, or good), income level range, level of agreement
(agree, neutral, or disagree).
Tutorial 1.6: To implement creating a data frame
consisting of qualitative ordinal data is as follows:
1. import pandas as pd
2. ordinal_data = {
3. "Student Rank in a Class": ["1st", "2nd", "3rd", "4th",
"5th"],
4. "Educational Qualification": ["Graduate", "Undergrad
uate", "High School", "Graduate", "Undergraduate"],
5. "Satisfaction Level": ["Good", "Average", "Bad", "Aver
age", "Good"],
6. "Income Level Range": ["80,000-100,000", "60,000-
80,000", "40,000-60,000", "100,000-120,000", "50,000-
70,000"],
7. "Level of Agreement": ["Agree", "Neutral", "Disagree"
, "Neutral", "Agree"]
8. }
9. ordinal_df = pd.DataFrame(ordinal_data)
10. print(ordinal_df)
Output:
1. Student Rank in a Class Educational Qualification Sati
sfaction Level \
2. 0 1st Graduate Good
3. 1 2nd Undergraduate Average
Discrete data
Discrete data is quantitative data, integers or whole
numbers, they cannot be subdivided into parts. For
example, total number of students present in a class, cost
of a cell phone, number of employees in a company, total
number of players who participated in a competition, days
in a week, number of books in a library, etc. For example,
number of coins in a jar, it can only be a whole number like
1,2,3 and so on.
Tutorial 1.7: To implement creating a data frame
consisting of quantitative discrete data is as follows:
1. import pandas as pd
2. discrete_data = {
3. "Students": [25, 30, 35, 40, 45],
4. "Cost": [500, 600, 700, 800, 900],
5. "Employees": [100, 150, 200, 250, 300],
6. "Players": [50, 40, 30, 20, 10],
7. "Week": [7, 7, 7, 7, 7]
8. }
9. discrete_df = pd.DataFrame(discrete_data)
10. discrete_df
Output:
1. Students Cost Employees Players Week
2. 0 25 500 100 50 7
3. 1 30 600 150 40 7
4. 2 35 700 200 30 7
5. 3 40 800 250 20 7
6. 4 45 900 300 10 7
Continuous data
Continuous data is quantitative data that can take any
value (including fractional value) within a range and have
no gaps between them. No gaps mean that if a person's
height is 1.75 meters, there is always a possibility of height
being between 1.75 and 1.76 meters, such as 1.751 or
1.755 meters.
Interval data
Interval data is quantitative numerical data with inherent
order. They always have an arbitrary zero, an arbitrary
zero meaning no meaningful zero, chosen by convention,
not by nature. For example, a temperature of zero degrees
Fahrenheit does not mean that there is no heat or
temperature, here, zero is an arbitrary zero point. For
example, temperature (Celsius or Fahrenheit), GMAT score
(200-800), SAT score (400-1600).
Tutorial 1.8: To implement creating a data frame
consisting of quantitative interval data is as follows:
1. import pandas as pd
2. interval_data = {
3. "Temperature": [10, 15, 20, 25, 30],
4. "GMAT_Score": [600, 650, 700, 750, 800],
5. "SAT_Score (400 - 1600)": [1200, 1300, 1400, 1500, 1
600],
6. "Time": ["9:00", "10:00", "11:00", "12:00", "13:00"]
7. }
8. interval_df = pd.DataFrame(interval_data)
9. # Print DataFrame as it is without print() also
10. interval_df
Output:
1. Temperature GMAT_Score SAT_Score (400 - 1600) Tim
e
2. 0 10 600 1200 9:00
3. 1 15 650 1300 10:00
4. 2 20 700 1400 11:00
5. 3 25 750 1500 12:00
6. 4 30 800 1600 13:00
Ratio data
Ratio data is naturally, numerical ordered data with an
absolute, where zero is not arbitrary but meaningful. For
example, height, weight, age, tax amount has true zero
point that is fixed by nature, and they are measured on a
ratio scale. Zero height means no height at all, like a point
in space. There is nothing shorter than zero height. Zero
tax amount means no tax at all, like being exempt. There is
nothing lower than zero tax amount.
Tutorial 1.9: To implement creating a data frame
consisting of quantitative ratio data is as follows:
1. import pandas as pd
2. ratio_data = {
3. "Height": [170, 180, 190, 200, 210],
4. "Weight": [60, 70, 80, 90, 100],
5. "Age": [20, 25, 30, 35, 40],
6. "Speed": [80, 90, 100, 110, 120],
7. "Tax Amount": [1000, 1500, 2000, 2500, 3000]
8. }
9. ratio_df = pd.DataFrame(ratio_data)
10. ratio_df
Output:
1. Height Weight Age Speed Tax Amount
2. 0 170 60 20 80 1000
3. 1 180 70 25 90 1500
4. 2 190 80 30 100 2000
5. 3 200 90 35 110 2500
6. 4 210 100 40 120 3000
Tutorial 1.10: To implement loading the ratio data in a
JSON format and displaying it.
Sometimes, data can be in JSON. The data used in the
following Tutorial 1.10 is in JSON format. In that case
json.loads() method can load it. JSON is a text format for
data interchange based on JavaScript as follows:
1. # Import json
2. import json
3. # The JSON string:
4. json_data = """
5. [
6. {
7. "Height": 170,
8. "Weight": 60,
9. "Age": 20,
10. "Speed": 80,
11. "Tax Amount": 1000
12. },
13. {
14. "Height": 180,
15. "Weight": 70,
16. "Age": 25,
17. "Speed": 90,
18. "Tax Amount": 1500
19. },
20. {
21. "Height": 190,
22. "Weight": 80,
23. "Age": 30,
24. "Speed": 100,
25. "Tax Amount": 2000
26. },
27. {
28. "Height": 200,
29. "Weight": 90,
30. "Age": 35,
31. "Speed": 110,
32. "Tax Amount": 2500
33. },
34. {
35. "Height": 210,
36. "Weight": 100,
37. "Age": 40,
38. "Speed": 120,
39. "Tax Amount": 3000
40. }
41. ]
42. """
43. # Convert to Python object (list of dicts):
44. data = json.loads(json_data)
45. data
Output:
1. [{'Height': 170, 'Weight': 60, 'Age': 20, 'Speed': 80, 'Tax
Amount': 1000},
2. {'Height': 180, 'Weight': 70, 'Age': 25, 'Speed': 90, 'Tax
Amount': 1500},
3. {'Height': 190, 'Weight': 80, 'Age': 30, 'Speed': 100, 'Tax
Amount': 2000},
4. {'Height': 200, 'Weight': 90, 'Age': 35, 'Speed': 110, 'Tax
Amount': 2500},
5. {'Height': 210, 'Weight': 100, 'Age': 40, 'Speed': 120, 'T
ax Amount': 3000}]
Bivariate data
Bivariate data consists of observing two variables or
attributes for each individual or unit. For example, if you
wanted to study the relationship between the age and
height of students in a class, you would collect the age and
height of each student. Age and height are two variables or
attributes, and each student is an individual or unit.
Bivariate analysis analyzes how two different variables,
columns, or attributes are related. For example, the
correlation between people's height and weight or between
hours worked and monthly salary.
Tutorial 1.23: To implement bivariate data and bivariate
analysis by selecting two columns or variables or attributes
from the CSV dataset and to describe them, as follows:
1. import pandas as pd
2. from IPython.display import display
3. diabities_df = pd.read_csv('/workspaces/ImplementingS
tatisticsWithPython/data/chapter1/diabetes.csv')
4. #To view all the column names
5. print(diabities_df.columns)
6. # Select two column Glucose column as a DataFrame fro
m diabities_df DataFrame
7. display(diabities_df[['Glucose','Age']])
8. # describe() gives the mean,standard deviation
9. print(diabities_df[['Glucose','Age']].describe())
10. # Use mode() for computing most frequest value i.e, mo
de
11. print(diabities_df[['Glucose']].mode())
12. # To get range simply subtract DataFrame maximum val
ue by the DataFrame minimum value. Use df.max() and
df.min() for maximum and minimum value
13. mode_range = diabities_df[['Glucose']].max() - diabities
_df[['Glucose']].min()
14. print(mode_range)
15. # For frequency or distribution of variables use value_co
unts()
16. diabities_df[['Glucose']].value_counts()
Here, we compared two columns, glucose and age in
diabities_df data frame, which involved multiple data
frame columns making it bivariate analysis.
Alternatively two or more columns can be accessed using
loc[row_start:row:stop,column_start:column:stop] or
also through column index via slicing by using
iloc[row_start:row:stop,column_start:column:stop] as
follows:
1. # Using loc
2. diabities_df.loc[:, ['Glucose','Age']]
3. # Using iloc, column index and slicing
4. diabities_df.iloc[:,0:2]
Further, to compute the correlation between two variables
or two columns, such as glucose and age, we can use
columns along with corr() as follows:
1. diabities_df['Glucose'].corr(diabities_df['Age'])
Correlation is a statistical measure that indicates how two
variables are related to each other. A positive correlation
means that the variables increase or decrease together,
while a negative correlation means that the variables move
in opposite directions. A correlation value close to zero
means that there is no linear relationship between the
variables.
In the context of
diabetes_df[‘Glucose’].corr(diabetes_df[‘Age’]), the
random positive correlation value of 0.26 means that there
is a weak positive correlation between glucose level and
age in the diabetes dataset. This implies that older people
tend to have higher glucose levels than younger people but
the relationship is not very strong or consistent.
Correlation can be computed using different methods such
as pearson, kendall, or spearman then specify method
='__' in corr() as follows:
1. diabities_df['Glucose'].corr(diabities_df['Age'], method=
'kendall')
Multivariate data
Multivariate data consists of observing three or more
variables or attributes for each individual or unit. For
example, if you want to study the relationship between the
age, gender, and income of customers in a store, you would
collect this data for each customer. Age, gender, and
income are the three variables or attributes, and each
customer is an individual or unit. In this case, the data you
collect will be multivariate data because it requires
observations on three variables or attributes for each
individual or unit. For example, the correlation between
age, gender, and sales in a store or between temperature,
humidity, and air quality in a city.
Tutorial 1.24: To implement multivariate data and
multivariate analysis by selecting multiple columns or
variables or attributes from the CSV dataset and describe
them, as follows:
1. import pandas as pd
2. from IPython.display import display
3. diabities_df = pd.read_csv('/workspaces/ImplementingS
tatisticsWithPython/data/chapter1/diabetes.csv')
4. #To view all the column names
5. print(diabities_df.columns)
6. # Select the column Glucose column as a DataFrame fro
m diabities_df DataFrame
7. display(diabities_df[['Glucose','BMI', 'Age', 'Outcome']])
8. # describe() gives the mean,standard deviation
9. print(diabities_df[['Glucose','BMI', 'Age', 'Outcome']].de
scribe())
Alternatively, multivariate analysis can be performed by
describing the whole data frame as follows:
1. # describe() gives the mean,standard deviation
2. print(diabities_df.describe())
3. # Use mode() for computing most frequest value i.e, mo
de
4. print(diabities_df.mode())
5. # To get range simply subtract DataFrame maximum val
ue by the DataFrame minimum value. Use df.max() and
df.min() for maximum and minimum value
6. mode_range = diabities_df.max() - diabities_df.min()
7. print(mode_range)
8. # For frequency or distribution of variables use value_co
unts()
9. diabities_df.value_counts()
Further, to compute the correlation between all the
variables in the data frame, use corr() after the data frame
variable name as follows:
1. diabities_df.corr()
You can also apply various multivariate analysis techniques,
as follows:
Principal Component Analysis (PCA): It transforms
high-dimensional data into a smaller set of uncorrelated
variables (principal components) that capture the most
variance, thereby simplifying the dataset while
retaining essential information. It makes easier to
visualize, interpret, and model multivariate
relationships
Library: Scikit-learn
Method: PCA(n_components=___)
Multivariate regression: This is used to analyze the
relationship between multiple dependent and
independent variables.
Library: Statsmodels
Method: statsmodels.api.OLS for ordinary least
squares regression. It allows you to perform
multivariate linear regression and analyze the
relationship between multiple dependent and
independent variables. Regression can also be
performed using scikit-learn's
LinearRegression(), LogisticRegression(), and
many more.
Cluster analysis: This is used to group similar data
points together based on their characteristics.
Library: Scikit-learn
Method: sklearn.cluster. KMeans for K-means
clustering. It allows you to group similar data
points together based on their characteristics.
And many more.
Factor analysis: This is used to identify underlying
latent variables that explain the observed variance.
Library: FactorAnalyzer
Method: FactorAnalyzer for factor analysis. It
allows you to perform Exploratory Factor
Analysis (EFA) to identify underlying latent
variables that explain the observed variance.
Canonical Correlation Analysis (CCA): To explore
the relationship between two sets of variables.
Library: Scikit-learn
Method: sklearn.cross_decomposition and
CCA allows you to explore the relationship
between two sets of variables and find linear
combinations that maximize the correlation
between the two sets.
Tutorial 1.25: To implement Principal Component
Analysis (PCA) for dimensionality reduction is as follows:
1. import pandas as pd
2. # Import principal component analysys
3. from sklearn.decomposition import PCA
4. # Scales data between 0 and 1
5. from sklearn.preprocessing import StandardScaler
6. # Import matplotlib to plot visualization
7. import matplotlib.pyplot as plt
8. # Step 1: Load your dataset into a DataFrame
9. # Assuming you have your dataset stored in a CSV file c
alled "data.csv", load it into a Pandas DataFrame.
10. data = pd.read_csv("/workspaces/ImplementingStatistic
sWithPython/data/chapter1/diabetes.csv")
11. # Step 2: Separate the features and the outcome variabl
e (if applicable)
12. # If the "Outcome" column represents the dependent va
riable and not a feature, you should separate it from the
features.
13. # If it's not the case, you can skip this step.
14. X = data.drop("Outcome", axis=1) # Features
15. y = data["Outcome"] # Outcome (if applicable)
16. # Step 3: Standardize the features
17. # PCA is sensitive to the scale of features, so it's crucial
to standardize them to have zero mean and unit varianc
e.
18. scaler = StandardScaler()
19. X_scaled = scaler.fit_transform(X)
20. # Step 4: Apply PCA for dimensionality reduction
21. # Create a PCA instance and specify the number of com
ponents you want to retain.
22. # If you want to reduce the dataset to a certain number
of dimensions (e.g., 2 or 3), set the 'n_components' acco
rdingly.
23. pca = PCA(n_components=2) # Reduce to 2 principal c
omponents
24. X_pca = pca.fit_transform(X_scaled)
25. # Step 5: Explained Variance Ratio
26. # The explained variance ratio gives us an idea of how
much information each principal component captures.
27. explained_variance_ratio = pca.explained_variance_rati
o_
28. # Step 6: Visualize the Explained Variance Ratio
29. plt.bar(range(len(explained_variance_ratio)), explained_
variance_ratio)
30. plt.xlabel("Principal Component")
31. plt.ylabel("Explained Variance Ratio")
32. plt.title("Explained Variance Ratio for Each Principal Co
mponent")
33. # Show the figure
34. plt.savefig('skew_negative.jpg',dpi=600,bbox_inches='ti
ght')
35. plt.show()
PCA reduces the dimensions but it also results in some loss
of information as we only retain the most important
components. Here, the original 8-dimensional diabetes data
set has been transformed into a new 2-dimensional data
set. The two new columns represent the first and second
principal components, which are linear combinations of the
original features. These principal components capture the
most significant variation in the data.
The columns of the data set pregnancies, glucose, blood
pressure, skin thickness, insulin, BMI, diabetes pedigree
function, and age are reduced to 2 principal components
because we specify n_components=2 as shown in Figure
1.1.
Output:
Data source
Data can be primary and secondary. It can be of two types,
that is, statistical sources like surveys, census,
experiments, and statistical reports and non-statistical
sources like business transactions, social media posts,
weblogs, data from wearables and sensors, or personal
records.
Tutorial 1.26: To implement reading data from different
sources and view statistical and non-statistical data is as
follows:
1. import pandas as pd
2. # To import urllib library for opening and reading URLs
3. import urllib.request
4. # To access CSV file replace file name
5. df = pd.read_csv('url_to_csv_file.csv')
To access or read data from different sources, pandas
provides read_csv() and read_json() and loadtxt(),
genfromtxt() in NumPy and many others. The URL can
also be used like
https://api.nobelprize.org/v1/prize.json, but it should be
accessible. Most data server would need authentication to
access the server.
To read JSON files replace file name in the script as
follows:
1. # To access JSON data replace file name
2. df = pd.read_json('your_file_name.json')
To read XML file from a server with NumPy, you can use
the np.loadtxt() function and pass as an argument a file
object created using the urllib.request.urlopen() function
from the urllib.request module. You must also specify the
delimiter parameter as < or > to separate XML tags from
the data values. To read XML file, replace files names with
appropriate one in the script as follows:
1. # To access and read the XML file using URL
2. file = urllib.request.urlopen('your_url_to_accessible_xml
_file.xml')
3. # To open the XML file from the URL and store it in a file
object
4. arr = np.loadtxt(file, delimiter='<')
5. print(arr)
Collection methods
Collection methods are surveys, interviews, observations,
focus groups, experiments, and secondary data analysis. It
can be quantitative, based on numerical data and statistical
analysis, or qualitative, based on words, images, actions,
and interpretive analysis. Also, sometimes mixed methods,
which combine qualitative and quantitative, can be used.
Data quality
Data quality indicates how suitable, accurate, useful,
complete, reliable, and consistent the data is for its
intended use. Verifying data quality is an important step in
analysis and preprocessing.
Tutorial 1.30: To implement checking the data quality of
CSV file data frame, is as follows:
Check missing values with isna() or isnull()
Check summary with describe() or info()
Check shape with shape, size with size, and memory
usage with memory_usage()
Check duplicates with duplicated() and remove
duplicate with drop_duplicates()
Based on this instruction, let us see the implementation as
follows:
1. import pandas as pd
2. diabities_df = pd.read_csv('/workspaces/ImplementingS
tatisticsWithPython/data/chapter1/diabetes.csv')
3. # Check for missing values using isna() or isnull()
4. print(diabities_df.isna().sum())
5. #Describe the dataframe with describe() or info()
6. print(diabities_df.describe())
7. # Check for the shape,size and memory usage
8. print(f'Shape: {diabities_df.shape} Size: {diabities_df.siz
e} Memory Usage: {diabities_df.memory_usage()}')
9. # Check for the duplicates using duplicated() and drop t
hem if necessary using drop_duplicates()
10. print(diabities_df.duplicated())
Now, we use synthetic transaction narrative data
containing unstructured information about the nature of
the transaction.
Tutorial 1.30: To implement viewing the text information
in the text files (synthetic transaction narrative files), is as
follows:
1. import pandas as pd
2. import numpy as np
3. # To import glob library for finding files and directories u
sing patterns
4. import glob
5. # To assign the path of the directory containing the text
files to a variable
6. path = "/workspaces/ImplementingStatisticsWithPython
/data/chapter1/TransactionNarrative"
7. # To find all the files in the directory that have a .txt ext
ension and store them in a list
8. files = glob.glob(path + "/*.txt")
9. # To loop through each file in the list
10. for file in files:
11. # To open each file in read mode with utf-
8 encoding and assign it to a file object
12. with open(file, "r", encoding="utf-8") as f:
13. print(f.read())
Output:
1. Date: 2023-08-01
2. Merchant: VideoStream Plus
3. Amount: $9.99
4. Description: Monthly renewal of VideoStream Plus subs
cription.
5. Your subscription to VideoStream Plus has been success
fully renewed for $9.99.
Tutorial 1.31: To implement checking the data quality of
multiple .txt files (synthetic transaction narrative files) that
contains text information as shown in Tutorial 1.30 output
and to check the quality of information in them, we use
file_size, line_count, missing_field, as follows:
1. import os
2. import glob
3. def check_file_quality(content):
4. # Check for presence of required fields
5. required_fields = ['Date:', 'Merchant:', 'Amount:', 'De
scription:']
6. missing_fields = [field for field in required_fields if fie
ld not in content]
7. # Calculate file size
8. file_size = len(content.encode('utf-8'))
9. # Count lines in the content
10. line_count = content.count('\n') + 1
11. # Return quality assessment
12. quality_assessment = {
13. "file_name": file,
14. "file_size_bytes": file_size,
15. "line_count": line_count,
16. "missing_fields": missing_fields
17. }
18. return quality_assessment
19. # To assign the path of the directory containing the text
files to a variable
20. path = "/workspaces/ImplementingStatisticsWithPython
/data/chapter1/TransactionNarrative"
21. # To find all the files in the directory that have a .txt ext
ension and store them in a list
22. files = glob.glob(path + "/*.txt")
23. # To loop through each file in the list
24. for file in files:
25. with open(file, "r", encoding="utf-8") as f:
26. content = f.read()
27. print(content)
28. quality_result = check_file_quality(content)
29. print(f"\nQuality Assessment for {quality_result['fil
e_name']}:")
30. print(f"File Size: {quality_result['file_size_bytes']} b
ytes")
31. print(f"Line Count: {quality_result['line_count']} lin
es")
32. if quality_result['missing_fields']:
33. print("Missing Fields:", ', '.join(quality_result['mi
ssing_fields']))
34. else:
35. print("All required fields present.")
36. print("=" * 40)
Output (Only one transaction narrative output is shown):
1. Date: 2023-08-01
2. Merchant: VideoStream Plus
3. Amount: $9.99
4. Description: Monthly renewal of VideoStream Plus subs
cription.
5.
6. Your subscription to VideoStream Plus has been success
fully renewed for $9.99.
7.
8.
9. Quality Assessment for /workspaces/ImplementingStati
sticsWithPython/data/chapter1/TransactionNarrative/3.
txt:
10. File Size: 201 bytes
11. Line Count: 7 lines
12. All required fields present.
13. ====================================
====
Cleaning
Data cleansing involves identifying and resolving
inconsistencies and errors in raw data sets to improve data
quality. High-quality data is critical to gaining accurate and
meaningful insights. Data cleansing also include data
handling. Different ways for data cleaning or data handling
are described below.
Missing values
Missing values refer to data points or observations with
incomplete or absent information. For example, in a survey,
if people do not answer a certain question, the related
entries will be empty. Appropriate methods, like imputation
or exclusion, are used to address them. If there are missing
values then one way is to drop missing value as shown in
Tutorial 1.32.
Tutorial 1.32: To implement finding the missing value and
dropping them.
Let us check prize_csv_df data frame for null values and
drop the null ones, as follows:
1. import pandas as pd
2. from IPython.display import display
3. # Read the prize csv file from the direcotory
4. prize_csv_df = pd.read_csv("/workspaces/Implementing
StatisticsWithPython/data/chapter1/prize.csv")
5. # Display the dataframe null values count
6. print(prize_csv_df.isna().sum())
Output:
1. year 374
2. category 374
3. overallMotivation 980
4. laureates__id 49
5. laureates__firstname 50
6. laureates__surname 82
7. laureates__motivation 49
8. laureates__share 49
Since prize_csv_df have null values, let us drop them and
view the count of null values after drop as follows:
1. print("\n \n **** After droping the null values in prize_c
sv_df****")
2. after_droping_null_prize_df = prize_csv_df.dropna()
3. print(after_droping_null_prize_df.isna().sum())
Finally, after applying the above code, the output will be as
follows:
1. **** After droping the null values in prize_csv_df****
2. year 0
3. category 0
4. overallMotivation 0
5. laureates__id 0
6. laureates__firstname 0
7. laureates__surname 0
8. laureates__motivation 0
9. laureates__share 0
10. dtype: int64
This shows there are now zero null values in all the column.
Imputation
Imputation means to place a substitute value in place of the
missing values. Like constant value imputation, mean
imputation, mode imputation.
Tutorial 1.33: To implement imputing the mean value of
the column laureates__share.
Mean imputation only imputes the mean value of numeric
data types as fillna() expects scalar, so we cannot use the
mean() method to fill missing values in object columns.
1. import pandas as pd
2. from IPython.display import display
3. # Read the prize csv file from the direcotory
4. prize_csv_df = pd.read_csv("/workspaces/Implementing
StatisticsWithPython/data/chapter1/prize.csv")
5. # View the number of null values in original DataFrame
6. print("Null Value Before",prize_csv_df['laureates__share
'].isna().sum())
7. # Calculate the mean of each column
8. prize_col_mean = prize_csv_df['laureates__share'].mean
()
9. # Fill missing values with column mean, inplace = True
will replace the original DataFrame
10. prize_csv_df['laureates__share'].fillna(value=prize_col_
mean, inplace=True)
11. # View the number of null values in the new DataFrame
12. print("Null Value After",prize_csv_df['laureates__share']
.isna().sum())
Output:
1. Null Value Before 49
2. Null Value After 0
Also to fill missing values in object columns, you have to
use a different strategy, such as using a constant value i.e,
df[column_name].fillna(' '), a mode value, or a custom
function..
Tutorial 1.34: To implement imputing the mode value in
the object data type column.
1. import pandas as pd
2. from IPython.display import display
3. # Read the prize csv file from the direcotory
4. prize_csv_df = pd.read_csv("/workspaces/Implementing
StatisticsWithPython/data/chapter1/prize.csv")
5. # Display the original DataFrame null values in object d
ata type columns
6. print(prize_csv_df.isna().sum())
7. # Select the object columns
8. object_cols = prize_csv_df.select_dtypes(include='object
').columns
9. # Calculate the mode of each object data type column
10. col_mode = prize_csv_df[object_cols].mode().iloc[0]
11. # Fill missing values with the mode of each object data
type column
12. prize_csv_df[object_cols] = prize_csv_df[object_cols].fill
na(col_mode)
13. # Display the DataFrame column after filling null values
in object data type columns
14. print(prize_csv_df.isna().sum())
Output:
1. year 374
2. category 374
3. overallMotivation 980
4. laureates__id 49
5. laureates__firstname 50
6. laureates__surname 82
7. laureates__motivation 49
8. laureates__share 49
9. dtype: int64
10. year 374
11. category 0
12. overallMotivation 0
13. laureates__id 49
14. laureates__firstname 0
15. laureates__surname 0
16. laureates__motivation 0
17. laureates__share 49
18. dtype: int64
Duplicates
Data may be duplicated or contains duplicate value. The
duplicacy will affect the final statistical result. Hence, to
prevent duplicacy, identifying and removing duplicates is
necessary step as explained in this section. Best way to
handle duplicate is to identify and remove duplicates.
Tutorial 1.35: To implement identifying and removing
duplicate rows in data frame with duplicated(), as follows:
1. # Identify duplicate rows and display their index
2. print(prize_csv_df.duplicated().index[prize_csv_df.dupli
cated()])
Since, there is no duplicate the output is empty it displays
indexes of duplicates as follows:
1. Index([], dtype='int64')
Also, you can find the duplicate values in a specific column
by using the following code:
1. prize_csv_df.duplicated(subset=
['name_of_the_column'])
To remove duplicates, drop() method can be used, syntax
will be dataframe.drop(labels, axis='columns',
inplace=False). Drop can be applied to row and index
using label and index values as follows:
1. import pandas as pd
2. # Create a sample dataframe
3. people_df = pd.DataFrame({'name': ['Alice', 'Bob', 'Char
lie'], 'age': [25, 30, 35], 'gender': ['F', 'M', 'M']})
4. # Print the original dataframe
5. print("original dataframe \n",people_df)
6. # Drop the 'gender' column and return a new dataframe
7. new_df = people_df.drop('gender', axis='columns')
8. # Print the new dataframe
9. print("dataframe after drop \n",new_df)
Output:
1. original dataframe
2. name age gender
3. 0 Alice 25 F
4. 1 Bob 30 M
5. 2 Charlie 35 M
6. dataframe after drop
7. name age
8. 0 Alice 25
9. 1 Bob 30
10. 2 Charlie 35
Outliers
Outliers are data points that are very different from the
other data points. They can be much higher or lower than
the standard range of values. For example, if the heights of
ten people in centimeters are measured, the values might
be as follows:
160, 165, 170, 175, 180, 185, 190, 195, 200, 1500.
Most of the heights are alike but the last measurement is
much larger than the others. This data point is an outlier
because it is not like the rest of the data. The best way to
handle outliers is to identify outliers and then correct,
resolve, or leave as needed. Ways to identify outliers are to
compute mean, standard deviation, and quantile (a
common approach is to compute interquartile range).
Another way to identify outliers is by computing the z-score
of the data points and then considering points beyond the
threshold values as outliers.
Tutorial 1.36: To implement identifying outliers in a data
frame with zscore.
Z-score measures how many standard deviations a value is
from the mean. In the following code, z_score identifies
outliers in the laureates’ share column:
1. import pandas as pd
2. import numpy as np
3. # Read the prize csv file from the direcotory
4. prize_csv_df = pd.read_csv("/workspaces/Implementing
StatisticsWithPython/data/chapter1/prize.csv")
5. # Calculate mean, standard deviation and Z-
scores for the column
6. z_scores = np.abs((prize_csv_df['laureates__share'] - pri
ze_csv_df['laureates__share'].mean()) / prize_csv_df['lau
reates__share'].std())
7. # Define a threshold for outliers (e.g., 4)
8. threshold = 2
9. # Display the row index of the outliers
10. print(prize_csv_df.index[z_scores > threshold])
Output:
1. Index([ 17, 18, 22, 23, 34, 35, 48, 49, 54, 55
,
62, 63,
2. 73, 74, 86, 87, 97, 98, 111, 112, 144, 145,
146, 147,
3. 168, 169, 180, 181, 183, 184, 215, 216, 242,
243,
249, 250,
4. 255, 256, 277, 278, 302, 303, 393, 394, 425,
426,
467, 468,
5. 471, 472, 474, 475, 501, 502, 514, 515, 556,
557,
563, 564,
6. 607, 608, 635, 636, 645, 646, 683, 684, 760,
761,
764, 765,
7. 1022, 1023],
8. dtype='int64')
The output shows the row index of the outliers in the
laureates’ share column of the prize.csv file. Outliers are
values that are unusually high or low compared to the rest
of the data. The code uses a z-score to measure how many
standard deviations a value is from the mean of the column.
A higher z-score means a more extreme value. The code
defines a threshold of two, which means that any value with
a z-score greater than two is considered an outlier.
Additionally, preparing data, cleaning it, manipulating it,
and doing data wrangling includes the following:
Cheking typos and spelling errors. Python provides
libraries like PySpellChecker, NLTK, TextBlob, or
Enchant to check typos and spelling errors.
Data transformation is a change from one form to
another desired form. It involves aggeration,
conversion, normalization, and many more, they are
covered in detail in Chapter 2, Exploratory Data
Analysis.
Handling inconsistencies which involve identifying
conflicting information and resolving them. For
example, the body temperature is listed as 1400 Celsius
which is not correct.
Standardize format and units of measurements to
ensure consistency.
Further data integrity and validation ensures that data
is unchanged, not altered or corrupted. Data validation
verifies that the data to be used is correct (use
techniques like validation rules, manual review).
Conclusion
Statistics provides a structured framework for
understanding and interpreting the world around us. It
empowers us to gather, organize, analyze, and interpret
information, thereby revealing patterns, testing
hypotheses, and informing decisions. In this chapter, we
examined the foundations of data and statistics: from the
distinction between qualitative (descriptive) and
quantitative (numeric) data to the varying levels of
measurement—nominal, ordinal, interval, and ratio. We
also considered the scope of analysis in terms of the
number of variables involved—whether univariate,
bivariate, or multivariate—and recognized that data can
originate from diverse sources, including surveys,
experiments, and observations.
We explored how careful data collection methods—whether
sampling from a larger population or studying an entire
group—can significantly affect the quality and applicability
of our findings. Ensuring data quality is key, as the validity
and reliability of statistical results depend on accurate,
complete, and consistent information. Data cleaning
addresses errors and inconsistencies, while data wrangling
and manipulation techniques help us prepare data for
meaningful analysis.
By applying these foundational concepts, we establish a
platform for more advanced techniques. In the upcoming
Chapter 2, Exploratory data analysis we learn to transform
and visualize data in ways that reveal underlying
structures, guide analytical decisions, and communicate
insights effectively, enabling us to extract even greater
value from data.
1 Source: https://scikit-
learn.org/stable/datasets/toy_dataset.html#iris-
dataset
2 Source: https://github.com/jdorfman/awesome-json-
datasets#nobel-prize
3 Source: https://github.com/jdorfman/awesome-json-
datasets#nobel-prize
Join our book’s Discord space
Join the book's Discord Workspace for Latest updates,
Offers, Tech happenings around the world, New Release and
Sessions with the Authors:
https://discord.bpbonline.com
CHAPTER 2
Exploratory Data Analysis
Introduction
Exploratory Data Analysis (EDA) is the technique of
examining, understanding, and summarizing data using
various methods. EDA uncovers important insights,
features, characteristics, patterns, relationships, and
outliers. It also generates hypotheses for the research
questions and covers descriptive statistics, a graphical
representation of data in a meaningful way, and data
exploration in general. In this chapter, we present
techniques for data aggregation, transformation,
normalization, standardization, binning, grouping, data
coding, and encoding, handling missing data and outliers,
and the appropriate data visualization methods.
Structure
In this chapter, we will discuss the following topics:
Exploratory data analysis and its importance
Data aggregation
Data normalization, standardization, and transformation
Data binning, grouping, encoding
Missing data, detecting and treating outliers
Visualization and plotting of data
Objectives
By the end of this chapter, readers will learn the techniques
to explore the data and to gather meaningful insight to
know data well. You will acquire the skills necessary to
explore data and gain insights for better understanding. You
will learn different data preprocessing method and how to
apply them. Further this chapter also explains data
encoding, grouping, cleansing, and visualization techniques
with Python.
Data aggregation
Data aggregation in statistics involves summarizing
numerical data using statistical measures like mean,
median, mode, standard deviation, or percentile. This
approach helps detect irregularities and outliers, and
enables effective analysis. For example, to determine the
average height of students in a class, their individual
heights can be aggregated using the mean function,
resulting in a single value representing the central tendency
of the data. To evaluate the extent of variation in student
heights, utilize the standard deviation function to gather
data, which will indicate how spread out the data is from the
average. The practice of data aggregation in statistics can
simplify and aid in comprehending large data sets.
Mean
The mean is a statistical measure used to determine the
average value of a set of numbers. To obtain the mean, add
all numbers and divide the sum by the number of values.
For example, if you have five test scores: 80, 90, 70, 60, and
100, the mean will be as follows:
Mean= (80 + 90 + 70 + 60 + 100) / 5
The average score will be the typical score for this series of
tests.
Tutorial 2.1: An example to compute the mean from a list
of numbers, is as follows:
1. # Define a list of test scores
2. test_scores = [80, 90, 70, 60, 100]
3. # Calculate the sum of the test scores
4. total = sum(test_scores)
5. # Calculate the number of test scores
6. count = len(test_scores)
7. # Calculate the mean by dividing the sum by the count
8. mean = total / count
9. # Print the mean
10. print("The mean is", mean)
The Python sum() function takes a list of numbers and
returns their sum. For instance, sum([1, 2, 3]) equals 6.
On the other hand, the len() function calculates the number
of elements in a sequence like a string, a list, or a tuple. For
example, len("hello") returns 5.
Output:
1. The mean is 80.0
Median
Median determines the middle value of a data set by
locating the value positioned at the center when the data is
arranged from smallest to largest. When there is an even
number of data points, the median is calculated as the
average of the two middle values. For example, among test
scores: 75, 80, 85, 90, 95. To determine the median, we
must sort the data and locate the middle value. In this case
the middle value is 85 thus, the median is 85. If we add
another score of 100 to the dataset, we now have six data
points: 75, 80, 85, 90, 95, 100. Therefore, the median is the
average of the two middle values 85 and 90. The average of
the two values: (85 + 90) / 2 = 87.5. Hence, the median is
87.5.
Tutorial 2.2: An example to compute the median is as
follows:
1. # Define the dataset as a list
2. data = [75, 80, 85, 90, 95, 100]
3. # Calculate the number of data points
4. num_data_points = len(data)
5. # Sort the data in ascending order
6. data.sort()
7. # Check if the number of data points is odd
8. if num_data_points % 2 == 1:
9. # If odd, find the middle value (median)
10. median = data[num_data_points // 2]
11. else:
12. # If even, calculate the average of the two middle valu
es
13. middle1 = data[num_data_points // 2 - 1]
14. middle2 = data[num_data_points // 2]
15. median = (middle1 + middle2) / 2
16. # Print the calculated median
17. print("The median is:", median)
Output:
1. The median is: 87.5
The median is a useful tool for summarizing data that is
skewed or has outliers. It is more reliable than the mean,
which can be impacted by extreme values. Furthermore, the
median separates data into two equal quartiles.
Mode
Mode represents the value that appears most frequently in a
given data set. For example, consider a set of shoe sizes
that is, 6, 7, 7, 8, 8, 8, 9, 10. To find the mode, count how
many times each value appears and identify the value that
occurs most frequently. The mode is the most common
value. In this case, the mode is 8 since it appears three
times, more than any other value.
Tutorial 2.3: An example to compute the mode, is as
follows:
1. # Define the dataset as a list
2. shoe_sizes = [6, 7, 7, 8, 8, 8, 9, 10]
3. # Create an empty dictionary to store the count of each
value
4. size_counts = {}
5. # Iterate through the dataset to count occurrences
6. for size in shoe_sizes:
7. if size in size_counts:
8. size_counts[size] += 1
9. else:
10. size_counts[size] = 1
11. # Find the mode by finding the key with the maximum v
alue in the dictionary
12. mode = max(size_counts, key=size_counts.get)
13. # Print the mode
14. print("The mode is:", mode)
max() used in tutorial 2.3 is a Python function that returns
the highest value from an iterable such as a list or
dictionary. In this instance, it retrieves the key (shoe_sizes)
with the highest count in the size_counts dictionary. The
.get() method is used in a dictionary as a key function for
max(). It retrieves the value associated with a key. In this
case, size_counts.get retrieves the count associated with
each shoe size key. Then max() uses this information to
determine which key (shoe_sizes) has the highest count,
indicating the mode.
Output:
1. The mode is: 8
Variance
Variance measures the deviation of data values from their
average in a dataset. It is calculated by averaging the
squared differences between each value and the mean. A
high variance suggests that data is spread out from the
mean, while a low variance suggests that data is tightly
grouped around the mean. For example, suppose we have
two sets of test scores: A = [90, 92, 94, 96, 98] and B =
[70, 80, 90, 100, 130]. The mean of both sets is 94, but
the variance of A is 8 and B is 424. Lower variance in A
means the scores in A are more consistent and closer to the
mean than the scores in B. We can use the var() function
from the numpy module to see the variance in Python.
Tutorial 2.4: An example to compute the variance is as
follows:
1. import numpy as np
2. # Define two sets of test scores
3. A = [90, 92, 94, 96, 98]
4. B = [70, 80, 90, 100, 130]
5. # Calculate and print the mean of A and B
6. print("The mean of A is", sum(A)/len(A))
7. print("The mean of B is", sum(B)/len(B))
8. # Calculate and print the variance of A and B
9. var_A = np.var(A)
10. var_B = np.var(B)
11. print("The variance of A is", var_A)
12. print("The variance of B is", var_B)
To compute the variance in a pandas data frame, one way is
to use the describe() method, which returns a summary of
the descriptive statistics for each column, including the
variance. For example, if we have a data frame named df,
we can use df.describe() to see the variance of each
column. Another way is to use the apply() method, which
applies a function to each column or row of a data frame.
For example, if we want to compute the variance of each
row, we can use df.apply(np.var, axis=1), where np.var is
the NumPy function for variance and axis=1 means that the
function is applied along the row axis.
Output:
1. The mean of A is 94.0
2. The mean of B is 94.0
3. The variance of A is 8.0
4. The variance of B is 424.0
Standard deviation
Standard deviation is a measure of how much the values in
a data set vary from the mean. It is calculated by taking the
square root of the variance. A high standard deviation
means that the data is spread out, while a low standard
deviation means that the data is concentrated around the
mean. For example, suppose we have two sets of test
scores: A = [90, 92, 94, 96, 98] and B = [70, 80, 90,
100, 110]. The mean of both sets is 94, but the standard
deviation of A is about 2.83 and the standard deviation of B
is about 14.14. This means that the scores in A are more
consistent and closer to the mean than the scores in B. To
find the standard deviation in Python, we can use the std()
function from the numpy module.
Tutorial 2.5: An example to compute the standard
deviation is as follows:
1. # Import numpy module
2. import numpy as np
3. # Define two sets of test scores
4. A = [90, 92, 94, 96, 98]
5. B = [70, 80, 90, 100, 110]
6. # Calculate and print the standard deviation of A and B
7. std_A = np.std(A)
8. std_B = np.std(B)
9. print("The standard deviation of A is", std_A)
10. print("The standard deviation of B is", std_B)
Output:
1. The standard deviation of A is 2.82
2. The standard deviation of B is 14.14
Quantiles
A quantile is a value that separates a data set into an equal
number of groups, typically four (quartiles), five (quintiles),
or ten (deciles). The groups are formed by ranking the data
set in ascending order, ensuring that each group contains
the same number of values. Quantiles are useful for
summarizing data distribution and comparing different data
sets.
For example, let us consider a set of 15 heights in
centimeters: [150, 152, 154, 156, 158, 160, 162, 164,
166, 168, 170, 172, 174, 176, 178]. To calculate the
quartiles (a specific subset of quantiles) for this dataset,
divide it into four equally sized groups. Q1, the first
quartile, represents the median of the lower half of the data,
which is 158. Q2, the second quartile, corresponds to the
median of the entire data set, which is 164. Q3, the third
quartile, represents the median of the upper half of the
data, which is 170. The data is split into four clear groups
by the quartiles: [150, 152, 154, 156], [158, 160, 162],
[164, 166, 168], and [170, 172, 174, 176, 178]. This
separation facilitates understanding and comparison of
distinct segments of the data's distribution.
Tutorial 2.6: An example to compute the quantiles is as
follows:
1. # Import numpy module
2. import numpy as np
3. # Define a data set of heights in centimeters
4. heights = [150 ,152 ,154 ,156 ,158 ,160 ,162 ,164 ,166 ,
168 ,170 ,172 ,174 ,176 ,178]
5. # Calculate and print the quartiles of the heights
6. Q1 = np.quantile(heights ,0.25)
7. Q2 = np.quantile(heights ,0.5)
8. Q3 = np.quantile(heights ,0.75)
9. print("The first quartile is", Q1)
10. print("The second quartile is", Q2)
11. print("The third quartile is", Q3)
Output:
1. The first quartile is 157.0
2. The second quartile is 164.0
3. The third quartile is 171.0
Tutorial 2.7: An example to compute mean, median, mode,
standard deviation, maximum, minimum value in pandas
data frame.
The mean, median, mode, variance, maximum and minimum
value in data frame can be computed easily with mean(),
median(), mode(), var(), max(), min() respectively, as
follows:
1. # Import the pandas library
2. import pandas as pd
3. # Import display function
4. from IPython.display import display
5. # Load the diabetes data from a csv file
6. diabetes_df = pd.read_csv(
7. "/workspaces/ImplementingStatisticsWithPython/data
/chapter1/diabetes.csv")
8. # Print the mean of each column
9. print(f'Mean: \n {diabetes_df.mean()}')
10. # Print the median of each column
11. print(f'Median: \n {diabetes_df.median()}')
12. # Print the mode of each column
13. print(f'Mode: \n {diabetes_df.mode()}')
14. # Print the variance of each column
15. print(f'Varience: \n {diabetes_df.var()}')
16. # Print the standard deviation of each column
17. print(f'Standard Deviation: \n{diabetes_df.std()}')
18. # Print the maximum value of each column
19. print(f'Maximum: \n {diabetes_df.max()}')
20. # Print the minimum value of each column
21. print(f'Minimum: \n {diabetes_df.min()}')
Tutorial 2.8: An example to compute mean, median, mode,
standard deviation, maximum, minimum value in NumPy
array, is as follows:
1. # Import the numpy and statistics libraries
2. import numpy as np
3. import statistics as st
4. # Create a numpy array with some data
5. data = np.array([12, 15, 20, 25, 30, 30, 35, 40, 45, 50])
6. # Calculate the mean of the data using numpy
7. mean = np.mean(data)
8. # Calculate the median of the data using numpy
9. median = np.median(data)
10. # Calculate the mode of the data using statistics
11. mode_result = st.mode(data)
12. # Calculate the standard deviation of the data using num
py
13. std_dev = np.std(data)
14. # Find the maximum value of the data using numpy
15. maximum = np.max(data)
16. # Find the minimum value of the data using numpy
17. minimum = np.min(data)
18. # Print the results to the console
19. print("Mean:", mean)
20. print("Median:", median)
21. print("Mode:", mode_result)
22. print("Standard Deviation:", std_dev)
23. print("Maximum:", maximum)
24. print("Minimum:", minimum)
Output:
1. Mean: 30.2
2. Median: 30.0
3. Mode: 30
4. Standard Deviation: 11.93
5. Maximum: 50
6. Minimum: 12
Tutorial 2.9: An example to compute variance, quantiles,
and percentiles using var() and quantile from diabetes
dataset data frame, and also describe() to describe the
data frame, is as follows:
1. import pandas as pd
2. from IPython.display import display
3. # Load the diabetes data from a csv file
4. diabetes_df = pd.read_csv(
5. "/workspaces/ImplementingStatisticsWithPython/data
/chapter1/diabetes.csv")
6. # Calculate the variance of each column using pandas
7. variance = diabetes_df.var()
8. # Calculate the quantiles (25th, 50th, and 75th percentil
es) of each column using pandas
9. quantiles = diabetes_df.quantile([0.25, 0.5, 0.75])
10. # Calculate the percentiles (90th and 95th percentiles) of
each column using pandas
11. percentiles = diabetes_df.quantile([0.9, 0.95])
12. # Display the results using the display function
13. display("Variance:", variance)
14. display("Quantiles:", quantiles)
15. display("Percentiles:", percentiles)
This will calculate the variance, quantile and percentile of
each column in the diabetes_df data frame.
Data normalization
Standardizing and organizing data entries through
normalization improves their suitability for analysis and
comparison, resulting in higher quality data. Additionally,
reducing the impact of outliers enhances algorithm
performance, increases data interpretability, and uncovers
underlying patterns among variables.
Alice 80
Bob 60
Carol 90
David 40
Alice 80 0.8
Bob 60 0.6
Carol 90 0.9
David 40 0.4
Data standardization
Data standardization is a type of data transformation that
adjusts data to have a mean of zero and a standard
deviation of one. It helps compare variables with different
scales or units and is necessary for algorithms like
Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA), or k-means clustering that
require standardized data. By standardizing values, we can
measure how far each value is from the mean in terms of
standard deviations. This can help us identify outliers,
perform hypothesis tests, or apply machine learning
algorithms that require standardized data. There are
different ways to standardize data like min-max
normalization described in normalization of data frames, but
the z-score formula remains the most widely used. This
formula adjusts each value in a dataset by subtracting the
mean and dividing it by the standard deviation. The formula
is as follows:
z = (x - μ) / σ
Where x represents the original value, μ represents the
mean, and σ represents the standard deviation.
Suppose, we have a dataset of two variables: height (in
centimeters) and weight (in kilograms) of five people:
Height Weight
160 50
175 70
180 80
168 60
-1.18 -1.07
0.79 0.66
1.45 1.52
-0.13 -0.21
Data transformation
Data transformation is essential as it satisfies the
requirements for particular statistical tests, enhances data
interpretation, and improves the visual representation of
charts. For example, consider a dataset that includes the
heights of 100 students measured in centimeters. If the
distribution of data is positively skewed (more students are
shorter than taller), assumptions like normality and equal
variances must be satisfied before conducting a t-test. A t-
test (a statistical test used to compare the means of two
groups) on the average height of male and female students
may produce inaccurate results if skewness violates these
assumptions.
To mitigate this problem, transform the height data by
taking the square root or logarithm of each measurement.
Doing so will improve consistency and accuracy. Perform a
t-test on the transformed data to compute the average
height difference between male and female students with
greater accuracy. Use the inverse function to revert the
transformed data back to its original scale. For example, if
the transformation involved the square root, then square the
result to express centimeters. Another reason to use data
transformation is to improve data visualization and
understanding. For example, suppose you have a dataset of
the annual income of 1000 people in US dollars that is
skewed to the right, indicating that more participants are in
the lower-income bracket. If you want to create a histogram
that shows income distribution, you will see that most of the
data is concentrated in a few bins on the left, while some
outliers exist on the right side. For improved clarity in
identifying the distribution pattern and range, apply a
transformation to the income data by taking the logarithm
of each value. This distributes the data evenly across bins
and minimizes the effect of outliers. After that, plot a
histogram of the log-transformed income to show the
income fluctuations among individuals.
Tutorial 2.17: An example to show the data transformation
of the annual income of 1000 people in US dollars, which is
a skewed data set, is as follows:
1. # Import the libraries
2. import numpy as np
3. import matplotlib.pyplot as plt
4. # Generate some random data for the annual income of
1000 people in US dollars
5. np.random.seed(42) # Set the seed for reproducibility
6. income = np.random.lognormal(mean=10, sigma=1, size
=1000) # Generate 1000 incomes from a lognormal distri
bution with mean 10 and standard deviation 1
7. income = income.round(2) # Round the incomes to two
decimal places
8. # Plot a histogram of the original income
9. plt.hist(income, bins=20)
10. plt.xlabel("Income (USD)")
11. plt.ylabel("Frequency")
12. plt.title("Histogram of Income")
13. plt.show()
Suppose the initial actual distribution of annual income of
1000 people in US dollars as shown in Figure 2.1:
Data binning
Data binning groups continuous or discrete values into a
smaller number of bins or intervals. For example, if you
have data on the ages of 100 people, you may group them
into five bins: [0-20), [20-40), [40-60), [60-80), and [80-100],
where [0-20) includes values greater than or equal to 0 and
less than 20, [80-100] includes values greater than or equal
to 80 and less than or equal to 100. Each bin represents a
range of values, and the number of cases in each bin can be
counted or visualized. Data binning reduces noise, outliers,
and skewness in the data, making it easier to view
distribution and trends.
Tutorial 2.19: A simple implementation of data binning for
grouping the ages of 100 people into five bins: [0-20), [20-
40), [40-60), [60-80), and [80-100] is as follows:
1. # Import the libraries
2. import numpy as np
3. import pandas as pd
4. import matplotlib.pyplot as plt
5. # Generate some random data for the ages of 100 peopl
e
6. np.random.seed(42) # Set the seed for reproducibility
7. ages = np.random.randint(low=0, high=101, size=100)
# Generate 100 ages between 0 and 100
8. # Create a pandas dataframe with the ages
9. df = pd.DataFrame({"Age": ages}) # Create a dataframe
with one column: Age
10. # Define the bins and labels for the age groups
11. bins = [0, 20, 40, 60, 80, 100] # Define the bin edges
12. labels = ["[0-20)", "[20-40)", "[40-60)", "[60-80)", "[80-
100]"] # Define the bin labels
13. # Apply data binning to the ages using the pd.cut functi
on
14. df["Age Group"] = pd.cut(df["Age"], bins=bins, labels=la
bels, right=False) # Create a new column with the age g
roups
15. # Print the first 10 rows of the dataframe
16. print(df.head(10))
Output:
1. Age Age Group
2. 0 51 [40-60)
3. 1 92 [80-100]
4. 2 14 [0-20)
5. 3 71 [60-80)
6. 4 60 [60-80)
7. 5 20 [20-40)
8. 6 82 [80-100]
9. 7 86 [80-100]
10. 8 74 [60-80)
11. 9 74 [60-80)
Tutorial 2.20: An example to apply binning on diabetes
dataset by grouping the ages of all the people in dataset
into three bins: [< 30], [30-60], [60-100], is as follows:
1. import pandas as pd
2. # Read the json file from the direcotory
3. diabetes_df = pd.read_csv(
4. "/workspaces/ImplementingStatisticsWithPython/data
/chapter1/diabetes.csv")
5. # Define the bin intervals
6. bin_edges = [0, 30, 60, 100]
7. # Use cut to create a new column with bin labels
8. diabetes_df['Age_Group'] = pd.cut(diabetes_df['Age'],
bins=bin_edges, labels=[
9. '<30', '30-60', '60-100'])
10. # Count the number of people in each age group
11. age_group_counts = diabetes_df['Age_Group'].
value_counts().sort_index()
12. # View new DataFrame with the new bin(categories) colu
mns
13. diabetes_df
The output is a new data frame with Age_Group column
consisting appropriate bin label.
Tutorial 2.21: An example to apply binning on NumPy
array data by grouping the scores of students in exam into
five bins based on the scores obtained: [< 60], [60-69], [70-
79], [80-89] , [90+], is as follows:
1. import numpy as np
2. import matplotlib.pyplot as plt
3. # Create a sample NumPy array of exam scores
4. scores = np.array([75, 82, 95, 68, 90, 85, 78, 72, 88, 93,
60, 72, 80])
5. # Define the bin intervals
6. bin_edges = [0, 60, 70, 80, 90, 100]
7. # Use histogram to count the number of scores in each
bin
8. bin_counts, _ = np.histogram(scores, bins=bin_edges)
9. # Plot a histogram of the binned scores
10. plt.bar(range(len(bin_counts)), bin_counts, align='center
')
11. plt.xticks(range(len(bin_edges) - 1), ['<60', '60-69', '70-
79', '80-89', '90+'])
12. plt.xlabel('Score Range')
13. plt.ylabel('Number of Scores')
14. plt.title('Distribution of Exam Scores')
15. plt.savefig("data_binning2.jpg",dpi=600)
16. plt.show()
Output:
Figure 2.3: Distribution of student’s exam scores in five bins
In text files, data binning can be grouping and categorizing
of text data based on some criteria. To apply data binning
on the text data, keep the following points in mind:
Determine a criterion for binning. For example: Could be
count of sentences in text, word count, sentiment score,
topic.
Read text and calculate the selected criteria for binning.
For example: Count number of words in bins.
Define bins based on range of values for the selected
criteria. For example: Defining short, medium, long
based on word count of text.
Assign text files appropriate bin based on calculated
value.
Analyze or summarize the data in the new bins.
Some use cases of binning in text file are grouping text files
based on their length, binning based on the sentiment
analysis score, topic binning by performing topic modelling,
language binning if text files are in different languages,
time-based binning if text files have timestamps.
Tutorial 2.22: An example showing data binning of text
files using word counts in the files with three bins: [<26
words] as short [26 and 30 words (inclusive)] as medium,
[>30] as long, is as follows:
1. # Import the os, glob, and pandas modules
2. import os
3. import glob
4. import pandas as pd
5. # Define the path of the folder that contains the files
6. path = "/workspaces/ImplementingStatisticsWithPython
/data/chapter1/TransactionNarrative"
7. files = glob.glob(path + "/*.txt") # Get a list of files that
match the pattern "/*.txt" in the folder
8. # Display a the information in first file
9. file_one = glob.glob("/workspaces/ImplementingStatisti
csWithPython/data/chapter1/TransactionNarrative/1.txt
")
10. for file1 in file_one: # Loop through the file_one list
11. # To open each file in read mode with utf-
8 encoding and assign it to a file object
12. with open(file1, "r", encoding="utf-
8") as f1: # Open each file in read mode with utf-
8 encoding and assign it to a file object named f1
13. print(f1.read()) # Print the content of the file object
14. # Function that takes a file name as an argument and ret
urns the word count of that file
15. def word_count(file): # Define a function named word_c
ount that takes a file name as an argument
16. # Open the file in read mode
17. with open(file, "r") as f: # Open the file in read mode
and assign it to a file object named f
18. # Read the file content
19. content = f.read() # Read the content of the file obj
ect and assign it to a variable named content
20. # Split the content by whitespace characters
21. words = content.split() # Split the content by white
space characters and assign it to a variable named word
s
22. # Return the length of the words list
23. return len(words) # Return the length of the words
list as the output of the function
24. counts = [word_count(file) for file in files] # Use a list co
mprehension to apply the word_count function to each fil
e in the files list and assign it to a variable named counts
25. binning_df = pd.DataFrame({"file": files, "count": counts}
) # Create a pandas dataframe with two columns: file an
d count, using the files and counts lists as values
26. binning_df["bin"] = pd.cut(binning_df["count"], bins=
[0, 26, 30, 35]) # Create a new column named bin, using
the pd.cut function to group the count values into three b
ins: [0-26), [26-30), and [30-35]
27. binning_df["bin"] = pd.cut(binning_df["count"], bins=
[0, 26, 30, 35], labels=
["Short", "Medium", "Long"]) # Replace the bin values w
ith labels: Short, Medium, and Long, using the labels arg
ument of the pd.cut function
28. binning_df # Display the dataframe
Output:
The output shows a sample text file, then, the file names,
the number of words in each file, and the assigned bin
labels as follows:
1. Date: 2023-08-05
2. Merchant: Bistro Delight
3. Amount: $42.75
4. Description: Dinner with colleagues - celebrating a
successful project launch.
5.
6. Thank you for choosing Bistro Delight. Your payment of
$42.75 has been processed.
7.
8. file count bin
9. 0 /workspaces/ImplementingStatisticsWithPython/d... 2
5 Short
10. 1 /workspaces/ImplementingStatisticsWithPython/d... 3
0 Medium
11. 2 /workspaces/ImplementingStatisticsWithPython/d... 3
1 Long
12. 3 /workspaces/ImplementingStatisticsWithPython/d... 2
7 Medium
13. 4 /workspaces/ImplementingStatisticsWithPython/d... 3
3 Long
In unstructured data, the data binning can be used for text
categorization and modelling of text data, color quantization
and feature extraction on image data, audio segmentation
and feature extraction on audio data.
Data grouping
Data grouping aggregates data by criteria or categories. For
example, if sales data exists for different products or market
regions, grouping by product type or region can be
beneficial. Each group represents a subset of data that
shares some common attribute, allowing for comparison of
summary statistics or measures. Data grouping simplifies
information, emphasizes group differences or similarities,
and exposes patterns or relationships.
Tutorial 2.23: An example for grouping sales data by
product and region for three different products, is as
follows:
1. # Import pandas library
2. import pandas as pd
3. # Create a sample sales data frame with columns for pro
duct, region, and sales
4. sales_data = pd.DataFrame({
5. "product": ["A", "A", "B", "B", "C", "C"],
6. "region": ["North", "South", "North", "South",
"North", "South"],
7. "sales": [100, 200, 150, 250, 120, 300]
8. })
9. # Print the sales data frame
10. print("\nOriginal dataframe")
11. print(sales_data)
12. # Group the sales data by product and calculate the total
sales for each product
13. group_by_product = sales_data.groupby("product").sum(
)
14. # Print the grouped data by product
15. print("\nGrouped by product")
16. print(group_by_product)
17. # Group the sales data by region and calculate the avera
ge sales for each region
18. group_by_region = sales_data.groupby("region").sum()
19. # Print the grouped data by region
20. print("\nGrouped by region")
21. print(group_by_region)
Output:
1. Original dataframe
2. product region sales
3. 0 A North 100
4. 1 A South 200
5. 2 B North 150
6. 3 B South 250
7. 4 C North 120
8. 5 C South 300
9.
10. Grouped by product
11. region sales
12. product
13. A NorthSouth 300
14. B NorthSouth 400
15. C NorthSouth 420
16.
17. Grouped by region
18. product sales
19. region
20. North ABC 370
21. South ABC 750
Tutorial 2.24: An example to show grouping of data based
on age interval through binning and calculate the mean
score for each group, is as follows:
1. # Import pandas library to work with data frames
2. import pandas as pd
3. # Create a data frame with student data, including name
, age, and score
4. data = {'Name': ['John', 'Anna', 'Peter', 'Carol', 'David', 'O
ystein','Hari'],
5. 'Age': [15, 16, 17, 15, 16, 14, 16],
6. 'Score': [85, 92, 78, 80, 88, 77, 89]}
7. df = pd.DataFrame(data)
8. # Create age intervals based on the age column, using bi
ns of 13-16 and 17-18
9. age_intervals = pd.cut(df['Age'], bins=[13, 16, 18])
10. # Group the data frame by the age intervals and calculat
e the mean score for each group
11. grouped_data = df.groupby(age_intervals)
['Score'].mean()
12. # Print the grouped data with the age intervals and the
mean score
13. print(grouped_data)
Output:
1. Age
2. (13, 16] 85.166667
3. (16, 18] 78.000000
4. Name: Score, dtype: float64
Tutorial 2.25: An example of grouping a scikit-learn digit
image dataset based on target labels, where target labels
are numbers from 0 to 9, is as follows:
1. # Import the sklearn library to load the digits dataset
2. from sklearn.datasets import load_digits
3. # Import the matplotlib library to plot the images
4. import matplotlib.pyplot as plt
5.
6. # Class to display and perform grouping of digits
7. class Digits_Grouping:
8. # Contructor method to initialize the object's attribute
s
9. def __init__(self, digits):
10. self.digits = digits
11.
12. def display_digit_image(self):
13. # Get the images and labels from the dataset
14. images = self.digits.images
15. labels = self.digits.target
16. # Display the first few images along with their label
s
17. num_images_to_display = 5 # You can change this
number as needed
18. # Plot the selected few image in a subplot
19. plt.figure(figsize=(10, 4))
20. for i in range(num_images_to_display):
21. plt.subplot(1, num_images_to_display, i + 1)
22. plt.imshow(images[i], cmap='gray')
23. plt.title(f"Label: {labels[i]}")
24. plt.axis('off')
25. # Save the figure to a file with no padding
26. plt.savefig('data_grouping.jpg', dpi=600, bbox_inch
es='tight')
27. plt.show()
28.
29. def display_label_based_grouping(self):
30. # Group the data based on target labels
31. grouped_data = {}
32. # Iterate through each image and its corresponding
target in the dataset.
33. for image, target in zip(self.digits.images, self.digit
s.target):
34. # Check if the current target value is not already
present as a key in grouped_data.
35. if target not in grouped_data:
36. # If the target is not in grouped_data, add it as
a new key with an empty list as the value.
37. grouped_data[target] = []
38. # Append the current image to the list associated
with the target key in grouped_data.
39. grouped_data[target].append(image)
40. # Print the number of samples in each group
41. for target, images in grouped_data.items():
42. print(f"Target {target}: {len(images)} samples")
43.
44. # Create an object of Digits_Grouping class with the digit
s dataset as an argument
45. displayDigit = Digits_Grouping(load_digits())
46. # Call the display_digit_image method to show some ima
ges and labels from the dataset
47. displayDigit.display_digit_image()
48. # Call the display_label_based_grouping method to show
how many samples are there for each label
49. displayDigit.display_label_based_grouping()
Output:
Data encoding
Data encoding converts categorical or text-based data into
numeric or binary form. For example, you can encode
gender data of 100 customers as 0 for male and 1 for
female. This encoding corresponds to a specific value or
level of the categorical variable to assist machine learning
algorithms and statistical models. Encoding data helps
manage non-numeric data, reduces data dimensionality, and
enhances model performance. It is useful because it allows
us to convert data from one form to another, usually for the
purpose of transmission, storage, or analysis. Data encoding
can help us prepare data for analysis, develop features,
compress data, and protect data.
There are several techniques for encoding data, depending
on the type and purpose of the data as follows:
One-hot encoding: This technique converts categorical
variables, which have a finite number of discrete values
or categories, into binary vectors of 0s and 1s. Each
category is represented by a unique vector where only
one element is 1 and the rest are 0. Appropriate when
ordinality is important. One-hot encoding generates a
column for every unique category variable value, and
binary 1 or 0 values indicate the presence or absence of
each value in each row. This approach encodes
categorical data in a manner that facilitates
comprehension and interpretation by machine learning
algorithms. Nevertheless, it expands data dimensions
and produces sparse matrices.
Tutorial 2.26: An example of applying one-hot encoding in
gender and color, is as follows:
1. import pandas as pd
2. # Create a sample dataframe with 3 columns: name, gen
der and color
3. df = pd.DataFrame({
4. 'name': ['Alice', 'Eve', 'Lee', 'Dam', 'Eva'],
5. 'gender': ['F', 'F', 'M', 'M', 'F'],
6. 'color': ['yellow', 'green', 'green', 'yellow', 'pink']
7. })
8. # Print the original dataframe
9. print("Original dataframe")
10. print(df)
11. # Apply one hot encoding on the gender and color colum
ns using pandas.get_dummies()
12. df_encoded = pd.get_dummies(df, columns=
['gender', 'color'], dtype=int)
13. # Print the encoded dataframe
14. print("One hot encoded dataframe")
15. df_encoded
Tutorial 2.27: An example of applying one-hot encoding in
object data type column in data frame using UCI adult
dataset, is as follows:
1. import pandas as pd
2. import numpy as np
3. # Read the json file from the direcotory
4. diabetes_df = pd.read_csv(
5. "/workspaces/ImplementingStatisticsWithPython/data
/chapter2/Adult_UCI/adult.data")
6.
7. # Define a function for one hot encoding
8. def one_hot_encoding(diabetes_df):
9. # Identify columns that are categorical to apply one h
ot encoding in them only
10. columns_for_one_hot = diabetes_df.select_dtypes(incl
ude="object").columns
11. # Apply one hot encoding to the categorical columns
12. diabetes_df = pd.get_dummies(
13. diabetes_df, columns=columns_for_one_hot, prefix=
columns_for_one_hot, dtype=int)
14. # Display the transformed dataframe
15. print(display(diabetes_df.head(5)))
16.
17. # Call the one hot encoding method by passing datafram
e as argument
18. one_hot_encoding(diabetes_df)
Label coding: This technique assigns a numeric value
to each category of a categorical variable. The
numerical values are usually sequential integers
starting from 0. Appropriate when order is important.
The transformed variable will have numerical values
instead of categorical values. Its drawback is the loss
of information about the similarity or difference
between categories.
Tutorial 2.28: An example of applying label encoding for
categorical variables, is as follows:
1. import pandas as pd
2. # Create a data frame with name, gender, and color colu
mns
3. df = pd.DataFrame({
4. 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'B
o'],
5. 'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M'],
6. 'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue
']
7. })
8. # Convert the gender column to a categorical variable an
d assign numerical codes to each category
9. df['gender_label'] = df['gender'].astype('category').cat.c
odes
10. # Convert the color column to a categorical variable and
assign numerical codes to each category
11. df['color_label'] = df['color'].astype('category').cat.codes
12. # Print the data frame with the label encoded columns
13. print(df)
Binary encoding: Binary coding converts categorical
variables into fixed-length binary codes. Performing a
binary search on sorted categories records the
comparison result as 1 or 0. Each unique category is
assigned an integer value, which is then converted into
binary code. This reduces the number of columns
necessary to describe categorical data, unlike one-hot
encoding, which requires a new column for each unique
category. However, binary encoding has certain
downsides, such as the creation of ordinality or hierarchy
within categories that did not previously exist, making
interpretation and analysis more challenging.
Tutorial 2.29: An example of applying binary encoding for
categorical variables using category_encoders package
from pip, is as follows:
1. # Import pandas library and category_encoders library
2. import pandas as pd
3. import category_encoders as ce
4. # Create a sample dataframe with 3 columns: name, gen
der and color
5. df = pd.DataFrame({
6. 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane', 'B
o'],
7. 'gender': ['F', 'M', 'M', 'M', 'F', 'F', 'M'],
8. 'color': ['red', 'blue', 'green', 'yellow', 'pink', 'red', 'blue
']
9. })
10. # Print the original dataframe
11. print("Original dataframe")
12. print(df)
13. # Create a binary encoder object
14. encoder = ce.BinaryEncoder(cols=['gender', 'color'])
15. # Fit and transform the dataframe using the encoder
16. df_encoded = encoder.fit_transform(df)
17. # Print the encoded dataframe
18. print("Binary encoded dataframe")
19. print(df_encoded)
Output:
1. Original dataframe
2. name gender color
3. 0 Alice F red
4. 1 Bob M blue
5. 2 Charlie M green
6. 3 David M yellow
7. 4 Eve F pink
8. 5 Ane F red
9. 6 Bo M blue
10. Binary encoded dataframe
11. name gender_0 gender_1 color_0 color_1 color_2
12. 0 Alice 0 1 0 0 1
13. 1 Bob 1 0 0 1 0
14. 2 Charlie 1 0 0 1 1
15. 3 David 1 0 1 0 0
16. 4 Eve 0 1 1 0 1
17. 5 Ane 0 1 0 0 1
18. 6 Bo 1 0 0 1 0
The difference between binary encoders and one-hot
encoders is in how they encode categorical variables. One-
hot encoding, which creates a new column for each
categorical value and marks their existence with either 1 or
0. However, binary encoding converts each categorical
variable value into a binary code and separates them into
distinct columns. For example, data frame's color column
can be one-hot encoded, as shown below.
The same color column of the data frame as can be binary
encoded, where each unique combination of bits represents
a specific color, as follows:
Tutorial 2.30: An example to illustrate difference of one-
hot encoding and binary encoding, is as follows:
1. # Import the display function to show the data frames
2. from IPython.display import display
3. # Import pandas library to work with data frames
4. import pandas as pd
5. # Import category_encoders library to apply different en
coding techniques
6. import category_encoders as ce
7.
8. # Class to compare the difference between one-
hot encoding and binary encoding
9. class Encoders_Difference:
10. # Constructor method to initialize the object's attribut
e
11. def __init__(self, df):
12. self.df = df
13.
14. # Method to apply one-
hot encoding to the color column
15. def one_hot_encoding(self):
16. # Use the get_dummies function to create binary ve
ctors for each color category
17. df_encoded1 = pd.get_dummies(df, columns=
['color'], dtype=int)
18. # Display the encoded data frame
19. print("One-hot encoded dataframe")
20. print(df_encoded1)
21.
22. # Method to apply binary encoding to the color column
23. def binary_encoder(self):
24. # Create a binary encoder object with the color colu
mn as the target
25. encoder = ce.BinaryEncoder(cols=['color'])
26. # Fit and transform the data frame with the encoder
object
27. df_encoded2 = encoder.fit_transform(df)
28. # Display the encoded data frame
29. print("Binary encoded dataframe")
30. print(df_encoded2)
31.
32. # Create a sample data frame with 3 columns: name, ge
nder and color
33. df = pd.DataFrame({
34. 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Ane'],
35. 'gender': ['F', 'M', 'M', 'M', 'F', 'F'],
36. 'color': ['red', 'blue', 'green', 'blue', 'green', 'red']
37. })
38.
39. # Create an object of Encoders_Difference class with the
sample data frame as an argument
40. encoderDifference_obj = Encoders_Difference(df)
41. # Call the one_hot_encoding method to show the result o
f one-hot encoding
42. encoderDifference_obj.one_hot_encoding()
43. # Call the binary_encoder method to show the result of
binary encoding
44. encoderDifference_obj.binary_encoder()
Output:
1. One-hot encoded dataframe
2. name gender color_blue color_green color_red
3. 0 Alice F 0 0 1
4. 1 Bob M 1 0 0
5. 2 Charlie M 0 1 0
6. 3 David M 1 0 0
7. 4 Eve F 0 1 0
8. 5 Ane F 0 0 1
9. Binary encoded dataframe
10. name gender color_0 color_1
11. 0 Alice F 0 1
12. 1 Bob M 1 0
13. 2 Charlie M 1 1
14. 3 David M 1 0
15. 4 Eve F 1 1
16. 5 Ane F 0 1
Hash coding: This technique applies a hash function to
each category of a categorical variable and maps it to a
numeric value within a predefined range. The hash
function is typically a one-way function that produces a
unique output for each input.
Feature scaling: This technique transforms numerical
variables into a common scale or range, usually between
0 and 1 or -1 and 1. Different methods of feature scaling,
such as min-max scaling, standardization, and
normalization, are discussed above.
Ram 90 85 95 ?
Deep 80 ? 75 70
John ? 65 80 60
David 70 75 ? 65
Ram 90 85 95 73.3
Deep 80 75 75 70
John 80 65 80 60
David 70 75 78.3 65
Line plot
Line plots are ideal for displaying trends and changes in
continuous or ordered data points, especially for time series
data that depicts how a variable evolves over time. For
instance, one could use a line plot to monitor a patient's
blood pressure readings taken at regular intervals
throughout the year, to monitor their health.
Tutorial 2.34: An example to plot patient blood pressure
reading taken at different months of year using line plot, is
as follows:
1. # Import matplotlib.pyplot module
2. import matplotlib.pyplot as plt
3. # Create a list of dates for the x-axis
4.
dates = ["01/08/2023", "01/09/2023", "01/10/2023", "01
/11/2023", "01/12/2023"]
5. # Create a list of blood pressure readings for the y-axis
6. bp_readings = [120, 155, 160, 170, 175]
7. # Plot the line plot with dates and bp_readings
8. plt.plot(dates, bp_readings)
9. # Add a title for the plot
10.
plt.title("Patient's Blood Pressure Readings Throughout t
he Year")
11. # Add labels for the x-axis and y-axis
12. plt.xlabel("Date")
13. plt.ylabel("Blood Pressure (mmHg)")
14. # Show the plot
15.
plt.savefig("lineplot.jpg", dpi=600, bbox_inches='tight')
16. plt.show()
Output:
Figure 2.5: Patient's blood pressure over the month in a line graph.
Pie chart
Pie chart is useful when showing the parts of a whole and
the relative proportions of different categories. Pie charts
are best suited for categorical data with only a few different
categories. Use pie charts to display the percentages of
daily calories consumed from carbohydrates, fats, and
proteins in a diet plan.
Tutorial 2.35: An example to display the percentages of
daily calories consumed from carbohydrates, fats, and
proteins in a pie chart, is as follows:
1. # Import matplotlib.pyplot module
2. import matplotlib.pyplot as plt
3. # Create a list of dates for the x-axis
4.
dates = ["01/08/2023", "01/09/2023", "01/10/2023", "01
/11/2023", "01/12/2023"]
5. # Create a list of blood pressure readings for the y-axis
6. bp_readings = [120, 155, 160, 170, 175]
7. # Plot the line plot with dates and bp_readings
8. Plt.plot(dates, bp_readings)
9. # Add a title for the plot
10.
plt.title("Patient's Blood Pressure Readings Throughout t
he Year")
11. # Add labels for the x-axis and y-axis
12. plt.xlabel("Date")
13. plt.ylabel("Blood Pressure (mmHg)")
14. # Show the plot
15.
plt.savefig("lineplot.jpg", dpi=600, bbox_inches='tight')
16. plt.show()
Output:
Figure 2.6: Daily calories consumed from carbohydrates, fats, and proteins in a
pie chart
Bar chart
Bar charts are suitable for comparing values of different
categories or showing the distribution of categorical data.
Mostly useful for categorical data with distinct categories
data type. For example: comparing the average daily step
counts of people in their 20s, 30s, 40s, and so on, to assess
the relationship between age and physical activity.
Tutorial 2.36: An example to plot average daily step counts
of people in their 20s, 30s, 40s, and so on using bar chart, is
as follows:
1. # Import matplotlib.pyplot module
2. import matplotlib.pyplot as plt
3.# Create a list of percentages of daily calories consumed
from carbohydrates, fats, and proteins
4. calories = [50, 30, 20]
5. # Create a list of labels for the pie chart
6. labels = ["Carbohydrates", "Fats", "Proteins"]
7. # Plot the pie chart with calories and labels
8. plt.pie(calories, labels=labels, autopct="%1.1f%%")
9. # Add a title for the pie chart
10.plt.title("Percentages of Daily Calories Consumed from
Carbohydrates, Fats, and Proteins")
11. # Show the pie chart
12.plt.savefig("piechart1.jpg", dpi=600, bbox_inches='tigh
t')
plt.show()
Output:
Figure 2.7: Daily step counts of people in different age category using bar
chart
Histogram
Histograms are used to visualize the distribution of
continuous data or to understand the frequency of values
within a range. Mostly used for continuous data. For
example, to show Body Mass Indexes (BMIs) in a large
sample of individuals to see how the population's BMIs are
distributed.
Tutorial 2.37: An example to plot distribution of individual
BMIs in a histogram plot, is as follows:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. # Generate a large sample of BMIs using numpy.random
.normal function
4. # The mean BMI is 25 and the standard deviation is 5
5. bmis = np.random.normal(25, 5, 1000)
6. # Plot the histogram with bmis and 20 bins
7. plt.hist(bmis, bins=20)
8. # Add a title for the histogram
9. plt.title("Histogram of BMIs in a Large Sample of Individ
uals")
10. # Add labels for the x-axis and y-axis
11. plt.xlabel("BMI")
12. plt.ylabel("Frequency")
13. # Show the histogram
14. plt.savefig('histogram.jpg', dpi=600, bbox_inches='tight'
)
15. plt.show()
Output:
Figure 2.8: Distribution of Body Mass Index of individuals in histogram
Scatter plot
Scatter plots are ideal for visualizing relationships between
two continuous variables. It is mostly used for two
continuous variables that you want to analyze for
correlation or patterns. For example, plotting the number of
hours of sleep on the x-axis and the self-reported stress
levels on the y-axis to see if there is a correlation between
the two variables.
Tutorial 2.38: An example to plot number of hours of sleep
and stress levels to show their correlation in a scatter plot,
is as follows:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. # Generate a sample of hours of sleep using numpy.rand
om.uniform function
4. # The hours of sleep range from 4 to 10
5. sleep = np.random.uniform(4, 10, 100)
6. # Generate a sample of stress levels using numpy.rando
m.normal function
7. # The stress levels range from 1 to 10, with a negative c
orrelation with sleep
8. stress = np.random.normal(10 - sleep, 1)
9. # Plot the scatter plot with sleep and stress
10. plt.scatter(sleep, stress)
11. # Add a title for the scatter plot
12. plt.title("Scatter Plot of Hours of Sleep and Stress Levels
")
13. # Add labels for the x-axis and y-axis
14. plt.xlabel("Hours of Sleep")
15. plt.ylabel("Stress Level")
16. # Show the scatter plot
17. plt.savefig("scatterplot.jpg", dpi=600, bbox_inches='tigh
t')
18. plt.show()
Output:
Figure 2.9: Number of hours of sleep and stress levels in a scatter plot
Dendrograms
Dendrogram illustrates the hierarchy of clustered data
points based on their similarity or distance. It allows for
exploration of data patterns and structure, as well as
identification of clusters or groups of data points that are
similar.
Violin plot
Violin plot shows how numerical data is distributed across
different categories, allowing for comparisons of shape,
spread, and outliers. This can reveal similarities or
differences between categories.
Word cloud
Word cloud is a type of visualization that shows the
frequency of words in a text or a collection of texts. It is
useful when you want to explore the main themes or topics
of the text, or to see which words are most prominent or
relevant.
Graph
Graph visually displays the relationship between two or
more variables using points, lines, bars, or other shapes. It
offers valuable insights into data patterns, trends, and
correlations, as well as allows for the comparison of values
or categories. It is suggested to use graphs for data
analysis.
Conclusion
Exploratory data analysis involves several critical steps to
prepare and analyze data effectively. Data is first
aggregated, normalized, standardized, transformed, binned,
and grouped. Missing data and outliers are detected and
treated appropriately before visualization and plotting. Data
encoding is also used to handle categorical variables. These
preprocessing steps are essential for EDA because they
improve the quality and reliability of the data and help
uncover useful insights and patterns. EDA includes many
steps beyond these and depends on the data, problem
statement, objective, and others. To summarize the main
steps, it includes. Data aggregation combins data from
different sources or groups to form a summary or a new
data set. Data aggregation reduces the complexity and size
of the data, and to reveal patterns or trends across different
categories or dimensions. Data normalization scales the
numerical values of the data to a common range, such as 0
to 1 or -1 to 1. Data normalization reduces the effect of
different units or scales on the data, making the data
comparable and consistent. Data standardization
contributes to remove the effect of outliers or extreme
values on the data, and to make the data follow a normal
distribution. The data transformation helps to change the
shape or distribution of the data, and to make the data more
suitable for certain analyses or models. Data binning is
dividing the numerical values of the data into discrete
intervals or bins, such as low, medium, high, etc. Data
binning can help to reduce the noise or variability of the
data, and to create categorical variables from numerical
variables. The data grouping groups the data based on
certain criteria or attributes, such as age, gender, location,
etc. Data grouping helps to segment or classify the data into
meaningful categories or clusters, and to analyze the
differences or similarities between groups. Data encoding
techniques, such as one-hot encoding, label encoding, and
ordinal encoding, convert categorical variables into
numerical variables, making the data compatible with
analyses or models that require numerical inputs. Data
cleaning detects and treats missing data and outliers.
Similarly when performing EDA of data, data visualization
assists to understand the data, display the summary, view
the relationship among the variables through charts, graphs
and other graphical representations. As you begin your
work in data science and statistics, these steps cover the
things you need to consider. So, this is the initial step while
working with data, and everything starts with this.
In Chapter 3: Frequency Distribution, Central Tendency,
Variability, we will start with descriptive statistics, which
will delve into ways to describe and understand the pre-
processed data based on frequency distribution, central
tendency, variability.
Introduction
Descriptive statistics is a way of better describing and
summarizing the data and its characteristics, in a
meaningful way. The part of descriptive statistics includes
the measure of frequency distribution, the measure of
central tendency, which includes mean, median, mode,
measure of variability, measure of association, and shapes.
Descriptive statistics simply show what the data shows.
Frequency distribution is primarily used to show the
distribution of categorical or numerical observations,
counting in different categories and ranges. Central
tendency calculates the mode, which is the most frequent
data set, median which is the middle value in an ordered set
and mean which is the average value. The measures of
variability estimate how much the values of a variable are
spread, or it calculates the variations in the value of the
variable. They allow us to understand how far the data
deviate from the typical or average value. Range, variance,
and standard deviation are commonly used measures of
variability. Measures of association estimate the
relationship between two or more variables, through
scatterplots, correlation, regression. Shapes describe the
pattern and distribution of data by measuring skewness,
symmetry of shape, bimodal, unimodal, and uniform
modality, kurtosis, counting and grouping.
Structure
In this chapter, we will discuss the following topics:
Measures of frequency
Measures of central tendency
Measures of variability or dispersion
Measures of association
Measures of shape
Objectives
By the end of this chapter, readers will learn about
descriptive statistics and how to use them to gain
meaningful insights. You will gain the skills necessary to
calculate measures of frequency distribution, central
tendency, variability, association, shape, and how to apply
them using Python.
Measure of frequency
A measure of frequency counts the number of times a
specific value or category appears within a dataset. For
example, to find out how many children in a class like each
animal, you can apply the measure of frequency on a data
set that contains the five most popular animals. Table 3.1
displays how many times each animal was chosen by the 10
children. Out of the 10 children, 4 like dogs, 3 like cats, 2
like cow, and 1 like rabbit.
Animal Frequency
Dog 4
Cat 3
Cow 2
Rabbit 1
Dog 4 0.4
Cat 3 0.3
Cow 2 0.2
Rabbit 1 0.1
Rabbit 1 0.1 1
USA 57,000
Norway 54,000
Nepal 50,000
India 50,000
China 50,000
Canada 53,000
Sweden 53,000
Measure of association
Measure of association is used to describe how multiple
variables are related to each other. The measure of
association varies and depends on the nature and level of
measurement of variables. We can measure the relationship
between variables by evaluating their strength and direction
of association while also determining their independence or
dependence through hypothesis testing. Before we go any
further, let us understand what hypothesis testing is
Hypothesis testing is used in statistics to investigate ideas
about the world. It's often used by scientists to test certain
predictions (called hypotheses) that arise from theories.
There are two types of hypotheses: null hypotheses and
alternative hypotheses. Let us understand them with an
example where a researcher want to see, if there is a
relationship between gender and height. Then the
hypotheses are as follows.
Null hypothesis (H₀): States the prediction that there
is no relationship between the variables of interest. So,
for the example above, the null hypothesis will be that
men are not, on average, taller than women.
Alternative hypothesis (Hₐ or H₁): Predicts a
particular relationship between the variables. So, for the
example above, the alternative hypothesis to null
hypothesis will be that men are, on average, taller than
women.
Continuing measures of association, it can help identify
potential causal factors, confounding variables, or
moderation effects that impact the outcome in question.
Covariance, correlation, chi-squared, Cramer's V, and
contingency coefficients, discussed below, are used in
statistical analyses to understand the relationships between
variables.
To demonstrate the importance of a measure of association,
let us take a simple example. Suppose we wish to
investigate the correlation between smoking habits and lung
cancer. We collect data from a sample of individuals,
recording whether or not they smoke and whether or not
they have lung cancer. Then, we can employ a measure of
association, like the chi-square test (described further
below), to ascertain if there is a link between smoking and
lung cancer. The chi-square test assesses the extent to
which smoking, and lung cancer frequencies observed differ
from expected frequencies, assuming their independence. A
high chi-square value demonstrates a notable correlation
between the variables, while a low chi-square value
suggests that they are independent.
For example, suppose we have the following data, and we
want to see the effect of smoking in lung cancer:
Smoking Lung Cancer No Lung Cancer Total
Yes 80 20 100
No 20 80 100
Yes 18 18 36
No 18 18 36
Total 36 36 72
A 80 90
B 70 80
C 60 70
D 50 60
E 40 50
Where , xi, and yi are the individual scores for math and
English, xˉ and yˉ are the mean scores for math and
English, and n is the number of students.
Using the data from Table 3.9, the mean (xˉ) is 60 and the
mean (yˉ) is 70. The sum of the products of paired
deviations ∑(xi−xˉ)(yi−yˉ) is 1000. Finally, the covariance
between column maths and English score is calculated to be
250. Which means, there is positive linear relation between
a student's math and English scores. Their meaning that as
one variable increases, the other variable also tends to
increase.
Tutorial 3.8: An example to compute the covariance in
data, is as follows:
1. import pandas as pd
2. # Define the dataframe as a dictionary
3. df = {"Student": ["A", "B", "C", "D", "E"], "Math Score": [
4. 80, 70, 60, 50, 40], "English Score": [90, 80, 70, 60, 5
0]}
5. # Convert the dictionary to a pandas dataframe
6. df = pd.DataFrame(df)
7. # Calculate the covariance between math and english sc
ores using the cov method
8. covariance = df["Math Score"].cov(df["English Score"])
9. # Print the result
10. print(f"The covariance between math and english score i
s {covariance}")
Output:
1. The covariance between math and english score is 250.0
Covariance and correlation are similar, but not the same.
They both measure the relationship between two variables,
but they differ in how they scale and interpret the results.
Following are some key differences between covariance and
correlation:
Covariance can take any value from negative infinity to
positive infinity, while correlation ranges from -1 to 1.
This means that correlation is a normalized and
standardized measure of covariance, which makes it
easier to compare and interpret the strength of the
relationship.
Covariance has units, which depend on the units of the
two variables. Correlation is dimensionless, which
means it has no units. This makes correlation
independent of the scale and units of the variables, while
covariance is sensitive to them.
Covariance only indicates the direction of the linear
relationship between two variables, such as positive,
negative, or zero. Correlation also indicates the
direction, but also the degree of how closely the two
variables are related. A correlation of -1 or 1 means a
perfect linear relationship, while a correlation of 0
means no linear relationship.
Tutorial 3.9: An example to compute the correlation in the
Math and English score data, is as follows:
1. import pandas as pd
2. # Create a dictionary with the data
3. data = {"Student": ["A", "B", "C", "D", "E"],
4. "Math Score": [80, 70, 60, 50, 40],
5. "English Score": [90, 80, 70, 60, 50]}
6. df = pd.DataFrame(data)
7. # Compute the correlation between the two columns
8. correlation = df["Math Score"].corr(df["English Score"])
9. print("Correlation between math and english score:", cor
relation)
Output:
1. Correlation between math and english score: 1.0
Chi-square
Chi-square tests if there is a significant connection
between two categories. For example, to determine if there
is a connection between the music individuals listen to and
their emotional state, chi-squared association tests can be
used to compare observed frequencies of different moods
with different types of music to expected frequencies if
there is no relationship between music and mood. The test
finds the chi-squared value by adding the squared
differences between the observed and expected frequencies
and then dividing that sum by the expected frequencies. If
the chi-squared value is higher, it suggests a stronger
likelihood of a significant connection between the variables.
The next step confirms the significance of the chi-squared
value by comparing it to a critical value from a table that
considers the degree of freedom and level of significance. If
the chi-squared value is higher than the critical value, we
will discard the assumption of no relationship.
Tutorial 3.10: An example to show the use of chi-square
test to find association between different types of music and
mood of a person, is as follows:
1. import pandas as pd
2. # Import chi-
squared test function from scipy.stats module
3. from scipy.stats import chi2_contingency
4. # Create a sample data frame with music and mood cate
gories
5. data = pd.DataFrame({"Music": ["Rock", "Pop", "Jazz", "
Classical", "Rap"],
6. "Happy": [25, 30, 15, 10, 20],
7. "Sad": [15, 10, 20, 25, 30],
8. "Angry": [10, 15, 25, 30, 15],
9. "Calm": [20, 15, 10, 5, 10]})
10. # Print the original data frame
11. print(data)
12. # Perform chi-square test of association
13. chi2, p, dof, expected = chi2_contingency(data.iloc[:, 1:]
)
14. # Print the chi-square test statistic, p-
value, and degrees of freedom
15. print("Chi-square test statistic:", chi2)
16. print("P-value:", p)
17. print("Degrees of freedom:", dof)
18. # Print the expected frequencies
19. print("Expected frequencies:")
20. print(expected)
Output:
1. Music Happy Sad Angry Calm
2. 0 Rock 25 15 10 20
3. 1 Pop 30 10 15 15
4. 2 Jazz 15 20 25 10
5. 3 Classical 10 25 30 5
6. 4 Rap 20 30 15 10
7. Chi-square test statistic: 50.070718462823734
8. P-value: 1.3577089704505725e-06
9. Degrees of freedom: 12
10. Expected frequencies:
11. [[19.71830986 19.71830986 18.73239437 11.83098592]
12. [19.71830986 19.71830986 18.73239437 11.83098592]
13. [19.71830986 19.71830986 18.73239437 11.83098592]
14. [19.71830986 19.71830986 18.73239437 11.83098592]
15. [21.12676056 21.12676056 20.07042254 12.67605634]
]
The chi-square test results indicate a significant connection
between the type of music and the mood of listeners. This
suggests that the observed frequencies of different music-
mood combinations are not random occurrences but rather
signify an underlying relationship between the two
variables. A higher chi-square value signifies a greater
disparity between observed and expected frequencies. In
this instance, the chi-square value is 50.07, a notably large
figure. Given that the p-value is less than 0.05, we can
reject the null hypothesis and conclude that there is indeed
a significant association between music and mood. The
degrees of freedom, indicating the number of independent
categories in the data, is calculated as (number of rows - 1)
x (number of columns - 1), resulting in 12 degrees of
freedom in this case. Expected frequencies represent what
would be anticipated under the null hypothesis of no
association, calculated by multiplying row and column totals
and dividing by the grand total. Comparing observed and
expected frequencies reveals the expected distribution if
music and mood were independent. Notably, rap and
sadness are more frequent than expected (30 vs 21.13),
suggesting that rap music is more likely to induce sadness.
Conversely, classical and calm are less frequent than
expected (5 vs 11.83), indicating that classical music is less
likely to induce calmness.
Cramer’s V
Cramer's V is a measure of the strength of the association
between two categorical variables. It ranges from 0 to 1,
where 0 indicates no association and 1 indicates perfect
association. Cramer's V and chi-square is related but are
different concepts. Cramer's V is an effect size that
describes how strongly two variables are related, while chi-
square is a test statistic that evaluates whether the
observed frequencies are different from the expected
frequencies. Cramer's V is based on chi-square, but also
takes into account the sample size and the number of
categories. Cramer's V is useful for comparing the strength
of association between different tables with different
numbers of categories. Chi-square can be used to test
whether there is a significant association between two
nominal variables, but it does not tell us how strong or weak
that association is. Cramer's V can be calculated from the
chi-squared value and the degrees of freedom of the
contingency table.
Cramer’s V = √(X2/n) / min(c-1, r-1)
Where:
X2: The Chi-square statistic
n: Total sample size
r: Number of rows
c: Number of columns
For example, Cramer’s V is to compare the association
between gender and eye color in two different populations.
Suppose we have the following data:
Population Gender Eye color Frequency
A Male Blue 10
A Male Brown 20
A Female Blue 15
A Female Brown 25
B Male Blue 5
B Male Brown 25
B Female Blue 25
B Female Brown 5
Contingency coefficient
The contingency coefficient is a measure of association in
statistics that indicates whether two variables or data sets
are independent or dependent on each other. It is also
known as Pearson's coefficient.
The contingency coefficient is based on the chi-square
statistic and is defined by the following formula:
C=χ2+Nχ2
Where:
χ2 is the chi-square statistic
N is the total number of cases or observations in our
analysis/study.
C is the contingency coefficient
The contingency coefficient can range from 0 (no
association) to 1 (perfect association). If C is close to zero
(or equal to zero), you can conclude that your variables are
independent of each other; there is no association between
them. If C is away from zero, there is some association.
Contingency coefficient is important because it can help us
summarize the relationship between two categorical
variables in a single number. It can also help us compare
the degree of association between different tables or
groups.
Tutorial 3.12: An example to measure the association
between two categorical variables gender and product using
contingency coefficient, is as follows:
1. import pandas as pd
2. from scipy.stats import chi2_contingency
3. # Create a simple dataframe
4. data = {'Gender': ['Male', 'Female', 'Female', 'Male', 'Mal
e', 'Female'],
5. 'Product': ['Product A', 'Product B', 'Product A', 'Pro
duct A', 'Product B', 'Product B']}
6. df = pd.DataFrame(data)
7. # Create a contingency table
8. contingency_table = pd.crosstab(df['Gender'], df['Produc
t'])
9. # Perform Chi-Square test
10. chi2, p, dof, expected = chi2_contingency(contingency_t
able)
11. # Calculate the contingency coefficient
12. contingency_coefficient = (chi2 / (chi2 + df.shape[0])) **
0.5
13. print('Contingency Coefficient is:', contingency_coefficie
nt)
Output:
1. Contingency Coefficient is: 0.0
In this case, the contingency coefficient is 0 which shows
there is no association at all between gender and product.
Tutorial 3.13: Similarly, as shown in Table 3.9, if we want
to know whether gender and eye color are related in two
different populations, we can calculate the contingency
coefficient for each population and see which one has a
higher value. A higher value indicates a stronger association
between the variables.
Code:
1. import pandas as pd
2. from scipy.stats import chi2_contingency
3. import numpy as np
4. df = pd.DataFrame({"Population": ["A", "A", "A", "A", "B",
"B", "B", "B"],
5. "Gender": ["Male", "Male", "Female", "Femal
e", "Male", "Male", "Female", "Female"],
6. "Eye Color": ["Blue", "Brown", "Blue", "Brow
n",
"Blue", "Brown", "Blue", "Brown"],
7. "Frequency": [10, 20, 15, 25, 5, 25, 25, 5]})
8. # Create a pivot table
9. pivot_table = pd.pivot_table(df, values='Frequency', inde
x=[
10. 'Population', 'Gender'], columns=
['Eye Color'], aggfunc=np.sum)
11. # Calculate chi-square statistic
12. chi2, _, _, _ = chi2_contingency(pivot_table)
13. # Calculate the total number of observations
14. N = df['Frequency'].sum()
15. # Calculate the Contingency Coefficient
16. C = np.sqrt(chi2 / (chi2 + N))
17. print(f"Contingency Coefficient: {C}")
Output:
1. Contingency Coefficient: 0.43
This gives contingency coefficient 0.4338. Which indicates
that there is a moderate association between the variables
in the above data (population, gender, and eye color). This
means that knowing the category of one variable gives some
information about the category of the other variables.
However, the association is not very strong because the
coefficient is closer to 0 than to 1. Furthermore, the
contingency coefficient has some limitations, such as being
affected by the size of the table and not reaching 1 for
perfect association. Therefore, some alternative measures of
association, such as Cramer’s V or the phi coefficient, may
be preferred in some situations.
Measures of shape
Measures of shape are used to describe the general shape
of a distribution, including its symmetry, skewness, and
kurtosis. These measures help to give a sense of how the
data is spread out, and can be useful for identifying
potentially outlier observations or data points. For example,
imagine you are a teacher, and you want to evaluate your
students’ performance on a recent math test. Here the
skewness tells you distribution of the scores. If the scores
are more spread out on one side of the mean than the other,
and kurtosis tells you how peaked or flattened the
distribution of scores is.
Skewness
Skewness measures the degree of asymmetry in a
distribution. A distribution is symmetrical if the two halves
on either side of the mean are mirror images of each other.
Positive skewness indicates that the right tail of the
distribution is longer or thicker than the left tail, while
negative skewness indicates the opposite.
Tutorial 3.14: Let us consider a class of 10 students who
recently took a math test. Their scores (out of 100) are as
follows, and based on these scores we can see the skewness
of the students' scores, whether they are positively skewed
(toward high scores) or negatively skewed (toward low
scores).
Refer to the following table:
Student
1 2 3 4 5 6 7 8 9 10
ID
Score 85 90 92 95 96 96 97 98 99 100
Kurtosis
Kurtosis measures the tilt of a distribution (that is, the
concentration of values at the tails). It indicates whether the
tails of a given distribution contain extreme values. If you
think of a data distribution as a mountain, the kurtosis
would tell you about the shape of the peak and the tails. A
high kurtosis means that the data has heavy tails or outliers.
In other words, the data has a high peak (more data in the
middle) and fat tails (more extreme values). This is called a
leptokurtic distribution. Low kurtosis in a data set is an
indicator that the data has light tails or lacks outliers. The
data points are moderately spread out (less in the middle
and less extreme values), which means it has a flat peak.
This is called a platykurtic distribution. A normal
distribution has zero kurtosis. Understanding the kurtosis of
a data set helps to identify volatility, risk, or outlier
detection in various fields such as finance, quality control,
and other statistical modeling where data distribution plays
a key role.
Tutorial 3.15: An example to understand how viewing the
Kurtosis of a dataset helps in identifying the presence of
outliers.
Let us look at three different data sets, as follows:
Dataset A: [1, 1, 2, 2, 3, 3, 4, 4, 9, 9] - This dataset has
a few extreme values (9).
Dataset B: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - This dataset has
no extreme values and is evenly distributed.
Dataset C: [1, 2, 2, 3, 3, 3, 4, 4, 4, 4] - This data set has
more values around the mean (3 and 4).
Let us calculate the kurtosis for these data sets.
Code:
1. import scipy.stats as stats
2. # Datasets
3. dataset_A = [1, 1, 2, 2, 3, 3, 4, 4, 4, 30]
4. dataset_B = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
5. dataset_C = [1, 2, 3, 3, 3, 3, 3, 3, 4, 5]
6. # Calculate kurtosis
7. kurtosis_A = stats.kurtosis(dataset_A)
8. kurtosis_B = stats.kurtosis(dataset_B)
9. kurtosis_C = stats.kurtosis(dataset_C)
10. print(f"Kurtosis of Dataset A: {kurtosis_A}")
11. print(f"Kurtosis of Dataset B: {kurtosis_B}")
12. print(f"Kurtosis of Dataset C: {kurtosis_C}")
Output:
1. Kurtosis of Dataset A: 4.841818043320611
2. Kurtosis of Dataset B: -1.2242424242424244
3. Kurtosis of Dataset C: 0.3999999999999999
Here we see, in data set A: [1, 1, 2, 2, 3, 3, 4, 4, 4, 30] has a
kurtosis of 4.84. This is a high positive value, indicating that
the data set has heavy tails and a sharp peak. This means
that there are more extreme values in the data set, as
indicated by the value 30. This is an example of a
leptokurtic distribution. In the data set B: [1, 2, 3, 4, 5, 6, 7,
8, 9, 10] has a kurtosis of -1.22. This is a negative value,
indicating that the data set has light tails and a flat peak.
This means that there are fewer extreme values in the data
set and the values are evenly distributed. This is an example
of a platykurtic distribution. The data set C: [1, 2, 3, 3, 3, 3,
3, 3, 3, 4, 5] has a kurtosis of 0.4, which is close to zero.
This indicates that the data set has a distribution shape
similar to a normal distribution (mesokurtic). The values are
somewhat evenly distributed around the mean, with a
balance between extreme values and values close to the
mean.
Conclusion
Descriptive statistics is a branch of statistics that organizes,
summarizes, and presents data in a meaningful way. It uses
different types of measures to describe various aspects of
the data. For example, measures of frequency, such as
relative and cumulative frequency, frequency tables and
distribution, help to understand how many times each value
of a variable occurs and what proportion it represents in the
data. Measures of central tendency, such as mean, median,
and mode, help to find the average or typical value of the
data. Measures of variability or dispersion, such as range,
variance, standard deviation, and interquartile range, help
to measure how much the data varies or deviates from the
center. Measures of association, such as correlation and
covariance, help to examine how two or more variables are
related to each other. Finally, measures of shape, such as
skewness and kurtosis, help to describe the symmetry and
the heaviness of the tails of a probability distribution. These
methods are vital in descriptive statistics because they give
a detailed summary of the data. This helps us understand
how the data behaves, find patterns, and make
knowledgeable choices. They are fundamental for additional
statistical analysis and hypothesis testing.
In Chapter 4: Unravelling Statistical Relationships we will
see more about the statistical relationship and understand
the meaning and implementation of covariance, correlation
and probability distribution.
Introduction
Understanding the connection between different variables is
part of unravelling statistical relationships. Covariance and
correlation, outliers and probability distributions are critical
to the unravelling of statistical relationships and make
accurate interpretations based on data. Covariance and
correlation essentially measure the same concept, the
change in two variables with respect to each other. They aid
in comprehending the relationship between two variables in
a dataset and describe the extent to which two random
variables or random variable sets are prone to deviate from
their expected values in the same manner. Covariance
illustrates the degree to which two random variables vary
together. And correlation is a mathematical method for
determining the degree of statistical dependence between
two variables. Ranging from -1 (perfect negative
correlation) to +1 (perfect positive correlation). Statistical
relationships are based on data and most data contains
outliers. Outliers are observations that are significantly
different from other data points, such as data variability or
experimental errors. Such outliers can significantly skew
data analysis and statistical modeling, potentially leading to
erroneous conclusions. Therefore, it is essential to identify
and manage outliers to ensure accurate results. To facilitate
comprehension and prediction of data patterns measuring
likelihood and distribution of likelihood is required. For
these statisticians use probability and probability
distribution. The probability measures the likelihood of a
specific event occurring and is denoted by a value between
0 and 1, where 0 implies impossibility and 1 signifies
certainty.
A probability distribution which is a mathematical function
describes how probabilities are spread out over the values
of a random variable. For instance, in a fair roll of a six-
sided dice, the probability distribution would indicate that
each outcome (1, 2, 3, 4, 5, 6) has a probability of 1/6.
While probability measures the likelihood of a single event,
a probability distribution considers all potential events and
their respective probabilities. It offers a comprehensive
view of the randomness or variability of a particular data
set. Sometimes there can be many data point or large data
that need to be represented as one. In such case the data
points in the form of arrays and matrices allow us to explore
statistical relationships, distinguish true correlations from
spurious ones, and visualize complex dependencies in data.
All of these concepts in the structure below are basic, but
very important steps in unraveling and understanding the
statistical relationship.
Structure
In this chapter, we will discuss the following topics:
Covariance and correlation
Outliers and anomalies
Probability
Array and matrices
Objectives
By the end of this chapter, readers will see what covariance,
correlation, outliers, anomalies are, how they affect data
analysis, statistical modeling, and learning, how they can
lead to misleading conclusions, and how to detect and deal
with them. We will also look at probability concepts and the
use of probability distributions to understand data, its
distribution, and its properties, how they can help in making
predictions, decisions, and estimating uncertainty.
Covariance
Covariance in statistics measures how much two variables
change together. In other words, it is a statistical tool that
shows us how much two numbers vary together. A positive
covariance indicates that the two variables tend to increase
or decrease together. Conversely, a negative covariance
indicates that as one variable increases, the other tends to
decrease and vice versa. Covariance and correlation are
important in measuring association, as discussed in Chapter
3, Frequency Distribution, Central Tendency, Variability.
While correlation is limited to -1 to +1, covariance can be
practically any number. Now, let us consider a simple
example.
Suppose you are a teacher with a class of students. And you
observed when the temperature is high in the summer, the
students' test scores generally decrease, while in the winter
when it is low, the scores tend to rise. This is a negative
covariance because as one variable, temperature, goes up,
the other variable, test scores, goes down. Similarly, if
students who study more hours tend to have higher test
scores, this is a positive covariance. As study hours
increase, test scores also increase. Covariance helps
identify the relationship between different variables.
Tutorial 4.1: An example to calculates the covariance
between temperature and test scores, and between study
hours and test scores, is as follows:
1. import numpy as np
2. # Let's assume these are the temperatures in Celsius
3. temperatures = np.array([30, 32, 28, 31, 33, 29, 34, 35,
36, 37])
4. # And these are the corresponding test scores
5. test_scores = np.array([70, 68, 72, 71, 67, 73, 66, 65, 64
, 63])
6. # And these are the corresponding study hours
7. study_hours = np.array([5, 6, 7, 6, 5, 7, 4, 3, 2, 1])
8. # Calculate the covariance between temperature and tes
t scores
9. cov_temp_scores = np.cov(temperatures, test_scores)
[0, 1]
10. print(f"Covariance between temperature and test scores
: {cov_temp_scores}")
11. # Calculate the covariance between study hours and test
scores
12. cov_study_scores = np.cov(study_hours, test_scores)
[0, 1]
13. print(f"Covariance between study hours and test scores:
{cov_study_scores}")
Output:
1. Covariance between temperature and test scores: -10.27
7777777777777
2. Covariance between study hours and test scores: 6.7333
33333333334
As output shows, covariance between temperature and test
score is negative (indicating that as temperature increases,
test scores decrease), and the covariance between study
hours and test scores is positive (indicating that as study
hours increase, test scores also increase).
Tutorial 4.2: Following is an example to calculates the
covariance in a data frame, here we only compute
covariance of selected three columns from the diabetes
dataset:
1. # Import the pandas library and the display function
2. import pandas as pd
3. from IPython.display import display
4. # Load the diabetes dataset csv file
5. diabities_df = pd.read_csv("/workspaces/ImplementingS
tatisticsWithPython/data/chapter1/diabetes.csv")
6. diabities_df[['Glucose','Insulin','Outcome']].cov()
Output:
1. Glucose Insulin Outcome
2. Glucose 1022.248314 1220.935799 7.115079
3. Insulin 1220.935799 13281.180078 7.175671
4. Outcome 7.115079 7.175671 0.227483
The diagonal elements (1022.24 for glucose, 13281.18 for
insulin, and 0.22 for outcome) represent the variance of
each variable. Looking at glucose its variance is 1022.24,
which means that glucose levels vary quite a bit and insulin
varies even more. Covariance between glucose and insulin
is a positive number, which means that high glucose levels
tend to be associated with high insulin levels and vice versa,
and the covariance between insulin and outcome is 7.17.
Since, these are positive numbers, this means that high
glucose and insulin levels tend to be associated with high
outcome and vice versa.
While covariance is a powerful tool for understanding
relationships in numerical data, other techniques are
typically more appropriate for text and image data. For
example, term frequency-inverse document frequency
(TF-IDF), cosine similarity, or word embeddings (such as
Word2Vec) are often used to understand relationships and
variations in text data. For image data, convolutional
neural networks (CNNs), image histograms, or feature
extraction methods are used.
Correlation
Correlation in statistics measures the magnitude and
direction of the connection between two or more variables.
It is important to note that correlation does not imply
causality between the variables. The correlation coefficient
assigns a value to the relationship on a -1 to 1 scale. A
positive correlation, closer to 1, indicates that as one
variable increases, so does the other. Conversely, a
negative correlation, closer to -1 means that as one
variable increases, the other decreases. A correlation of
zero suggests no association between two variables. More
about correlation is also discussed in Chapter 1,
Introduction to Statistics and Data, and Chapter 3,
Frequency Distribution, Central Tendency, Variability.
Remember that while covariance and correlation are related
correlation provides a more interpretable measure of
association, especially when comparing variables with
different units of measurement.
Let us understand correlation with an example, consider
relationship between study duration and exam grade. If
students who spend more time studying tend to achieve
higher grades, we can conclude that there is a positive
correlation between study time and exam grades, as an
increase in study time corresponds to an increase in exam
grades. On the other hand, an analysis of the correlation
between the amount of time devoted to watching television
and test scores reveals a negative correlation. Specifically,
as the duration of television viewing (one variable)
increases, the score on the exam (the other variable) drops.
Bear in mind that correlation does not necessarily suggest
causation. Mere correlation between two variables does not
reveal a cause-and-effect relationship.
Tutorial 4.3: An example to calculates the correlation
between study time and test scores, and between TV
watching time and test scores, is as follows:
1. import numpy as np
2. # Let's assume these are the study hours
3. study_hours = np.array([5, 6, 7, 6, 5, 7, 4, 3, 2, 1])
4. # And these are the corresponding test scores
5. test_scores = np.array([70, 72, 75, 72, 70, 75, 68, 66, 64
, 62])
6. # And these are the corresponding TV watching hours
7. tv_hours = np.array([1, 2, 1, 2, 3, 1, 4, 5, 6, 7])
8. # Calculate the correlation between study hours and test
scores
9. corr_study_scores = np.corrcoef(study_hours, test_score
s)[0, 1]
10. print(f"Correlation between study hours and test scores:
{corr_study_scores}")
11. # Calculate the correlation between TV watching hours a
nd test scores
12. corr_tv_scores = np.corrcoef(tv_hours, test_scores)[0, 1]
13. print(
14. f"Correlation between TV watching hours and test sco
res: {corr_tv_scores}")
Output:
1. Correlation between study hours and test scores: 0.9971
289059323629
2. Correlation between TV watching hours and test scores:
-0.9495412844036697
Output shows an increase in study hours correspond to a
higher test score, indicating a positive correlation. A
negative correlation is between the number of hours spent
watching television and test scores. This suggests that an
increase in TV viewing time is linked to a decline in test
scores.
Probability distribution
Probability distribution is a mathematical function that
provides the probabilities of occurrence of different possible
outcomes in an experiment. Let us consider flipping a fair
coin. The experiment has two possible outcomes, Heads
(H) and Tails (T). Since the coin is fair, the likelihood of
both outcomes is equal.
This experiment can be represented using a probability
distribution, as follows:
Probability of getting heads P(H) = 0.5
Probability of getting tails P(T) = 0.5
In probability theory, the sum of all probabilities within a
distribution must always equal 1, representing every
possible outcome of an experiment. For instance, in our coin
flip example, P(H) + P(T) = 0.5 + 0.5 = 1. This is a
fundamental rule in probability theory.
Probability distributions can be discrete and continuous as
follows:
Discrete probability distributions are used for
scenarios with finite or countable outcomes. For
example, you have a bag of 10 marbles, 5 of which are
red and 5 of which are blue. If you randomly draw a
marble from the bag, the possible outcomes are a red
marble or a blue marble. Since there are only two
possible outcomes, this is a discrete probability
distribution. The probability of getting a red marble is
1/2, and the probability of getting a blue marble is 1/2.
Tutorial 4.8: To illustrate discrete probability distributions
based on example of 10 marbles, 5 of which are red and 5 of
which are blue, is as follows:
1. import random
2. # Define the sample space
3. sample_space = ['red', 'red', 'red', 'red', 'red', 'blue', 'blu
e', 'blue', 'blue', 'blue']
4. # Conduct the experiment (draw a marble from the bag)
5. outcome = random.choice(sample_space)
6. # Check if the outcome is red or blue
7. if outcome == 'red':
8. print(f"Outcome is a: {outcome}")
9. elif outcome == 'blue':
10. print(f"Outcome is a: {outcome}")
11. # Calculate the probability of the events
12. probability_red = sample_space.count('red') / len(sampl
e_space)
13. probability_blue = sample_space.count('blue') / len(sam
ple_space)
14. print(f"Overall probablity of drawing a red marble: {prob
ability_red}")
15. print(f"Overall probablity of drawing a blue marble: {pro
bability_blue}")
Output:
1. Outcome is a: red
2. Overall probablity of drawing a red marble: 0.5
3. Overall probablity of drawing a blue marble: 0.5
Continuous probability distributions are used for
scenarios with an infinite number of possible outcomes.
For example, you have a scale that measures the weight
of objects to the nearest gram. When you weigh an apple,
the possible outcomes are any weight between 0 and
1000 grams. This is a continuous probability distribution
because there are an infinite number of possible
outcomes in the range of 0 to 1000 grams. The probability
of getting any particular weight, such as 150 grams, is
zero. However, we can calculate the probability of getting
a weight within a certain range, such as between 100 and
200 grams.
Tutorial 4.9: To illustrate continuous probability
distributions, is as follows:
1. import numpy as np
2. # Define the range of possible weights
3. min_weight = 0
4. max_weight = 1000
5. # Generate a random weight for the apple
6. apple_weight = np.random.uniform(min_weight, max_we
ight)
7. print(f"Weight of the apple is {apple_weight} grams")
8. # Define a weight range
9. min_range = 100
10. max_range = 200
11. # Check if the weight is within the range
12. if min_range <= apple_weight <= max_range:
13. print(f"Weight of the apple is within the range of {min
_range}-{max_range} grams")
14. else:
15. print(f"Weight of the apple is not within the range of {
min_range}-{max_range} grams")
16. # Calculate the probability of the weight being within th
e range
17. probability_range = (max_range - min_range) / (max_wei
ght - min_weight)
18. print(f"Probability of the weight of the apple being withi
n the range of {min_range}-
{max_range} grams is {probability_range}")
Output:
1. Weight of the apple is 348.2428034693577 grams
2. Weight of the apple is not within the range of 100-
200 grams
3. Probability of the weight of the apple being within the ra
nge of 100-200 grams is 0.1
Uniform distribution
In uniform distribution, all possible outcomes are equally
likely. The flipping a fair coin, is a uniform distribution.
There are two possible outcomes: Heads (H) and Tails (T).
Here, every outcome is equally likely.
Tutorial 4.10: An example to illustrate uniform probability
distributions, is as follows:
1. import random
2. # Define the sample space
3. sample_space = ['H', 'T']
4. # Conduct the experiment (flip the coin)
5. outcome = random.choice(sample_space)
6. # Print the outcome
7. print(f"Outcome of the coin flip: {outcome}")
8. # Calculate the probability of the events
9. probability_H = sample_space.count('H') / len(sample_sp
ace)
10. probability_T = sample_space.count('T') / len(sample_sp
ace)
11. print(f"Probability of getting heads (P(H)): {probability_
H}")
12. print(f"Probability of getting tails (P(T)): {probability_T}"
)
Output:
1. Outcome of the coin flip: T
2. Probability of getting heads (P(H)): 0.5
3. Probability of getting tails (P(T)): 0.5
Normal distribution
Normal distribution is symmetric about the mean,
meaning that data near the mean is more likely to occur
than data far from the mean. It is also known as the
Gaussian distribution and describes data with bell-shaped
curves. For example, measuring the test scores of 100
students. The resulting data would likely follow a normal
distribution, with most students' scores falling around the
mean and fewer students having very high or low scores.
Tutorial 4.11: An example to illustrate normal probability
distributions, is as follows:
1. import numpy as np
2. import matplotlib.pyplot as plt
3. from scipy.stats import norm
4. # Define the parameters for the normal distribution,
5. # where loc is the mean and scale is the standard deviati
on.
6. # Let's assume the average test score is 70 and the stan
dard deviation is 10.
7. loc, scale = 70, 10
8. # Generate a sample of test scores
9. test_scores = np.random.normal(loc, scale, 100)
10. # Create a histogram of the test scores
11. plt.hist(test_scores, bins=20, density=True, alpha=0.6, c
olor='g')
12. # Plot the probablity distribution function
13. xmin, xmax = plt.xlim()
14. x = np.linspace(xmin, xmax, 100)
15. p = norm.pdf(x, loc, scale)
16. plt.plot(x, p, 'k', linewidth=2)
17. title = "Fit results: mean = %.2f, std = %.2f" % (loc, scal
e)
18. plt.title(title)
19. plt.savefig('normal_distribution.jpg', dpi=600, bbox_inch
es='tight')
20. plt.show()
Output:
Figure 4.2: Plot showing the normal distribution
Binomial distribution
Binomial distribution describes the number of successes
in a series of independent trials that only have two possible
outcomes: success or failure. It is determined by two
parameters, n, which is the number of trials, and p, which is
the likelihood of success in each trial. For example, suppose
you flip a coin ten times. There is a 50-50 chance of getting
either heads or tails. For instance, the likelihood of getting
strictly three heads is, we can use the binomial distribution
to figure out how likely it is to get a specific number of
heads in those ten flips.
For instance, the likelihood of getting strictly three heads, is
as follows:
P(X = 3) = nCr * p^x * (1-p)^(n-x)
Where:
nCr is the binomial coefficient, which is the number of
ways to choose x successes out of n trials
p is the probability of success on each trial (0.5 in this
case)
(1-p) is the probability of failure on each trial (0.5 in this
case)
x is the number of successes (3 in this case)
n is the number of trials (10 in this case)
Substituting the values provided, we can calculate that
there is a 12.16% chance of getting exactly 3 heads out of
ten-coin tosses.
Tutorial 4.12: An example to illustrate binomial probability
distributions, using coin toss example, is as follows:
1. from scipy.stats import binom
2. import matplotlib.pyplot as plt
3. import numpy as np
4. # number of trials, probability of each trial
5. n, p = 10, 0.5
6. # generate a range of numbers from 0 to n (number of tr
ials)
7. x = np.arange(0, n+1)
8. # calculate binomial distribution
9. binom_dist = binom.pmf(x, n, p)
10. # display probablity distribution of each
11. for i in x:
12. print(
13. f"Probability of getting exactly {i} heads in {n} flips i
s: {binom_dist[i]:.5f}")
14. # plot the binomial distribution
15. plt.bar(x, binom_dist)
16. plt.title(
17. 'Binomial Distribution PMF: 10 coin Flips, Odds of Suc
cess for Heads is p=0.5')
18. plt.xlabel('Number of Heads')
19. plt.ylabel('Probability')
20. plt.savefig('binomial_distribution.jpg', dpi=600, bbox_inc
hes='tight')
21. plt.show()
Output:
1. Probability of getting exactly 0 heads in 10 flips is: 0.000
98
2. Probability of getting exactly 1 heads in 10 flips is: 0.009
77
3. Probability of getting exactly 2 heads in 10 flips is: 0.043
95
4. Probability of getting exactly 3 heads in 10 flips is: 0.117
19
5. Probability of getting exactly 4 heads in 10 flips is: 0.205
08
6. Probability of getting exactly 5 heads in 10 flips is: 0.246
09
7. Probability of getting exactly 6 heads in 10 flips is: 0.205
08
8. Probability of getting exactly 7 heads in 10 flips is: 0.117
19
9. Probability of getting exactly 8 heads in 10 flips is: 0.043
95
10. Probability of getting exactly 9 heads in 10 flips is: 0.009
77
11. Probability of getting exactly 10 heads in 10 flips is: 0.00
098
Figure 4.3: Plot showing the normal distribution
Poisson distribution
Poisson distribution is a discrete probability distribution
that describes the number of events occurring in a fixed
interval of time or space if these events occur independently
and with a constant rate. The Poisson distribution has only
one parameter, λ (lambda), which is the mean number of
events. For example, assume you run a website that gets an
average of 500 visitors per day. This is your λ (lambda).
Now you want to find the probability of getting exactly 550
visitors in a day. This is a Poisson distribution problem
because the number of visitors can be any non-negative
integer, the visitors arrive independently, and you know the
average number of visitors per day. Using the Poisson
distribution formula, you can calculate the probability.
Tutorial 4.13: An example to illustrate Poisson probability
distributions, is as follows:
1. from scipy.stats import poisson
2. import matplotlib.pyplot as plt
3. import numpy as np
4. # average number of visitors per day
5. lambda_ = 500
6. # generate a range of numbers from 0 to 600
7. x = np.arange(0, 600)
8. # calculate Poisson distribution
9. poisson_dist = poisson.pmf(x, lambda_)
10. # number of visitors we are interested in
11. k = 550
12. prob_k = poisson.pmf(k, lambda_)
13. print(f"Probability of getting exactly {k} visitors in a day
is: {prob_k:.5f}")
14. # plot the Poisson distribution
15. plt.bar(x, poisson_dist)
16. plt.title('Poisson Distribution PMF: λ=500')
17. plt.xlabel('Number of Visitors')
18. plt.ylabel('Probability')
19. plt.savefig('poisson_distribution.jpg', dpi=600, bbox_inch
es='tight')
20. plt.show()
We set lambda_ to 500 in the program, representing the
average daily visitors. The average number of visitors per
day is 500. We generate numbers between 0 and 600 for x
to cover your desired number of visitors, specifically 550.
The program calculates and displays a bar chart of the
Poisson distribution once executed. This chart represents
the probability of receiving a specific number of visitors per
day. The horizontal axis indicates the number of visitors,
and the vertical axis displays the probability. The chart
displays the likelihood of having a certain number of visitors
in a day. Each bar on the chart represents the probability of
obtaining that exact number of visitors in one day.
Output:
Conclusion
Understanding covariance and correlation is critical to
determining relationships between variables, while
understanding outliers and anomalies is essential to
ensuring the accuracy of data analysis. The concept of
probability and its distributions is the backbone of statistical
prediction and inference. Finally, understanding arrays and
matrices is fundamental to performing complex
computations and manipulations in data analysis. These
concepts are not only essential in statistics, but also have
broad applications in fields as diverse as data science,
machine learning, and artificial intelligence. Using
covariance, correlation, observing outliers, anomalies,
understanding of how data and probability concepts are
used to predict outcomes and analyze the likelihood of
events. All of these descriptive statistics concepts help to
untangles statistical relationships. Finally, this covers
descriptive statistics,
In Chapter 5, Estimation and Confidence Intervals we will
start with the important concept of inferential statistics and
how estimation is done, confidence interval is measured.
Introduction
Estimation involves making an inference on the true value,
while the confidence interval provides a range of values
that we can be confident contains the true value. For
example, suppose you are a teacher and you want to
estimate the average height of the students in your school.
It is not possible to measure the height of every student, so
you take a sample of 30 students and measure their
heights. Let us say the average height of your sample is
160 cm and the standard deviation is 10 cm. This average
of 160 cm is your point estimate of the average height of all
students in your school. However, it should be noted that
the 30 students sampled may not be a perfect
representation of the entire class, as there may be taller or
shorter students who were not included. Therefore, it
cannot be definitively concluded that the average height of
all students in the class is exactly 160 cm. To ad-dress this
uncertainty, a confidence interval can be calculated. A
confidence interval is an estimate of the range in which the
true population mean, the average height of all students in
the class, is likely to lie. It is based on the sample mean and
standard deviation and provides a measure of the
uncertainty in the estimate. In this example, a 95%
confidence interval was calculated, indicating that there is
a 95% probability that the true average height of all
students in the class falls between 155 cm and 165 cm.
These concepts from descriptive statistics aid in making
informed decisions based on the available data by
quantifying uncertainty, understanding variations around
an estimate, comparing different estimates, and testing
hypotheses.
Structure
In this chapter, we will discuss the following topics:
Points and interval estimation
Standard error and margin of error
Confidence intervals
Objectives
By the end of this chapter, readers will be introduced to the
concept of estimation in data analysis and explain how to
perform it using different methods. Estimation is the
process of inferring unknown population parameters from
sample data. There are two types of estimation: point
estimation and interval estimation. This chapter will also
discuss the types of errors in estimation, and how to
measure them. Moreover, this chapter will demonstrate
how to construct and interpret various confidence intervals
for different scenarios, such as comparing means,
proportions, or correlations. Finally, this chapter will show
how to use t-tests and p-values to test hypotheses about
population parameters based on confidence intervals.
Examples and exercises will be provided throughout the
chapter to help the reader understand and apply the
concepts and methods of estimation.
Confidence intervals
All confidence intervals are interval estimates, but not all
interval estimates are confidence intervals. Interval
estimate is a broader term that refers to any range of
values that is likely to contain the true value of a population
parameter. For instance, if you have a population of
students and want to estimate their average height, you
might reason that it is likely to fall between 5 feet 2 inches
and 6 feet 2 inches. This is an interval estimate, but it does
not have a specific probability associated with it.
Confidence interval, on the other hand, is a specific type
of interval estimate that is accompanied by a probability
statement. For example, a 95% confidence interval means
that if you repeatedly draw different samples from the
same population, 95% of the time, the true population
parameter will fall within the calculated interval.
As discussed, confidence interval is also used to make
inferences about the population based on the sample data.
Tutorial 5.9: Suppose you want to estimate the average
height of all adult women in your city. You take a sample of
100 women and find that their average height is 5 feet 5
inches. You want to estimate the true average height of all
adult women in the city with 95% confidence. This means
that you are 95% confident that the true average height is
between 5 feet 3 inches and 5 feet 7 inches. Based on this
example a Python program illustrating confidence intervals,
is as follows:
1. import numpy as np
2. from scipy import stats
3. # Sample data
4. data = np.array([5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9,
6])
5. # Calculate sample mean and standard deviation
6. mean = np.mean(data)
7. std = np.std(data)
8. # Calculate confidence interval with 95% confidence lev
el
9. margin_of_error = stats.norm.ppf(0.975) * std / np.sqrt(
len(data))
10. confidence_interval = (mean - margin_of_error, mean +
margin_of_error)
11. print("Sample mean:", mean)
12. print("Standard deviation:", std)
13. print("95% confidence interval:", confidence_interval)
Output:
1. Sample mean: 5.55
2. Standard deviation: 0.2872281323269015
3. 95% confidence interval: (5.371977430445669, 5.72802
2569554331)
The sample mean is 5.55, indicating that the average
height in the sample is 5.55 feet. The standard deviation is
0.287, indicating that the heights in the sample vary by
about 0.287 feet. The 95% confidence interval is (5.371,
5.72), which suggests that we can be 95% confident that
the true average height of all adult women in the city falls
within this range. To put it simply, if we were to take
multiple samples of 10 women from the city and calculate
the average height of each sample, the true average height
would fall within the range of 5.37 feet to 5.72 feet 95% of
the time.
Tutorial 5.10: A Python program to illustrate confidence
interval for the age column in the diabetes dataset, is as
follows:
1. import pandas as pd
2. # Load the diabetes data from a csv file
3. diabities_df = pd.read_csv("/workspaces/Implementing
StatisticsWithPython/data/chapter1/diabetes.csv")
4. # Calculate the mean and standard deviation of the 'Ag
e' column
5. mean = diabities_df['Age'].mean()
6. std_dev = diabities_df['Age'].std()
7. # Calculate the standard error
8. std_err = std_dev / (len(diabities_df['Age']) ** 0.5)
9. # Calculate the 95% Confidence Interval
10. ci = stats.norm.interval(0.95, loc=mean, scale=std_err)
11. print(f"95% confidence interval for the 'Age' column is {
ci}")
Output:
1. 95% confidence interval for the 'Age' column is
(32.40915352661263, 34.0726173067207)
Conclusion
In this chapter, we have learned how to estimate unknown
population parameters from sample data using various
methods. We saw that there are two types of estimation:
point estimation and interval estimation. Point estimation
gives a single value as the best guess for the parameter,
while interval estimation gives a range of values that
includes the parameter with a certain degree of confidence.
We have also discussed the errors in estimation and how to
measure them using standard error and margin of error. In
addition, we have shown how to construct and interpret
different confidence intervals for different scenarios, such
as comparing means, proportions, or correlations. We
learned how to use t-tests and p-values to test hypotheses
about population parameters based on confidence intervals.
We applied the concepts and methods of estimation to real-
world examples using the diabetes dataset and the
transaction narrative.
Similarly, estimation is a fundamental and useful tool in
data analysis because it allows us to make inferences and
predictions about a population based on a sample. By using
estimation, we can quantify the uncertainty and variability
of our estimates and provide a measure of their reliability
and accuracy. Estimation also allows us to test hypotheses
and draw conclusions about the population parameters of
interest. It is used in a wide variety of fields and disciplines,
including economics, medicine, engineering, psychology,
and the social sciences.
We hope this chapter has helped you understand and apply
the concepts and methods of estimation in data analysis.
The next chapter will introduce the concept of hypothesis
and significance testing.
Introduction
Testing a claim and drawing conclusion from the result is
testing association. It is one of the most done work in
statistics. For which hypothesis testing defines a claim and
using significance level and bunch of different tests. The
validity of the claim in relation to the data is checked.
Hypothesis testing is a method of making decisions based
on data analysis. It involves stating a null hypothesis and an
alternative hypothesis, which are mutually exclusive
statements about a population parameter. Significance tests
are procedures that assess how likely it is that the observed
data are consistent with the null hypothesis. There are
different types of statistical tests that can be used for
hypothesis testing, depending on the nature of the data and
the research question. Such as z-test, t-test, chi-square test,
ANOVA. These are described later in the chapter, with
examples. Sampling techniques and sampling distributions
are important concepts, and sometimes they are critical in
hypothesis testing because they affect the validity and
reliability of the results. Sampling techniques are methods
of selecting a subset of individuals or units from a
population that is intended to be representative of the
population. Sampling distributions are the probability
distributions of the possible values of a sample statistic
based on repeated sampling from the population.
Structure
In this chapter, we will discuss the following topics:
Hypothesis testing
Significance tests
Role of p-value and significance level
Statistical test
Sampling techniques and sampling distributions
Objectives
The objective of this chapter is to introduce the concept of
hypothesis testing, determining significance, and
interpreting hypotheses through multiple testing. A
hypothesis is a claim or technique for drawing a conclusion,
and a significance test checks the likelihood that the claim
or conclusion is correct. We will see how to perform them
and interpret the result obtained from the data. This
chapter also discusses the types of tests used for hypothesis
testing and significance testing. In addition, this chapter
will explain the role of the p-value and the significance level.
Finally, this chapter shows how to use various hypothesis
and significance tests and p-values to test hypotheses.
Hypothesis testing
Hypothesis testing is a statistical method that uses data
from a sample to draw conclusions about a population. It
involves testing an assumption, known as the null
hypothesis, to determine whether it is likely to be true or
false. The null hypothesis typically states that there is no
effect or difference between two groups, while the
alternative hypothesis is the opposite and what we aim to
prove. Hypothesis testing checks if an idea about the world
is true or not. For example, you might have an idea that
men are taller than women on average, and you want to see
if the data support your idea or not.
Tutorial 6.1: An illustration of the hypothesis testing using
the example ‘men are taller than women on average’, as
mentioned in above example, is as follows:
1. import scipy.stats as stats
2. # define the significance level
3. # alpha = 0.05, which means there is a 5% chance of ma
king a type I error (rejecting the null hypothesis when it i
s true)
4. alpha = 0.05
5. # generate some random data for men and women heigh
ts (in cm)
6. # you can replace this with your own data
7. men_heights = stats.norm.rvs(loc=175, scale=10, size=1
00) # mean = 175, std = 10
8. women_heights = stats.norm.rvs(loc=165, scale=8, size
=100) # mean = 165, std = 8
9. # calculate the sample means and standard deviations
10. men_mean = men_heights.mean()
11. men_std = men_heights.std()
12. women_mean = women_heights.mean()
13. women_std = women_heights.std()
14. # print the sample statistics
15. print("Men: mean = {:.2f}, std = {:.2f}".format(men_mea
n, men_std))
16. print("Women: mean = {:.2f}, std = {:.2f}".format(women
_mean, women_std))
17. # perform a two-sample t-test
18. # the null hypothesis is that the population means are e
qual
19. # the alternative hypothesis is that the population means
are not equal
20. t_stat, p_value = stats.ttest_ind(men_heights, women_hei
ghts)
21. # print the test statistic and the p-value
22. print("t-statistic = {:.2f}".format(t_stat))
23. print("p-value = {:.4f}".format(p_value))
24. # compare the p-
value with the significance level and make a decision
25. if p_value <= alpha:
26. print("Reject the null hypothesis: the population mean
s are not equal.")
27. else:
28. print("Fail to reject the null hypothesis: the populatio
n means are equal.")
Output: Number and result may vary based on a random
generated number. Following is the snippet of output:
1. Men: mean = 174.48, std = 9.66
2. Women: mean = 165.16, std = 7.18
3. t-statistic = 7.70
4. p-value = 0.0000
5. Reject the null hypothesis: the population means are not
equal.
Here is a simple explanation of how hypothesis testing
works. Suppose you have a jar of candies, and you want to
determine whether there are more red candies than blue
candies in the jar. Since counting all the candies in the jar is
not feasible, you can extract a handful of them and
determine the number of red and blue candies. This process
is known as sampling. Based on the sample, you can make
an inference about the entire jar. This inference is referred
to as a hypothesis, which is akin to a tentative answer to a
question. However, to determine the validity of this
hypothesis, a comparison between the sample and the
expected outcome is necessary. For instance, consider the
hypothesis: There are more red candies than blue candies in
the jar. This comparison is known as a hypothesis test,
which determines the likelihood of the sample matching the
hypothesis. For instance, if the hypothesis is correct, the
sample should contain more red candies than blue candies.
However, if the hypothesis is incorrect, the sample should
contain roughly the same number of red and blue candies. A
test provides a numerical measurement of how well the
sample aligns with the hypothesis. This measurement is
known as a p-value, which indicates the level of surprise in
the sample. A low p-value indicates a highly significant
result, while a high p-value indicates a result that is not
statistically significant. For instance, if you randomly select
a handful of candies and they are all red, the result would
be highly significant, and the p-value would be low.
However, if you randomly select a handful of candies and
they are half red and half blue, the result would not be
statistically significant, and the p-value would be high.
Based on the p-value, one can determine whether the
hypothesis is true or false. This determination is akin to a
final answer to the question. For instance, if the p-value is
low, it can be concluded that the hypothesis is true, and one
can state that there are more red candies than blue candies
in the jar. Conversely, if the p-value is high, it can be
concluded that the hypothesis is false, and one can state:
The jar does not contain more red candies than blue
candies.
Tutorial 6.2: An illustration of the hypothesis testing using
the example jar of candies, as mentioned in above example,
is as follows:
1. # import the scipy.stats library
2. import scipy.stats as stats
3. # define the significance level
4. alpha = 0.05
5. # geerate some random data for the number of red and
blue candies in a handful
6. # you can replace this with your own data
7. n = 20 # number of trials (candies)
8. p = 0.5 # probability of success (red candy)
9. red_candies = stats.binom.rvs(n, p) # number of red can
dies
10. blue_candies = n - red_candies # number of blue candies
11. # print the sample data
12. print("Red candies: {}".format(red_candies))
13. print("Blue candies: {}".format(blue_candies))
14. # perform a binomial test
15. # the null hypothesis is that the probability of success is
0.5
16. # the alternative hypothesis is that the probability of suc
cess is not 0.5
17. p_value = stats.binomtest(red_candies, n, p, alternative=
'two-sided')
18. # print the p-value
19. print("p-value = {:.4f}".format(p_value.pvalue))
20. # compare the p-
value with the significance level and make a decision
21. if p_value.pvalue <= alpha:
22. print("Reject the null hypothesis: the probability of su
ccess is not 0.5.")
23. else:
24. print("Fail to reject the null hypothesis: the probabilit
y of success is 0.5.")
Output: Number and result may vary based on generated
random number. Following is the snippet of output:
1. Red candies: 6
2. Blue candies: 14
3. p-value = 0.1153
4. Fail to reject the null hypothesis: the probability of succe
ss is 0.5.
Significance testing
Significance testing evaluates the likelihood of a claim or
statement about a population being true using data. For
instance, it can be used to test if a new medicine is more
effective than a placebo or if a coin is biased. The p-value is
a measure used in significance testing that indicates how
frequently you would obtain the observed data or more
extreme data if the claim or statement were false. The
smaller the p-value, the stronger the evidence against the
claim or statement. Significance testing is different from
hypothesis testing, although they are often confused and
used interchangeably. Hypothesis testing is a formal
procedure for comparing two competing statements or
hypotheses about a population, and making a decision based
on the data. One of the hypotheses is called the null
hypothesis, the other hypothesis is called the alternative
hypothesis, as described above in hypothesis testing.
Hypothesis testing involves choosing a significance level,
which is the maximum probability of making a wrong
decision when the null hypothesis is true. Usually, the
significance level is set to 0.05. Hypothesis testing also
involves calculating a test statistic, which is a number that
summarizes the data and measures how far it is from the
null hypothesis. Based on the test statistic, a p-value is
computed, which is the probability of getting the data (or
more extreme) if the null hypothesis is true. If the p-value is
less than the significance level, the null hypothesis is
rejected and the alternative hypothesis is accepted. If the p-
value is greater than the significance level, the null
hypothesis is not rejected and the alternative hypothesis is
not accepted.
Suppose, you have a friend who claims to be able to guess
the outcome of a coin toss correctly more than half the time,
you can test their claim using significance testing. Ask them
to guess the outcome of 10-coin tosses and record how
many times they are correct. If the coin is fair and your
friend is just guessing, you would expect them to be right
about 5 times out of 10, on average. However, if they get 6,
7, 8, 9, or 10 correct guesses, how likely is it to happen by
chance? The p-value answers the question of the probability
of getting the same or more correct guesses as your friend
did, assuming a fair coin and random guessing. A smaller p-
value indicates a lower likelihood of this happening by
chance, and therefore raises suspicion about your friend's
claim. Typically, a p-value cutoff of 0.05 is used. If the p-
value is less than 0.05, we consider the result statistically
significant and reject the claim that the coin is fair, and the
friend is guessing. If the p-value is greater than 0.05, we
consider the result not statistically significant and do not
reject the claim that the coin is fair, and the friend is
guessing.
Tutorial 6.11: An illustration of the significance testing,
based on above coin toss example, is as follows:
1. # Import the binom_test function from scipy.stats
2. from scipy.stats import binomtest
3. # Ask the user to input the number of correct guesses b
y their friend
4. correct = int(input("How many correct guesses did your
friend make out of 10 coin tosses? "))
5. # Calculate the p-value using the binom_test function
6. # The arguments are: number of successes, number of tr
ials,
probability of success, alternative hypothesis
7. p_value = binomtest(correct, 10, 0.5, "greater")
8. # Print the p-value
9. print("p-value = {:.4f}".format(p_value.pvalue))
10. # Compare the p-value with the cutoff of 0.05
11. if p_value.pvalue < 0.05:
12. # If the p-value is less than 0.05, reject the
claim that the coin is fair and the friend is guessing
13. print("This result is statistically significant. We
reject the claim that the coin is fair and the friend
is guessing.")
14. else:
15. # If the p-
value is greater than 0.05, do not reject the claim that th
e coin is fair and the friend
is guessing
16. print("This result is not statistically significant.
We do not reject the claim that the coin is fair and the
friend is guessing.")
Output: For nine correct guesses, is as follows:
1. How many correct guesses did your friend make out of 1
0 coin tosses? 9
2. p-value = 0.0107
3. This result is statistically significant.
We reject the claim that the coin is fair and the friend is
guessing.
For two correct guesses, the output is not statistically
significant as follows:
1. How many correct guesses did your friend make out of 1
0 coin tosses? 2
2. p-value = 0.9893
3. This result is not
statistically significant. We do not reject the claim that t
he coin
is fair and the friend is guessing.
The following is another example to better understand the
relation between hypothesis and significance testing.
Suppose, you want to know whether a new candy makes
children smarter. You have two hypotheses: The null
hypothesis is that the candy has no effect on children's
intelligence. The alternative hypothesis is that the candy
increases children's intelligence.
You decide to test your hypotheses by giving the candy to 20
children and a placebo to another 20 children. You then
measure their IQ scores before and after the treatment. You
choose a significance level of 0.05, meaning that you are
willing to accept a 5% chance of being wrong if the candy
has no effect. You calculate a test statistic, which is a
number that tells you how much the candy group improved
compared to the placebo group. Based on the test statistic,
you calculate a p-value, which is the probability of getting
the same or greater improvement than you observed if the
candy had no effect.
If the p-value is less than 0.05, you reject the null
hypothesis and accept the alternative hypothesis. You
conclude that the candy makes the children smarter.
If the p-value is greater than 0.05, you do not reject the null
hypothesis and you do not accept the alternative hypothesis.
You conclude that the candy has no effect on the children's
intelligence.
Tutorial 6.12: An illustration of the significance testing,
based on above candy and smartness example, is as follows:
1. # Import the ttest_rel function from scipy.stats
2. from scipy.stats import ttest_rel
3. # Define the IQ scores of the candy group before and aft
er the treatment
4. candy_before = [100, 105, 110, 115, 120, 125, 130, 135,
140]
5. candy_after = [104, 105, 110, 120, 123, 125, 135, 135, 1
44]
6. # Define the IQ scores of the placebo group before and a
fter the treatment
7. placebo_before = [101, 106, 111, 116, 121, 126, 131, 13
6, 141]
8. placebo_after = [100, 104, 109, 113, 117, 121, 125, 129,
133]
9. # Calculate the difference in IQ scores for each group
10. candy_diff = [candy_after[i] - candy_before[i] for i in ran
ge(9)]
11. placebo_diff = [placebo_after[i] - placebo_before[i] for i i
n range(9)]
12. # Perform a paired t-test on the difference scores
13. # The null hypothesis is that the mean difference is zero
14. # The alternative hypothesis is that the mean difference
is positive
15. t_stat, p_value = ttest_rel(candy_diff, placebo_diff, altern
ative="greater")
16. # Print the test statistic and the p-value
17. print(f"The test statistic is {t_stat:.4f}")
18. print(f"The p-value is {p_value:.4f}")
19. # Compare the p-
value with the significance level of 0.05
20. if p_value < 0.05:
21. # If the p-
value is less than 0.05, reject the null hypothesis and acc
ept the alternative hypothesis
22. print("This result is statistically significant. We reject
the null hypothesis and accept the alternative hypothesis
.")
23. print("We conclude that the candy makes the children
smarter.")
24. else:
25. # If the p-
value is greater than 0.05, do not reject the null hypothe
sis and do not accept the alternative hypothesis
26. print("This result is not statistically significant. We do
not reject the null hypothesis and do not accept the alter
native hypothesis.")
27. print("We conclude that the candy has no effect on th
e
children's intelligence.")
Output:
1. The test statistic is 5.6127
2. The p-value is 0.0003
3. This result is statistically significant.
We reject the null hypothesis and accept the alternative
hypothesis.
4. We conclude that the candy makes the children smarter.
The above output can be changed by changing the p-value,
as indicated. The p-value depends on the before and after
values.
Statistical tests
Commonly used statistical tests include the z-test, t-test,
and chi-square test, which are typically applied to different
types of data and research questions. Each of these tests
plays a crucial role in the field of statistics, providing a
framework for making inferences and drawing conclusions
from data. Z-test, t-test and chi-square test, one-way
ANOVA, and two-way ANOVA are used for both hypothesis
and assessing significance testing in statistics.
Z-test
The z-test is a statistical test that compares the mean of a
sample to the mean of a population or the means of two
samples when the population standard deviation is known.
It can determine if the difference between the means is
statistically significant. For example, you can use a z-test to
determine if the average height of students in your class
differs from the average height of all students in your
school, provided you know the standard deviation of the
height of all students. To explain it simply, imagine you have
two basketball teams, and you want to know if one team is
taller than the other. You can measure the height of each
player on both teams, calculate the average height for each
team, and then use a z-test to determine if the difference
between the averages is significant or just due to chance.
Tutorial 6.14: To illustrate the z-test test, based on above
student height example, is as follows:
1. # import the ztest function from statsmodels package
2. from statsmodels.stats.weightstats import ztest
3. # create a list of heights (in cm) for each team
4. teamA = [180, 182, 185, 189, 191, 191, 192,
194, 199, 199, 205, 209, 209, 209, 210, 212, 212, 213, 2
14, 214]
5. teamB = [190, 191, 191, 191, 195, 195, 199, 199,
208, 209, 209, 214, 215, 216, 217, 217, 228, 229, 230, 2
33]
6. # perform a two sample z-
test to compare the mean heights of the two teams
7. # the null hypothesis is that the mean heights are equal
8. # the alternative hypothesis is that the mean heights are
different
9. # we use a two-
tailed test with a significance level of 0.05
10. z_stat, p_value = ztest(teamA, teamB, value=0)
11. # print the test statistic and the p-value
12. print("Z-statistic:", z_stat)
13. print("P-value:", p_value)
14. # interpret the result
15. if p_value < 0.05:
16. print("We reject the null hypothesis and conclude that
the mean heights of the two teams are significantly diffe
rent.")
17. else:
18. print("We fail to reject the null hypothesis and conclu
de that the mean heights of the two teams are not signifi
cantly different.")
Output:
1. Z-statistic: -2.020774406815312
2. P-value: 0.04330312332391124
3. We reject the null hypothesis and conclude that
the mean heights of the two teams are significantly diffe
rent.
This means that, based on the sample data, there is enough
evidence to suggest that Team B is, on average, taller than
Team A, and that this difference is not due to chance.
T-test
A t-test is a statistical test that compares the mean of a
sample to the mean of a population or the means of two
samples. It can determine if the difference between the
means is statistically significant or not, even when the
population standard deviation is unknown and estimated
from the sample. Here is a simple example: Suppose, you
want to compare the delivery times of two different pizza
places. You can order a pizza from each restaurant and
record the time it takes for each pizza to arrive. Then, you
can use a t-test to determine if the difference between the
times is significant or if it could have occurred by chance.
Another example is, you can use a t-test to determine
whether the average score of students who took a math test
online differs from the average score of students who took
the same test on paper, provided that you are unaware of
the standard deviation of the scores of all students who took
the test.
Tutorial 6.15: To illustrate the t-test, based on above
student score example, is as follows:
1. # import the ztest function from statsmodels package
2. from statsmodels.stats.weightstats import ztest
3. # create a list of delivery times (in minutes) for each piz
za place
4. placeA = [15, 18, 20, 22, 25, 28, 30, 32, 35, 40]
5. placeB = [12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
6. # perform a two sample z-
test to compare the mean delivery times of the two pizza
places
7. # the null hypothesis is that the mean delivery times are
equal
8. # the alternative hypothesis is that the mean delivery ti
mes are
different
9. # we use a two-
tailed test with a significance level of 0.05
10. z_stat, p_value = ztest(placeA, placeB, value=0)
11. # print the test statistic and the p-value
12. print("Z-statistic:", z_stat)
13. print("P-value:", p_value)
14. # interpret the result
15. if p_value < 0.05:
16. print("We reject the null hypothesis and conclude that
the mean delivery times of the two pizza places are
significantly different.")
17. else:
18. print("We fail to reject the null hypothesis and
conclude that the mean delivery times of the two pizza
places are not significantly different.")
Output:
1. Z-statistic: 1.7407039045950503
2. P-value: 0.08173549351419786
3. We fail to reject the null hypothesis and conclude that th
e mean
delivery times of the two pizza places are not significantl
y different.
This means that based on the sample data, there is enough
evidence to suggest that location B delivers faster than
location A on average, and that this difference is not due to
chance.
Chi-square test
The chi-square test is a statistical tool that compares
observed and expected frequencies of categorical data
under a null hypothesis. It can determine if there is a
significant association between two categorical variables or
if the distribution of a categorical variable differs from the
expected distribution. To determine if there is a relationship
between the type of pet a person owns and their favorite
color, or if the proportion of people who prefer chocolate ice
cream is different from 50%, you can use a chi-square test.
Tutorial 6.16: Suppose, based on the above example of
pets and favorite colors, you have data consisting of the
observed frequencies of categories in Table 6.1, then
implementation of the chi-square test on it, is as follows:
Pet Red Blue Green Yellow
Cat 12 18 10 15
Dog 8 14 12 11
Bird 5 9 15 6
One-way ANOVA
A one-way ANOVA is a statistical test that compares the
means of three or more groups that have been split on one
independent variable. A one-way ANOVA can tell you if
there is a significant difference among the group means or
not. For example, you can use a one-way ANOVA to see if
the average weight of dogs varies by breed, if you have data
on the weight of dogs from three or more breeds. Another
example is, you can use an analogy of a baking contest to
know if the type of flour you use affects the taste of your
cake. You can bake three cakes using different types of flour
and ask some judges to rate the taste of each cake. Then
you can use a one-way ANOVA to see if the average rating
of the cakes is different depending on the type of flour, or if
they are all similar.
Tutorial 6.17: To illustrate the one-way ANOVA test, based
on above baking contest example, is as follows.
1. import numpy as np
2. import scipy.stats as stats
3. # Define the ratings of the cakes by the judges
4. cake1 = [8.4, 7.6, 9.2, 8.9, 7.8] # Cake made with flour t
ype 1
5. cake2 = [6.5, 5.7, 7.3, 6.8, 6.4] # Cake made with flour t
ype 2
6. cake3 = [7.1, 6.9, 8.2, 7.4, 7.0] # Cake made with flour t
ype 3
7. # Perform one-way ANOVA
8. f_stat, p_value = stats.f_oneway(cake1, cake2, cake3)
9. # Print the results
10. print("F-statistic:", f_stat)
11. print("P-value:", p_value)
Output:
1. F-statistic: 11.716117216117217
2. P-value: 0.001509024295003377
The p-value is very small, which means that we can reject
the null hypothesis that the means of the ratings are equal.
This suggests that the type of flour affects the taste of the
cake.
Two-way ANOVA
A two-way ANOVA is a statistical test that compares the
means of three or more groups split on two independent
variables. It can determine if there is a significant
difference among the group means, if there is a significant
interaction between the two independent variables, or both.
For example, if you have data on the blood pressure of
patients from different genders and age groups, you can use
a two-way ANOVA to determine if the average blood
pressure of patients varies by gender and age group.
Another example is, analogy of a science fair project.
Imagine, you want to find out if the type of music you listen
to and the time of day you study affect your memory.
Volunteers can be asked to memorize a list of words while
listening to different types of music (such as classical, rock,
or pop) at various times of the day (such as morning,
afternoon, or evening). Their recall of the words can then be
tested, and their memory score measured. A two-way
ANOVA can be used to determine if the average memory
score of the volunteers differs depending on the type of
music and time of day, or if there is an interaction between
these two factors. For instance, it may show, listening to
classical music may enhance memory more effectively in the
morning than in the evening, while rock music may have the
opposite effect.
Tutorial 6.18: The implementation of two-way ANOVA test,
based on above baking contest example, is as follows:
1. import pandas as pd
2. import statsmodels.api as sm
3. from statsmodels.formula.api import ols
4. from statsmodels.stats.anova import anova_lm
5. # Define the data
6. data = {"music": ["classical", "classical", "classical", "clas
sical", "classical",
7. "rock", "rock", "rock", "rock", "rock",
8. "pop", "pop", "pop", "pop", "pop"],
9. "time": ["morning", "morning", "afternoon", "afterno
on", "evening",
10. "morning", "morning", "afternoon", "afternoon
", "evening",
11. "morning", "morning", "afternoon", "afternoon
", "evening"],
12. "score": [12, 14, 11, 10, 9,
13. 8, 7, 9, 8, 6,
14. 10, 11, 12, 13, 14]}
15. # Create a pandas DataFrame
16. df = pd.DataFrame(data)
17. # Perform two-way ANOVA
18. model = ols("score ~ C(music) + C(time) + C(music):C(t
ime)", data=df).fit()
19. aov_table = anova_lm(model, typ=2)
20. # Print the results
21. print(aov_table)
Output:
1. sum_sq df F PR(>F)
2. C(music) 54.933333 2.0 36.622222 0.000434
3. C(time) 1.433333 2.0 0.955556 0.436256
4. C(music):C(time) 24.066667 4.0 8.022222 0.013788
5. Residual 4.500000 6.0 NaN NaN
Since the p-value for music is less than 0.05, the music has
a significant effect on memory score, while time has no
significant effect. And since the p-value for the interaction
effect (0.013788) is less than 0.05, this tells us that there is
a significant interaction effect between music and time.
Conclusion
In this chapter, we learned about the concept and process of
hypothesis testing, which is a statistical method for testing
whether or not a statement about a population parameter is
true. Hypothesis testing is important because it allows us to
draw conclusions from data and test the validity of our
claims.
We also learned about significance tests, which are used to
evaluate the strength of evidence against the null
hypothesis based on the p-value and significance level.
Significance testing uses the p-value and significance level
to determine whether the observed effect is statistically
significant, meaning that it is unlikely to occur by chance.
We explored different types of statistical tests, such as z-
test, t-test, chi-squared test, one-way ANOVA, and two-way
ANOVA, and how to choose the appropriate test based on
the research question, data type, and sample size. We also
discussed the importance of sampling techniques and
sampling distributions, which are essential for conducting
valid and reliable hypothesis tests. To illustrate the
application of hypothesis testing, we conducted two
examples using a diabetes dataset. The first example tested
the null hypothesis that the mean BMI of diabetic patients is
equal to the mean BMI of non-diabetic patients using a two-
sample t-test. The second example tests the null hypothesis
that there is no association between the number of
pregnancies and the outcome (diabetic versus non-diabetic)
using a chi-squared test.
Chapter 7, Statistical Machine Learning discusses the
concept of machine learning and how to apply it to make
artificial intelligent models and evaluate them.
Introduction
Statistical Machine Learning (ML) is a branch of
Artificial Intelligence (AI) that combines statistics and
computer science to create models that can learn from data
and make predictions or decisions. Statistical machine
learning has many applications in fields as diverse as
computer vision, speech recognition, bioinformatics, and
more.
There are two main types of learning problems: supervised
and unsupervised learning. Supervised learning involves
learning a function that maps inputs to outputs, based on a
set of labeled examples. Unsupervised learning involves
discovering patterns or structure in unlabeled data, such as
clustering, dimensionality reduction, or generative
modeling. Evaluating the performance and generalization of
different machine learning models is also important. This
can be done using methods such as cross-validation, bias-
variance tradeoff, and learning curves. And sometimes when
supervised and unsupervised are not useful semi and self-
supervised techniques may be useful. This chapters cover
only supervised machine learning, semi-supervised and self-
supervised learning. Topics covered in this chapter are
listed in the Structure section below.
Structure
In this chapter, we will discuss the following topics:
Machine learning
Supervised learning
Model selection and evaluation
Semi-supervised and self-supervised leanings
Semi-supervised techniques
Self-supervised techniques
Objectives
By the end of this chapter, readers will be introduced to the
concept of machine learning, its types, and the topic
associated with supervised machine learning with simple
examples and tutorials. At the end of this chapter, you will
have a solid understanding of the principles and methods of
statistical supervised machine learning and be able to apply
and evaluate them to various real-world problems.
Machine learning
ML is a prevalent form of AI. It powers many of the digital
goods and services we use daily. Algorithms trained on data
sets create models that enable machines to perform tasks
that would otherwise only be possible for humans. Deep
learning is also popular subbranch of machine learning that
uses neural networks with multiple layers. Facebook uses
machine learning to suggest friends, pages, groups, and
events based on your activities, interests, and preferences.
Additionally, it employs machine learning to detect and
remove harmful content, such as hate speech,
misinformation, and spam. Amazon, on the other hand,
utilizes machine learning to analyze your browsing history,
purchase history, ratings, reviews, and other factors to
suggest products that may interest or benefit you. In
healthcare it is used to detect cancer, diabetes, heart
disease, and other conditions from medical images, blood
tests, and other data sources. It can also monitor patient
health, predict outcomes, and suggest optimal treatments
and many more. Types of learning include supervised,
unsupervised, reinforcement, self-supervised, and semi-
supervised.
Supervised learning
Supervised learning uses labeled data sets to train
algorithms to classify data or predict outcomes accurately.
For example, using labeled data of dogs and cats to train a
model to classify them, sentiment analysis, hospital
readmission prediction, spam email filtering.
Figure 7.1: Plot fitting number of hours studies and test score
In Figure 7.1, the data (dots) points represent the actual
values of the number of hours studied and the test score for
each student and the red line represents the fitted linear
regression model that predicts the test score based on the
number of hours studied. Figure 7.1 shows that the line fits
the data well and that the student's test score increases by
almost five points for every hour they study. The line also
predicts that if students did not study at all, their score
would be around 45.
Linear regression
Linear regression uses linear models to predict the target
variable based on the input characteristics. A linear model
is a mathematical function that assumes a linear
relationship between the variables, meaning that the output
can be expressed as a weighted sum of the inputs plus a
constant term. For example, a linear model could be used to
predict the price of a house based on its size and location
can be represented as follows:
price = w1 *size + w2*location + b
Where w1 and w2 are the weights or coefficients that
measure the influence of each feature on the price, and b is
the bias or intercept that represents the base price.
Before moving to the tutorials let us look at the syntax for
implementing linear regression with sklearn, which is as
follows:
1. # Import linear regression
2. from sklearn.linear_model import LinearRegression
3. # Create a linear regression model
4. linear_regression = LinearRegression()
5. # Train the model
6. linear_regression.fit(X_train, y_train)
Tutorial 7.2: To implement and illustrate the concept of
linear regression models to fit a model to predict house
price based on size and location as in the example above, is
as follows:
1. # Import the sklearn linear regression library
2. import sklearn.linear_model as lm
3. # Create some fake data
4. x = [[50, 1], [60, 2], [70, 3], [80, 4], [90, 5]]
# Size and location of the houses
5. y = [100, 120, 140, 160, 180] # Price of the houses
6. # Create a linear regression model
7. model = lm.LinearRegression()
8. # Fit the model to the data
9. model.fit(x, y)
10. # Print the intercept (b) and the slope (w1 and w2)
11. print(f"Intercept: {model.intercept_}") # b
12. print(f"Coefficient/Slope: {model.coef_}") # w1 and w2
13. # Predict the price of a house with size 75 and location
3
14. print(f"Prediction: {model.predict([[75, 3]])}") # y
Output:
1. Intercept: 0.7920792079206933
2. Coefficient/Slope: [1.98019802 0.1980198 ]
3. Prediction: [149.9009901]
ow let us see how the above fitted house price prediction
model looks like in a plot.
Tutorial 7.3: To visualize the fitted line in Tutorial 7.2 and
the data points in a scatter plot, is as follows:
1. import matplotlib.pyplot as plt
2. # Extract the x and y values from the data
3. x_values = [row[0] for row in x]
4. y_values = y
5. # Plot the data points as a scatter plot
6. plt.scatter(x_values, y_values, color="blue", label="Data
points")
7. # Plot the fitted line as a line plot
8. plt.plot(x_values, model.predict(x), color="red", label="
Fitted linear regression model")
9. # Add some labels and a legend
10. plt.xlabel("Size of the house")
11. plt.ylabel("Price of the house")
12. plt.legend()
13. plt.savefig('fitting_models_to_independent_data.jpg',dpi
=600,bbox_inches='tight') # Show the figure
14. plt.show() # Show the plot
Output:
Logistic regression
Logistic regression is a type of statistical model that
estimates the probability of an event occurring based on a
given set of independent variables. It is often used for
classification and predictive analytics, such as predicting
whether an email is spam or not, or whether a customer will
default on a loan or not. Logistic regression predicts the
probability of an event or outcome using a set of predictor
variables based on the concept of a logistic (sigmoid)
function mapping a linear combination into a probability
score between 0 and 1. Here, the predicted probability can
be used to classify the observation into one of the categories
by choosing a cutoff value. For example, if the probability is
greater than 0.5, the observation is classified as a success,
otherwise it is classified as a failure.
For example, a simple example of logistic regression is to
predict whether a student will pass an exam based on the
number of hours they studied. Suppose we have the
following data:
Hours
studied 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Passed 0 0 0 0 0 1 1 1 1 1
Figure 7.6. Plot of fitted logistic regression model for prediction of student
score
Figure 7.6. shows that the probability of passing the final
exam increases as the number of hours studied increases,
and that the logistic regression curve captures this trend
well.
Decision tree
Decision tree is a way of making decisions based on some
data, they are used for both classification and regression
problems. It looks like a tree with branches and leaves.
Each branch represents a choice or a condition, and each
leaf represents an outcome or a result. For example,
suppose you want to decide whether to play tennis or not
based on the weather, if the weather is nice and sunny, you
want to play tennis, if not, you do not want to play tennis.
The decision tree works by starting with the root node,
which is the top node. The root node asks a question about
the data, such as Is it sunny? If the answer is yes, follow
the branch to the right. If the answer is no, you follow the
branch to the left. You keep doing this until you reach a leaf
node that tells you the final decision, such as Play tennis or
Do not play tennis.
Before moving to the tutorials let us look at the syntax for
implementing decision tree with sklearn, which is as
follows:
1. # Import decision tree
2. from sklearn.tree import DecisionTreeClassifier
3. # Create a decision tree classifier
4. tree = DecisionTreeClassifier()
5. # Train the classifier
6. tree.fit(X_train, y_train)
Tutorial 7.10: To implement a decision tree algorithm on
patient data to classify the blood pressure of 20 patients
into low, normal, high is as follows:
1. import pandas as pd
2. from sklearn.tree import DecisionTreeClassifier
3. # Read the data
4. data = pd.read_csv("/workspaces/ImplementingStatistic
sWithPython/data/chapter7/patient_data.csv")
5. # Separate the features and the target
6. X = data.drop("blood_pressure", axis=1)
7. y = data["blood_pressure"]
8. # Encode the categorical features
9. X["gender"] = X["gender"].map({"M": 0, "F": 1})
10. # Build and train the decision tree
11. tree = DecisionTreeClassifier()
12. tree.fit(X, y)
Tutorial 7.11: To view graphical representation of the
above fitted decision tree (Tutorial 7.10), showing the
features, thresholds, impurity, and class labels at each node,
is as follows:
1. import matplotlib.pyplot as plt
2. # Import the plot_tree function from the sklearn.tree mo
dule
3. from sklearn.tree import plot_tree
4. # Plot the decision tree
5. plt.figure(figsize=(10, 8))
6. # Fill the nodes with colors, round the corners, and add f
eature and class names
7. plot_tree(tree, filled=True, rounded=True, feature_name
s=X.columns, class_names=
["Low", "Normal", "High"], fontsize=12)
8. # Show the figure
9. plt.savefig('decision_tree.jpg',dpi=600,bbox_inches='tig
ht')
10. plt.show()
Output:
Figure 7.7: Fitted decision tree plot with features, thresholds, impurity, and
class labels at each node
It is often a better idea to separate dependent and
independent variables and split the dataset into train and
test split before fitting the model. Independent data are the
features or variables that are used as input to the model,
and dependent data are the target or outcome that is
predicted by the model. Splitting data into train test split is
important because it allows us to evaluate the performance
of the model on unseen data and avoid overfitting or
underfitting. From the split, train set is used to fit or train
the model and test set is used for evaluation of the model.
Tutorial 7.12: To implement decision tree by including the
separation of dependent and independent variables, train
test split and then fitting data on train set, based on Tutorial
7.10 is as follows:
1. import pandas as pd
2. from sklearn.tree import DecisionTreeClassifier
3. from sklearn.model_selection import train_test_split
4. # Import the accuracy_score function
5. from sklearn.metrics import accuracy_score
6. # Read the data
7. data = pd.read_csv("/workspaces/ImplementingStatistic
sWithPython/data/chapter7/patient_data.csv")
8. # Separate the features and the target
9. X = data.drop("blood_pressure", axis=1) # independent
variables
10. y = data["blood_pressure"] # dependent variable
11. # Encode the categorical features
12. X["gender"] = X["gender"].map({"M": 0, "F": 1})
13. # Split the data into training and test sets
14. X_train, X_test, y_train, y_test = train_test_split(X, y, test
_size=0.2, random_state=42)
15. # Build and train the decision tree on the training set
16. tree = DecisionTreeClassifier()
17. tree.fit(X_train, y_train)
18. # Further test set can be used to evaluate the model
19. # Predict the values for the test set
20. y_pred = tree.predict(X_test) # Get the predicted values
for the test data
21. # Calculate the accuracy score on the test set
22. accuracy = accuracy_score(y_test, y_pred) # Compare th
e predicted values with the actual values
23. # Print the accuracy score
24. print("Accuracy of the decision tree model on the test se
t :", accuracy)
After fitting the model on the training set, to use the
remaining test set for evaluation of fitted model you need to
import the accuracy_score() from the sklearn.metrics
module. Then use the predict() of the model on the test set
to get the predicted values for the test data. Compare the
predicted values with the actual values in the test set using
the accuracy_score(), which returns a fraction of correct
predictions. Finally print the accuracy score to see how well
the model performs on the test data. More of this is
discussed in the Model selection and evaluation.
Output:
1. Accuracy of the decision tree model on the test set : 1.0
This accuracy is quite high because we only have 20 data
points in this dataset. Once we have adequate data, the
above script will present more realistic results.
Random forest
Random forest is an ensemble learning method that
combines multiple decision trees to make predictions. It is
highly accurate and robust, making it a popular choice for a
variety of tasks, including classification and regression, and
other tasks that work by constructing a large number of
decision trees at training time. Random forest works by
building individual trees and then averaging the predictions
of all the trees. To prevent overfitting, each tree is trained
on a random subset of the training data and uses a random
subset of the features. The random forest predicts by
averaging the predictions of all the trees after building
them. Averaging reduces prediction variance and improves
accuracy.
For example, you have a large dataset of student data,
including information about their grades, attendance, and
extracurricular activities. As a teacher, you can use random
forest to predict which students are most likely to pass their
exams. To build a model, you would train a group of
decision trees on different subsets of your data. Each tree
would use a random subset of the features to make its
predictions. After training all of the trees, you would
average their predictions to get your final result. This is like
having a group of experts who each look at different pieces
of information about your students. Each expert is like a
decision tree, and they all make predictions about whether
each student will pass or fail. After all the experts have
made their predictions, you take an average of all the expert
answers to give you the most likely prediction for each
student.
Before moving to the tutorials let us look at the syntax for
implementing random forest classifier with sklearn, which is
as follows:
1. # Import RandomForestClassifier
2. from sklearn.ensemble import RandomForestClassifier
3. # Create a Random Forest classifier
4. rf = RandomForestClassifier()
5. # Train the classifier
6. rf.fit(X_train, y_train)
Tutorial 7.13. To implement a random forest algorithm on
patient data to classify the blood pressure of 20 patients
into low, normal, high is as follows:
1. import pandas as pd
2. from sklearn.ensemble import RandomForestClassifier
3. # Read the data
4. data = pd.read_csv("/workspaces/ImplementingStatistic
sWithPython/data/chapter7/patient_data.csv")
5. # Separate the features and the target
6. X = data.drop("blood_pressure", axis=1) # independent
variables
7. y = data["blood_pressure"] # dependent variable
8. # Encode the categorical features
9. X["gender"] = X["gender"].map({"M": 0, "F": 1})
10. # Split the data into training and test sets
11. X_train, X_test, y_train, y_test = train_test_split(X, y, test
_size=0.2, random_state=42)
12. # Create a Random Forest classifier
13. rf = RandomForestClassifier()
14. # Train the classifier
15. rf.fit(X_train, y_train)
Tutorial 7.14: To evaluate Tutorial 7.13, fitted random
forest classifier on the test set of data append these lines of
code at the end of Tutorial 7.13:
1. from sklearn.model_selection import train_test_split
2. from sklearn.metrics import accuracy_score
3. # Further test set can be used to evaluate the model
4. # Predict the values for the test set
5. y_pred = tree.predict(X_test) # Get the predicted values
for the test data
6. # Calculate the accuracy score on the test set
7. accuracy = accuracy_score(y_test, y_pred) # Compare t
he predicted values with the actual values
8. # Print the accuracy score
9. print("Accuracy of the Random Forest classifier model o
n the test set :", accuracy)
K-nearest neighbor
K-Nearest Neighbor (KNN) is a machine learning
algorithm used for classification and regression. It finds the
k nearest neighbors of a new data point in the training data
and uses the majority class of those neighbors to classify the
new data point. KNN is useful when the data is not linearly
separable, meaning that there is no clear boundary between
different classes or outcomes. KNN is useful when dealing
with data that has many features or dimensions because it
makes no assumptions about the distribution or structure of
the data. However, it can be slow and memory-intensive
since it must store and compare all the training data for
each prediction.
A simpler example to explain it is, suppose you want to
predict the color of a shirt based on its size and price. The
training data consists of ten shirts, each labeled as either
red or blue. To classify a new shirt, we need to find the k
closest shirts in the training data, where k is a number
chosen by us. For example, if k = 3, we look for the 3
nearest shirts based on the difference between their size
and price. Then, we count how many shirts of each color are
among the 3 nearest neighbors, and assign the most
frequent color to the new shirt. For example, if 2 of the 3
nearest neighbors are red, and 1 is blue, we predict that the
new shirt is red.
Let us see a tutorial to predict the type of flower based on
its features, such as petal length, petal width, sepal length,
and sepal width. The training data consists of 150 flowers,
each labeled as one of three types: Iris setosa, Iris
versicolor, or Iris virginica. The number of k is chosen by us.
For instance, if k = 5, we look for the 5 nearest flowers
based on the Euclidean distance between their features. We
count the number of flowers of each type among the 5
nearest neighbors and assign the most frequent type to the
new flower. For instance, if 3 out of the 5 nearest neighbors
are Iris versicolor and 2 are Iris virginica, we predict that
the new flower is Iris versicolor.
Tutorial 7.16: To implement KNN on iris dataset to predict
the type of flower based on its features, such as petal
length, petal width, sepal length, and sepal width and also
evaluate the result, is as follows:
1. # Load the Iris dataset
2. from sklearn.datasets import load_iris
3. # Import the KNeighborsClassifier class
4. from sklearn.neighbors import KNeighborsClassifier
5. # Import train_test_split for data splitting
6. from sklearn.model_selection import train_test_split
7. # Import accuracy_score for evaluating model performa
nce
8. from sklearn.metrics import accuracy_score
9. # Load the Iris dataset
10. iris = load_iris()
11. # Separate the features and the target variable
12. X = iris.data # Features (sepal length, sepal width, petal
length, petal width)
13. y = iris.target # Target variable (species: Iris-
setosa, Iris-versicolor, Iris-virginica)
14. # Encode categorical features (if any)
15. # No categorical features in the Iris dataset
16. # Split the data into training (90%) and test sets (10%)
17. X_train, X_test, y_train, y_test = train_test_split(X, y, test
_size=0.1, random_state=42)
18. # Create a KNeighborsClassifier object
19. knn = KNeighborsClassifier(n_neighbors=5) # Set num
ber of neighbors to 5
20. # Train the classifier
21. knn.fit(X_train, y_train)
22. # Make predictions on the test data
23. y_pred = knn.predict(X_test)
24. # Evaluate the model's performance using accuracy
25. accuracy = accuracy_score(y_test, y_pred)
26. # Print the accuracy score
27. print("Accuracy of the KNN classifier on the test set :", a
ccuracy)
Output:
1. Accuracy of the KNN classifier on the test set : 1.0
Semi-supervised techniques
Semi-supervised learning bridges the gap between fully
supervised and unsupervised learning. It leverages both
labeled and unlabeled data to improve model performance.
Semi-supervised techniques allow us to make the most of
limited labeled data by incorporating unlabeled examples.
By combining these methods, we achieve better
generalization and performance in real-world scenarios In
this chapter, we explore three essential semi-supervised
techniques which are self-training, co-training, and
graph-based methods, each with a specific task or idea,
along with examples to address or solve them.
Self-training: Self-training is a simple yet effective
approach. It starts with an initial model trained on the
limited labeled data available. The model then predicts
labels for the unlabeled data, and confident predictions
are added to the training set as pseudo-labeled
examples. The model is retrained using this augmented
dataset, iteratively improving its performance. Suppose
we have a sentiment analysis task with a small labeled
dataset of movie reviews. We train an initial model on
this data. Next, we apply the model to unlabeled
reviews, predict their sentiments, and add the confident
predictions to the training set. The model is retrained,
and this process continues until convergence.
Idea: Iteratively label unlabeled data using model
predictions.
Example: Train a classifier on labeled data, predict
labels for unlabeled data, and add confident
predictions to the labeled dataset.
Tutorial 7.32: To implement self-training classifier on Iris
dataset, as follows:
1. from sklearn.semi_supervised import SelfTrainingClassif
ier
2. from sklearn.datasets import load_iris
3. from sklearn.model_selection import train_test_split
4. from sklearn.linear_model import LogisticRegression
5. # Load the Iris dataset (labeled data)
6. X, y = load_iris(return_X_y=True)
7. # Split data into labeled and unlabeled portions
8. X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_t
est_split(X, y, test_size=0.8, random_state=42)
9. # Initialize a base classifier (e.g., logistic regression)
10. base_classifier = LogisticRegression()
11. # Create a self-training classifier
12. self_training_clf = SelfTrainingClassifier(base_classifier)
13. # Fit the model using labeled data
14. self_training_clf.fit(X_labeled, y_labeled)
15. # Predict on unlabeled data
16. y_pred_unlabeled = self_training_clf.predict(X_unlabeled
)
17. # Print the original labels for the unlabeled data
18. print("Original labels for unlabeled data:")
19. print(y_unlabeled)
20. # Print the predictions
21. print("Predictions on unlabeled data:")
22. print(y_pred_unlabeled)
Output:
1. Original labels for unlabeled data:
2. [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0
0010021
3. 0 0 0 2 1 1 0 0 1 2 2 1 2 1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1
2012022
4. 1 1 2 1 0 1 2 0 0 1 1 0 2 0 0 1 1 2 1 2 2 1 0 0 2 2 0 0 0 1
2022011
5. 2 1 2 0 2 1 2 1 1]
6. Predictions on unlabeled data:
7. [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0
0010021
8. 0 0 0 2 1 1 0 0 1 2 2 1 2 1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1
2012022
9. 1 1 2 1 0 1 2 0 0 1 2 0 2 0 0 2 1 2 2 2 2 1 0 0 1 2 0 0 0 1
2022011
10. 2 1 2 0 2 1 2 1 1]
The above outputs have few wrong predictions. Now, let us
see the evaluation metrics.
Tutorial 7.33: To evaluate the trained self-training
classifier performance using appropriate metrics (e.g.,
accuracy, F1-score, etc.), as follows:
1. from sklearn.metrics import accuracy_score, f1_score, pr
ecision_score, recall_score
2. # Assuming y_unlabeled_true contains true labels for un
labeled data
3. accuracy = accuracy_score(y_unlabeled, y_pred_unlabel
ed)
4. f1 = f1_score(y_unlabeled, y_pred_unlabeled, average='
weighted')
5. precision = precision_score(y_unlabeled, y_pred_unlabel
ed, average='weighted')
6. recall = recall_score(y_unlabeled, y_pred_unlabeled, ave
rage='weighted')
7. print(f"Accuracy: {accuracy:.2f}")
8. print(f"F1-score: {f1:.2f}")
9. print(f"Precision: {precision:.2f}")
10. print(f"Recall: {recall:.2f}")
Output:
1. Accuracy: 0.97
2. F1-score: 0.97
3. Precision: 0.97
4. Recall: 0.97
Here, we see an accuracy of 0.97 means that approximately
97% of the predictions were correct. F1-score of 0.97
suggests a good balance between precision and recall,
where higher values indicate better performance. A
precision of 0.97 means that 97% of the positive predictions
were accurate. A recall of 0.97 indicates that 97% of the
positive instances were correctly identified. Further
calibration of the classifier is essential for better results.
You can fine-tune hyperparameters or use techniques like
Platt scaling or isotonic regression to improve calibration.
Co-training: Co-training leverages multiple views of the
data. It assumes that different features or
representations can provide complementary information.
Two or more classifiers are trained independently on
different subsets of features or views. During training,
they exchange their confident predictions on unlabeled
data, reinforcing each other’s learning. Consider a text
classification problem where we have both textual
content and associated metadata, for example, author,
genre. We train one classifier on the text and another on
the metadata. They exchange predictions on unlabeled
data, improving their performance collectively.
Idea: Train multiple models on different views of
data and combine their predictions.
Example: Train one model on text features and
another on image features, then combine their
predictions for a joint task.
Tutorial 7.34: To show and easy implementation of co-
training with two views of data, on UCImultifeature
dataset from mvlearn.datasets, as follows:
1. from mvlearn.semi_supervised import CTClassifier
2. from mvlearn.datasets import load_UCImultifeature
3. from sklearn.linear_model import LogisticRegression
4. from sklearn.ensemble import RandomForestClassifier
5. from sklearn.model_selection import train_test_split
6. data, labels = load_UCImultifeature(select_labeled=
[0,1])
7. X1 = data[0] # Text view
8. X2 = data[1] # Metadata view
9. X1_train, X1_test, X2_train, X2_test, l_train, l_test = trai
n_test_split(X1, X2, labels)
10. # Co-
training with two views of data and 2 estimator types
11. estimator1 = LogisticRegression()
12. estimator2 = RandomForestClassifier()
13. ctc = CTClassifier(estimator1, estimator2, random_state
=1)
14. # Use different matrices for each view
15. ctc = ctc.fit([X1_train, X2_train], l_train)
16. preds = ctc.predict([X1_test, X2_test])
17. print("Accuracy: ", sum(preds==l_test) / len(preds))
This code snippet illustrates the application of co-training, a
semi-supervised learning technique, using the CTClassifier
from mvlearn.semi_supervised. Initially, a multi-view
dataset is loaded, focusing on two specified classes. The
dataset is divided into two views: text and metadata.
Following this, the data is split into training and testing
sets. Two distinct classifiers, logistic regression and random
forest, are instantiated. These classifiers are then
incorporated into the CTClassifier. After training on the
training data from both views, the model predicts labels for
the test data. Finally, the accuracy of the co-training model
on the test data is computed and displayed. Output will
display the accuracy of the model as follows:
Graph-based methods: Graph-based methods exploit
the inherent structure in the data. They construct a
graph where nodes represent instances (labeled and
unlabeled), and edges encode similarity or relationships.
Label propagation or graph-based regularization is then
used to propagate labels across the graph, benefiting
from both labeled and unlabeled data. In a
recommendation system, users and items can be
represented as nodes in a graph. Labeled interactions
(e.g., user-item ratings) provide initial labels. Unlabeled
interactions contribute to label propagation, enhancing
recommendations as follows:
Idea: Leverage data connectivity (e.g., graph
Laplacians) for label propagation.
Example: Construct a graph where nodes represent
data points, and edges represent similarity.
Propagate labels across the graph.
Self-supervised techniques
Self-supervised learning techniques empower models to
learn from unlabeled data, reducing the reliance on
expensive labeled datasets. These methods exploit inherent
structures within the data itself to create meaningful
training signals. In this chapter, we delve into three
essential self-supervised techniques: word
embeddings, masked language models, and language
models.
Word embeddings: A word embedding is a
representation of a word as a real-valued vector. These
vectors encode semantic meaning, allowing similar
words to be close in vector space. Word embeddings are
crucial for various Natural Language Processing
(NLP) tasks. They can be obtained using techniques like
neural networks, dimensionality reduction, and
probabilistic models. For
instance, Word2Vec and GloVe are popular methods for
generating word embeddings. Let us consider an
example, suppose we have a corpus of text. Word
embeddings capture relationships between words. For
instance, the vectors for king and queen should be
similar because they share a semantic relationship.
Idea: Pretrained word representations.
Use: Initializing downstream models, for example
natural language processing tasks.
Tutorial 7.35: To implement word embeddings using self-
supervised task using Word2Vec method, as follows:
1. # Install Gensim and import word2vec for word embeddi
ngs
2. import gensim
3. from gensim.models import Word2Vec
4. # Example sentences
5. sentences = [
6. ["I", "love", "deep", "learning"],
7. ["deep", "learning", "is", "fun"],
8. ["machine", "learning", "is", "easy"],
9. ["deep", "learning", "is", "hard"],
10. # Add more sentences, embeding changes with new w
ords...
11. ]
12. # Train Word2Vec model
13. model = Word2Vec(sentences, vector_size=10, window=
5, min_count=1, sg=1)
14. # Get word embeddings
15. word_vectors = model.wv
16. # Example: Get embedding for the each word in sentenc
e "I love deep learning"
17. print("Embedding for 'I':", word_vectors["I"])
18. print("Embedding for 'love':", word_vectors["love"])
19. print("Embedding for 'deep':", word_vectors["deep"])
20. print("Embedding for 'learning':", word_vectors["learnin
g"])
Output:
1. Embedding for 'I': [-0.00856557 0.02826563 0.0540142
9
0.07052656 -0.05703121 0.0185882
2. 0.06088864 -0.04798051 -0.03107261 0.0679763 ]
3. Embedding for 'love': [ 0.05455794 0.08345953 -0.0145
3741
-0.09208143 0.04370552 0.00571785
4. 0.07441908 -0.00813283 -0.02638414 -0.08753009]
5. Embedding for 'deep': [ 0.07311766 0.05070262 0.067
57693
0.00762866 0.06350891 -0.03405366
6. -0.00946401 0.05768573 -0.07521638 -0.03936104]
7. Embedding for 'learning': [-0.00536227 0.00236431
0.0510335 0.09009273 -0.0930295 -0.07116809
8. 0.06458873 0.08972988 -0.05015428 -0.03763372]
Masked Language Models (MLM): MLM is a
powerful self-supervised technique used by models
like Bidirectional Encoder Representations from
Transformers (BERT). In MLM, some tokens in an
input sequence are m asked, and the model learns to
predict these masked tokens based on context. It
considers both preceding and following tokens,
making it bidirectional. Given the sentence: The cat
sat on the [MASK]. The model predicts the masked
token, which could be mat, chair, or any other valid
word based on context as follows:
Idea: Unidirectional pretrained language
representations.
Use: Full downstream model initialization for
various language understanding tasks.
Language models: A language model is a
probabilistic model of natural language. It estimates
the likelihood of a sequence of words. Large language
models, such as GPT-4 and ELMo, combine neural
networks and transformers. They have superseded
earlier models like n-gram language models. These
models are useful for various NLP tasks, including
speech recognition, machine translation, and
information retrieval. Imagine a language model
trained on a large corpus of text. Given a partial
sentence, it predicts the most likely next word. For
instance, if the input is The sun is shining, the
model might predict brightly as follows:
Idea: Bidirectional pretrained language
representations.
Use: Full downstream model initialization for
tasks like text classification and sentiment
analysis.
Conclusion
In this chapter, we explored the basics and applications of
statistical machine learning. Supervised machine learning is
a powerful and versatile tool for data analysis and AI for
labeled data. Knowing the type of problem, whether
supervised or unsupervised, solves half the learning
problems; the next step is to implement different models
and algorithms. Once this is done, it is critical to evaluate
and compare the performance of different models using
techniques such as cross-validation, bias-variance trade-off,
and learning curves. Some of the best known and most
commonly used supervised machine learning techniques
have been demonstrated. These techniques include decision
trees, random forests, support vector machines, K-nearest
neighbors, linear and logistic regression. We've also talked
about semi-supervised and self-supervised, and techniques
for implementing them. We have also mentioned the
advantages and disadvantages of each approach, as well as
some of the difficulties and unanswered questions in the
field of machine learning.
Chapter 8, Unsupervised Machine Learning explores the
other type of statistical machine learning, unsupervised
machine learning.
CHAPTER 8
Unsupervised Machine
Learning
Introduction
Unsupervised learning is a key area within statistical
machine learning that focuses on uncovering patterns and
structures in unlabelled data. This includes techniques like
clustering, dimensionality reduction, and generative
modelling. Given that most real-world data is unstructured,
extensive preprocessing is often required to transform it
into a usable format, as discussed in previous chapters. The
abundance of unstructured and unlabelled data makes
unsupervised learning increasingly valuable. Unlike
supervised learning, which relies on labelled examples and
predefined target variables, unsupervised learning
operates without such guidance. It can group similar items
together, much like sorting a collection of coloured marbles
into distinct clusters, or reduce complex datasets into
simpler forms through dimensionality reduction, all without
sacrificing important information. Evaluating the
performance and generalization in unsupervised learning
also requires different metrics compared to supervised
learning.
Structure
In this chapter, we will discuss the following topics:
Unsupervised learning
Model selection and evaluation
Objectives
The objective of this chapter is to introduce unsupervised
machine learning, ways to evaluate a trained unsupervised
model. With real-world examples and tutorials to better
explain and demonstrate the implementation.
Unsupervised learning
Unsupervised learning is a machine learning technique
where algorithms are trained on unlabeled data without
human guidance. The data has no predefined categories or
labels and the goal is to discover patterns and hidden
structures. Unsupervised learning works by finding
similarities or differences in the data and grouping them
into clusters or categories. For example, an unsupervised
algorithm can analyze a collection of images and sort them
by color, shape or size. This is useful when there is a lot of
data and labeling them is difficult. For example, imagine
you have a bag of 20 candies with various colors and
shapes. You wish to categorize them into different groups,
but you are unsure of the number of groups or their
appearance. Unsupervised learning can help find the
optimal way to sort or group items.
Another example is, let us take the iris dataset without
flower type labels. Suppose from iris dataset you take a
data of 100 flowers with different features, such as petal
length, petal width, sepal length and sepal width. You want
to group the flowers into different types, but you do not
know how many types there are or what they look like. You
can use unsupervised learning to find the optimal number
of clusters and assign each flower to one of them. You can
use any of unsupervised learning algorithm, for example K-
means algorithm for clustering, which is described in the
K-means section. The algorithm will randomly be choosing
K points as the centers of the clusters, and then assigning
each flower to the nearest center. Then, it will update the
centers by taking the average of the features of the flowers
in each cluster. It will repeat this process until the clusters
are stable and no more changes occur.
There are many unsupervised learning algorithms some
most common ones are described in this chapter.
Unsupervised learning models are used for three main
tasks: clustering, association, and dimensionality reduction.
Table 8.1 summarizes these tasks:
Algorithm Task Description
Finds a lower-dimensional
Principal
Dimensionality representation of data while
component
reduction preserving as much information as
analysis
possible.
K-means
K-means clustering is an iterative algorithm that divides
data points into a predefined number of clusters. It works
by first randomly selecting K centroids, one for each
cluster. It then assigns each data point to the nearest
centroid. The centroids are then updated to be the average
of the data points in their respective clusters. This process
is repeated until the centroids no longer change. It is used
to cluster numerical data. It is often used in marketing to
segment customers, in finance to detect fraud and in data
mining to discover hidden patterns in data.
For example, K-means can be applied here. Imagine you
have a shopping cart dataset of items purchased by
customers. You want to group customers into clusters
based on the items they tend to buy together.
Before moving to the tutorials let us look at the syntax for
implementing K-means with sklearn, which is as follows:
1. from sklearn.cluster import KMeans
2. # Load the dataset
3. data = ...
4. # Create and fit the k-
means model, n_clusters can be any number of clusters
5. kmeans = KMeans(n_clusters=...)
6. kmeans.fit(data)
Tutorial 8.1: To implement K-means clustering using
sklearn on a sample data, is as follows:
1. from sklearn.cluster import KMeans
2. # Load the dataset
3. data = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]]
4. # Create and fit the k-means model
5. kmeans = KMeans(n_clusters=3)
6. kmeans.fit(data)
7. # Predict the cluster labels for each data point
8. labels = kmeans.predict(data)
9. print(f"Clusters labels for data: {labels}")
Following is an output which shows the respective
cluster label for the above six data:
1. Clusters labels for data: [1 1 2 2 0 0]
K-prototype
K-prototype clustering is a generalization of K-means
clustering that allows for mixed clusters with both
numerical and categorical data. It works by first randomly
selecting K centroids, just like K-means. It then assigns
each data point to the nearest centroid. The centroids are
then updated to be the mean of the data points in their
respective clusters. This process is repeated until the
centroids no longer change. It is a used for clustering data
that has both numerical and categorical characteristics.
And also, for textual data.
For example, K-prototype can be applied here. Imagine you
have a social media dataset of users and their posts. You
want to group users into clusters based on both their
demographic information (e.g., age, gender) and their
posting behavior (e.g., topics discussed, sentiment).
Before moving to the tutorials let us look at the syntax for
implementing K-prototype with K modes, which is as
follows:
1. from kmodes.kprototypes import KPrototypes
2. # Load the dataset
3. data = ...
4. # Create and fit the k-prototypes model
5. kproto = KPrototypes(n_clusters=3, init='Cao')
6. kproto.fit(data, categorical=[0, 1])
Tutorial 8.2: To implement K-prototype using K modes on
a sample data, is as follows:
1. import numpy as np
2. from kmodes.kmodes import KModes
3. # Load the dataset
4. data = [[1, 2, 'A'], [2, 3, 'B'], [3, 4, 'A'], [4, 5, 'B'], [5, 6, '
B'], [6, 7, 'A']]
5. # Convert the data to a NumPy array
6. data = np.array(data)
7. # Define the number of clusters
8. num_clusters = 3
9. # Create and fit the k-prototypes model
10. kprototypes = KModes(n_clusters=num_clusters, init='r
andom')
11. kprototypes.fit(data)
12. # Predict the cluster labels for each data point
13. labels = kprototypes.predict(data)
14. print(f"Clusters labels for data: {labels}")
Output:
1. Clusters labels for data: [2 0 2 1 0 2]
Hierarchical clustering
Hierarchical clustering is an algorithm that creates a tree-
like structure of clusters by merging or splitting groups of
data points. There are two main types of hierarchical
clustering, that is, agglomerative and divisive.
Agglomerative hierarchical clustering starts with each data
point in its own cluster and then merges clusters until the
desired number of clusters is reached. On the other hand,
divisive hierarchical clustering starts with all data points in
a single cluster and then splits clusters until the desired
number of clusters is reached. It is a versatile algorithm. It
can cluster any type of data. Often used in social network
analysis to identify communities. Additionally, it is used in
data mining to discover hierarchical relationships in data.
For example, hierarchical clustering can be applied here.
Imagine you have a network of people connected by
friendship ties. You want to group people into clusters
based on the strength of their ties.
Before moving to the tutorials let us look at the syntax for
implementing hierarchical clustering with sklearn, which is
as follows:
1. from sklearn.cluster import AgglomerativeClustering
2. # Load the dataset
3. data = ...
4. # Create and fit the hierarchical clustering model
5. hier = AgglomerativeClustering(n_clusters=3)
6. hier.fit(data)
Tutorial 8.3: To implement hierarchical clustering using
sklearn on a sample data, is as follows:
1. from sklearn.cluster import AgglomerativeClustering
2. # Load the dataset
3. data = [[1, 1], [1, 2], [2, 2], [2, 3], [3, 3], [3, 4]]
4. # Create and fit the hierarchical clustering model
5. cluster = AgglomerativeClustering(n_clusters=3)
6. cluster.fit(data)
7. # Predict the cluster labels for each data point
8. labels = cluster.labels_
9. print(f"Clusters labels for data: {labels}")
Output:
1. Clusters labels for data: [2 0 2 1 0 2]
DBSCAN
Density-Based Spatial Clustering of Applications with
Noise (DBSCAN) is a density-based clustering algorithm
that identifies groups of data points that are densely
packed together. It works by identifying core points, which
are points that have a minimum number of neighbors
within a specified radius. These core points form the basis
of clusters and other points are assigned to clusters based
on their proximity to core points. It is useful when the
number of clusters is unknown. Commonly used for data
that is not well-separated, particularly in computer vision,
natural language processing, and social network analysis.
For example, DBSCAN can be applied here. Imagine you
have a dataset of customer locations. You want to group
customers into clusters based on their proximity to each
other.
Before moving to the tutorials let us look at the syntax for
implementing DBSCAN with sklearn, which is as follows:
1. from sklearn.cluster import DBSCAN
2. # Load the dataset
3. data = ...
4. # Create and fit the DBSCAN model
5. dbscan = DBSCAN(eps=0.5, min_samples=5)
6. dbscan.fit(data)
Tutorial 8.7: To implement DBSCAN using sklearn on a
generated sample data, is as follows:
1. import numpy as np
2. from sklearn.cluster import DBSCAN
3. from sklearn.datasets import make_moons
4. # Generate some data
5. X, y = make_moons(n_samples=200, noise=0.1)
6. # Create a DBSCAN clusterer
7. dbscan = DBSCAN(eps=0.3, min_samples=10)
8. # Fit the DBSCAN clusterer to the data
9. dbscan.fit(X)
10. # Predict the cluster labels for each data point
11. labels = dbscan.labels_
12. print(f"Clusters labels for data: {labels}")
Output:
1. Clusters labels for data: [0 0 1 0 1 0 0 1 1 1 0 0 1
110111010011010001100101
2. 1 0 0 0 0 1 0 1 1 1 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0
10111011001
3. 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 1
101011010110
4. 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 1
1000100001
5. 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1
010011001
6. 0 1 0 0 0 1 1 0 0 0 0 1 1 1 1]
Apriori
Apriori is a frequent itemset mining algorithm that
identifies frequent item sets in transactional datasets. It
works by iteratively finding item sets that meet a minimum
support threshold. It is often used in market basket
analysis to identify patterns in customer behavior. It can
also be used in other domains, such as recommender
systems and fraud detection. For example, apriori can be
applied here. Imagine you have a dataset of customer
transactions. You want to identify common patterns of
items that customers tend to buy together.
Before moving to the tutorials let us look at the syntax for
implementing Apriori with apyori package, which is as
follows:
1. from apyori import apriori
2. # Load the dataset
3. data = ...
4. # Create and fit the apriori model
5. rules = apriori(data, min_support=0.01, min_confidence
=0.5)
Tutorial 8.9: To implement Apriori to find the all the
frequently bought item from a grocery item dataset, is as
follows:
1. import pandas as pd
2. from apyori import apriori
3. # Load the dataset
4. data = pd.read_csv(
5. '/workspaces/ImplementingStatisticsWithPython/dat
a/chapter7/Groceries.csv')
6. # Reshape the data from wide to long format
7. data = pd.melt(data, id_vars='Channel',
8. var_name='Product', value_name='Quantity')
9. # Group the data by customer and aggregate the produ
ct categories into a list
10. data = data.groupby('Channel')['Product'].apply(list)
11. # Convert the data into a list of lists
12. data = data.tolist()
13. # Create the apriori model
14. rules = apriori(data, min_support=0.00003)
15. # Print the rules
16. for rule in rules:
17. for rule in rules:
18. print(list(rule.items))
Tutorial 8.9 output will display the items in each frequent
item set as a list.
Tutorial 8.10: To implement Apriori, to view only the first
five frequent items from a grocery item dataset, is as
follows:
1. import pandas as pd
2. from apyori import apriori
3. # Load the dataset
4. data = pd.read_csv(
5. '/workspaces/ImplementingStatisticsWithPython/dat
a/chapter7/Groceries.csv')
6. # Reshape the data from wide to long format
7. data = pd.melt(data, id_vars='Channel',
8. var_name='Product', value_name='Quantity')
9. # Group the data by customer and aggregate the produ
ct categories into a list
10. data = data.groupby('Channel')['Product'].apply(list)
11. # Convert the data into a list of lists
12. data = data.tolist()
13. # Create the apriori model
14. rules = apriori(data, min_support=0.00003)
15. # Print the rules and the first 5 elements
16. rules = list(rules)
17. rules = rules[:5]
18. for rule in rules:
19. for item in rule.items:
20. print(item)
Output:
1. Delicassen
2. Detergents_Paper
3. Fresh
4. Frozen
5. Grocery
Tutorial 8.11: To implement Apriori, to view all most
frequent items with the support value of each itemset from
the grocery item dataset, is as follows:
1. import pandas as pd
2. from apyori import apriori
3. # Load the dataset
4. data = pd.read_csv(
5. '/workspaces/ImplementingStatisticsWithPython/dat
a/chapter7/Groceries.csv')
6. # Reshape the data from wide to long format
7. data = pd.melt(data, id_vars='Channel',
8. var_name='Product', value_name='Quantity')
9. # Group the data by customer and aggregate the produ
ct categories into a list
10. data = data.groupby('Channel')['Product'].apply(list)
11. # Convert the data into a list of lists
12. data = data.tolist()
13. # Create the apriori model
14. rules = apriori(data, min_support=0.00003)
15. # Print the rules
16. for rule in rules:
17. # Join the items in the itemset with a comma
18. itemset = ", ".join(rule.items)
19. # Get the support value of the itemset
20. support = rule.support
21. # Print the itemset and the support in one line
22. print("{}: {}".format(itemset, support))
Eclat
Eclat is a frequent itemset mining algorithm similar to
Apriori, but more efficient for large datasets. It works by
using a vertical data format to represent transactions. It is
also used in market basket analysis to identify patterns in
customer behavior. It can also be used in other areas such
as recommender systems and fraud detection. For example,
Eclat can be applied here. Imagine you have a dataset of
customer transactions. You want to identify frequent item
sets in transactional datasets efficiently.
Tutorial 8.12: To implement frequent item data mining
using a sample data set of transactions, is as follows:
1. # Define a function to convert the data from horizontal
to vertical format
2. def horizontal_to_vertical(data):
3. # Initialize an empty dictionary to store the vertical fo
rmat
4. vertical = {}
5. # Loop through each transaction in the data
6. for i, transaction in enumerate(data):
7. # Loop through each item in the transaction
8. for item in transaction:
9. # If the item is already in the dictionary, append the
transaction ID to its value
10. if item in vertical:
11. vertical[item].append(i)
12. # Otherwise, create a new key-
value pair with the item and the transaction ID
13. else:
14. vertical[item] = [i]
15. # Return the vertical format
16. return vertical
17. # Define a function to generate frequent item sets using
the ECLAT algorithm
18. def eclat(data, min_support):
19. # Convert the data to vertical format
20. vertical = horizontal_to_vertical(data)
21. # Initialize an empty list to store the frequent item sets
22. frequent = []
23. # Initialize an empty list to store the candidates
24. candidates = []
25. # Loop through each item in the vertical format
26. for item in vertical:
27. # Get the support count of the item by taking the leng
th of its value
28. support = len(vertical[item])
29. # If the support count is greater than or equal to the
minimum support, add the item to the frequent list and t
he candidates list
30. if support >= min_support:
31. frequent.append((item, support))
32. candidates.append((item, vertical[item]))
33. # Loop until there are no more candidates
34. while candidates:
35. # Initialize an empty list to store the new candidates
36. new_candidates = []
37. # Loop through each pair of candidates
38. for i in range(len(candidates) - 1):
39. for j in range(i + 1, len(candidates)):
40. # Get the first item set and its transaction IDs fro
m the first candidate
41. itemset1, tidset1 = candidates[i]
42. # Get the second item set and its transaction IDs fr
om the second candidate
43. itemset2, tidset2 = candidates[j]
44. # If the item sets have the same prefix, they can be
combined
45. if itemset1[:-1] == itemset2[:-1]:
46. # Combine the item sets by adding the last eleme
nt of the second item set to the first item set
47. new_itemset = itemset1 + itemset2[-1]
48. # Intersect the transaction IDs to get the support
count of the new item set
49. new_tidset = list(set(tidset1) & set(tidset2))
50. new_support = len(new_tidset)
51. # If the support count is greater than or equal to t
he minimum support, add the new item set to the freque
nt list and the new candidates list
52. if new_support >= min_support:
53. frequent.append((new_itemset, new_support))
54. new_candidates.append((new_itemset, new_tids
et))
55. # Update the candidates list with the new candidates
56. candidates = new_candidates
57. # Return the frequent item sets
58. return frequent
59. # Define a sample data set of transactions
60. data = [
61. ["A", "B", "C", "D"],
62. ["A", "C", "E"],
63. ["A", "B", "C", "E"],
64. ["B", "C", "D"],
65. ["A", "B", "C", "D", "E"]
66. ]
67. # Define a minimum support value
68. min_support = 3
69. # Call the eclat function with the data and the minimum
support
70. frequent = eclat(data, min_support)
71. # Print the frequent item sets and their support counts
72. for itemset, support in frequent:
73. print(itemset, support)
Output:
1. A 4
2. B 4
3. C 5
4. D 3
5. E 3
6. AB 3
7. AC 4
8. AE 3
9. BC 4
10. BD 3
11. CD 3
12. CE 3
13. ABC 3
14. ACE 3
15. BCD 3
FP-Growth
FP-Growth is a frequent itemset mining algorithm based
on the FP-tree data structure. It works by recursively
partitioning the dataset into smaller subsets and then
identifying frequent item sets in each subset. FP-Growth is
a popular association rule mining algorithm that is often
used in market basket analysis to identify patterns in
customer behavior. It is also used in recommendation
systems and fraud detection. For example, FP-Growth can
be applied here. Imagine you have a dataset of customer
transactions. You want to identify frequent item sets in
transactional datasets efficiently using a pattern growth
approach.
Before moving to the tutorials let us look at the syntax for
implementing FP-Growth with
mlxtend.frequent_patterns, which is as follows:
1. from mlxtend.frequent_patterns import fpgrowth
2. # Load the dataset
3. data = ...
4. # Create and fit the FP-Growth model
5. patterns = fpgrowth(data, min_support=0.01, use_colna
mes=True)
Tutorial 8.13: To implement frequent item for data mining
using FP-Growth using mlxtend. frequent patterns, as
follows:
1. import pandas as pd
2. # Import fpgrowth function from mlxtend library for fre
quent pattern mining
3. from mlxtend.frequent_patterns import fpgrowth
4. # Import TransactionEncoder class from mlxtend librar
y for encoding data
5. from mlxtend.preprocessing import TransactionEncoder
6. # Define a list of transactions, each transaction is a list
of items
7. data = [["A", "B", "C", "D"],
8. ["A", "C", "E"],
9. ["A", "B", "C", "E"],
10. ["B", "C", "D"],
11. ["A", "B", "C", "D", "E"]]
12. # Create an instance of TransactionEncoder
13. te = TransactionEncoder()
14. # Fit and transform the data to get a boolean matrix
15. te_ary = te.fit(data).transform(data)
16. # Convert the matrix to a pandas dataframe with colum
n names as items
17. df = pd.DataFrame(te_ary, columns=te.columns_)
18. # Apply fpgrowth algorithm on the dataframe with a mi
nimum support of 0.8
19. # and return the frequent itemsets with their correspon
ding support values
20. fpgrowth(df, min_support=0.8, use_colnames=True)
Output:
1. support itemsets
2. 0 1.0 (C)
3. 1 0.8 (B)
4. 2 0.8 (A)
5. 3 0.8 (B, C)
6. 4 0.8 (A, C)
Figure 8.2 and the SI, CI, DI, RI scores show that
agglomerative clustering performs better than K-means on
the iris dataset according to all four metrics. Agglomerative
clustering has a higher SI score, which means that the
clusters are more cohesive and well separated. It also has a
lower DI, which means that the clusters are more distinct
and less overlapping. In addition, agglomerative clustering
has a higher CI score, which means that the clusters have a
higher ratio of inter-cluster variance to intra-cluster
variance. Finally, agglomerative clustering has a higher RI,
which means that the predicted labels are more consistent
with the true labels. Therefore, agglomerative clustering is a
better model choice for this data.
Conclusion
In this chapter, we explored unsupervised learning and
algorithms for uncovering hidden patterns and structures
within unlabeled data. We delved into prominent clustering
algorithms like K-means, K-prototype, and hierarchical
clustering, along with probabilistic approaches like
Gaussian mixture models. Additionally, we covered
dimensionality reduction techniques like PCA and SVD for
simplifying complex datasets. This knowledge lays a
foundation for further exploration of unsupervised
learning's vast potential in various domains. From
customer segmentation and anomaly detection to image
compression and recommendation systems, unsupervised
learning plays a vital role in unlocking valuable insights
from unlabeled data.
We hope that this chapter has helped you understand and
apply the concepts and methods of statistical machine
learning, and that you are motivated and inspired to learn
more and apply these techniques to your own data and
problems.
The next Chapter 9, Linear Algebra, Nonparametric
Statistics, and Time Series Analysis explores time series
data, linear algebra and nonparametric statistics.
Introduction
This chapter explores the essential mathematical
foundations, statistical techniques, and methods for
analyzing time-dependent data. We will cover three
interconnected topics: linear algebra, nonparametric
statistics, and time series analysis, incorporating survival
analysis. The journey begins with linear algebra, where we
will unravel key concepts such as linear functions, vectors,
and matrices, providing a solid framework for
understanding complex data structures. Nonparametric
statistics will enable us to analyze data without the
restrictive assumptions of parametric models. We will
explore techniques like rank-based tests and kernel density
estimation, which offer flexibility in analyzing a wide range
of data types.
Time series data, prevalent in diverse areas such as stock
prices, weather patterns, and heart rate variability, will be
examined with a focus on trend and seasonality analysis. In
the realm of survival analysis, where life events such as
disease progression, customer churn, or equipment failure
are unpredictable, we will delve into the analysis of time-to-
event data. We will demystify techniques such as Kaplan-
Meier estimators, making survival analysis accessible and
understandable. Throughout the chapter, each concept will
be illustrated with practical examples and real-world
applications, providing a hands-on guide for
implementation.
Structure
In this chapter, we will discuss the following topics:
Linear algebra
Nonparametric statistics
Survival analysis
Time series analysis
Objectives
This chapter provides the reader with the necessary tools,
the ability to gain insight, the understanding of the theory
and the ways to implement linear algebra, nonparametric
statistics and time series analysis techniques with Python.
By the last page, you will be armed with the knowledge to
tackle complex data challenges and interpret results with
clarity about these topics.
Linear algebra
Linear algebra is a branch of mathematics that focuses on
the study of vectors, vector spaces and linear
transformations. It deals with linear equations, linear
functions and their representations through matrices and
determinants.
Let us understand vectors, linear function and matrices in
linear algebra.
Following is the explanation of vectors:
Vectors: Vectors are a fundamental concept in linear
algebra as they represent quantities that have both
magnitude and direction. Examples of such quantities
include velocity, force and displacement. In statistics,
vectors organize data points. Each data point can be
represented as a vector, where each component
corresponds to a specific feature or variable.
Tutorial 9.1: To create a 2D vector with NumPy and
display, is as follows:
1. import numpy as np
2. # Create a 2D vector
3. v = np.array([3, 4])
4. # Access individual components
5. x, y = v
6. # Calculate magnitude (Euclidean norm) of the vecto
r
7. magnitude = np.linalg.norm(v)
8. print(f"Vector v: {v}")
9. print(f"Components: x = {x}, y = {y}")
10. print(f"Magnitude: {magnitude:.2f}")
Output:
1. Vector v: [3 4]
2. Components: x = 3, y = 4
3. Magnitude: 5.00
Linear function: A linear function is represented by the
equation f(x) = ax + b, where a and b are constants. They
model relationships between variables. For example,
linear regression shows how a dependent variable
changes linearly with respect to an independent variable.
Tutorial 9.2: To create a simple linear function, f(x) =
2x + 3 and plot it, is as follows:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. # Define a linear function: f(x) = 2x + 3
4. def linear_function(x):
5. return 2 * x + 3
6. # Generate x values
7. x_values = np.linspace(-5, 5, 100)
8. # Calculate corresponding y values
9. y_values = linear_function(x_values)
10. # Plot the linear function
11. plt.plot(x_values, y_values, label="f(x) = 2x + 3")
12. plt.xlabel("x")
13. plt.ylabel("f(x)")
14. plt.title("Linear Function")
15. plt.grid(True)
16. plt.legend()
17. plt.savefig("linearfunction.jpg",dpi=600,bbox_inches
='tight')
18. plt.show()
Output:
It plots the f(x) = 2x + 3 as shown in Figure 9.1:
Figure 9.1: Plot of a linear function
Matrices: Matrices are rectangular arrays of numbers
that are commonly used to represent systems of linear
equations and transformations. In statistics, matrices
are used to organize data, where rows correspond to
observations and columns represent variables. For
example, a dataset with height, weight, and age can be
represented as a matrix.
Tutorial 9.3: To create a matrix (rectangular array) of
numbers with NumPy and transpose it, as follows:
1. import numpy as np
2. # Create a 2x3 matrix
3. A = np.array([[1, 2, 3],
4. [4, 5, 6]])
5. # Access individual elements
6. element_23 = A[1, 2]
7. # Transpose the matrix
8. A_transposed = A.T
9. print(f"Matrix A:\n{A}")
10. print(f"Element at row 2, column 3: {element_23}")
11. print(f"Transposed matrix A:\n{A_transposed}")
Output:
1. Matrix A:
2. [[1 2 3]
3. [4 5 6]]
4. Element at row 2, column 3: 6
5. Transposed matrix A:
6. [[1 4]
7. [2 5]
8. [3 6]]
Linear algebra models and analyses relationships between
variables, aiding our comprehension of how changes in one
variable affect another. Its further application include
cryptography to create solid encryption techniques,
regression analysis, dimensionality reduction and solving
systems of linear equations. We discussed this earlier in
Chapter 7, Statistical Machine Learning on linear
regression. For example, imagine we want to predict a
person’s weight based on their height. We collect data from
several individuals and record their heights (in inches) and
weights (in pounds). Linear regression allows us to create a
straight line (a linear model) that best fits the data points
(height and weight). Using this method, we can predict
someone’s weight based on their height using the linear
equation. The use and implementation of linear algebra in
statistics is shown in the following tutorials:
Tutorial 9.4: To illustrate the use of linear algebra, solve a
linear system of equations using the linear algebra
submodule of SciPy, is as follows:
1. import numpy as np
2. # Import the linear algebra submodule of SciPy and assig
n it the alias "la"
3. import scipy.linalg as la
4. A = np.array([[1, 2], [3, 4]])
5. b = np.array([3, 17])
6. # Solving a linear system of equations
7. x = la.solve(A, b)
8. print(f"Solution x: {x}")
9. print(f"Check if A @ x equals b: {np.allclose(A @ x, b)}")
Output:
1. Solution x: [11. -4.]
2. Check if A @ x equals b: True
Tutorial 9.5: To illustrate the use of linear algebra in
statistics to compare performance, solving vs. inverting for
linear systems, using SciPy, is as follows:
1. import numpy as np
2. import scipy.linalg as la
3. A1 = np.random.random((1000, 1000))
4. b1 = np.random.random(1000)
5. # Uses %timeit magic command to measure the executio
n time of la.solve(A1, b1) and la.solve solves linear equat
ions
6. solve_time = %timeit -o la.solve(A1, b1)
7. # Measures the time for solving by first inverting A1 usi
ng la.inv(A1) and then multiplying the inverse with b1.
8. inv_time = %timeit -o la.inv(A1) @ b1
9. # Prints the best execution time for la.solve method in m
illiseconds
10. print(f"Solve time: {solve_time.best:.2f} ms")
11. # Prints the best execution time for the inversion metho
d in milliseconds
12. print(f"Inversion time: {inv_time.best:.2f} ms")
Output:
1. 31.3 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs,
10 loops each)
2. 112 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1
0 loops each)
3. Solve time: 0.03 ms
4. Inversion time: 0.11 ms
Tutorial 9.6: To illustrate the use of linear algebra in
statistics to perform basic matrix properties, using the
linear algebra submodule of SciPy, is as follows:
1. import numpy as np
2. import scipy.linalg as la
3. # Create a complex matrix C
4. C = np.array([[1, 2 + 3j], [3 - 2j, 4]])
5. # Print the conjugate of C (element-
wise complex conjugate)
6. print(f"Conjugate of C:\n{C.conjugate()}")
7. # Print the trace of C (sum of diagonal elements)
8. print(f"Trace of C: {np.diag(C).sum()}")
9. # Print the matrix rank of C (number of linearly indepen
dent rows/columns)
10. print(f"Matrix rank of C: {np.linalg.matrix_rank(C)}")
11. # Print the Frobenius norm of C (square root of sum of s
quared elements)
12. print(f"Frobenius norm of C: {la.norm(C, None)}")
13. # Print the largest singular value of C (largest eigenvalu
e of C*C.conjugate())
14. print(f"Largest singular value of C: {la.norm(C, 2)}")
15. # Print the smallest singular value of C (smallest eigenv
alue of C*C.conjugate())
16. print(f"Smallest singular value of C: {la.norm(C, -2)}")
Output:
1. Conjugate of C:
2. [[1.-0.j 2.-3.j]
3. [3.+2.j 4.-0.j]]
4. Trace of C: (5+0j)
5. Matrix rank of C: 2
6. Frobenius norm of C: 6.557438524302
7. Largest singular value of C: 6.389028023601217
8. Smallest singular value of C: 1.4765909770949925
Tutorial 9.7: To illustrate the use of linear algebra in
statistics to compute the least squares solution in a square
matrix, using the linear algebra submodule of SciPy, is as
follows:
1. import numpy as np
2. import scipy.linalg as la
3. # Define a square matrix A1 and vector b1
4. A1 = np.array([[1, 2], [2, 4]])
5. b1 = np.array([3, 17])
6. # Attempt to solve the system of equations A1x = b1 usi
ng la.solve
7. try:
8. x = la.solve(A1, b1)
9. print(f"Solution using la.solve: {x}") # Print solution if
successful
10. except la.LinAlgError as e: # Catch potential error if ma
trix is singular
11. print(f"Error using la.solve: {e}") # Print error messa
ge
12. # # Compute least-squares solution
13. x, residuals, rank, s = la.lstsq(A1, b1)
14. print(f"Least-squares solution x: {x}")
Output:
1. Error using la.solve: Matrix is singular.
2. Least-squares solution x: [1.48 2.96]
Tutorial 9.8: To illustrate the use of linear algebra in
statistics to compute the least squares solution of a random
matrix, using the linear algebra submodule of SciPy, is as
follows:
1. import numpy as np
2. import scipy.linalg as la
3. import matplotlib.pyplot as plt
4. A2 = np.random.random((10, 3))
5. b2 = np.random.random(10)
6. #Computing least square from random matrix
7. x, residuals, rank, s = la.lstsq(A2, b2)
8. print(f"Least-squares solution for random A2: {x}")
Output:
1. Least-
squares solution for random A2: [0.34430232 0.5421179
6 0.18343947]
Tutorial 9.9: To illustrate the implementation of linear
regression to predict car prices based on historical data, is
as follows:
1. import numpy as np
2. from scipy import linalg
3. # Sample data: car prices (in thousands of dollars) and f
eatures
4. prices = np.array([20, 25, 30, 35, 40])
5. features = np.array([[2000, 150],
6. [2500, 180],
7. [2800, 200],
8. [3200, 220],
9. [3500, 240]])
10. # Fit a linear regression model
11. coefficients, residuals, rank, singular_values = linalg.lsts
q(features, prices)
12. # Predict price for a new car with features [3000, 170]
13. new_features = np.array([3000, 170])
14. # Calculate predicted price using the dot product of the
new features and their corresponding coefficients
15. predicted_price = np.dot(new_features, coefficients)
16. print(f"Predicted price: ${predicted_price:.2f}k")
Output:
1. Predicted price: $41.60k
Nonparametric statistics
Nonparametric statistics is a branch of statistics that does
not rely on specific assumptions about the underlying
probability distribution. Unlike parametric statistics, which
assume that data follow a particular distribution (such as
the normal distribution), nonparametric methods are more
flexible and work well with different types of data.
Nonparametric statistics make inferences without assuming
a particular distribution. They often use ordinal data (based
on rankings) rather than numerical values. As mentioned
unlike parametric methods, nonparametric statistics do not
estimate specific parameters (such as mean or variance) but
focus on the overall distribution.
Let us understand nonparametric statistics and its use
through an example of clinical trial rating, as follows:
Clinical trial rating: Imagine that a researcher is
conducting a clinical trial to evaluate the effectiveness of
a new pain medication. Participants are asked to rate
their treatment experience on a scale of one to five
(where one is very poor and five is excellent). The data
collected consist of ordinal ratings, not continuous
numerical values. These ratings are inherently
nonparametric because they do not follow a specific
distribution.
To analyze the treatment’s impact, the researcher can
apply nonparametric statistical tests like the Wilcoxon
signed-rank test. Wilcoxon signed-rank test is a
statistical method used to compare paired data,
specifically when you want to assess whether there is a
significant difference between two related groups. It
compares the median ratings before and after treatment
and does not assume a normal distribution and is
suitable for paired data.
Hypotheses:
Null hypothesis (H₀): The median rating before
treatment is equal to the median rating after
treatment.
Alternative hypothesis (H₁): The median rating
differs before and after treatment.
If the p-value from the test is small (typically less than
0.05), we reject the null hypothesis, indicating a
significant difference in treatment experience.
This example shows that nonparametric methods allow us to
make valid statistical inferences without relying on specific
distributional assumptions. They are particularly useful
when dealing with ordinal data or situations where
parametric assumptions may not hold.
Tutorial 9.10: To illustrate the use of nonparametric
statistics to compare treatment ratings (ordinal data). We
collect treatment ratings (ordinal data) before and after a
new drug. We want to know if the drug improves the
patient's experience, as follows:
1. import numpy as np
2. from scipy.stats import wilcoxon
3. # Example data (ratings on a scale of 1 to 5)
4. before_treatment = [3, 4, 2, 3, 4]
5. after_treatment = [4, 5, 3, 4, 5]
6. # Null Hypothesis (H₀): The median treatment rating befo
re the new drug is equal to the median rating after the dr
ug.
7. # Alternative Hypothesis (H₁): The median rating differs
before and after the drug.
8. # Perform Wilcoxon signed-rank test
9. statistic, p_value = wilcoxon(before_treatment, after_tre
atment)
10. if p_value < 0.05:
11. print("P-value:", p_value)
12. print("P-
value is less than 0.05, so reject the null hypothesis, we
can confidently say that the new drug led to better treat
ment experience.")
13. else:
14. print("P-value:", p_value)
15. print("No significant change")
16. print("P value is greater than or equal to 0.05, so we c
annot reject the null hypothesis and therefore cannot co
nclude that the drug had a significant effect.")
Output:
1. P-value: 0.0625
2. No significant change
3. P value is greater than or equal to 0.05, so we cannot
reject the null hypothesis and therefore cannot conclude
Rank-based tests
Rank-based tests compare rankings or orders of data
points between groups. It includes Mann-Whitney U test
(Wilcoxon rank-sum test) and Wilcoxon signed-rank test.
The Mann-Whitney U test compares medians between two
independent groups (e.g., treatment vs. control group). It
determines if their distributions differ significantly and is
useful when assumptions of normality are violated. Wilcoxon
signed-rank test compares paired samples (e.g., before and
after treatment), as in Tutorial 9.10. It tests if the median
difference is zero and is robust to non-gaussian data.
Goodness-of-fit tests
Goodness-of-fit tests assess whether observed data fits a
specific distribution. It includes chi-squared goodness-of-fit
test. This test checks if observed frequencies match
expected frequencies in different categories. Suppose you
are a data analyst working for a shop owner who claims that
an equal number of customers visit the shop each weekday.
To test this hypothesis, you record the number of customers
that come into the shop during a given week, as follows:
Days Monday Tuesday Wednesday Thursday Friday
Number of 50 60 40 47 53
Customers
Independence tests
Independence tests determine if two categorical variables
are independent. It includes chi-squared test of
independence and Kendall’s tau or Spearman’s rank
correlation. Chi-squared test of independence examines
association between variables in a contingency table, as
discussed in earlier in Chapter 6, Hypothesis Testing and
Significance Tests. Kendall’s tau or Spearman’s rank
correlation assess correlation between ranked variables.
Suppose two basketball coaches rank 12 players from worst
to best. The rankings assigned by each coach are as follows:
Players Coach #1 Rank Coach #2 Rank
A 1 2
B 2 1
C 3 3
D 4 5
E 5 4
F 6 6
G 7 8
H 8 7
I 9 9
J 10 11
K 11 10
L 12 12
Kruskal-Wallis test
Kruskal-Wallis test is nonparametric alternative to one-way
ANOVA. It allows to compare medians across multiple
independent groups and generalizes the Mann-Whitney test.
Suppose researchers want to determine if three different
fertilizers lead to different levels of plant growth. They
randomly select 30 different plants and split them into three
groups of 10, applying a different fertilizer to each group.
After one month, they measure the height of each plant.
Tutorial 9.13: To implement the Kruskal-Wallis test to
compare median heights across multiple groups, is as
follows:
1. from scipy import stats
2. # Create three arrays to hold the plant measurements fo
r each of the three groups
3. group1 = [7, 14, 14, 13, 12, 9, 6, 14, 12, 8]
4. group2 = [15, 17, 13, 15, 15, 13, 9, 12, 10, 8]
5. group3 = [6, 8, 8, 9, 5, 14, 13, 8, 10, 9]
6. # Perform Kruskal-Wallis Test
7. # Null hypothesis (H₀): The median is equal across all gr
oups.
8. # Alternative hypothesis (Hₐ): The median is not equal ac
ross all groups
9. result = stats.kruskal(group1, group2, group3)
10. print("Kruskal-
Wallis Test Statistic:", round(result.statistic, 3))
11. print("p-value:", round(result.pvalue, 3))
Output:
1. Kruskal-Wallis Test Statistic: 6.288
2. p-value: 0.043
Here, p-value is less than our chosen significance level (e.g.,
0.05), so we reject the null hypothesis. We conclude that the
type of fertilizer used leads to statistically significant
differences in plant growth.
Bootstrapping
Bootstrapping is a resampling technique to estimate
parameters or confidence intervals. Like bootstrapping the
mean or median from a sample. Bootstrapping is a
resampling technique that generates simulated samples by
repeatedly drawing from the original dataset. Each
simulated sample is the same size as the original sample. By
creating these simulated samples, we can explore the
variability of sample statistics and make inferences about
the population. It is especially useful when population
distribution is unknown or does not follow a standard form.
Sample sizes are small. You want to estimate parameters
(e.g., mean, median) or construct confidence intervals.
For example, imagine we have a dataset of exam scores
(sampled from an unknown population). We resample the
exam scores with replacement to create bootstrap samples.
We want to estimate the mean exam score and create a
bootstrapped confidence interval. The bootstrapped mean
provides an estimate of the population mean. The
confidence interval captures the uncertainty around this
estimate.
Tutorial 9.14: To implement nonparametric statistical
method bootstrapping to bootstrap the mean or median
from a sample, is as follows:
1. import numpy as np
2. # Example dataset (exam scores)
3. scores = np.array([78, 85, 92, 88, 95, 80, 91, 84, 89, 87]
)
4. # Number of bootstrap iterations
5. # The bootstrapping process is repeated 10,000 times (1
0,000 iterations is somewhat arbitrary).
6. # Allowing us to explore the variability of the statistic (m
ean in this case). And construct confidence intervals.
7. n_iterations = 10_000
8. # Initialize an array to store bootstrapped means
9. bootstrapped_means = np.empty(n_iterations)
10. # Perform bootstrapping
11. for i in range(n_iterations):
12. bootstrap_sample = np.random.choice(scores, size=le
n(scores), replace=True)
13. bootstrapped_means[i] = np.mean(bootstrap_sample)
14. # Calculate the bootstrap means of all bootstrapped sam
ples from the main exam score data set
15. print(f"Bootstrapped Mean: {np.mean(bootstrapped_mea
ns):.2f}")
16. # Calculate the 95% confidence interval
17. lower_bound = np.percentile(bootstrapped_means, 2.5)
18. upper_bound = np.percentile(bootstrapped_means, 97.5)
19. print(f"95% Confidence Interval: [{lower_bound:.2f}, {up
per_bound:.2f}]")
Output:
1. Bootstrapped Mean: 86.89
2. 95% Confidence Interval: [83.80, 90.00]
This means that we expect the average exam score in the
entire population (from which our sample was drawn) to be
around 86.89. We are 95% confident that the true
population mean exam score falls within this interval (83.80,
89.90).
The nonparametric methods include Kernel Density
Estimation (KDE) which is nonparametric way to estimate
probability density functions (probability distribution for a
random, continuous variable) and is useful for visualizing
data distributions. The survival analysis is also a
nonparametric method because it focuses on estimating
survival probabilities without making strong assumptions
about the underlying distribution of event times. Kaplan-
Meier estimator is a non-parametric method used to
estimate the survival function.
Survival analysis
Survival analysis is a statistical method used to analyze the
amount of time it takes for an event of interest to occur
(helping to understand the time it takes for an event to
occur). It is also known as time-to-event analysis or duration
analysis. Common applications include studying time to
death (in medical research), disease recurrence, or other
significant events. But not limited to medicine, it can be
used in various fields such as finance, engineering and
social sciences. For example, imagine a clinical trial for lung
cancer patients. Researchers want to study the time until
death (survival time) for patients receiving different
treatments. Other examples include analyzing time until
finding a new job after unemployment, mechanical system
failure, bankruptcy of a company, pregnancy & recovery
from a disease.
Kaplan-Meier estimator is one of the most widely used and
simplest methods of survival analysis. It handles censored
data, where some observations are partially observed (e.g.,
lost to follow-up). Kaplan-Meier estimation includes the
following:
Sort the data by time
Calculate the proportion of surviving patients at each
time point
Multiply the proportions to get the cumulative survival
probability
Plot the survival curve
For example, imagine that video game players are
competing in a battle video game tournament. The goal is to
use survival analysis to see which player can stay alive (not
killed) the longest.
In the context of survival analysis, data censoring is often
encountered concept. Sometimes we do not observe the
event for the entire study period, which is when censoring
comes into play. Censored data is now; the organizer may
have to end the game early. In this case, some player may
still be alive when the game end whistle blows. We know
they survived at least that long, but we do not know exactly
how much longer they would have lasted. This is censored
data in survival analysis. Censoring has type fight and left.
Right-censored data occurs when we know an event has not
happened yet, but we do not know exactly when it will
happen in the future. Here censored data can have type
right-centered and left centered like in above video game
competition. Players who were alive in the game when the
whistle blew are right-censored. We know that they survived
at least that long (until the whistle blew), but their true
survival time (how long they would have survived if the
game had continued) is unknown. Left censored data is the
opposite of right-censored data. It occurs when we know
that an event has already happened, but we do not know
exactly when it happened in the past.
Tutorial 9.15: To implement the Kaplan-Meier method to
estimate the survival function (survival analysis) of a video
game player in a battling video game competition, is as
follows:
1. from lifelines import KaplanMeierFitter
2. import numpy as np
3. import matplotlib.pyplot as plt
4. # Let's create a sample dataset
5. # durations represents the time of the event (e.g., time u
ntil student is "alive" in game (not tagged)
6. # event_observed is a boolean array that denotes if the e
vent was observed (True) or censored (False)
7. durations = [24, 18, 30, 12, 36, 15, 8, 42, 21, 6,
8. 10, 27, 33, 5, 19, 45, 28, 9, 39, 14,
9. 22, 7, 48, 31, 17, 20, 40, 25, 3, 37]
10. event_observed = [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
11. 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1]
12. # Create an instance of KaplanMeierFitter
13. kmf = KaplanMeierFitter()
14. # Fit the data into the model
15. kmf.fit(durations, event_observed)
16. # Plot the survival function
17. kmf.plot_survival_function()
18. # Customize plot (optional)
19. plt.xlabel('Time')
20. plt.ylabel('Survival Probability')
21. plt.title('Kaplan-Meier Survival Curve')
22. plt.grid(True)
23. # Save the plot
24. plt.savefig('kaplan_meier_survival.png', dpi=600, bbox_i
nches='tight')
25. plt.show()
Output:
Figure 9.2 and Figure 9.3 show the probability of survival
appears to decrease over time with a steeper decline
observed in the time period near 10 to near 40 points. This
suggests that patients are more likely to experience the
event (possibly death) as time progresses after surgery. The
KM_estimate in Figure 9.2 is survival curve line, this line
represents the Kaplan-Meier survival curve, which is
estimated survival probability over time. And shaded area is
the Confidence Interval (CI). The narrower the CI, the
more precise our estimate of the survival curve. If the CI
widens at certain points, it indicates greater uncertainty in
the survival estimate at those time intervals.
Figure 9.2: Kaplan-Meier curve showing change in probability of survival over
time
Let us see another example, suppose we want to estimate
the lifespan of patients (time until death) with certain
conditions using a sample dataset of 30 patients with their
IDs, time of observation (in months) and event status (alive
or death). Let us say we are studying patients with heart
failure. We will follow them for two years to see if they have
a heart attack during that time.
Following is our data set:
Patient A: Has a heart attack after six months (event
observed).
Patient B: Still alive after two years (right censored).
Patient C: Drops out of the study after one year (right
censored).
In this case, the way censoring works is as follows:
Patient A: We know the exact time of the event (heart
attack).
Patient B: Their data are right-censored because we did
not observe the event (heart attack) during the study.
Patient C: Also, right-censored because he dropped out
before the end of the study.
Tutorial 9.16: To implement Kaplan-Meier method to
estimate survival function (survival analysis) of the patients
with a certain condition over time, is as follows:
1. import matplotlib.pyplot as plt
2. import pandas as pd
3. # Import Kaplan Meier Fitter from the lifelines library
4. from lifelines import KaplanMeierFitter
5. # Create sample healthcare data (change names as need
ed)
6. data = pd.DataFrame({
7. # IDs from 1 to 10
8. "PatientID": range(1, 31),
9. # Time is how long a patient was followed up from the
start of the study,
10. # until the end of the study or the occurrence of the e
vent.
11. "Time": [24, 18, 30, 12, 36, 15, 8, 42, 21, 6,
12. 10, 27, 33, 5, 19, 45, 28, 9, 39, 14,
13. 22, 7, 48, 31, 17, 20, 40, 25, 3, 37],
14. # Event indicates the event status of patient at the en
d of observation ,
15. # weather patient was dead or alive at the end of stud
y period
16. "Event": ['Alive', 'Death', 'Alive', 'Death', 'Alive', 'Alive'
, 'Death', 'Alive', 'Alive', 'Death',
17. 'Alive', 'Death', 'Alive', 'Death', 'Alive', 'Alive', 'D
eath', 'Alive', 'Alive', 'Death',
18. 'Alive', 'Death', 'Alive', 'Alive', 'Death', 'Alive', 'Al
ive', 'Death', 'Alive', 'Death']
19. })
20. # Convert Event to boolean (Event indicates occurrence
of death)
21. data["Event"] = data["Event"] == "Death"
22. # Create Kaplan-
Meier object (focus on event occurrence)
23. kmf = KaplanMeierFitter()
24. kmf.fit(data["Time"], event_observed=data["Event"])
25. # Estimate the survival probability at different points
26. time_points = range(0, max(data["Time"]) + 1)
27. survival_probability = kmf.survival_function_at_times(ti
me_points).values
28. # Plot the Kaplan-Meier curve
29. plt.step(time_points, survival_probability, where='post')
30. plt.xlabel('Time (months)')
31. plt.ylabel('Survival Probability')
32. plt.title('Kaplan-Meier Curve for Patient Survival')
33. plt.grid(True)
34. plt.savefig('Survival_Analysis2.png', dpi=600, bbox_inch
es='tight')
35. plt.show()
Output:
Figure 9.3: Kaplan-Meier curve showing change in probability of survival over
time
Following is an example on survival analysis project:
Analyzes and demonstrates patient survival after surgery on
a fictitious dataset of patients who have undergone a
specific type of surgery. The goal is to understand the
factors that affect patient survival time after surgery.
Specifically, to analyze the questions. What is the overall
survival rate of patients after surgery? How does survival
vary with patient age? Is there a significant difference in
survival between men and women?
The data includes the following columns:
Columns Description
survival_time Time (in days) from surgery to the event (if it occurred)
or the end of the follow-up period (if censored).
Figure 9.9: Time series analysis of monthly sales to assess the impact of
seasons, holidays, and festivals
Conclusion
Finally, this chapter served as an engaging exploration of
powerful data analysis techniques like linear algebra,
nonparametric statistics, time series analysis and survival
analysis. We experienced the elegance of linear algebra, the
foundation for maneuvering complex data structures. We
embraced the liberating power of nonparametric statistics,
which allows us to analyze data without stringent
assumptions. We ventured into the realm of time series
analysis, revealing the hidden patterns in sequential data.
Finally, we delved into survival analysis, a meticulous
technique for understanding the time frames associated
with the occurrence of events. This chapter, however,
serves only as a stepping stone, providing you with the basic
knowledge to embark on a deeper exploration. The path to
data mastery requires ongoing learning and
experimentation.
Following are some suggested next steps to keep you
moving forward: deepen your understanding through
practice by tackling real-world problems, master software,
packages, and tools and embrace learning. Chapter 10,
Generative AI and Prompt Engineering ventures into the
cutting-edge realm of GPT-4, exploring the exciting
potential of prompt engineering for statistics and data
science. We will look at how this revolutionary language
model can be used to streamline data analysis workflows
and unlock new insights from your data.
CHAPTER 10
Generative AI and Prompt
Engineering
Introduction
Generative Artificial Intelligence (AI) has emerged as
one of the most influential and beloved technologies in
recent years, particularly since the widespread accessibility
of models like ChatGPT to the general public. This powerful
technology generates diverse content based on the input it
receives, commonly referred as, prompts. As generative AI
continues to evolve, it finds applications across various
fields, driving innovation and refinement.
Researchers are actively exploring its capabilities, and
there is a growing sense that generative AI is inching
closer to achieving Artificial General Intelligence (AGI).
AGI represents the holy grail of AI, a system that can
understand, learn, and perform tasks across a wide range
of domains akin to human intelligence. The pivotal moment
in this journey was the introduction of Transformers, a
groundbreaking architecture that revolutionized natural
language processing. Generative AI, powered by
Transformers, has significantly impacted people’s lives,
from chatbots and language translation to creative writing
and content generation.
In this chapter, we will look into the intricacies of prompt
engineering—the art of crafting effective inputs to coax
desired outputs from generative models. We will explore
techniques, best practices, and real-world examples,
equipping readers with a deeper understanding of this
fascinating field.
Structure
In this chapter, we will discuss the following topics:
Generative AI
Large language model
Prompt engineering and types of prompts
Open-ended prompts vs. specific prompts
Zero-shot, one-shot, and few-shot learning
Using LLM and generative AI models
Best practices for building effective prompts
Industry-specific use cases
Objectives
By the end of this chapter, you would have learned the
concept of generative AI, prompt engineering techniques,
ways to access generative AI, and many examples of
writing prompts.
Generative AI
Generative AI is an artificially intelligent computer
program that has a remarkable ability to create new
content, and the content is sometimes fresh and original
artifacts. It can generate audio, images, text, video, code,
and more. It produces new things based on what it has
learned from existing examples.
Now, let us look at how generative AI is built. They
leverage powerful foundation models trained on massive
datasets and then fine-tuned with complex algorithms for
specific creative tasks. Generative AI is based on four
major components: the foundation model, training data,
fine-tuning, complex mathematics, and computation. Let us
look at them in detail as follows:
Foundation models are the building blocks. Generative
AI often relies on foundation models, such as Large
Language Models (LLMs). These models are trained
on large amounts of text data, learning patterns,
context, and grammar.
Training data is a large reference database of existing
examples. Generative AIs learn from training data,
which includes everything from books and articles to
social media posts, reports, news articles, dissertations,
etc. The more diverse the data, the better they become
at generating content.
After initial training, the models undergo fine-tuning.
Fine-tuning customizes them for specific tasks. For
example, GPT-4 can be fine-tuned to generate
conversational responses or to write poetry.
Building these models involves complex mathematics
and requires massive computing power. However, at
their core, they are essentially predictive algorithms.
Understanding generative AI
This generative AI takes in the prompt. You provide a
prompt (a question, phrase, or topic). Based on the input
prompt, AI uses its learned patterns from training data to
generate an answer. It does not just regurgitate existing
content; it creates something new. The two main
approaches used by generative AI are Generative
Adversarial Networks (GANs) and autoregressive
models:
GANs: Imagine two AI models competing against each
other. One, the generator, tries to generate realistic
data (images, text, etc.), while the other, the
discriminator, tries to distinguish the generated data
from real data. Through this continuous competition,
the generator learns to produce increasingly realistic
output.
Autoregressive models: These models analyze
sequences of data, such as sentences or image pixels.
They predict the next element in the sequence based on
the previous ones. This builds a probabilistic
understanding of how the data is structured, allowing
the model to generate entirely new sequences that
adhere to the learned patterns.
Beyond the foundational models such as GANs and
autoregressive models, generative AI also relies on several
key mechanisms that enable it to process and generate
sophisticated outputs. Behind the scenes, generative AI
performs embedding and uses attention mechanism. These
two critical components are described as follows:
Embedding: Complex data such as text or images are
converted into numerical representations. Each word or
pixel is assigned a vector containing its characteristics
and relationships to other elements. This allows the
model to efficiently process and manipulate the data.
Attention mechanisms: In text-based models,
attention allows the AI to focus on specific parts of the
input sequence when generating output. Imagine
reading a sentence; you pay more attention to relevant
words for comprehension. Similarly, the model
prioritizes critical elements within the input prompt to
create a coherent response.
While understanding generative AI is crucial, it is equally
important to keep the human in the loop. Human validation
and control are essential to ensure the reliability and
ethical use of AI systems. Even though generative AI can
produce impressive results, it is not perfect. Human
involvement remains essential for validation and control.
Validation is when AI-generated content requires human
evaluation to ensure accuracy, factuality, and lack of bias.
Control is when humans define the training data and
prompts that guide the AI's direction and output style.
One-shot
It is used to deal with limited labeled data and is ideal for
scenarios where many labeled examples are scarce. For
example, training models with only one example per class,
for example, recognition of rare species or ancient scripts.
In one-shot learning, a model is expected to understand
and generate a response or task (such as writing poem)
based on a single prompt without needing additional
examples or instructions. Now, let us look at a few
examples as follows:
Example 1:
Prompt: Write a short poem about the moon.
Technique: A single input prompt is given to generate
content.
Example 2:
Prompt: Describe a serene lakeside scene.
Technique: Model is given one-shot description (i.e, a
vivid scene) in the given prompt.
Few-shot
The purpose of few-shot learning is that it can learn from
very few labeled samples. Hence, it is useful to bridge the
gap between one-shot and traditional supervised learning.
For example, it addresses tasks such as medical diagnosis
with minimal patient data or personalized
recommendations. Now, let us look at a few examples:
Example 1:
Prompt: Continue the story: Once upon a time, in a
forgotten forest
Technique: Few-shot prompting allows the model to
build on a partial narrative.
Example 2:
Prompt: List three benefits of meditation.
Technique : Few-shot information retrieval. The model
provides relevant points based on limited context.
Chain-of-thought
Chain-of-Thought (CoT) encourages models to maintain
coherent thought processes across multiple responses. It is
useful for generating longer, contextually connected
outputs. For example, crafting multi-turn dialogues or
essay-like responses. Now, let us look at a few examples as
follows:
Example 1:
Prompt: Write a paragraph about the changing
seasons.
Technique: Chain of thought involves generating
coherent content by building upon previous sentences.
Here, writing about the change in the season involves
keeping the past season in mind.
Example 2:
Prompt: Discuss the impact of technology on
human relationships.
Technique: Chain of thought essay. The model
elaborates on the topic step by step.
Self-consistency
Self-consistency prompting is a technique used to ensure
that a model's responses are coherent and consistent with
its previous answers. This method plays a crucial role in
preventing the generation of contradictory or nonsensical
information, especially in tasks that require logical
reasoning or factual accuracy. The goal is to make sure
that the model's output follows a clear line of thought and
maintains internal harmony. For instance, when performing
fact-checking or engaging in complex reasoning, it's vital
that the model doesn't contradict itself within a single
response or across multiple responses. By applying self-
consistency prompting, the model is guided to maintain
logical coherence, ensuring that all parts of the response
are in agreement and that the conclusions drawn are based
on accurate and consistent information. This is particularly
important in scenarios where accuracy and reliability are
key, such as in medical diagnostics, legal assessments, or
research. Now, let us look at a few examples s follows:
Example 1:
Prompt: Create a fictional character named Gita
and describe her personality.
Technique: Self-consistency will ensure coherence
within the generated content.
Example 2:
Prompt: Write a dialogue between two friends
discussing their dreams.
Technique: Self-consistent conversation. The model
has to maintain character consistency throughout.
Generated knowledge
Generated knowledge prompting encourages models to
generate novel information. It is useful for creative writing,
brainstorming, or expanding existing knowledge. For
example, crafting imaginative stories, inventing fictional
worlds, or suggesting innovative ideas. Since this is one of
the areas of keen interest for most researchers, efforts are
being put to make it better for generating knowledge. Now,
let us look at a few examples as follows:
Example 1:
Prompt: Explain the concept of quantum
entanglement.
Technique: Generated knowledge provides accurate
information.
Example 2:
Prompt: Describe the process of photosynthesis.
Technique: Generated accurate scientific explanation.
Conclusion
The field of generative AI, driven by LLMs, is at the
forefront of technological innovation. Its impact is
reverberating across multiple domains, simplifying tasks,
and enhancing human productivity. From chatbots that
engage in natural conversations to content generation that
sparks creativity, generative AI has become an
indispensable ally. However, this journey is not without its
challenges. The occasional hallucination where models
produce nonsensical results, the need for alignment with
human values, and ethical considerations all demand our
attention. These hurdles are stepping stones to progress.
Imagine a future where generative AI seamlessly assists us,
a friendly collaborator that creates personalized emails,
generates creative writing, and solves complex problems. It
is more than a tool; it is a companion on our digital journey.
This chapter serves as a starting point- an invitation to
explore further. Go deeper, experiment, and shape the
future. Curiosity will be your guide as you navigate this
ever-evolving landscape. Generative AI awaits your
ingenuity, and together, we will create harmonious
technology that serves humanity.
In final Chapter 11, Data Science in Action: Real-World
Statistical Applications, we explore two key projects. The
first applies data science to banking data, revealing
insights that inform financial decisions. The second focuses
on health data, using statistical analysis to enhance patient
care and outcomes. These real-world applications will
demonstrate how data science is transforming industries
and improving lives.
Introduction
As we reach the climax of the book, this final chapter serves
as a practical bridge between theoretical knowledge and
real-world applications. Throughout this book, we have
moved from the basics of statistical concepts to advanced
techniques. In this chapter, we want to solidify your
understanding by applying the principles you have learned
to real-world projects. In this chapter, we will delve into two
comprehensive case studies-one focused on banking data
and the other on healthcare data. These projects are
designed not only to reinforce the concepts covered in
earlier chapters but also to challenge you to use your
analytical skills to solve complex problems and generate
actionable insights. By implementing the statistical methods
and data science techniques discussed in this book, you will
see how data visualization, exploratory analysis, inferential
statistics and machine learning come together to solve real-
world problems. This hands-on approach will help you
appreciate the power of statistics in data science and
prepare you to apply these skills in your future endeavors,
whether in academia or industry. The final chapter puts
theory into practice, ensuring that you leave with both the
knowledge and the confidence to tackle statistical data
science projects on your own.
Structure
In this chapter, we will discuss the following topics:
Project I: Implementing data science and statistical
analysis on banking data
Project II: Implementing data science and statistical
analysis on health data
Objectives
This chapter aims to demonstrate the practical
implementation of data science and statistical concepts
using real-world synthetic banking and health data
generated for this book only, as a case study. By analyzing
these datasets, we will illustrate how to derive meaningful
insights and make informed decisions based on statistical
inference.
Figure 11.4: Data frame with customer bank details and credit card risk
Figure 11.4 is a data frame with a new column credit cards
risk type, which indicates the risk level of the customer for
issuing credit cards.
Figure 11.7: Scatter plot to view relationship between cholesterol and glucose
level
This following code displays the summary statistics of the
features in data:
1. # Print descriptive statistics for the selected features
2. display(data[features].describe())
Figure 11.8 shows platelets variable has wide range of
values, with a minimum of 150 and a maximum of 400. This
suggests considerable variation in platelet counts within the
dataset, which may be important for understanding
potential health outcomes.
Figure 11.8: Summary statistics of selected features
Conclusion
This chapter provided a hands-on experience in the
practical application of data science and statistical analysis
in two critical sectors: banking and healthcare. Using
synthetic data, the chapter demonstrated how the theories,
methods, and techniques covered throughout the book can
be skillfully applied to real-world contexts. However, the use
of statistics, data science, and Python programming extends
far beyond these examples. In banking, additional
applications include fraud detection and risk assessment,
customer segmentation, and forecasting. In healthcare,
applications extend to predictive modelling for patient
outcomes, disease surveillance and public health
management, and improving operational efficiency in
healthcare systems.
Despite these advances, the real-world use of data requires
careful consideration of ethical, privacy, and security issues,
which are paramount and must always be carefully
addressed. In addition, the success of statistical applications
is highly dependent on the quality and granularity of the
data, making data quality and management equally critical.
With ongoing technological advancements and regulatory
changes, there is a constant need to learn and adapt new
methodologies and tools. This dynamic nature of data
science requires practitioners to remain current and flexible
to effectively navigate the evolving landscape.
B
Bidirectional Encoder Representations from Transformers (BERT) 253
binary coding 84, 85
binomial distribution 151
binom.interval function 176
bivariate analysis 26, 27
bivariate data 26, 27
body mass index (BMI) 96, 213
Bokeh 92
bootstrapping 289, 293
C
Canonical Correlation Analysis (CCA) 30
Chain-of-Thought (CoT) 318
chi-square test 118-120, 210
clinical trial rating 287
cluster analysis 29
collection methods 33
Comma Separated Value (CSV) files 332
confidence interval 161, 172, 173
estimation for diabetes data 179-183
estimation in text 183-185
for differences 177-179
for mean 175
for proportion 176, 177
confidence intervals 169, 170
types 170, 171
contingency coefficient 124
continuous data 13
continuous probability distributions 148
convolutional neural networks (CNNs) 138
correlation 117, 138, 139
negative correlation 138, 139
positive correlation 138
co-training 251
covariance 116, 117, 136-138
Cramer's V 120-123
cumulative frequency 106
D
data 5
qualitative data 6-8
quantitative data 8
data aggregation 50
mean 50, 51
median 51, 52
mode 52, 53
quantiles 55
standard deviation 54
variance 53, 54
data binning 72-77
data cleaning
duplicates 42, 43
imputation 40, 41
missing values 39, 40
outliers 43-45
data encoding 82, 83
data frame
standardization 66
data grouping 77-79
data manipulation 45, 46
data normalization 58, 59
NumPy array 59-61
pandas data frame 61-64
data plotting 92, 93
bar chart 95, 96
dendrograms 100
graphs 100
line plot 93
pie chart 94
scatter plot 97
stacked area chart 99
violin plot 100
word cloud 100
data preparation tasks 35
cleaning 39
data quality 35-37
data science and statistical analysis, on banking data
credit card risk, analyzing 332-335
exploratory data analysis (EDA) 329-331
implementing 328, 329
predictive modeling 335-338
statistical testing 331, 332
data science and statistical analysis, on health data
exploratory data analysis 339-342
implementing 338, 339
inferential statistics 344, 345
statistical analysis 342-344
statistical machine learning 345, 346
data sources 32, 33
data standardization 58, 64, 65
data frame 66
NumPy array 66
data transformation 58, 67-70
data wrangling 45, 46
decision tree 235-238
dendrograms 100
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 264
describe() 18
descriptive statistics 103
detect_outliers function 142
discrete data 12
discrete probability distributions 147
dtype() 17
E
Eclat 270
implementing 270
effective prompts
best practices 322, 323
Enchant 45
environment setup 2
Exploratory Data Analysis (EDA) 49
importance 50
Exploratory Factor Analysis (EFA) 30
F
factor analysis 30
feature scaling 88
few-shot learning 317
First Principal Component (PC1) 32
FP-Growth 273
implementing 273, 274
frequency distribution 106
frequency tables 106
G
Gaussian distribution 150
Gaussian Mixture Models (GMMs) 260
implementing 261
generated knowledge prompting 319
Generative Adversarial Networks (GANs) 313
generative AI models 320
Generative Artificial Intelligence (AI) 311-313
GitHub Codespaces 3
goodness-of-fit tests 289
Google Collaboratory 3
GPT-4
setting up in Python, OpenAI API used 320-322
graph-based methods 252
graphs 100
groupby() 22
groupby().sum() 23
H
hash coding 87
head() 21
hierarchical clustering 259
implementing 260
histograms 96
hypothesis testing 114, 187-190
in diabetes dataset 213-215
one-sided testing 193
performing 191-193
two-sample testing 196
two-sided testing 194, 195
I
independence tests 289, 290
independent tests 197
industry-specific use cases, LLMs 324
info() 20
integrated development environment (IDE) 2
Interquartile Range (IQR) 61
interval data 13
interval estimate 164-166
is_numeric_dtype() 19
is_string_dtype() 19
K
Kaplan-Meier estimator 295
Kaplan-Meier survival curve analysis
implementing 300-304
Kendall’s Tau 291
Kernel Density Estimation (KDE) 294
K-means clustering 257, 258
K modes 259
K-Nearest Neighbor (KNN) 242
implementing 242
K-prototype clustering 258, 259
Kruskal-Wallis test 289, 292
kurtosis 132, 133
L
label coding 83
language model 254
Large Language Model (LLM) 312, 314, 320
industry-specific use cases 324, 325
left skew 128
leptokurtic distribution 132
level of measurement 10
continuous data 13
discrete data 12
interval data 13
nominal data 10
ordinal data 11
ratio data 14, 15
linear algebra 280
using 283-286
Linear Discriminant Analysis (LDA) 64
linear function 281
Linear Mixed-Effects Models (LMMs) 233-235
linear regression 225-231
log10() function 69
logistic regression 231-233
fitting models to dependent data 233
M
machine learning (ML) 222, 223
algorithm 223
data 223
fitting models 223
inference 223
prediction 223
statistics 223
supervised learning 224
margin of error 167, 168
Masked Language Models (MLM) 253
Matplotlib 5, 50, 92
matrices 155, 282
uses 157, 158
mean 50, 51
mean deviation 113
measure of association 114-116
chi-square 118-120
contingency coefficient 124-126
correlation 116
covariance 116
Cramer's V 120-124
measure of central tendency 108, 109
measure of frequency 104
frequency tables and distribution 106
relative and cumulative frequency 106, 107
visualizing 104
measures of shape 126
skewness 126-130
measures of variability or dispersion 110-113
median 51, 52
Microsoft Azure Notebooks 3
missing data
data imputation 88-92
model selection and evaluation methods 243
evaluation metrics 243-248
multivariate analysis 28, 29
multivariate data 28, 29
multivariate regression 29
N
Natural Language Processing (NLP) 142, 252
negative skewness 128
NLTK 45
nominal data 10
nonparametric statistics 287
bootstrapping 293, 294
goodness-of-fit tests 289, 290
independence tests 290-292
Kruskal-Wallis test 292, 293
rank-based tests 289
using 288, 289
nonparametric test 198, 199
normal probability distributions 150
null hypothesis 114, 200
NumPy 4, 50
NumPy array
normalization 59-61
standardization 66
numpy.genfromtxt() 25
numpy.loadtxt() 25
O
one-hot encoding 82
one-shot learning 317
one-way ANOVA 211
open-ended prompts 315
versus specific prompts 315
ordinal data 11
outliers 139-144
detecting 88
treating 88-92
P
paired test 197
pandas 4, 50
pandas data frame
normalization 61-64
parametric test 198
platykurtic distribution 132
Plotly 92
point estimate 162, 163
Poisson distribution 153
population and sample 34, 35
Principal Component Analysis (PCA) 29-32, 64, 262
probability 145, 146
probability distributions 147
binomial distribution 151, 152
continuous probability distributions 148
discrete probability distributions 147
normal probability distributions 150
Poisson distribution 153, 154
uniform probability distributions 149
prompt engineering 314
prompt types 315
p-value 173, 190, 206
using 174
PySpellChecker 45
Python 4
Q
qualitative data 6
example 6-8
versus, quantitative data 17-25
quantile 55-58
quantitative data 8
example 9, 10
R
random forest 238-240
rank-based tests 289
ratio data 14, 15
read_csv() 24
read_json() 24
Receiver-Operating Characteristic Curve (ROC) curve 345
relative frequency 106
retrieval augmented generation (RAG) 319
Robust Scaler 61
S
sample 216
sample mean 216
sampling 189
sampling distribution 216-219
sampling techniques 216-218
scatter plot 97
Scikit-learn 50
Scipy 50
Seaborn 50, 92
Second Principal Component (PC2) 32
select_dtypes(include='____') 22
self-consistency prompting 318
self-supervised learning 248
self-supervised techniques
word embedding 252
self-training classifier 249
semi-supervised learning 248
semi-supervised techniques 249-251
significance levels 206
significance testing 187, 199-203
ANOVA 205
chi-square test 206
correlation test 206
in diabetes dataset 213-215
performing 203-205
regression test 206
t-test 205
Singular Value Decomposition (SVD) 263
skewness 126
Sklearn 5
specific prompts 315
stacked area chart 99
standard deviation 54
standard error 166, 167
Standard Error of the Mean (SEM) 173
Standard Scaler 61
statistical relationships 135
correlation 138
covariance 136-138
statistical tests 207
chi-square test 210, 211
one-way ANOVA 211, 212
t-test 208, 209
two-way ANOVA 212, 213
z-test 207, 208
statistics 5
Statsmodels 50
supervised learning 224
fitting models to independent data 224, 225
Support Vector Machines (SVMs) 240
implementing 241
survival analysis 294-299
T
tail() 21
t-Distributed Stochastic Neighbor Embedding (t-SNE) 265
implementing 266, 267
term frequency-inverse document frequency (TF-IDF) 138
TextBlob 45
time series analysis 304, 305
implementing 305-309
train_test_split() 35
t-test 172, 208
two-way ANOVA 212
type() 23
U
uniform probability distributions 149
Uniform Resource Locator (URLs) 320
univariate analysis 25, 26
univariate data 25, 26
unsupervised learning 256, 257
Apriori 267-269
DBSCAN 264
Eclat 270
evaluation matrices 275-278
FP-Growth 273, 274
Gaussian Mixture Models (GMMs) 260, 261
hierarchical clustering 259, 260
K-means clustering 257, 258
K-prototype clustering 258, 259
model selection and evaluation 275
Principal Component Analysis (PCA) 262
Singular Value Decomposition (SVD) 263
t-SNE 265-267
V
value_counts() 18
variance 53
vectors 280
Vega-altair 92
violin plot 100
W
Word2Vec 138
word cloud 100
word embeddings 252
implementing 253
Z
zero-shot learning 316, 317
z-test 207, 208