0% found this document useful (0 votes)

6 views

CE880_Lecture3_slides

Uploaded by

Anand A J

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

CE880_Lecture3_slides

Uploaded by

Anand A J

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

School of Computer Science and Electronics Engineering, University of Essex

ILecture 3: Data Exploration: Summarising, presenting and

compressing data
CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 31 Jan 2023

1
About Myself

I Name: Haider Raza

I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: [email protected]
I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Common file formats in Data Science

0
https://www.weirdgeek.com/

3
Reading Zipped file

Zip files are a gift from the coding gods. It is like they have fallen from heaven to save
our storage space and time. Old school programmers and computer users will certainly
relate to how we used to copy gigantic installation files in Zip format.

0
analyticsvidhya

4
Reading Text Files

Text files are one of the most common file formats to store data. Python makes it
very easy to read data from text files. Python provides the ‘open()‘ function to read
files that take in the file path and the file access mode as its parameters. For reading
a text file, the file access mode is ‘r‘. I have mentioned the other access modes below:

I ‘w‘ – writing to a file

I ‘r+‘ or ‘w+‘ – read and write to a file
I ‘a‘ – appending to an already existing file
I ‘a+‘ – append to a file after reading

0
analyticsvidhya

5
Reading CSV Files

A CSV (or Comma Separated Value) file is the most common type of file that a data
scientist will ever work with. These files use a “,“ as a delimiter to separate the values
and each row in a CSV file is a data record.

0
analyticsvidhya

6
Reading CSV Files . . .

But CSV can run into problems if the values contain commas. This can be overcome
by using different delimiters to separate information in the file, like ‘‘ or ‘;‘, etc. These
can also be imported with the ‘read_csv()‘ function by specifying the delimiter in the
parameter value as shown below while reading a TSV (Tab Separated Values) file:

0
analyticsvidhya

7
Reading Excel Files

Pandas has a very handy function called ‘read_excel()‘ to read Excel files

We can easily read data from any sheet we wish by providing its name in the
sheet_name parameter in the ‘read_excel()‘ function

0
analyticsvidhya

8
Importing Data from a Database

Data in databases is stored in the form of tables and these systems are known as
Relational database management systems (RDBMS). However, connecting to
RDBMS and retrieving the data from it can prove to be quite a challenging task. You
will need to import the sqlite3 module to use SQLite.

0
analyticsvidhya
9
Reading JSON Files

JSON (JavaScript Object Notation) files are lightweight and human-readable to store
and exchange data. It is easy for machines to parse and generate these files and are
based on the JavaScript programming language. JSON files store data within { }
similar to how a dictionary stores it in Python

0
analyticsvidhya

10
Reading Data from Pickle

Pickle files are used to store the serialized form of Python objects. This means objects
such as list, set, tuple, dict, etc. are converted to a character stream before being
stored on the disk. This allows you to continue working with the objects later on.
These are particularly useful when you have trained your machine learning model and
want to save them to make predictions later on.

0
analyticsvidhya

11
Reading HTML using Python

12
Uploading Data to Colab

There are three different ways of uploading data to Colab:

I Manually locate the file

I Mounting Google Drive
I Using Git or API

13
Manually locate the file

14
Mounting Google Drive

15
Using Git or API

16
Types of Analytics in Data Science

17
Types of Analytics in Data Science

I Descriptive Analytics tells us what happened in the past and helps a business
understand how it is performing by providing context to help stakeholders
interpret information. Example: year-over-year pricing changes,
month-over-month sales growth, or the total revenue per subscriber
I Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past. Example: Examining Market
Demand, Explaining Customer Behavior, Identifying Technology Issues
I Predictive Analytics predicts what is most likely to happen in the future and
provides companies with actionable insights based on the information. Example:
Forecasting future cash flow, Early detection of disease
I Prescriptive Analytics provides recommendations regarding actions that will take
advantage of the predictions and guide the possible actions toward a solution.
Example: Investment Decisions, Fraud Detection, Algorithmic Recommendations
(Instagram, tiktok)

18
Central Tendency

I Mean: The average of the dataset.

I Median: The middle value of an ordered dataset.
I Mode: The most frequent value in the dataset. If the data have multiple values
that occurred the most frequently, we have a multimodal distribution.

19
Central Tendency

I Skewness: A measure of symmetry.

I Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative
to a normal distribution

20
Variability

21
Variability

Range: The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)

I Percentiles — A measure that indicates the value below which a given

percentage of observations in a group of observations falls.
I Quantiles— Values that divide the number of data points into four more or less
equal parts, or quarters.
I Interquartile Range (IQR)— A measure of statistical dispersion and variability
based on dividing a data set into quartiles. IQR = Q3 − Q1 distribution

22
Variability

Variance: The average squared difference of the values from the mean to measure
how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean
and the square root of variance.

23
Variability

24
Relationship Between Variables

Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more
variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.

25
Hypothesis Testing and Statistical Significance

Null Hypothesis: A general statement that there is no relationship between two

measured phenomena or no association among groups. Alternative Hypothesis: Be
contrary to the null hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,
while a type II error is the non-rejection of a false null hypothesis.

“ Students who eat breakfast will perform better on a math exam than students who
do not eat breakfast. ”

26
Clustering

We would like to group our data into different groups.

27
Code: Generate the half-moon data

28
Half-Moon Data with 1500 points

29
Code: Generate the Cicle data

30
Circle Data with 1500 points

31
K-means Algorithm

Possibly the most popular algorithm for clustering

k-Means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.

I Initialise with "n_clusters" random "centroids"

I Iterates over two steps
I Assign each point to one of the centroids it is closer to using euclidean
distance
I Create new centroids by defining each centroid as the average of each
dimension
I Repeat
I Algorithms is unstable, different starting positions will result in different clusters

32
Metrics for clustering

Completeness: clustering must assign all of those datapoints that are members of a
single class to a single cluster

33
Metrics for clustering

Silhouette Coefficient:
I +1 indicate that the sample is far away from the neighboring clusters
I 0 indicates that the sample is on or very close to the decision boundary between
two neighboring clusters
I negative values indicate that those samples might have been assigned to the
wrong cluster

34
Let‘s run it for clustering moon data

35
Two clusters on moon data

36
Let‘s run it for clustering Cicle data

37
Two clusters on Circle data

38
Disadvantage of k-means clustering

I Difficult to predict k-value

I Use elbow plot to select best value of k
I Different initial partitions can result in different final clusters

39
Density-based spatial clustering of applications with noise (DBSCAN)

The DBSCAN algorithm should be used to find associations and structures in data
that are hard to find manually but that can be relevant and useful to find patterns and
predict trends.
Depends on two parameters

I eps: the minimum distance between two points. It means that if the distance
between two points is lower or equal to this value (eps), these points are
considered neighbors.
I minPoints: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.

40
DBSCAN: Advantages and disadvantages

Advantages:

I Can discover arbitrarily shaped clusters

I Find cluster completely surrounded by different clusters region.
Disadvantages:

I Datasets with altering densities are tricky

I Sensitive on two parameters

41
Let’s run it for clustering moon data with DBSCAN

42
Two clusters on moon data using DBSCAN

43
Two clusters on circle data using DBSCAN

Unit 2
No ratings yet
Unit 2
32 pages
SENSOR-BASED GAS LEAKAGE DETECTOR SYSTEM Mini
No ratings yet
SENSOR-BASED GAS LEAKAGE DETECTOR SYSTEM Mini
36 pages
Zipcar Case Analysis
No ratings yet
Zipcar Case Analysis
5 pages
Light Gauge Steel Framing
100% (6)
Light Gauge Steel Framing
14 pages
Unit-1
No ratings yet
Unit-1
84 pages
AI-Data Science
No ratings yet
AI-Data Science
21 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Unit 1 - FoDS - Sep 2023
No ratings yet
Unit 1 - FoDS - Sep 2023
147 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Project Report
No ratings yet
Project Report
37 pages
4HG21CS007
No ratings yet
4HG21CS007
13 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
DSA QB 2023-24
No ratings yet
DSA QB 2023-24
3 pages
Digital Vidya Python Data Analytst Course
No ratings yet
Digital Vidya Python Data Analytst Course
18 pages
Module I
No ratings yet
Module I
74 pages
Data Science
No ratings yet
Data Science
59 pages
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
No ratings yet
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
43 pages
Unit 2
No ratings yet
Unit 2
20 pages
Data Analytics With Python Lecture 1
No ratings yet
Data Analytics With Python Lecture 1
23 pages
Data Science
No ratings yet
Data Science
24 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Six Weeks Summer Training Reportpdf
100% (1)
Six Weeks Summer Training Reportpdf
26 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
4 (John Stredwick) Introduction To Human Resource Ma
No ratings yet
4 (John Stredwick) Introduction To Human Resource Ma
61 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
prw questions
No ratings yet
prw questions
31 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Module 1
No ratings yet
Module 1
91 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
lecture-week4
No ratings yet
lecture-week4
50 pages
Data Science For Service Change
No ratings yet
Data Science For Service Change
52 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Lesson 4
No ratings yet
Lesson 4
64 pages
POA - Tracker MACHINE LEARNING
100% (1)
POA - Tracker MACHINE LEARNING
48 pages
Introduction To R For Social Scientist Preview
No ratings yet
Introduction To R For Social Scientist Preview
26 pages
B Ei
No ratings yet
B Ei
44 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (7)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Analytics Boot Camp
No ratings yet
Analytics Boot Camp
126 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
2.1 - Introduction To Data Analytics
No ratings yet
2.1 - Introduction To Data Analytics
32 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
CE802_Lec_FuncOptim_handouts
No ratings yet
CE802_Lec_FuncOptim_handouts
11 pages
CE802_Lec_Eval_handouts
No ratings yet
CE802_Lec_Eval_handouts
33 pages
Bayesian_Learning1
No ratings yet
Bayesian_Learning1
21 pages
CE880_lecture5_slides
No ratings yet
CE880_lecture5_slides
32 pages
CE880_Lecture9_slides
No ratings yet
CE880_Lecture9_slides
43 pages
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
No ratings yet
List of ENGLISH Books by Mahatma Gandhi Available With Gandhi Research Foundation, Jalgaon
41 pages
An Open Letter To All Engineering Grads Trying To Pursue Physics
No ratings yet
An Open Letter To All Engineering Grads Trying To Pursue Physics
2 pages
Training PM Id Covi - 3rd Wave Extended Team Pre-Read
No ratings yet
Training PM Id Covi - 3rd Wave Extended Team Pre-Read
29 pages
Calculating Tank Durations For HFNC Systems and Bipap V 60 Systems
No ratings yet
Calculating Tank Durations For HFNC Systems and Bipap V 60 Systems
15 pages
Maryland Using Native Plants For Butterfly Gardens
No ratings yet
Maryland Using Native Plants For Butterfly Gardens
2 pages
13162/Blgt Koaa Exp Second Sitting (2S)
No ratings yet
13162/Blgt Koaa Exp Second Sitting (2S)
2 pages
MC Donalds Asg
No ratings yet
MC Donalds Asg
22 pages
Peas Framework For Critical Thinking
No ratings yet
Peas Framework For Critical Thinking
3 pages
MCQs SCM
No ratings yet
MCQs SCM
15 pages
unDUEL Syallabus
No ratings yet
unDUEL Syallabus
19 pages
Radio One, Inc.: By: Ankur Gupta B08071 Anwar Syed B08072 Arun Kumarb08073 B Prathik B08075
No ratings yet
Radio One, Inc.: By: Ankur Gupta B08071 Anwar Syed B08072 Arun Kumarb08073 B Prathik B08075
11 pages
HDSD Kbone
No ratings yet
HDSD Kbone
8 pages
Unit 12 - Video Class
No ratings yet
Unit 12 - Video Class
4 pages
FF Specification
No ratings yet
FF Specification
42 pages
Download full (Ebook) Design Engineer's Reference Guide: Mathematics, Mechanics, and Thermodynamics by Keith L. Richards ISBN 9781466592858, 9781466592865, 1466592850, 1466592869 ebook all chapters
100% (1)
Download full (Ebook) Design Engineer's Reference Guide: Mathematics, Mechanics, and Thermodynamics by Keith L. Richards ISBN 9781466592858, 9781466592865, 1466592850, 1466592869 ebook all chapters
81 pages
Free Ebook
No ratings yet
Free Ebook
22 pages
Converting Many-State PDA To One-State PDA
No ratings yet
Converting Many-State PDA To One-State PDA
3 pages
Grade-8 REVISION TEST
No ratings yet
Grade-8 REVISION TEST
1 page
Holo-Print User Guide
No ratings yet
Holo-Print User Guide
12 pages
Arun
No ratings yet
Arun
5 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages
The Nestle Internship Program Aldovino, John Bryan Quality Assurance
No ratings yet
The Nestle Internship Program Aldovino, John Bryan Quality Assurance
5 pages
Linear Integrated Circuit: 1A Low Dropout Positive Voltage Regulator
No ratings yet
Linear Integrated Circuit: 1A Low Dropout Positive Voltage Regulator
7 pages
Literary Analysis Rubric
No ratings yet
Literary Analysis Rubric
1 page
ASTM D7234
No ratings yet
ASTM D7234
8 pages
(XXXX) Syllabus - ICND v3.0 - CCNA Routing & Switching - 051118
No ratings yet
(XXXX) Syllabus - ICND v3.0 - CCNA Routing & Switching - 051118
1 page
Starting and Managing A Profitable Catfish Farming Business in Nigeria Business Post Nigeria
No ratings yet
Starting and Managing A Profitable Catfish Farming Business in Nigeria Business Post Nigeria
7 pages
Assignment of Morphology and Syntax by Ali Asghar Roll No O11
100% (1)
Assignment of Morphology and Syntax by Ali Asghar Roll No O11
6 pages
Aisy 202100048
No ratings yet
Aisy 202100048
13 pages

CE880_Lecture3_slides

Uploaded by

CE880_Lecture3_slides

Uploaded by

School of Computer Science and Electronics Engineering, University of Essex

ILecture 3: Data Exploration: Summarising, presenting and

I Name: Haider Raza

I ‘w‘ – writing to a file

There are three different ways of uploading data to Colab:

I Manually locate the file

I Mean: The average of the dataset.

I Skewness: A measure of symmetry.

I Percentiles — A measure that indicates the value below which a given

Null Hypothesis: A general statement that there is no relationship between two

We would like to group our data into different groups.

Possibly the most popular algorithm for clustering

I Initialise with "n_clusters" random "centroids"

I Difficult to predict k-value

I Can discover arbitrarily shaped clusters

I Datasets with altering densities are tricky

You might also like