0% found this document useful (0 votes)
6 views

CE880_Lecture3_slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

CE880_Lecture3_slides

Uploaded by

Anand A J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

School of Computer Science and Electronics Engineering, University of Essex

ILecture 3: Data Exploration: Summarising, presenting and


compressing data
CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 31 Jan 2023

1
About Myself

I Name: Haider Raza


I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: [email protected]
I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Common file formats in Data Science

0
https://www.weirdgeek.com/

3
Reading Zipped file

Zip files are a gift from the coding gods. It is like they have fallen from heaven to save
our storage space and time. Old school programmers and computer users will certainly
relate to how we used to copy gigantic installation files in Zip format.

0
analyticsvidhya

4
Reading Text Files

Text files are one of the most common file formats to store data. Python makes it
very easy to read data from text files. Python provides the ‘open()‘ function to read
files that take in the file path and the file access mode as its parameters. For reading
a text file, the file access mode is ‘r‘. I have mentioned the other access modes below:

I ‘w‘ – writing to a file


I ‘r+‘ or ‘w+‘ – read and write to a file
I ‘a‘ – appending to an already existing file
I ‘a+‘ – append to a file after reading

0
analyticsvidhya

5
Reading CSV Files

A CSV (or Comma Separated Value) file is the most common type of file that a data
scientist will ever work with. These files use a “,“ as a delimiter to separate the values
and each row in a CSV file is a data record.

0
analyticsvidhya

6
Reading CSV Files . . .

But CSV can run into problems if the values contain commas. This can be overcome
by using different delimiters to separate information in the file, like ‘‘ or ‘;‘, etc. These
can also be imported with the ‘read_csv()‘ function by specifying the delimiter in the
parameter value as shown below while reading a TSV (Tab Separated Values) file:

0
analyticsvidhya

7
Reading Excel Files

Pandas has a very handy function called ‘read_excel()‘ to read Excel files

We can easily read data from any sheet we wish by providing its name in the
sheet_name parameter in the ‘read_excel()‘ function

0
analyticsvidhya

8
Importing Data from a Database

Data in databases is stored in the form of tables and these systems are known as
Relational database management systems (RDBMS). However, connecting to
RDBMS and retrieving the data from it can prove to be quite a challenging task. You
will need to import the sqlite3 module to use SQLite.

0
analyticsvidhya
9
Reading JSON Files

JSON (JavaScript Object Notation) files are lightweight and human-readable to store
and exchange data. It is easy for machines to parse and generate these files and are
based on the JavaScript programming language. JSON files store data within { }
similar to how a dictionary stores it in Python

0
analyticsvidhya

10
Reading Data from Pickle

Pickle files are used to store the serialized form of Python objects. This means objects
such as list, set, tuple, dict, etc. are converted to a character stream before being
stored on the disk. This allows you to continue working with the objects later on.
These are particularly useful when you have trained your machine learning model and
want to save them to make predictions later on.

0
analyticsvidhya

11
Reading HTML using Python

12
Uploading Data to Colab

There are three different ways of uploading data to Colab:

I Manually locate the file


I Mounting Google Drive
I Using Git or API

13
Manually locate the file

14
Mounting Google Drive

15
Using Git or API

16
Types of Analytics in Data Science

17
Types of Analytics in Data Science

I Descriptive Analytics tells us what happened in the past and helps a business
understand how it is performing by providing context to help stakeholders
interpret information. Example: year-over-year pricing changes,
month-over-month sales growth, or the total revenue per subscriber
I Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past. Example: Examining Market
Demand, Explaining Customer Behavior, Identifying Technology Issues
I Predictive Analytics predicts what is most likely to happen in the future and
provides companies with actionable insights based on the information. Example:
Forecasting future cash flow, Early detection of disease
I Prescriptive Analytics provides recommendations regarding actions that will take
advantage of the predictions and guide the possible actions toward a solution.
Example: Investment Decisions, Fraud Detection, Algorithmic Recommendations
(Instagram, tiktok)

18
Central Tendency

I Mean: The average of the dataset.


I Median: The middle value of an ordered dataset.
I Mode: The most frequent value in the dataset. If the data have multiple values
that occurred the most frequently, we have a multimodal distribution.

19
Central Tendency

I Skewness: A measure of symmetry.


I Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative
to a normal distribution

20
Variability

21
Variability

Range: The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)

I Percentiles — A measure that indicates the value below which a given


percentage of observations in a group of observations falls.
I Quantiles— Values that divide the number of data points into four more or less
equal parts, or quarters.
I Interquartile Range (IQR)— A measure of statistical dispersion and variability
based on dividing a data set into quartiles. IQR = Q3 − Q1 distribution

22
Variability

Variance: The average squared difference of the values from the mean to measure
how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean
and the square root of variance.

23
Variability

24
Relationship Between Variables

Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more
variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.

25
Hypothesis Testing and Statistical Significance

Null Hypothesis: A general statement that there is no relationship between two


measured phenomena or no association among groups. Alternative Hypothesis: Be
contrary to the null hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,
while a type II error is the non-rejection of a false null hypothesis.

“ Students who eat breakfast will perform better on a math exam than students who
do not eat breakfast. ”

26
Clustering

We would like to group our data into different groups.

27
Code: Generate the half-moon data

28
Half-Moon Data with 1500 points

29
Code: Generate the Cicle data

30
Circle Data with 1500 points

31
K-means Algorithm

Possibly the most popular algorithm for clustering


k-Means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.

I Initialise with "n_clusters" random "centroids"


I Iterates over two steps
I Assign each point to one of the centroids it is closer to using euclidean
distance
I Create new centroids by defining each centroid as the average of each
dimension
I Repeat
I Algorithms is unstable, different starting positions will result in different clusters

32
Metrics for clustering

Completeness: clustering must assign all of those datapoints that are members of a
single class to a single cluster

33
Metrics for clustering

Silhouette Coefficient:
I +1 indicate that the sample is far away from the neighboring clusters
I 0 indicates that the sample is on or very close to the decision boundary between
two neighboring clusters
I negative values indicate that those samples might have been assigned to the
wrong cluster

34
Let‘s run it for clustering moon data

35
Two clusters on moon data

36
Let‘s run it for clustering Cicle data

37
Two clusters on Circle data

38
Disadvantage of k-means clustering

I Difficult to predict k-value


I Use elbow plot to select best value of k
I Different initial partitions can result in different final clusters

39
Density-based spatial clustering of applications with noise (DBSCAN)

The DBSCAN algorithm should be used to find associations and structures in data
that are hard to find manually but that can be relevant and useful to find patterns and
predict trends.
Depends on two parameters

I eps: the minimum distance between two points. It means that if the distance
between two points is lower or equal to this value (eps), these points are
considered neighbors.
I minPoints: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.

40
DBSCAN: Advantages and disadvantages

Advantages:

I Can discover arbitrarily shaped clusters


I Find cluster completely surrounded by different clusters region.
Disadvantages:

I Datasets with altering densities are tricky


I Sensitive on two parameters

41
Let’s run it for clustering moon data with DBSCAN

42
Two clusters on moon data using DBSCAN

43
Two clusters on circle data using DBSCAN

44

You might also like