CE880_Lecture3_slides
CE880_Lecture3_slides
Haider Raza
Tuesday, 31 Jan 2023
1
About Myself
2
Common file formats in Data Science
0
https://www.weirdgeek.com/
3
Reading Zipped file
Zip files are a gift from the coding gods. It is like they have fallen from heaven to save
our storage space and time. Old school programmers and computer users will certainly
relate to how we used to copy gigantic installation files in Zip format.
0
analyticsvidhya
4
Reading Text Files
Text files are one of the most common file formats to store data. Python makes it
very easy to read data from text files. Python provides the ‘open()‘ function to read
files that take in the file path and the file access mode as its parameters. For reading
a text file, the file access mode is ‘r‘. I have mentioned the other access modes below:
0
analyticsvidhya
5
Reading CSV Files
A CSV (or Comma Separated Value) file is the most common type of file that a data
scientist will ever work with. These files use a “,“ as a delimiter to separate the values
and each row in a CSV file is a data record.
0
analyticsvidhya
6
Reading CSV Files . . .
But CSV can run into problems if the values contain commas. This can be overcome
by using different delimiters to separate information in the file, like ‘‘ or ‘;‘, etc. These
can also be imported with the ‘read_csv()‘ function by specifying the delimiter in the
parameter value as shown below while reading a TSV (Tab Separated Values) file:
0
analyticsvidhya
7
Reading Excel Files
Pandas has a very handy function called ‘read_excel()‘ to read Excel files
We can easily read data from any sheet we wish by providing its name in the
sheet_name parameter in the ‘read_excel()‘ function
0
analyticsvidhya
8
Importing Data from a Database
Data in databases is stored in the form of tables and these systems are known as
Relational database management systems (RDBMS). However, connecting to
RDBMS and retrieving the data from it can prove to be quite a challenging task. You
will need to import the sqlite3 module to use SQLite.
0
analyticsvidhya
9
Reading JSON Files
JSON (JavaScript Object Notation) files are lightweight and human-readable to store
and exchange data. It is easy for machines to parse and generate these files and are
based on the JavaScript programming language. JSON files store data within { }
similar to how a dictionary stores it in Python
0
analyticsvidhya
10
Reading Data from Pickle
Pickle files are used to store the serialized form of Python objects. This means objects
such as list, set, tuple, dict, etc. are converted to a character stream before being
stored on the disk. This allows you to continue working with the objects later on.
These are particularly useful when you have trained your machine learning model and
want to save them to make predictions later on.
0
analyticsvidhya
11
Reading HTML using Python
12
Uploading Data to Colab
13
Manually locate the file
14
Mounting Google Drive
15
Using Git or API
16
Types of Analytics in Data Science
17
Types of Analytics in Data Science
I Descriptive Analytics tells us what happened in the past and helps a business
understand how it is performing by providing context to help stakeholders
interpret information. Example: year-over-year pricing changes,
month-over-month sales growth, or the total revenue per subscriber
I Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past. Example: Examining Market
Demand, Explaining Customer Behavior, Identifying Technology Issues
I Predictive Analytics predicts what is most likely to happen in the future and
provides companies with actionable insights based on the information. Example:
Forecasting future cash flow, Early detection of disease
I Prescriptive Analytics provides recommendations regarding actions that will take
advantage of the predictions and guide the possible actions toward a solution.
Example: Investment Decisions, Fraud Detection, Algorithmic Recommendations
(Instagram, tiktok)
18
Central Tendency
19
Central Tendency
20
Variability
21
Variability
Range: The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)
22
Variability
Variance: The average squared difference of the values from the mean to measure
how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean
and the square root of variance.
23
Variability
24
Relationship Between Variables
Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more
variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.
25
Hypothesis Testing and Statistical Significance
“ Students who eat breakfast will perform better on a math exam than students who
do not eat breakfast. ”
26
Clustering
27
Code: Generate the half-moon data
28
Half-Moon Data with 1500 points
29
Code: Generate the Cicle data
30
Circle Data with 1500 points
31
K-means Algorithm
32
Metrics for clustering
Completeness: clustering must assign all of those datapoints that are members of a
single class to a single cluster
33
Metrics for clustering
Silhouette Coefficient:
I +1 indicate that the sample is far away from the neighboring clusters
I 0 indicates that the sample is on or very close to the decision boundary between
two neighboring clusters
I negative values indicate that those samples might have been assigned to the
wrong cluster
34
Let‘s run it for clustering moon data
35
Two clusters on moon data
36
Let‘s run it for clustering Cicle data
37
Two clusters on Circle data
38
Disadvantage of k-means clustering
39
Density-based spatial clustering of applications with noise (DBSCAN)
The DBSCAN algorithm should be used to find associations and structures in data
that are hard to find manually but that can be relevant and useful to find patterns and
predict trends.
Depends on two parameters
I eps: the minimum distance between two points. It means that if the distance
between two points is lower or equal to this value (eps), these points are
considered neighbors.
I minPoints: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.
40
DBSCAN: Advantages and disadvantages
Advantages:
41
Let’s run it for clustering moon data with DBSCAN
42
Two clusters on moon data using DBSCAN
43
Two clusters on circle data using DBSCAN
44