DSC Unit 1
DSC Unit 1
Data Science
T. Y. BTECH
1
Data Science: Why all the
Excitement?
Exciting new effective
applications of data analytics
e.g.,
Google Flu Trends:
2
“Big Data” Sources
It’s All Happening On-line User Generated (Web &
Every: Mobile)
Click
Ad impression
Billing event
…
Fast Forward, pause,… ..
Server request
Transaction
Network message
Fault
…
3
Graph Data
Lots of interesting data has a graph structure:
• Social networks
• Communication networks
• Computer Networks
• Road networks
• Citations
• Collaborations/Relationships
• …
4
What can you do with the data?
to produce:
6
5 Vs of Big Data
7
DATA SCIENCE – WHAT IS IT?
8
Data Science – A Definition
9
Ben Fry’s Model
Visualizing Data Process
1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact
10
Jeff Hammerbacher’s Model
1. Identify problem
3. Collect data
5. Build model
6. Evaluate model
7. Communicate results
11
Data Scientist’s Practice
Clean,
prep
Evaluate
Interpret
12
The Big Picture
Extract
Transform
Load
13
The Big Picture
14
Data Science: Getting Value out of
Data
15
Data Science: Getting Value out of
Data
16
Data Science: Getting Value out of
Data
17
Data Science: Getting Value out of
Data
18
Why the Increased Interest in Data
Science?
19
20
Applications
21
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra,…
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
22
Contrast: BI
23
Contrast: AI
24
Modern Data Science Skills
25
Data Analysts vs Data Scientists
26
27
Why Python for Data Science???
https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
28
The Structure Spectrum
29
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
30
NumPy/Python
• NumPy is a Python library used for working with arrays.
• It also has functions for working in domain of linear algebra, Fourier transform, and
matrices.
• NumPy stands for Numerical Python.
• In Python we have lists that serve the purpose of arrays, but they are slow to
process.
• NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.
• The array object in NumPy is called ndarray, it provides a lot of supporting
functions that make working with ndarray very easy.
• Arrays are very frequently used in data science, where speed and resources are very
important
• NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently
31
Pandas
• Pandas officially stands for ‘Python Data Analysis
Library’, THE most important Python tool used by Data
Scientists today.
• Pandas is an open source Python library that allows users to
explore, manipulate and visualise data in an extremely
efficient manner. It is literally Microsoft Excel in Python.
• It is easy to read and learn
• It is extremely fast and powerful
• It integrates well with other visualisation libraries
• Pandas can take in a huge variety of data, the most common
ones are csv, excel, sql or even a webpage.
32
Pandas/Python
33
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by aggregate: sum, count, average, max,
min
• Pivot or reshape -- To reshape the data frame from
long to wide in Pandas, we can use Pandas' pd. pivot()
method. columns : Column to use to make new
frame's columns (e.g., 'Year Month').
• Relational:
– union, intersection, difference, Cartesian product (CROSS
JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join, etc.
– Rename
34
Applications
Amazing real-time Data Science Applications:
Recommendation- Most of the apps and websites like Amazon, YouTube, Flipkart, etc. give
recommendation over as per the viewer’s interest. Online music applications like Spotify give
recommendations as per your taste in music. So these are good examples of data science
recommendation applications.
Search Results- Machine Learning algorithms used to find the most relevant search for Google
search engines. Such an algorithm used for the most visited sites on google chrome.
Intelligent Assistant- Google assistant, Siri are examples of intelligent assistants. The advanced
machine learning algorithm converts voice input into text output. These smart assistants
recognize the voice and provide the required information in both voice and text outputs.
Autonomous driving vehicles- Automobile companies like Waymo and Tesla looking for the
next generation of autonomous vehicles. 3D images were taken by the cameras and the
information provided to the algorithms for further processing.
Piracy Detection- YouTube is an example of piracy detection using machine learning
algorithms. Due to the big database, copied contents cannot be detected manually. So it helps
to detect and remove the copied content to reduce human efforts.
Image Recognition- Facebook is the application that uses image recognition by data science
and machine learning for the friend suggestion. Even Google lens uses an image recognition
algorithm to provide the related information to you.
35
Data Cleaning-Dirty Data
• The Statistics View:
• There is a process that produces data
• We want to model ideal samples of that process, but
in practice we have non-ideal samples:
• Distortion – some samples are corrupted by a process
• Selection Bias - likelihood of a sample depends on its
value
• Left and right censorship - users come and go from our
scrutiny
• Dependence – samples are supposed to be independent,
but are not (e.g. social networks)
36
Dirty Data
Solution:
39
Data Quality Problems
• (Source) Data is dirty on its own.
• Transformations corrupt the data (complexity of
software pipelines).
• Data sets are clean but integration (i.e., combining
them) screws them up.
• “Rare” errors can become frequent after
transformation or integration.
• Data sets are clean but suffer “bit rot”
• Old data loses its value/accuracy over time
• Any combination of the above
40
Big Picture: Where can Dirty Data Arise?
Integrate
Clean
Extract
Transform
Load
41
Numeric Outliers
43
Conventional Definition of Data Quality
• Accuracy
– The data was recorded correctly.
• Completeness
– All relevant data was recorded.
• Uniqueness
– Entities are recorded once.
• Timeliness
– The data is kept up to date.
• Special problems in federated data: time consistency.
• Consistency
– The data agrees with itself.
44
How we can deal with the noisy data
• Data Binning : In this approach sorting of data is performed concerning the
values of the neighborhood. This method is also known as local smoothing.
• Preprocessing in Clustering : In this approach, the outliers may be detected
by grouping the similar data in the same group, i.e., in the same cluster.
• Machine Learning : A Machine Learning algorithm can be executed for
smoothing of data. For example, Regression Algorithm can be used for
smoothing of data using a specified linear function.
• Removing manually: The noisy data can be deleted manually by the human
being, but it is a time-consuming process, so mostly this method is not given
priority. Noisy data is meaningless data. The term has often been used as a
synonym for corrupt data. However, its meaning has expanded to include
any data that cannot be understood and interpreted correctly by machines,
such as unstructured data.
45
Missing, Noisy and inconsistent Data
46
What is Data Preprocessing?
47
Data Preprocessing
48
Why is Data Preparation Important?
• Data Preprocessing is necessary because of the presence of
unformatted real-world data.
49
How is Data Preprocessing performed?
How missing data can be handled.
Three different steps can be executed which are given below –
• Ignoring the missing record – It is the simplest and efficient method for
handling the missing data. But, this method should not be performed at the
time when the number of missing values are immense or when the pattern
of data is related to the unrecognized primary root of the cause of statement
problem.
• Filling the missing values manually – This is one of the best-chosen
methods. But there is one limitation that when there are large data set, and
missing values are significant then, this approach is not efficient as it
becomes a time-consuming task.
• Filling using computed values – The missing values can also be occupied
by computing mean, mode or median of the observed given values.
Another method could be the predictive values that are computed by using
any Machine Learning or Deep Learning algorithm.
50
Tasks of Data Preparation
• Data Cleaning :This is the first step which is implemented in Data
Preprocessing. In this step, the primary focus is on handling missing data,
noisy data, detection, and removal of outliers, minimizing duplication and
computed biases within the data.
• Data Integration :This process is used when data is gathered from various
data sources and data are combined to form consistent data. This
consistent data after performing data cleaning is used for analysis.
• Data Transformation :This step is used to convert the raw data into a
specified format according to the need of the model. The options used
for transformation of data are given below –
• Normalization – In this method, numerical data is converted into the
specified range, i.e., between 0 and 1 so that scaling of data can be
performed.
• Aggregation – This method is used to combine the features into one. For
example, combining two categories can be used to form a new group.
• Generalization – In this case, lower level attributes are converted to a
higher standard.
• Data Reduction :After the transformation and scaling of data duplication,
i.e., redundancy within the data is removed and efficiently organize the
data.
51
What Is Data Wrangling?
• It is used to convert the raw data into the format that is convenient
for the consumption of data. Data Wrangling is a technique that is
executed at the time of making an interactive model.
• Steps:
– extracting the data from different data sources
– sorting of data using certain algorithm is performed
– decompose the data into a different structured format
– finally store the data into another database.
52
Data Wrangling
53
Data Wrangling in Python
Data wrangling in python deals with the below functionalities:
1.Data exploration: In this process, the data is studied, analyzed and
understood by visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast
amount of data contain missing values of NaN, they are needed to be
taken care of by replacing them with mean, mode, the most frequent
value of the column or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to
the requirements, where new data can be added or pre-existing data
can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows
or columns which are required to be removed or filtered
5.Other: After dealing with the raw dataset with the above
functionalities we get an efficient dataset as per our requirements and
then it can be used for a required purpose like data analyzing,
machine learning, data visualization, model training etc.
in Python 54
Why is Data Wrangling Important?
55
Data Leakage
Data Leakage can be demonstrated in many ways that are
given below –
56
Minimizing Data Leakage
57
How is Data Wrangling performed?
• If one considers the complete data set for
normalization and standardization, then the
cross-validation is performed for the estimation of the
performance of the model leads to the beginning of
data leakage.
• The effect of Data Leakage could be minimized by
recalculating for the required Data Preparation during
the cross-validation process that includes feature
selection, outliers detection, and removal, projection
methods, scaling of selected features and much more.
• Another solution is that dividing the complete dataset
into training data set that is used to train the model
and validation dataset which is used to evaluate the
performance and accuracy of the applied model.
58
Tasks of Data Wrangling
• Discovering: Firstly, data should be understood thoroughly and
examine which approach will best suit. For example: if have a weather
data when we analyze the data it is observed that data is from one
area and so primary focus is on determining patterns.
• Structuring :As the data is gathered from different sources, the data
will be present in various shapes and sizes. Therefore, there is a need
for structuring the data in proper format.
• Cleaning :Cleaning or removing of data should be performed that can
degrade the performance of analysis.
• Enrichment :Extract new features or data from the given data set to
optimize the performance of the applied model.
• Validating: This approach is used for improving the quality of data and
consistency rules so that transformations that are applied to the data
could be verified.
• Publishing :After completing the steps of Data Wrangling, the steps
can be documented so that similar steps can be performed for the
same kind of data to save time.
59