0% found this document useful (0 votes)

42 views60 pages

Unit 1

Uploaded by

jidey30017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views60 pages

Unit 1

Uploaded by

jidey30017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Unit -1

Introduction to Data Science and Big Data

Contents
●
Basics and need of Data Science and Big Data, Applications of Data
Science, Data explosion, 5 V’s of Big Data, Relationship between
Data Science and Information Science, Business intelligence versus
Data Science, Data Science Life Cycle, Data: Data Types, Data
Collection.
●
Need of Data wrangling, Methods: Data Cleaning, Data Integration,
Data Reduction, Data Transformation, Data Discretization.
Introduction to Data Science
●
Data science in simple words can be defined as an interdisciplinary
field of study that uses data for various research and reporting, to
derive insights and meaning out of that data.

●
Data science is an interdisciplinary academic field that uses statistics,
scientific computing, scientific methods, processes, algorithms and
systems to extract knowledge and insights from noisy, structured, and
unstructured data.
Applications of Data Science

1. Healthcare: Healthcare companies are using data science to build

sophisticated medical instruments to detect and cure diseases.

2. Gaming: Video and computer games are now being created with the help of
data science and that has taken the gaming experience to the next level.

3. Image Recognition: Identifying patterns in images and detecting objects in

an image is one of the most popular data science applications.

4. Recommendation Systems: Netflix and Amazon give movie and product

recommendations based on what you like to watch, purchase, or browse on
their platforms.
5. Logistics: Data Science is used by logistics companies to optimize routes to
ensure faster delivery of products and increase operational efficiency.

6. Fraud Detection: Banking and financial institutions use data science and
related algorithms to detect fraudulent transactions.

7. Internet Search: Search engines, such as Google, Yahoo, Bing and others,
employ data science algorithms to offer the best results for our searched query
in a matter of seconds.

8. Speech recognition: Speech recognition is dominated by data science

techniques. We may see the excellent work of these algorithms in our daily
lives using virtual speech assistant like Google Assistant, Alexa, or Siri
9. Airline Route Planning: As a result of data science, it is easier to predict
flight delays for the airline industry. It also helps to determine whether to land
immediately at the destination or to make a stop in between, such as a flight
from Delhi to the United States of America or to stop in between and then
arrive at the destination.

10. Augmented Reality: There is a fascinating relationship between data

science and virtual reality. A virtual reality headset incorporates computer
expertise, algorithms, and data to create the greatest viewing experience
possible. The popular game Pokemon GO is a minor step in that direction.
Data Explosion
●
The rapid or exponential increase in the amount of data that is generated and stored
in the computing systems, that reaches level where data management becomes
difficult, is called Data Explosion.
●
The key drivers of data growth are following :
➢ Increase in storage capacities.
➢ Cheaper storage.
➢ Increase in data processing capabilities by modern computing devices.
➢ Data generated and made available by different sectors.
5 V’s of Big Data
●
Big data is a collection of data from many different sources and is often
describe by five characteristics: volume, variety, velocity, value and
veracity.
➢
Volume: The size and amounts of big data that companies manage and
analyze. If the volume of data is large enough, it can be considered big data.
➢
Variety: Variety describes the diversity of the data types and its
heterogeneous sources. The data comes in three different types: Structured
Data, Unstructured Data and Semi-Structured Data.
➢
Velocity: It describes how rapidly the data is generated and how quickly it
moves. This data flow comes from sources such as mobile phones, social
media, networks, servers, etc. An organization that uses big data will have a
large and continuous flow of data that is being created and sent to its end
destination. This data needs to be digested and analyzed quickly.
➢
Value: This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data. The
value of big data usually comes from insight discovery and pattern
recognition that lead to more effective operations, stronger customer
relationships and other clear and quantifiable business benefits. The
more insights derived from the Big Data, the higher its value.
➢
Veracity: Veracity describes the data’s accuracy and quality. Since the
data is pulled from diverse sources, the information can have
uncertainties, errors, redundancies, gaps, and inconsistencies.
Veracity, overall, refers to the level of trust there is in the collected
data.
Relationship between Data Science and
Information Science
●
Data Science and Information Science are distinct but complementary disciplines.
●
Data Science:- Data science is used in business functions such as strategy formation,
decision making and operational processes. It transform raw, messy data into
actionable knowledge which supports decision-making. It touches on practices such
as artificial intelligence, analytics, predictive analytics and algorithm design. Data
science is an interdisciplinary field about scientific methods, processes, and systems
to extract knowledge or insights from data in various forms, either structured or
unstructured.
●
Information Science:- Information Science is the use of any computers, storage,
networking and other physical devices, infrastructure and processes to create,
process, store, secure and exchange all forms of electronic data. It is the
development, maintenance, and use of computer software, systems, and networks. It
includes their use for the processing and distribution of data.
Business Intelligence Vs Data Science
●
Perspective: BI systems are designed to look backwards based on real
data from real events. Data Science looks forward, interpreting the
information to predict what might happen in the future.
●
Focus: BI delivers detailed reports, trends but it doesn’t tell you what
this data may look like in the future in the form of patterns and
experimentation.
●
Process: Traditional BI systems tend to be static and comparative. They
do not offer room for exploration and experimentation.
●
Data sources: Because of its static nature, BI data sources tend to be pre-
planned and added slowly. Data science offers a much more flexible
approach as it means data sources can be added on the go as needed.
●
Transform: How the data delivers a difference to the business is
important. BI helps you answer the questions you know, whereas Data
Science helps you to discover new questions because of the way it
encourages companies to apply insights to new data.

●
Storage: Like any business asset, data needs to be flexible. BI systems
tend to be warehoused, which means it is difficult to deploy across the
business. Data Science can be distributed real time.
●
Data quality: Any data analysis is only as good as the quality of the data
captured. BI provides a single version of truth while data science offers
precision, confidence level and much wider probabilities with its findings.
●
IT owned vs. business owned In the past, BI systems were often owned and
operated by the IT department, sending along intelligence to analysts who
interpreted it. With Data Science, the analysts are in-charge. The new Big
Data solutions are designed to be owned by analysts, who spend most of
their time analyzing data and making predictions upon which to base
business decisions.
Data Science Life Cycle
1. Business Understanding:

➢
The complete cycle revolves around the enterprise goal. What will you resolve if
you do no longer have a specific problem?
➢
It is extraordinarily essential to apprehend the commercial enterprise goal sincerely
due to the fact that will be your ultimate aim of the analysis.
➢
e.g. You need to understand if the customer desires to minimize savings loss, or if
they prefer to predict the rate of a commodity, etc.
2. Data Understanding (Data Mining):

➢
After enterprise understanding, the subsequent step is data understanding.
➢
This step includes describing the data, their structure, their relevance, their records
type.
➢
Explore the information using graphical plots.
3. Preparation of Data (Data Cleaning)

➢
Next comes the data preparation stage.
➢
This consists of steps like choosing the applicable data, integrating the data by
means of merging the data sets, cleaning it, treating the lacking values through
either eliminating them or imputing them, treating inaccurate data through
eliminating them, additionally test for outliers the use of box plots and cope with
them.
➢
Constructing new data, derive new elements from present ones.
➢
Format the data into the preferred structure, eliminate undesirable columns and
features.
➢
Data preparation is the most time-consuming but arguably the most essential step in
the complete existence cycle.
➢
Your model will be as accurate as your data.
4. Exploratory Data Analysis (Data Exploration):

➢
This step includes getting some concept about the answer and elements affecting it,
earlier than constructing the real model.
➢
Distribution of data inside distinctive variables is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps.
➢
Many data visualization strategies are used to discover each and every characteristic
individually and by means of combining them with different features.
5. Data Modelling (Feature Engineering)

➢
Data modelling is the heart of data analysis.
➢
A model takes the organized data as input and gives the preferred output.
➢
This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem.
➢
After deciding on the model family, amongst the number of algorithms amongst that
family, we need to cautiously pick out the algorithms to put into effect and enforce
them.
➢
We need to tune the hyper parameters of every model to obtain the preferred
performance.
6. Model Evaluation (Predictive Modelling):

➢
Here the model is evaluated for checking if it is geared up to be deployed.
➢
The model is examined on an unseen data, evaluated on a cautiously thought out set
of assessment metrics.
➢
If we do not acquire a quality end result in the evaluation, we have to re-iterate the
complete modelling procedure until the preferred stage of metrics is achieved.
➢
We can construct more than one model for a certain phenomenon. The model
assessment helps us select and construct an ideal model.
7. Model Deployment (Data Visualization):

➢
The model after a rigorous assessment is at the end deployed in the preferred
structure and channel.
➢
This is the last step in the data science life cycle.
➢
Each step in the data science life cycle defined above must be laboured upon
carefully. If any step is performed improperly, and hence, have an effect on the
subsequent step and the complete effort goes to waste.
➢
Right from Business perception to model deployment, every step has to be given
appropriate attention, time, and effort.
Data
Preprocessing

58
Why Data Preprocessing?
● Data in the real world is dirty
➢ Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g. occupation=“ ”
➢ Noisy: containing errors or outliers
e.g. Salary=“-10”
➢ Inconsistent: containing discrepancies in codes or names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”

● No quality data, no quality mining results!

59
Major tasks in Data Preprocessing
1. Data Cleaning
2. Data Integration
3. Data Reduction
4. Data Transformation
5. Discretization

60
1. Data Cleaning
Data cleaning tasks:

1) Fill in missing values

2) Identify outliers and smooth out noisy data

3) Correct inconsistent data

4) Resolve redundancy caused by data integration

61
Missing Data
● Data is not always available
e.g. many tuples have no recorded value for several attributes
● Missing data may be due to
➢ equipment malfunction
➢ inconsistent with other recorded data and thus deleted
➢ data not entered due to misunderstanding
➢ certain data may not be considered important at the time of entry

62
Handling of Missing values
1) Ignore the tuple: usually done when class label is missing
2) Fill in the missing value manually: tedious + infeasible?
3) Fill in it automatically with

- a global constant : e.g., “unknown”

- the attribute mean

- the attribute mean for all samples belonging to the same class

4) The most probable value

63
Noisy Data
● Noise: It is the random error in measured variable. That is there can be
incorrect attribute value (outlier). e.g. Salary=“-10”
● Incorrect attribute values may be due to
➢ faulty data collection instruments
➢ data entry problems
➢ data transmission problems

64
Handling of Noisy Data
● Binning method:
✔ first sort data and partition into equal-frequency (equi-depth) or Equal-width
(distance) bins
✔ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries etc.
✔ used also for discretization (discussed later)
● Clustering
✔ detect and remove outliers
● Regression
✔ smooth by fitting the data into regression functions

65
Binning method
Binning Methods for Data Smoothing:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins::
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

66
Cluster Analysis

67
Regression
● Data smoothing can also be done by regression.
● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.
● Using regression, we find a mathematical equation to fit the data which helps to
smooth out the noise

68
Continued...

Y1’ y=x+1

X1 x

69
2. Data Integration
● Data integration combines data from multiple sources into a coherent store (unified
view of that data)
●

● Schema integration: e.g., A.cust-id = B.cust-#

Integrate metadata from different sources

● Entity identification problem:

Identify real world entities from multiple data sources
e.g., Bill Clinton = William Clinton

● Detecting and resolving data value conflicts:

➢ For the same real world entity, attribute values from different sources are
different
➢ Possible reasons: different representations, different scales, e.g., metric vs. British
units 70
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

➔ Object identification: The same attribute or object may have different names
in different databases
➔ Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
➔ Redundant attributes may be able to be detected by correlation analysis
➔ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

71
Correlation analysis
➢ Correlation coefficient (also called Pearson’s product moment
coefficient):

r A,B =
∑ ( A− A )( B−B) ∑ ( AB )−n A B
=
(n−1)σAσB (n−1)σAσB
where n is the number of tuples, A and B are the respective means of A and B, σA
and σB are the respective standard deviation of A and B

- If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
- rA,B < 0 : negatively correlated;

- rA,B = 0 : independent

72
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.

Where, N is number of tuples and µ is mean.

● Square of standard deviation is known as variance.

73
Standard Deviation
● The standard deviation (SD, also represented by the Greek letter sigma σ or the
Latin letter s) is a measure that is used to quantify the amount of variation or
dispersion of a set of data values.

Where, N is number of tuples and µ is mean.

● Square of standard deviation is known as variance.

74
3. Data Transformation
➔ Smoothing: remove noise from data (Use of binning,regression and clustering)
➔ Aggregation: summarization (e.g. daily sales data aggregated to compute monthly
or annual sales)
➔ Generalization: concept hierarchy climbing (e.g. street can be generalized to
higher level concepts like city or country)
➔ Discretization: values of numeric attribute (e.g age) are replaced by interval
labels(0-10,10-20,..) or conceptual labels(e.g. youth, adult, senior)
➔ Normalization: Attributes are scaled to fall within a smaller range such as -1.0 to
1.0 or 0.0 to 1.0
➢ min-max normalization
➢ z-score normalization
➢ normalization by decimal scaling
➔ Attribute/feature construction: New attributes constructed from the given ones75
Data Transformation: Normalization
● Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

76
Normalization Continued...

● Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to

73,600  54,000
 1.225
16,000

● Normalization by decimal scaling:

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

77
4. Data Reduction
Why data reduction?
➢ A database/data warehouse may store terabytes of data. Complex data
analysis/mining may take a very long time to run on the complete data set
Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
Data reduction strategies:
➔ Data cube aggregation
➔ Dimensionality reduction
➔ Data Compression
➔ Numerosity reduction
➔ Discretization and concept hierarchy generation

78
A. Data cube aggregation

79
B. Dimensionality reduction

● Most data mining algorithms are column-wise implemented, which makes them
slower and slower on a growing number of data columns (i.e. dimensions).
● So it is necessary to remove some unimportant attributes/dimensions in the dataset
in order to improve the performance of data mining algorithm.
● Attribute selection methods:
1) Missing Values Ratio: Data columns with too many missing values are unlikely
to carry much useful information. Thus data columns with number of missing values
greater than a given threshold can be removed.
2) Low Variance Filter: Similarly to the previous technique, data columns with
little changes in the data carry little information. Thus all data columns with
variance lower than a given threshold are removed.
3) High Correlation Filter. Data columns with very similar trends are also likely to
carry very similar information. In this case, only one of them will suffice to feed the
machine learning model. Pairs of columns with correlation coefficient higher than a
threshold are reduced to only one.
80
Continued...

4) Principal Component Analysis (PCA): Principal Component Analysis (PCA) is

a statistical procedure that orthogonally transforms the original n coordinates of a
data set into a new set of n coordinates called principal components. As a result of
the transformation, the first principal component has the largest possible variance;
each succeeding component has the highest possible variance under the constraint
that it is orthogonal to (i.e., uncorrelated with) the preceding components. Keeping
only the first m < n components reduces the data dimensionality while retaining
most of the data information, i.e. the variation in the data.
5) Backward Feature Elimination: In this technique, at a given iteration, the
selected classification algorithm is trained on n input features. Then we remove one
input feature at a time and train the same model on n-1 input features n times. The
input feature whose removal has produced the smallest increase in the error rate is
removed, leaving us with n-1 input features. The classification is then repeated
using n-2 features, and so on.

81
Continued...

6) Forward Feature Construction: This is the inverse process to the Backward

Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a
time, i.e. the feature that produces the highest increase in performance.
● Both algorithms, Backward Feature Elimination and Forward Feature Construction,
are quite time and computationally expensive. They are practically only applicable
to a data set with an already relatively low number of input columns.

82
C. Numerosity reduction

● Reduce data volume by choosing alternative, smaller forms of data

representation
1) Parametric methods:
➢ Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
➢ Regression, Log-Linear Models.
2) Non-parametric methods
➢ Do not assume models
➢ Major families: histograms, clustering, sampling

83
Regression

● Linear regression involves finding the “best” line to fit two atributes (or variables)
so that one attribute can be used to predict the other.
● Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to multidimensional surface.

84
Continued...

Y1’ y=x+1

X1 x

85
Histogram

● A histogram is a graphical
representation of the distribution of
numerical data.
● Divide data into buckets and store
average (sum) for each bucket
● Partitioning rules:
- Equal-width: equal bucket range
- Equal-frequency (or equal-depth)

86
Clustering

Partition data set into clusters

based on similarity, and store
cluster representation (e.g.,
centroid and diameter) only

87
Sampling
● Sampling is the process of selecting units from a population of interest so
that by studying the sample we may fairly generalize our results back to the
population from which they were chosen
● Types of sampling:
1) Simple random sampling without replacement (SRSWOR): The
sampling units are chosen without replacement in the sense that the units once chosen
are not placed back in the population.

2) Simple random sampling with replacement (SRSWR): The sampling

units are chosen with replacement in the sense that the chosen units are placed back
in the population.

3) Cluster sample: The population is divided into N groups, called clusters. We can
randomly select n clusters to include in the sample.

4) Stratified sampling: We divide the population into separate groups, called strata.
Then, a simple random sample is drawn from each group.

88
5. Discretization
● Discretization techniques can be used to reduce the number of values for a
given continuous attribute, by dividing the attribute into a range of intervals.
● Interval labels can then be used to replace actual data values.
● Many discretization techniques can be applied recursively in order to
provide a hierarchical partitioning of the attribute values known as concept
hierarchy.

89
Continued...

● Discretization techniques can be categorized based on which direction it

proceeds, as:
➢ Top-down: If the process starts by first finding one or a few points to split
the entire attribute range, and then repeats this recursively on the resulting
intervals
➢ Bottom-up: Starts by considering all of the continuous values as potential
split-points, removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting intervals.

90
Continued...

91
Continued...

● Methods for concept hierarchy generation are:

➢ Binning: Attribute values can be discretized by distributing the values into
bin and replacing each bin by the mean bin value or bin median value. These
technique can be applied recursively to the resulting partitions in order to
generate concept hierarchies.
➢ Histogram analysis: Histograms can also be used for discretization.
Partitioning rules can be applied to define range of values as
- Equal-width: equal bucket range
- Equal-frequency (or equal-depth)

92
Continued...

➢ Cluster analysis:
A clustering algorithm can be applied to partition data into clusters or
groups. Each cluster forms a node of a concept hierarchy, where all nodes
are at the same conceptual level. Each cluster may be further decomposed
into sub-clusters, forming a lower kevel in the hierarchy. Clusters may also
be grouped together to form a higher-level concept hierarchy.
➢ Data segmentation by “natural partitioning”:
Breaking up annual salaries in the range of into ranges like ($50,000-
$100,000) are often more desirable than ranges arrived at by cluster analysis.

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
DS
No ratings yet
DS
32 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
1. Data Science Introduction
No ratings yet
1. Data Science Introduction
24 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
Intro To Data and Data Science
No ratings yet
Intro To Data and Data Science
9 pages
Data Science Tutorial 1
No ratings yet
Data Science Tutorial 1
26 pages
DS 1
No ratings yet
DS 1
85 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data
No ratings yet
Data
43 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
Applied_Data_Science-MODULE-1-SEM8
No ratings yet
Applied_Data_Science-MODULE-1-SEM8
16 pages
DS B&V-1 (1)
No ratings yet
DS B&V-1 (1)
30 pages
1 1 Intro To Data and Data Science Course Notes
No ratings yet
1 1 Intro To Data and Data Science Course Notes
8 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
CD101 Fundamental of Data Science
No ratings yet
CD101 Fundamental of Data Science
41 pages
Session 1819
No ratings yet
Session 1819
47 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
data science chacha
No ratings yet
data science chacha
150 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Science Unit 1st[1]
No ratings yet
Data Science Unit 1st[1]
25 pages
IDS-UNIT-1-FINAL (1)
No ratings yet
IDS-UNIT-1-FINAL (1)
30 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
COMPUTATIONAL DATA SCIENCE - UNIT 1
No ratings yet
COMPUTATIONAL DATA SCIENCE - UNIT 1
18 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
TLMweek1IntroDs
No ratings yet
TLMweek1IntroDs
11 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
Unit 1
No ratings yet
Unit 1
28 pages
What Is Data Science A Beginner’s Guide To Data Science (1)
No ratings yet
What Is Data Science A Beginner’s Guide To Data Science (1)
15 pages
Data Science
No ratings yet
Data Science
6 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Anu Data Scie
No ratings yet
Anu Data Scie
32 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Vishwha D
No ratings yet
Vishwha D
29 pages
AI UNIT 1 Data Science
No ratings yet
AI UNIT 1 Data Science
16 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
No ratings yet
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
7 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Analytics with Python: Data Analytics in Python Using Pandas
From Everand
Data Analytics with Python: Data Analytics in Python Using Pandas
Frank Millstein
3/5 (1)
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Unit 2 - 1
No ratings yet
Unit 2 - 1
13 pages
DSBDA Decode
No ratings yet
DSBDA Decode
59 pages
DA-Unit V
No ratings yet
DA-Unit V
152 pages
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
No ratings yet
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
68 pages
Comeptitor Analysis - US Vehicle Market
No ratings yet
Comeptitor Analysis - US Vehicle Market
29 pages
Big Data Twist PDF
No ratings yet
Big Data Twist PDF
6 pages
Intelligent Enterprise For Banking Industry
No ratings yet
Intelligent Enterprise For Banking Industry
35 pages
ADBMS-Module 1 Notes
No ratings yet
ADBMS-Module 1 Notes
18 pages
Introductionto Big Data Analytics
No ratings yet
Introductionto Big Data Analytics
162 pages
2656-Article Text-4990-1-10-20160531
No ratings yet
2656-Article Text-4990-1-10-20160531
5 pages
Darwish 2017
No ratings yet
Darwish 2017
16 pages
Artificial Intelligence (AI): When Humans and Machines Might Have to Coexist
No ratings yet
Artificial Intelligence (AI): When Humans and Machines Might Have to Coexist
15 pages
Rothberg 2017
No ratings yet
Rothberg 2017
20 pages
PROJECT TITLES REAL TIME 2025
No ratings yet
PROJECT TITLES REAL TIME 2025
8 pages
Big Data and Health Analytics. ISBN 1482229234, 978-1482229233
100% (22)
Big Data and Health Analytics. ISBN 1482229234, 978-1482229233
23 pages
Artificial Intelligence:: A Rival For Humans, or A Partner?
No ratings yet
Artificial Intelligence:: A Rival For Humans, or A Partner?
1 page
Journal of Business Research: Sciencedirect
No ratings yet
Journal of Business Research: Sciencedirect
9 pages
1 NoSQL For BigData
No ratings yet
1 NoSQL For BigData
8 pages
Big Data For The Future: Unlocking The Predictive Power of The Web
No ratings yet
Big Data For The Future: Unlocking The Predictive Power of The Web
12 pages
Dfa Vs Nfa: Deterministic Finite Automata Non-Deterministic Finite Automata
100% (1)
Dfa Vs Nfa: Deterministic Finite Automata Non-Deterministic Finite Automata
17 pages
Instant Download Fundamentals of Information Systems 9th Edition Ralph Stair PDF All Chapters
100% (5)
Instant Download Fundamentals of Information Systems 9th Edition Ralph Stair PDF All Chapters
76 pages
Big Data Report
No ratings yet
Big Data Report
10 pages
Matthew Dixon CV March 2018
No ratings yet
Matthew Dixon CV March 2018
6 pages
Digital Transformation Course Brochure
100% (1)
Digital Transformation Course Brochure
19 pages
Erasmus Mundus Joint Master Degree
No ratings yet
Erasmus Mundus Joint Master Degree
31 pages
Application of Big Data Analytics - An Innovation in Health Care
No ratings yet
Application of Big Data Analytics - An Innovation in Health Care
14 pages
BIg Data Is Not About Data
No ratings yet
BIg Data Is Not About Data
118 pages
Data Ingestion Tools
No ratings yet
Data Ingestion Tools
7 pages
Study On The Perception of Social Media SM Project
No ratings yet
Study On The Perception of Social Media SM Project
16 pages
Case Study DSBDA Report Final
No ratings yet
Case Study DSBDA Report Final
24 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Seminar Information System
No ratings yet
Seminar Information System
18 pages
风险管理论文
100% (1)
风险管理论文
8 pages
CCC 100 Questions
No ratings yet
CCC 100 Questions
2 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit -1

Introduction to Data Science and Big Data

1. Healthcare: Healthcare companies are using data science to build

3. Image Recognition: Identifying patterns in images and detecting objects in

4. Recommendation Systems: Netflix and Amazon give movie and product

8. Speech recognition: Speech recognition is dominated by data science

10. Augmented Reality: There is a fascinating relationship between data

● No quality data, no quality mining results!

1) Fill in missing values

2) Identify outliers and smooth out noisy data

3) Correct inconsistent data

4) Resolve redundancy caused by data integration

- a global constant : e.g., “unknown”

- the attribute mean

4) The most probable value

● Schema integration: e.g., A.cust-id = B.cust-#

● Entity identification problem:

● Detecting and resolving data value conflicts:

Redundant data occur often when integration of multiple databases

Where, N is number of tuples and µ is mean.

Where, N is number of tuples and µ is mean.

● Z-score normalization (μ: mean, σ: standard deviation):

● Normalization by decimal scaling:

4) Principal Component Analysis (PCA): Principal Component Analysis (PCA) is

6) Forward Feature Construction: This is the inverse process to the Backward

● Reduce data volume by choosing alternative, smaller forms of data

Partition data set into clusters

2) Simple random sampling with replacement (SRSWR): The sampling

● Discretization techniques can be categorized based on which direction it

● Methods for concept hierarchy generation are:

You might also like