0% found this document useful (0 votes)

6 views87 pages

Ids (R22) U1 PPT 03092024

The document provides an introduction to Data Science, defining it as a multidisciplinary field that extracts and interprets knowledge from large data sets. It outlines the Data Science process, applications, and the significance of Big Data, while also discussing its limitations and the importance of statistical inference. Additionally, it highlights the current landscape of perspectives in Data Science, emphasizing the need for a combination of mathematical, statistical, and domain expertise.

Uploaded by

abhinavreddy2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views87 pages

Ids (R22) U1 PPT 03092024

Uploaded by

abhinavreddy2307

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

UNIT-1 – CHAPTER -1

Introduction to Data Science

Contents
 Definition of Data Science
 Big Data and Data Science hype
 Getting past the hype
 Datafication
 Current landscape of perspectives
 Statistical Inference
 Populations and samples
 Statistical modelling
 Probability distributions
 Fitting a model
 Over fitting
 Explolratory Data analysis
Definition of Data Science
Data Science is the area of study that extracts,
manages, manipulates, and interprets knowledge
from from vast amounts of data using various
scientific methods, algorithms, and processes.

Data Science is an multidisciplinary field that allows

you to extract knowledge from structured or
unstructured data.

Data science enables you to translate a business

problem into a research project and then translate it
back into a practical solution.
Definition of Data Science
Data science refers to set of
theories and techniques from
many fields and disciplines are
used to investigate and
analyze a large amount of
data to help decision makers in
many industries such as
science, engineering, e-
commerce, economics,
politics, finance, and
education.
Data Science Process or Life cycle
1. Discovery: Discovery step
involves acquiring data from
all the identified internal &
external sources, which helps
you answer the business
question.

2. Preparation: Data can have

many inconsistencies like
missing values, blank columns,
an incorrect data format,
which needs to be cleaned.
Data Science Process or Life cycle
3. Model Planning: In this
stage, you need to
determine the method and
technique to draw the
relation between input and
output variables

4. Model Building: The actual

model building process
starts. Here, Data scientist
distributes datasets for
training and testing.
Data Science Process or Life cycle

5.Operationalize: You deliver

the final baselined model
with reports, code, and
technical documents in this
stage.

6.Communicate Results: In
this stage, the key findings
are communicated to all
stakeholders.
Applications of Data Science
 Internet Search: Google search uses Data science technology
to search for a specific result within a fraction of a second
 Recommendation Systems: To create a recommendation
system. For example, “suggested friends” on Facebook or
suggested videos” on YouTube.
 Image & Speech Recognition: Speech recognizes systems like
Siri, Google Assistant, and Alexa run on the Data science
technique. Moreover, Facebook recognizes your friend when
you upload a photo with them.
 Gaming world: EA Sports, Sony, Nintendo are using Data
science technology. This enhances your gaming experience.
 Online Price Comparison: PriceRunner, Junglee, Shopzilla work
on the Data science mechanism.
Why Data Science is important?
 To process large volumes of data: According to IDC, by 2025,
global data will grow to 175 zettabytes. To process large
volumes of data

 Data Science enables companies to efficiently understand

complex structured data from multiple sources and derive
valuable insights to make smarter data-driven decisions.

 Data Science is widely used in various industry domains,

including marketing, healthcare, finance, banking, policy
work, and more.
Big Data and Data Science hype
and
Getting past the hype
What is Big Data?
 Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large
size and complexity that none of traditional data
management tools can store it or process it efficiently.
 Example of Big Data:
 Social Media: The statistic shows that 500+terabytes of new data
get inserted into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
 The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day.
 A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day,
generation of data reaches up to many Petabytes.
Types of Big Data
 Structured data: Any data that can be stored, accessed and
processed in the form of fixed format is termed as a
‘structured’ data (Tables).
 Nowadays, we are foreseeing issues when a size of such data
grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Types of Big Data
 Unstructured :Any data with unknown form or the structure is
classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it.
 A typical example of unstructured data is a heterogeneous
data source containing a combination of simple text files,
images, videos etc. (The output of a Google search)
Types of Big Data
 Semi-structured: Semi-structured data can contain both the
forms of data. We can see semi-structured data as a
structured in form but it is actually not defined.
 Example of semi-structured data is a data represented in an
XML file.

<rec><name>Prashant Rao</name><gen>Male</gen><age>35</age></rec>
<rec><name>Seema R.</name><gen>Female</gen><age>41</age></rec>
Characteristics of Big Data
 Volume – Refers to the amount of data that exists. If the volume of
data is large enough, it can be considered big data.
 Variety – Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured.
 Velocity – Refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands,
determines real potential in the data.
 Variability – It refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being able to
handle and manage the data effectively.
 Value - It refers to the value that big data can provide, and it
relates directly to what organizations can do with that collected
data.
Applications of Big Data
Banking and Insurance sectors
Communications, Media and Entertainment
Healthcare Providers
Education
Manufacturing and Natural Resources
Government
Retail and Wholesale trade
Transportation
Energy and Utilities
Limitations of Big Data
Storage: Datasets can require considerable resources to
store
Formatting and Data cleaning: Advanced formatting and
cleaning methods may be required before data analysis.
Quality control: Can be difficult and often has to be done
through small representative samples
Security and Privacy concerns: Often more complex than
for traditional data sets.
Accuracy and consistency of methods: Many approaches
are relatively new and imperfect although these may
continue to improve over time
Limitations of Data Science
 Data Science is Blurry Term: Data Science is a very general term and does not have a
definite definition. While it has become a buzzword, it is very hard to write down the exact
meaning of a Data Scientist.
 Mastering Data Science is near to impossible: Being a mixture of many fields, Data
Science stems from Statistics, Computer Science and Mathematics. It is far from possible
to master each field and be equivalently expert in all of them.
 Large Amount of Domain Knowledge Required: Another disadvantage of Data
Science is its dependency on Domain Knowledge. A person with a considerable
background in Statistics and Computer Science will find it difficult to solve Data
Science problem without its background knowledge.
 Arbitrary Data May Yield Unexpected Results: A Data Scientist analyzes the data
and makes careful predictions in order to facilitate the decision-making process. Many
times, the data provided is arbitrary and does not yield expected results.
 Problem of Data Privacy: For many industries, data is their fuel. Data Scientists help
companies make data-driven decisions. However, the data utilized in the process may
breach the privacy of customers.
Big Data and Data Science Hype
 Given the hype around data
science, the reality is that most
companies still fail to use much of
the data they collect and store
during business activities.

 Why Now: Technology makes this

possible
Infrastructure for large data
processing
Increased memory and
bandwidth
Datafication
 Datafication: It is the process of “taking all aspects of life and
turning them into data”.
(or)
 Datafication aims to transform most aspects of a business into
quantifiable data that can be tracked, monitored, and
analyzed.
 It refers to the use of tools and processes to turn an
organization into a data-driven enterprise.
 Example:
Twitter “datafies” stray thoughts
Linkedin “datafies” professional networks
Google’s augmented reality glasses “datify” gaze (looks)
Current landscape of perspectives
 Data science is not merely Statistics or
Hacking or Mathematics. Data science
is the civil engineering of data. It
includes
Statistics (traditional mathematical
analysis)
Data changing (parsing, scraping,
and formatting data)
Visualization (graphs, tools, etc.)

 Its a practical knowledge of tools and

materials, coupled with a theoretical
understanding of what’s possible.
Current landscape of perspectives
 Math and Statistics knowledge: Mathematics is the critical part of
data science. Mathematics involves the study of quantity, structure,
space, and changes. For a data scientist, knowledge of good
mathematics is essential. Statistics is one of the most important
components of data science. Statistics is a way to collect and
analyze the numerical data in a large amount and finding
meaningful insights from it.

 Substantive (Domain) Expertise: The Substantive Knowledge is the

knowledge specific to the area where data science is applied. It is
often referred to as “domain knowledge”. For example, if you are
applying data science to genome problems, you should have
“substantive knowledge” on that topic.
Current landscape of perspectives
 Hacking Skills: The hacking skills refer to the computer science skills.
Data is digital. In order to efficiently manipulate the data, you need to
have some programming skills. You need to be comfortable at the
command line, be able to manipulate files of different formats,
program algorithms that will modify the data, etc.

 Machine Learning: Machine learning is backbone of data science.

Machine learning is all about to provide training to a machine so that it
can act as a human brain. In data science, we use various machine
learning algorithms like supervised learning, Un supervised learning and
Reinforcement learning algorithms to solve the problems. There are
various machine learning algorithms which are broadly being used in
data science such as Regression, Decision tree, Clustering. Principal
component analysis, Support vector machines, Naive Bayes, Artificial
neural network and Apriori algorithms.
Statistical Inference
 Statistics is a branch of Mathematics, that deals with the collection,
analysis, interpretation, and the presentation of the numerical data.

 The main purpose of Statistics is to make an accurate conclusion

using a limited sample about a greater population.

 Types of Statistics:
Descriptive Statistics: Describe about the data
Inferential Statistics: It helps to make predictions from the data.

 Statistical Inference means “guess”, which means making inference

about something.
Statistical Inference
 Statistical inference is the discipline that concerns with the development of
procedures, methods, and theorems that allow us to extract meaning and
information from data that has been generated by stochastic (random)
processes.

 The overall process is starting from

 the activities or processes in the world to the data,
 manipulate the data and then
 from the data back to the world, is the field of statistical inference.

 Example:
 Process or Activity – Sending and Receiving mails of employees
 Data - No. of mails sent and received every day for the last 3 months
 Inference – Find how many no. of mails and will be sent or received in
the next 3 months
Statistical Inference – Process and Data
 Process:
 The activities or functions which are happening in and around the
world is called Process.
 One should know about ways to describe, understand, and make sense
of these processes to understand the world better and understanding
these processes is part of the solution to problems.
 Data:
 It represents the traces of the real-world processes, and exactly which
traces we gather are decided by our data collection or sampling
method.
 Once we have all the data, to derive new idea, and that’s to simplify those
captured traces(data) into more comprehensible, one should found a
mathematical models or functions of the data, known as statistical model
or estimator.
 Note that, the process and data will be random and uncertainty in nature.
Statistical Inference – Example
 Example: From the shuffled pack of cards, a card is drawn. This trial is
repeated for 400 times, and the suits are given below:
Suit Spade Clubs Hearts Diamonds

No of times drawn 90 100 120 90

 Question: While a card is tried at random, then what is the probability of

getting a Diamond card.
 Solution:
Total number of events = 400
Number of trials in which diamond card is drawn = 90
Therefore, P(diamond card) = 90/400 = 0.225
Populations and Samples
 Population refers to the entire group of individuals about whom
you wish to draw conclusions.
 Sample refers to the sub set of people (from population) from
which you will be collecting data.
Populations and Samples
 In Statistical Inference, the term Population denotes the set of
objects or units, such as tweets or photographs or stars.

 The set of characteristics that are measured or extracted from the

objects is called as Observations, and it is denoted as N, the number
of observations from the population.

 Example:
 Population: The emails sent last year by employee
 Observation: The sender’s name, The list of recipients,
Date sent, Text of email,
No. of characters and sentences in the email,
No. of verbs in the email, and
The length of time until first reply.
Populations and Samples
 Sample refers to a subset of the units of size n from population that
are considered in order to examine the observations to draw
conclusions and make inferences about the population.

 There are different ways that can be followed for getting this subset
of data, which are called sampling mechanisms.

 Note that, some sampling mechanisms may introduce biases into the
data, and distort it. Once that happens, any conclusions you draw
will simply be wrong and distorted.
Populations and Samples
 Example: Employee Emails

 Sample -1 : 1/10 of Employees and their emails at random

 Sample- 2 : 1/10 of Emails and its Employees at random

 But if we counted how many email messages each person sent, and used
that to estimate the underlying distribution of emails sent by all employees,
we might get entirely different answers.

 Notice that, the basic thing counting can get distorted when we’re using
sampling methods of different type
Populations Vs Samples
BASIS FOR
POPULATION SAMPLE
COMPARISON
Meaning Population refers to the Sample means a subgroup
collection of all elements of the members of
possessing common population chosen for
characteristics, that comprises participation in the study.
universe.

Includes Each and every unit of the Only a handful of units of

group. population.

Characteristic Parameter Statistic

Data collection Complete enumeration or Sample survey or
census sampling
Focus on Identifying the characteristics. Making inferences about
population.
Big Data – Population and Samples
 This Big Data world is defined by the enormous amount of ever-
expanding, diverse data being generated, collected and
analyzed by researchers and practitioners alike.

 While large data sets allow us to gain useful insights about general
trends, smaller segments contained within the larger data set are still
useful.

 For example, consider concept of personalization works

(personalized medicine). Here from the large data set, we
create smaller, homogeneous, data sets to make predictions
within smaller groups.
Big Data – Population and Samples
 In this context, one can apply the concept of population and
samples to derive useful insights from smaller data sets
(sample) which was considered from larger data sets
(population).

 Issues need to be addressed

Sampling solves some engineering challenges
Hidden biases of big data
Sampling method
Underlying assumptions
Sampling distribution
Modelling
 Modelling is describing mathematically a situation in
reality for the purpose of solving a problem or finding a
answer to a question in that situation(from Data).

 Modelling process includes an iterative process that

requires creativity and inventiveness and in which
mathematical, scientific and technical knowledge is
applied to describe new situations(Data).

 Modelling process consists of the activates related to

 determining a strategy to design the model,
 analyzing or getting to the bottom of the problem,
 choosing variables, setting up relation between variables , and
 deploying mathematical and computational tools.
Modelling - Examples
An Architects capture attributes of buildings through
blueprints and three-dimensional, scaled-down
versions.
A Molecular biologists capture protein structure with
three-dimensional visualizations of the connections
between amino acids.
The Statisticians and data scientists capture the
uncertainty and randomness of data-generating
processes with mathematical functions.
Note that, a model is an artificial construction where
all external detail has been removed or abstracted.
Modelling - Activities
 On the left-hand side are
activities related to research,
such as collecting data that are
used in the model and/or can
be used to assess the modelling
results.

 On the right-hand side are

conceptual activities that must
lead to the development of a
model, including creative
thinking and formulating
hypotheses to be tested.
Modelling - Activities
How to Build a Model
The key steps involved in Data Science Modelling are:
Step 1: Understanding the Problem:
The first step involved in Data Science Modelling is understanding the
problem. A Data Scientist listens for keywords and phrases when
interviewing a line-of-business expert about a business challenge. The
Data Scientist breaks down the problem into a procedural flow that
always involves a holistic understanding of the business challenge.

Step 2: Data Extraction:

Not just any Data, but the Unstructured Data pieces you collect,
relevant to the business problem you’re trying to address. The Data
Extraction is done from various sources online, surveys, and existing
Databases.
How to Build a Model
Step 3: Data Cleaning: Data Cleaning is useful as you need to sanitize Data while
gathering it. The following are some of the most typical causes of Data
Inconsistencies and Errors:
 Duplicate items are reduced from a variety of Databases.
 The error with the input Data in terms of Precision.
 Changes, Updates, and Deletions are made to the Data entries.
 Variables with missing values across multiple Databases.
Step 4: Exploratory Data Analysis: Exploratory Data Analysis (EDA) is a robust
technique for familiarizing yourself with Data and extracting useful insights. Data
Scientists use Statistics and Visualization tools to summaries Central Measurements
and variability to perform EDA.
Step 5: Feature Selection: Feature Selection is the process of identifying and
selecting the features that contribute the most to the prediction variable or
output that you are interested in, either automatically or manually.
How to Build a Model
Step 6: Incorporating Machine Learning Algorithms
This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from,
the Model is selected based on the problem. There are three types of Machine Learning
methods that are incorporated:
1) Supervised Learning
 Linear Regression
 Random Forest
 Support Vector Machines
2) Unsupervised Learning
 KNN (k-Nearest Neighbors)
 K-means Clustering
 Hierarchical Clustering
 Anomaly Detection
3) Reinforcement Learning
 Q-Learning
 State-Action-Reward-State-Action (SARSA)
 Deep Q Network
How to Build a Model
Step 7: Testing the Models:
The Data Model is applied to the Test Data to check if it’s
accurate and houses all desirable features. You can further test
your Data Model to identify any adjustments that might be
required to enhance the performance and achieve the desired
results.

Step 8: Deploying the Model:

The Model which provides the best result based on test findings is
completed and deployed in the production environment
whenever the desired result is achieved through proper testing as
per the business needs.
How to Build a Statistical Model - Issues
The major issues involved building a Model are:
Underlying process about he problem
Assumptions about the problem
Simple Vs Complex model
Mathematical expressions Vs Visualization methods
Probability Distributions - Variables
 A variable is a quantity whose value changes.

 A discrete variable is a variable whose value is obtained by counting.

Example: number of students present

 A continuous variable is a variable whose value is obtained by measuring.

Example: heights of all the students in class

 A random variable is a variable whose value is a numerical outcome of a

random phenomenon.

 The probability distribution of a random variable X tells what the

possible values of X are and how probabilities are assigned
 A random variable can be discrete or continuous
Probability Distributions
 Statistical model is non-deterministic models, where variables are
stochastic(Random) in nature i.e. they have probability distributions. So,
the probability distributions are the foundation of statistical models.

 A probability distribution is a mathematical function that describes the

probability of different possible values of a variable. Probability
distributions are often depicted using graphs or probability tables.

 Example: One Coin flip Test scores

Heads Tails

0.5 0.5
Probability Distributions - Types
There are 3 types of probabilities
1.Probability distribution of One Random Variable
2.Probability distribution of Multiple Random Variables(Joint
Probability distribution)
 Joint Probability: Probability of events A and B.
 Conditional Probability: Probability of event A given event B.
3.Probability distribution of Independence and Exclusivity
Probability Distributions - Types
Probability distribution of One Random Variable:
 It quantifies how likely a specific outcome is for a random variable, such as
the flip of a coin, the roll of a dice, or drawing a playing card from a deck.
 For a random variable x, P(x) is a function that assigns a probability
to all values of x.
Probability Distribution of x = P(x)

 Probability is calculated as the number of desired outcomes divided

by the total possible outcomes.
Probability = (number of desired outcomes) / (total number of possible outcomes)

 For example, the probability of a die rolling a 5 is calculated as one

outcome of rolling a 5 (1) divided by the total number of discrete
outcomes (6) or 1/6 or about 0.1666 or about 16.666%.
Probability Distributions - Types
Probability distribution of One Random Variable: Example
 Let a random variable is x (the amount of time until the next bus arrives)

 Let p(x) is corresponding probability distribution, which maps x to a

positive real number. Let us assume that the probability of arrival of next
bus is given as

 Then if you want to calculate the probability (likelihood) of the next bus
arriving in between 12 and 13 minutes is given as
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Joint Probability)
 The probability of two (or more) events is called the joint probability. The
joint probability of two or more random variables is referred to as the joint
probability distribution.

 For the random variables x and y, P(x, y) is a joint probability and it is

represented as
Probability Distribution P(x, y) = P(x and y) = P(x) * P(y)

 The calculation of the joint probability is sometimes called the

fundamental rule of probability or the “product rule” of probability or the
“chain rule” of probability.
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Joint Probability) - Example
 Example: What is the joint probability of drawing a King that is
black?
Event “A” = The probability of drawing a king = 4/52 = 0.0769

Event “B” = The probability of drawing a black card = 26/52 = 0.50

Therefore, the joint probability of event “A” and “B” is

P(4/52) x P(26/52) = 0.0385 = 3.9%.
Probability Distributions - Types
Probability distribution of two Random Variable: (Conditional Probability)
 The probability of one event given the occurrence of another
event is called the conditional probability.
 The conditional probability of one variable to one or more random
variables is referred to as the conditional probability distribution.
 The conditional probability for events A given event B is calculated
as follows:
P(A | B) = P(A given B) = P(A and B) / P(B)
 Note:
 This notation assumes that the probability of event B is not zero.
 The notion of event A given event B does not mean that event B has
occurred, instead, it is the probability of event A occurring after or in
the presence of event B for a given trial.
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Conditional Probability) - Example
 Example:
 Susanth took two tests. The probability passing both tests is 0.6. The
probability of passing the first test is 0.8. What is the probability of
passing the second test given that she has passed the first test?
Different Probability Distribution Functions
Fitting a model
 Model fitting is the measure of how well a mathematical model
generalizes data similar to that with which it was trained.
 A good model fit refers to a model that accurately approximates the
output when it is provided with test inputs.

 Fitting a model means that, estimating the parameters of the model

using the observed data. we are using the data as evidence to design
the real-world mathematical process that generates the data.

 While doing coding for your model, the code will read data, and we will
specify the functional form of the model.
 The R or Python will use built-in optimization methods to give you the
most likely values of the parameters given the data.
Fitting a model
 Fitting a model refers to adjusting the parameters in the model to
improve accuracy. The process involves
Running an algorithm on data for which the target variable is
known to produce a mathematical model.
Then, the model’s outcomes are compared to the real,
observed values of the target variable to determine the
accuracy.
The next step involves adjusting the algorithm’s standard
parameters in order to reduce the level of error and make the
model more accurate.
 This process is repeated several times until the model finds the
optimal parameters to make predictions with substantial
accuracy.
Overfitting and Underfitting
 When random fluctuations or the noise in the training data are
picked up and learned as concepts by the model, the model
“overfits”.
 Overfitting negatively impacts the performance of the model on
new data.
 It will perform well on the training set, but very poorly on the test
set. This negatively impacts the model’s ability to generalize and
make accurate predictions for new data.
Overfitting and Underfitting
 Underfitting happens when the model cannot sufficiently
model the training data nor generalize new data.

 An underfit model is not a suitable model; this will be obvious

as it will have a poor performance on the training data.
Data Science Process
 The complete picture of data science process can be
depicted as shown below.
Data Science Process
 Inside the Real World are lots raw data—logs, Olympics records, employee emails, or
recorded genetic material.
 We want to process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call it. To do this we
use tools such as Python, shell scripts, R, or SQL, or all.
 Once we have this clean dataset, we should be doing some kind of EDA. In the course of
doing EDA, we may realize that it isn’t actually clean because of duplicates, missing
values, absurd outliers, and data that wasn’t actually logged or incorrectly logged. If
that’s the case, we may have to go back to collect more data, or spend more time
cleaning the dataset.
 Next, we design the model to use some algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else. The model we choose depends on the type
of problem we’re trying to solve.
 We then can interpret, visualize, report, or communicate our results. This could take the
form of reporting the results up to business to make decisions.
 Alternatively, the goal may be to build or prototype a “data product”; e.g., a spam
classifier, or a search ranking algorithm, or a recommendation system.
Exploratory Data Analysis
 Exploratory Data Analysis (EDA) is an approach to analyze the
data using visual techniques.
 It is used to discover trends, patterns, or to check assumptions
with the help of statistical summary and graphical
representations.
 Exploratory data analysis is a significant step to take before
diving into statistical modeling or machine learning, to ensure
the data is really what it is claimed to be and that there are no
obvious errors.
 EDA should be part of data science projects in every
organization.
Objectives of Exploratory Data Analysis
 The goal of EDA is to allow data scientists to get deep insight into a
data set and at the same time provide specific outcomes that a
data scientist would want to extract from the data set. It includes:
 List of outliers
 Estimates for parameters
 Uncertainties for those estimates
 List of all important factors
 Conclusions or assumptions as to whether certain
individual factors are statistically essential
 Optimal settings
 A good predictive model
Exploratory Data Analysis Tools
 The basic tools of EDA are plots, graphs and summary statistics.
 The EDA is a method of systematically going through the data to
do the following
 Plotting distributions of all variables (using box plots),
 plotting time series of data,
 transforming variables,
 looking at all pairwise relationships between variables using
scatterplot matrices,
 Generating summary statistics.
 Computing the mean, minimum, maximum, the upper and lower
quartiles, and identifying outliers.
UNIT-1 – CHAPTER -2

Basics of R Language
Contents
Introduction
R- Environment Setup
Programming with R
Basic Data Types.
Introduction to R language
 Introduction:
 R is an open-source programming language and environment
used for statistical analysis, data visualization, and data science.

 Being open-source, R has a massive community that

continuously works to improve the environment as well as helps
members worldwide to improve and innovate.

 It has over 10,000 different libraries and packages to enhance

and add on to its already significant capabilities.
Introduction to R language
 History:
 R is an extension of the S-programming language, which was
created by John Chambers at Bell Laboratories (formerly AT&T)
in 1976. S was a premiere tool for statistical research.

 In 1992, Ross Ihaka and Robert Gentleman created R at the

University of Auckland, New Zealand, as a tool that their students
could learn and use easily.

 Ihaka and Gentleman released the initial version in 1995, and a

stable beta version was released in 2000.
Introduction to R language
 Advantages/Features:
 Open source: R is an open-source environment. It is cost-
effective for projects of any size and is widely available.

 Advanced graphics: R has various libraries and packages available

for plotting attractive and elegant graphs. These can also be used
to create highly interactive graphics for data-driven story telling, as
well.

 R has a massive community that works tirelessly to improve and

add upon R’s abilities. CRAN or Comprehensive R Archive Network
has over 10,000 packages or extensions that can be used from
producing high-definition graphics to creating interactive web-
apps.
Introduction to R language
 Advantages/Features:
R can perform complex mathematical and statistical
operations on vectors, matrices, data frames, arrays, and
other data objects of varying sizes.

R is an interpreted language and does not need a compiler. It

generates a machine-independent code that is easy to
debug and is highly portable.

R is a comprehensive programming language that supports

object-oriented as well as procedural programming with
generic and first-class functions.
Introduction to R language
 Advantages/Features:
R supports both Command Line Interface and Graphical User
Interface by which users can be allowed to do programming
at console level and also allows to work with scripts.

R supports a wide variety of packages to handle the problems

in the aera of Financial sector, Healthcare, High Performance
computing, distributed computing, Statics and many more.

Compatible with various other technologies: R can integrate

with a number of different technologies and programming
languages.
Introduction to R language
 Disadvantages:
 The R seems to be relatively easy to learn at the beginning, but it is
hard to master it.

 With the command based R, it become highly inconvenient for the

statisticians and non-computing professionals to use it.

 R commands don’t concern with memory management, and

therefore R can consume a large amount of memory.

 Due to a large number of packages available and the existing

redundancy among them, some packages can be of poor quality.
R- Environment Setup – Install R on windows
 Step – 1: Go to CRAN R project website.
 Step – 2: Click on the Download R for Windows link.
R- Environment Setup
R- Environment Setup – Install R on windows
 Step – 3: Click on the base subdirectory link or install R for the first
time link.
 Step – 4: Click Download R 3.3.4 for Windows and save the executable
.exe file.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
5.a. Select the desired language and then click Next.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
 5.b. Read the license agreement and click Next.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
5.c. Select the components you wish to install (it is recommended to
install all the components). Click Next.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
5.d. Enter/browse the folder/path you wish to install R into and
then confirm by clicking Next.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
 5.e. Select additional tasks like creating desktop shortcuts etc. then
click Next.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
 5.f. Wait for the installation process to complete.
R- Environment Setup – Install R on windows
 Step – 5: Run the .exe file and follow the installation instructions.
 5.g. Click on Finish to complete the installation.
R- Environment Setup – Install R Studio on windows
 Step – 1: To begin, go to download RStudio and click on the download button
for RStudio desktop.
 Step – 2: Click on the link for the windows version of RStudio and save the .exe
file.
 Step – 3: Run the .exe and follow
the installation instructions.
 3.a. Click Next on the welcome
window.
R- Environment Setup – Install R Studio on windows
 3.b. Enter/browse the path to the installation folder and
click Next to proceed.
R- Environment Setup – Install R Studio on windows
 3.c. Select the folder for the start menu shortcut or click on do
not create shortcuts and then click Next.
R- Environment Setup – Install R Studio on windows
 3.d. Wait for the installation process to complete.
R- Environment Setup – Install R Studio on windows
 3.e. Click Finish to end the installation.
R- Environment Setup – Install R on Linux
R- Environment Setup – Install R Studio on Linux

Statistics and Probability: Quarter 3 - Module 1: Illustrating A Random Variable (Discrete and Continuous)
100% (8)
Statistics and Probability: Quarter 3 - Module 1: Illustrating A Random Variable (Discrete and Continuous)
28 pages
A Course in Mathematical Statistics George G. Roussas p593 T
No ratings yet
A Course in Mathematical Statistics George G. Roussas p593 T
593 pages
Statistics - Probability - Q3 - Mod2 - Mean and Variance of Discrete Random Variable v2
100% (1)
Statistics - Probability - Q3 - Mod2 - Mean and Variance of Discrete Random Variable v2
18 pages
Document
100% (2)
Document
533 pages
Stat and Prob Q3 Module 7
33% (3)
Stat and Prob Q3 Module 7
8 pages
ICS2307 Simulation and Modelling Notes PDF
100% (1)
ICS2307 Simulation and Modelling Notes PDF
33 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Data Science 1A
100% (2)
Data Science 1A
53 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
Data
No ratings yet
Data
43 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Introduction To Data Science 5-13
No ratings yet
Introduction To Data Science 5-13
19 pages
Data Science Class Lecture
No ratings yet
Data Science Class Lecture
22 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
Lecture-1 Introduction To Data Science
No ratings yet
Lecture-1 Introduction To Data Science
20 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Data Science vs. Big Data vs. Data Analytics
No ratings yet
Data Science vs. Big Data vs. Data Analytics
7 pages
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
No ratings yet
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
7 pages
Inventory Theory PDF
No ratings yet
Inventory Theory PDF
18 pages
FRM一级强化段定量分析 Crystal 金程教育 (标准版
No ratings yet
FRM一级强化段定量分析 Crystal 金程教育 (标准版
156 pages
Chapter 7 Uncertainty
No ratings yet
Chapter 7 Uncertainty
93 pages
Sma 2201
No ratings yet
Sma 2201
35 pages
Data Science PDF
No ratings yet
Data Science PDF
8 pages
Statistics and Probability Second SEMESTER S.Y. 2020 - 2021: Quest
No ratings yet
Statistics and Probability Second SEMESTER S.Y. 2020 - 2021: Quest
6 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Datascience
75% (8)
Datascience
28 pages
Statistical Inspector
No ratings yet
Statistical Inspector
22 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
DS 1
No ratings yet
DS 1
56 pages
CIE Review For : Probability and Statistics
No ratings yet
CIE Review For : Probability and Statistics
6 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Random Variables and Probability Distributions (Example and Exercises, Walpole, 8 Edition)
No ratings yet
Random Variables and Probability Distributions (Example and Exercises, Walpole, 8 Edition)
6 pages
4 Cse It Ma2262 PQT
No ratings yet
4 Cse It Ma2262 PQT
3 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Day 1
No ratings yet
Day 1
13 pages
Conditional Probability and Expectation
No ratings yet
Conditional Probability and Expectation
19 pages
Past Exam (Fall 2021)
No ratings yet
Past Exam (Fall 2021)
5 pages
Math 3215 Intro. Probability & Statistics Summer '14 Practice Exam 1
No ratings yet
Math 3215 Intro. Probability & Statistics Summer '14 Practice Exam 1
3 pages
1.introduction To Data Science
No ratings yet
1.introduction To Data Science
23 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Pundra University of Science & Technology
No ratings yet
Pundra University of Science & Technology
2 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Toaz - Info Statistics and Probability Week 2 DLL PR
No ratings yet
Toaz - Info Statistics and Probability Week 2 DLL PR
6 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Chapter #2 (Part 1)
No ratings yet
Chapter #2 (Part 1)
17 pages
Data Science
No ratings yet
Data Science
85 pages
DSV Module-1
No ratings yet
DSV Module-1
26 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
C Random Variables Probability Distribution
No ratings yet
C Random Variables Probability Distribution
21 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Data Science
No ratings yet
Data Science
40 pages
21EC51 DC Module 2
No ratings yet
21EC51 DC Module 2
52 pages
Big Data & Data Science - PIK - C5
No ratings yet
Big Data & Data Science - PIK - C5
10 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
Bsd1313 Chapter 1
No ratings yet
Bsd1313 Chapter 1
60 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
29 ArticleText 207 1 10 20191112
No ratings yet
29 ArticleText 207 1 10 20191112
26 pages
Modul 1
No ratings yet
Modul 1
56 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Ids Unit 1 Final
No ratings yet
Ids Unit 1 Final
30 pages
DS - Module 1
No ratings yet
DS - Module 1
57 pages
Unit 1
No ratings yet
Unit 1
76 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Introduction To Data Science - Students
No ratings yet
Introduction To Data Science - Students
237 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
DSE Math Module 1
No ratings yet
DSE Math Module 1
45 pages
Exercices Class2
No ratings yet
Exercices Class2
3 pages
Mod 3
No ratings yet
Mod 3
96 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Global Optimization Methods in Geophysical Inversion 2ed. Edition Sen M.K. - Download The Ebook Now To Start Reading Without Waiting
100% (1)
Global Optimization Methods in Geophysical Inversion 2ed. Edition Sen M.K. - Download The Ebook Now To Start Reading Without Waiting
73 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
CH1 What Is Data Science
No ratings yet
CH1 What Is Data Science
21 pages
Introduction To Probability: Jonas Elmerraji, CMT
No ratings yet
Introduction To Probability: Jonas Elmerraji, CMT
10 pages
Eco227y5y Lec0101
No ratings yet
Eco227y5y Lec0101
5 pages
Unit-1 .Ds
No ratings yet
Unit-1 .Ds
30 pages
Introduction To Data Science and Big Data
No ratings yet
Introduction To Data Science and Big Data
124 pages
Unit I Introduction To Data Science and Big Data
No ratings yet
Unit I Introduction To Data Science and Big Data
121 pages
Ids Unit-I
No ratings yet
Ids Unit-I
34 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
IDS Unit-1
No ratings yet
IDS Unit-1
18 pages

Ids (R22) U1 PPT 03092024

Uploaded by

Ids (R22) U1 PPT 03092024

Uploaded by

UNIT-1 – CHAPTER -1

Introduction to Data Science

Data Science is an multidisciplinary field that allows

Data science enables you to translate a business

2. Preparation: Data can have

4. Model Building: The actual

5.Operationalize: You deliver

 Data Science enables companies to efficiently understand

 Data Science is widely used in various industry domains,

 Why Now: Technology makes this

 Its a practical knowledge of tools and

 Substantive (Domain) Expertise: The Substantive Knowledge is the

 Machine Learning: Machine learning is backbone of data science.

 The main purpose of Statistics is to make an accurate conclusion

 Statistical Inference means “guess”, which means making inference

 The overall process is starting from

No of times drawn 90 100 120 90

 Question: While a card is tried at random, then what is the probability of

 The set of characteristics that are measured or extracted from the

 Sample -1 : 1/10 of Employees and their emails at random

 Sample- 2 : 1/10 of Emails and its Employees at random

Includes Each and every unit of the Only a handful of units of

Characteristic Parameter Statistic

 For example, consider concept of personalization works

 Issues need to be addressed

 Modelling process includes an iterative process that

 Modelling process consists of the activates related to

 On the right-hand side are

Step 2: Data Extraction:

Step 8: Deploying the Model:

 A discrete variable is a variable whose value is obtained by counting.

 A continuous variable is a variable whose value is obtained by measuring.

 A random variable is a variable whose value is a numerical outcome of a

 The probability distribution of a random variable X tells what the

 A probability distribution is a mathematical function that describes the

 Example: One Coin flip Test scores

 Probability is calculated as the number of desired outcomes divided

 For example, the probability of a die rolling a 5 is calculated as one

 Let p(x) is corresponding probability distribution, which maps x to a

 For the random variables x and y, P(x, y) is a joint probability and it is

 The calculation of the joint probability is sometimes called the

Event “B” = The probability of drawing a black card = 26/52 = 0.50

Therefore, the joint probability of event “A” and “B” is

 Fitting a model means that, estimating the parameters of the model

 An underfit model is not a suitable model; this will be obvious

 Being open-source, R has a massive community that

 It has over 10,000 different libraries and packages to enhance

 In 1992, Ross Ihaka and Robert Gentleman created R at the

 Ihaka and Gentleman released the initial version in 1995, and a

 Advanced graphics: R has various libraries and packages available

 R has a massive community that works tirelessly to improve and

R is an interpreted language and does not need a compiler. It

R is a comprehensive programming language that supports

R supports a wide variety of packages to handle the problems

Compatible with various other technologies: R can integrate

 With the command based R, it become highly inconvenient for the

 R commands don’t concern with memory management, and

 Due to a large number of packages available and the existing

You might also like