0% found this document useful (0 votes)
106 views

Data Science - UNIT-2 - Notes

The document discusses four units related to data science and Microsoft Excel. Unit I introduces Excel tables, formulas, functions and data visualization. Unit II covers introduction to data science, probability theory, and importing SQL data into Excel. Unit III discusses machine learning algorithms, linear regression, and data standardization in Excel. Unit IV focuses on data visualization using charts and statistical analysis in Excel.

Uploaded by

catsa dogga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Data Science - UNIT-2 - Notes

The document discusses four units related to data science and Microsoft Excel. Unit I introduces Excel tables, formulas, functions and data visualization. Unit II covers introduction to data science, probability theory, and importing SQL data into Excel. Unit III discusses machine learning algorithms, linear regression, and data standardization in Excel. Unit IV focuses on data visualization using charts and statistical analysis in Excel.

Uploaded by

catsa dogga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit-I

Introduction to Microsoft Excel


Creating Excel tables, understand how to Add, Subtract, Multiply, Divide in Excel. Excel Data Validation, Filters,
Grouping. Introduction to formulas and functions in Excel. Logical functions (operators) and conditions. Visualizing
data using charts in Excel. Import XML Data into Excel How to Import CSV Data (Text) into Excel, How to Import
MS Access Data into Excel, Working with Multiple Worksheets.
Unit-II
Introduction to Data Science: What is Data Science? Probability theory, bayes theorem, bayes probability;
Cartesian plane, equations of lines, graphs; exponents. Introduction to SQL: SQL: creation, insertion, deletion,
retrieval of Tables by experimental demonstrations. Import SQL Database Data into Excel
Unit-III
Data science components: Tools for data science, definition of AI, types of machine learning (ML), list of ML
algorithms for classification, clustering, and feature selection. Description of linear regression and Logistic
Regression. Introducing the Gaussian, Introduction to Standardization, Standard Normal Probability Distribution
in Excel, Calculating Probabilities from Z-scores, Central Limit Theorem, Algebra with Gaussians, Markowitz
Portfolio Optimization, Standardizing x and y Coordinates for Linear Regression, Standardization Simplifies
Linear Regression, Modeling Error in Linear Regression, Information Gain from Linear Regression.
Unit-IV
Data visualization using scatter plots, charts, graphs, histograms and maps: Statistical Analysis: Descriptive
statistics- Mean, Standard Deviation for Continuous Data, Frequency, Percentage for Categorical Data.
Applications of Data Science: Data science life cycle, Applications of data science with demonstration of
experiments by using Microsoft Excel.
Unit-II
Introduction to Data Science: What is Data Science? Probability theory, bayes theorem, bayes probability;
Cartesian plane, equations of lines, graphs; exponents. Introduction to SQL: SQL: creation, insertion,
deletion, retrieval of Tables by experimental demonstrations. Import SQL Database Data into Excel
Introduction to Data Science

The Ascendance of Data


We live in a world that’s drowning in data. Websites track every user’s every click. Your smartphone is
building up a record of your location and speed every second of every day. “Quantified selfers” wear pedometers-
on-steroids that are ever recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect
driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The Internet itself
represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced
encyclopedia; domain-specific databases about movies, music, sports results, pinball machines, memes, and
cocktails; and too many government statistics (some of them nearly true!) from too many governments to wrap
your head around. Buried in these data are answers to countless questions that no one’s ever thought to ask.

What Is Data Science?


A data scientist is someone who knows more statistics than a computer scientist and more computer
science than a statistician. In fact, some data scientists are — for all practical purposes — statisticians, while
others are pretty much indistinguishable from software engineers. Some are machine-learning experts, while
others couldn’t machine-learn their way out of kindergarten. Some are PhDs with impressive publication records,
while others have never read an academic paper. A data scientist is someone who extracts insights from messy
data. Today’s world is full of people trying to turn data into insight. Data science is an inter-disciplinary field that
uses scientific methods, processes, algorithms and systems to extract valuable information from available huge
structural and unstructured data. The life cycle of the data science is as given below:
• The business requirement step deals with the identification of the problem and objectives of the organization
requirements. It also identifies the parameters that are to be forecasted or predicted.
• The data acquisition step deals with finding and collecting of the source of data and store the data for finding
the information of interest that meets the business requirements.
• The data processing step is used to transform the data to a form that suits better for finding the required
information. The major task of this step is data cleaning i.e removal of unwanted data from the available
raw data.
• The data exploration step is a brain storming step where identification of pattern is done. Here visualization
charts are used extract the required information.
• The data modeling step deals with building of data models and training the models using the data sets. It
uses machine learning algorithms and techniques for better prediction and forecasting.
• The deployment stage deals with the deployment of the model in the business environment.

Probability theory

There are many sources of uncertainty in ai, including variance in the specific data values, the sample of data
collected from the domain, and in the imperfect nature of any models developed from such data.
• Uncertainty is the biggest source of difficulty for beginners in machine learning, especially developers.
• Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources
of uncertainty in machine learning.
• Probability provides the foundation and tools for quantifying, handling, and harnessing uncertainty in
applied machine learning.
Uncertainty means working with imperfect or incomplete information. Probability is a numerical description of
how likely an event is to occur or how likely it is that a proposition is true. Probability is a number between 0 and
1, where, roughly speaking, 0 indicates impossibility and 1 indicates certainty.
How to compute the probability?
Given: Statistical experiment has n equally-likely outcomes, r outcome is “success”
Find: Probability of successful outcome(S)
P( S) = Number of Successes ∕ Total Number of Outcomes = r/n Example:1
Given: 10 marbles: 2 red, 3 green, 5 blue.
Find: probability of selecting green? Solution: P(G) = 3/10= .30
A Random Variable is a set of possible values from a random experiment.
Example: Throw a die once
Random Variable X = "The score shown on the top face". X could be 1, 2, 3, 4, 5 or 6 So the Sample Space is {1,
2, 3, 4, 5, 6}
We can show the probability of any one value using this style: P(X = value) = probability of that value
X = {1, 2, 3, 4, 5, 6}
In this case they are all equally likely, so the probability of any one is 1/6
• P(X = 1) = 1/6
• P(X = 2) = 1/6
• P(X = 3) = 1/6
• P(X = 4) = 1/6
• P(X = 5) = 1/6
• P(X = 6) = 1/6
Atomic event: A complete specification of the state of the world about which the agent is uncertain. E.g., if the
world consists of only two Boolean variables Cavity and Toothache, then there are 4 distinct atomic events:
Cavity = false  Toothache = false
Cavity = true  Toothache = false
Cavity = true  Toothache = false
Cavity = true  Toothache = true
Atomic events are mutually exclusive and exhaustive
In case of atomic events that are mutually exclusive if some atomic event is true, then all other atomic events are
false. And in case of atomic events that are exhaustive there is always some atomic event true.
Joint Probability
It is the likelihood of more than one event occurring at the same time.
Two types of joint probability we can find:
1. Mutual exclusive events(Without common outcomes)
2. Non Mutual exclusive events (With common outcomes)
Mutual exclusive mean occurrence of events both A and B together is impossible i.e. P(A and B)=0 and A or B is
the sum of A and B i.e. P(A or B) =P(A) + P(B)
In case of Non Mutual exclusive events A or B is the sum of A and B minus A and B i.e.
P(A or B) =P(A) + P(B) – P(A and B)
The conditional probability of an event B in relationship to an event A is the probability that event B occurs given
that event A has already occurred. The notation for conditional probability is P(B|A), read as the probability of B
given A i.e. probability of B given that an event A is already occurred.
When two events, A and B, are dependent, the probability of both occurring is:
Problem 1: A math teacher gave her class two tests. 25% of the class passed both tests and 42% of the class passed
the first test. What percent of those who passed the first test also passed the second test?
Answer: P(Second | First) = P(First and Second)/P(First) = 0.25/0.42=0.60 = 60%
Problem 2: A jar contains black and white marbles. Two marbles are chosen without replacement. The probability
of selecting a black marble and then a white marble is 0.34, and the probability of selecting a black marble on the
first draw is 0.47. What is the probability of selecting a white marble on the second draw, given that the first
marble drawn was black?
Answer: P(White | Black) = P(Black and White)/P(Black) = 0.34/0.47=.72 = 72%

Bayes Theorem Statement


Let E1, E2,…,En be a set of events associated with a sample space S, where all the events E1, E2,…, En have
nonzero probability of occurrence and they form a partition of S. Let A be any event associated with S, then
according to Bayes theorem,

Bayes Theorem Proof

Note:
The following terminologies are also used when the Bayes theorem is applied:
Hypotheses: The events E1, E2,… En is called the hypotheses
Priori Probability: The probability P(Ei) is considered as the priori probability of hypothesis Ei
Posteriori Probability: The probability P(Ei|A) is considered as the posteriori probability of hypothesis Ei

Bayes Theorem Formula


If A and B are two events, then the formula for Bayes theorem is given by:

P(A|B) = P(A∩B)/P(B)

Where P(A|B) is the probability of condition when event A is occurring while event B has already occurred.
P(A ∩ B) is the probability of event A and event B
P(B) is the probability of event B
Some illustrations will improve the understanding of the concept.
Example 1:A bag I contain 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One
ball is drawn at random from one of the bags, and it is found to be black. Find the probability that it was drawn
from Bag I.
Solution:
Let E1 be the event of choosing the bag I, E2 the event of choosing the bag II, and A be the event of drawing a
black ball.
Then,P(E1) = P(E2) = 1/2
Also,P(A|E1) = P(drawing a black ball from Bag I) = 6/10 = 3/5
P(A|E2) = P(drawing a black ball from Bag II) = 3/7
By using Bayes’ theorem, the probability of drawing a black ball from bag I out of two bags,
P(E1|A) = P(E1)P(A|E1)P(E1)P(A│E1)+P(E2)P(A|E2)
=(1/2 × 3/5)/(1/2 × 3/5 + 1/2 × 3/7) = 7/12
Example 2: A man is known to speak truth 2 out of 3 times. He throws a die and reports that the
number obtained is a four. Find the probability that the number obtained is actually a four.
Solution:
Let A be the event that the man reports that number four is obtained.
Let E1 be the event that four is obtained and E2 be its complementary event.
Then, P(E1) = Probability that four occurs = 1/6
P(E2) = Probability that four does not occurs = 1 – P(E1) = 1 −1/6 = 5/6
Also, P(A|E1) = Probability that man reports four and it is actually a four = 2/3
P(A|E2) = Probability that man reports four and it is not a four = 1/3
By using Bayes’ theorem, probability that number obtained is actually a four,
P(E1|A) =P(E1)P(A|E1)P(E1)P(A│E1) + P(E2)P(A|E2) = (1/6 × 2/3)/(1/6 × 2/3 + 5/6 × 1/3) = 2/7
Practice Problems: Solve the following problems using Bayes Theorem.
1. A bag contains 5 red and 5 black balls. A ball is drawn at random, its color is noted, and again the ball is
returned to the bag. Also, 2 additional balls of the color drawn are put in the bag. After that, the ball is
drawn at random from the bag. What is the probability that the second ball drawn from the bag is red?
2. Of the students in the college, 60% of the students reside in the hostel and 40% of the students are day
scholars. Previous year result reports that 30% of all students who stay in the hostel scored A Grade and
20% of day scholars scored A grade. At the end of the year, one student is chosen at random and found
that he/she has an A grade. What is the probability that the student is a hostlier?
3. From the pack of 52 cards, one card is lost. From the remaining cards of a pack, two cards are drawn and
both are found to be the diamond cards. What is the probability that the lost card being a diamond?
Problem: A factory production line is manufacturing bolts using three machines, A, B and C. Of the
total output, machine A is responsible for 25%, machine B for 35% and machine C for the rest. It is
known from previous experience with the machines that 5% of the output from machine A is
defective, 4% from machine B and 2% from machine C. A bolt is chosen at random from the
production line and found to be defective. What is the probability that it came from (a) machine A (b)
machine B (c) machine C?
Problem: An engineering company advertises a job in three newspapers, A, B and C. It is known that
these papers attract undergraduate engineering readerships in the proportions 2:3:1. The probabilities
that an engineering undergraduate sees and replies to the job advertisement in these papers are 0.002,
0.001 and 0.005 respectively. Assume that the undergraduate sees only one job advertisement. (a) If the
engineering company receives only one reply to it advertisements, calculate the probability that the
applicant has seen the job advertised in place A. (i) A, (ii) B, (iii) C. (b) If the company receives two
replies, what is the probability that both applicants saw the job advertised in paper A?

You might also like