0% found this document useful (0 votes)
3 views

2. Statistics and Data Analysis

The document outlines a module on Statistics & Data Analysis, covering key topics such as Probability, Descriptive Statistics, and Data Visualization. It provides detailed explanations of probability concepts, formulas, and examples, including probability rules and distributions. Additionally, it introduces tools like NumPy and Pandas for data manipulation and visualization techniques using Matplotlib and Seaborn.

Uploaded by

ssen29750
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2. Statistics and Data Analysis

The document outlines a module on Statistics & Data Analysis, covering key topics such as Probability, Descriptive Statistics, and Data Visualization. It provides detailed explanations of probability concepts, formulas, and examples, including probability rules and distributions. Additionally, it introduces tools like NumPy and Pandas for data manipulation and visualization techniques using Matplotlib and Seaborn.

Uploaded by

ssen29750
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 229

Data Science

Module Name: Statistics & Data Analysis

1
Statistics & Data Analysis

Chapters
1. Probability
2. Descriptive Statistics
3. Inferential Statistics
4. NumPy for Mathematical Computing
5. Data Manipulation using Pandas
6. Data Visualization with Matplotlib and Seaborn
7. Web Scraping Using Beautifulsoup Using Beautifulsoup

2
Chapter 1

Probability

3
1. Probability

What is probability?
• Probability defines the likelihood of occurrence of an event

• For example
• What is the chance to get head(H), when toss a coin?
• What is the change to get 1, when roll a dice?
• What is the change to get tow heads(HH), when toss two coins

4
1. Probability

Probability Formula
• Probability can also be defined as the ratio between the number of favorable outcomes and the
total number of outcomes of an event

• Probability of an event can be written as:


• Probability(event) or P(event) = Possible Outcome / Total Outcomes

• Probability of an event ranges between 0 to 1


• 0 means, no chance to occur/happen at all
• 1 means, 100% change to occur/happen

• But in real time, we will get an event probability value in between 0 and 1, not exactly 0 or 1 5
1. Probability

Examples
• What is the probability of getting/landed head, when a coin is tossed?

Solution:
Total outcomes = 2 (H, T)
Possible outcome = 1 (H)

P(H) = 1/2
= 0.5

The probability of getting head is 0.5 or 50%


6
1. Probability

Examples
• What is the probability of getting/landed tail, when a coin is tossed?

Solution:
Total outcomes = 2 (H, T)
Possible outcome = 1 (T)

P(H) = 1/2
= 0.5

The probability of getting tail is 0.5 or 50%


7
1. Probability

Examples
• What is the probability of getting/landed 2 consecutive heads, when a coin is tossed twice?

Solution:
Total outcomes = 2 (HH, HT, TH, TT)
Possible outcome = 1 (HH)

P(2 consecutive heads) = 1/4


= 0.25

The probability of getting 2 consecutive heads is 0.25 or 25%


8
1. Probability

Examples
• What is the probability of getting/landed at least one head, when a coin is tossed twice?

Solution:
Total outcomes = 2 (HH, HT, TH, TT)
Possible outcome = 3 (HH, HT, TH)

P(at least one head) = 3/4


= 0.75

The probability of getting at least one head is 0.75 or 75%


9
1. Probability

Examples
• What is the probability of getting 1, when a dice is rolled?

Solution:
Total outcomes = 6 (1, 2, 3, 4, 5, 6)
Possible outcome = 1 (1)

P(1) = 1/6
= 0.1666

The probability of getting 1 is 0.1666


10
1. Probability

Examples
• What is the probability of getting sum 8, when a dice is rolled twice?

Solution:
Total outcomes = 36 [(1, 1), (1, 2) , (1,3) , ….. ,(6,4), (6,5), (6,6)]
Possible outcome = 5 [(2, 6), (3, 5), (4, 4), (5, 3), (2, 6)]

P(sum 8) = 5/36
= 0.1388

The probability of getting sum 8 is 0.1388


11
1. Probability

Probability Terminology
• Experiment
• An activity whose outcomes are not known is an experiment
• Example: Experiment to find Gravity

• Random Experiment
• A random experiment is an experiment for which the set of possible outcomes is known, but
which particular outcome will occur on a particular execution of the experiment cannot be
said prior to performing the experiment
• Example: tossing a coin

12
1. Probability

Probability Terminology
• Trail
• The numerous attempts in the process of an experiment are called trials
• Example: tossing a coin

• Event
• A trial with a clearly defined outcome is an event
• Example: getting 2, when rolling a dice

• Random Event
• An event that cannot be easily predicted is a random event
• Example: survival of a person, when he met sever accident 13
1. Probability

Probability Terminology
• Random Variable:
• Discrete Random Variable:
• Coin, Dice

• Continues Random Variable:


• Age, Salary

14
1. Probability

Probability Terminology
• Outcome
• The result of a trail
• Example: head and tail, when coin is tossed

• Possible Outcome
• The list of all the outcomes in an experiment can be referred to as possible outcomes.
• Example: getting head, when coin is tossed

• Equally likely Outcomes


• An experiment in which each of the outcomes has an equal probability
• Example: probability of H and T are same, when coin is tossed 15
1. Probability

Probability Terminology
• Sample Space
• It is the set of outcomes of all the trials in an experiment
• Example: S = {H, T}, when coin is tossed

• Probable Event
• An event that can be predicted is called a probable event
• Example: probability an employee getting promotion

• Impossible Event
• An event that is not a part of the experiment
• Example: probability getting 7, when dice is rolled 16
1. Probability

Probability Terminology
• Complementary Events
• Complementary events occur when there are just two outcomes
• Example: {success, failure}, when a game is played
• P(success) + P(Failure) = 1

• Independent Events
• A and B are said to be independent, event A is not effecting event B or Vice-Versa
• Example: getting two consecutive heads, when coin is tossed twice

17
1. Probability

Probability Terminology
• Dependent Events
• A and B are said to be dependent, event A is effecting event B or Vice-Versa
• Example: getting blue ball in second pick from 5 red, 4 blue balls

• Mutually Exclusive Events


• Two events such that the happening of one event prevents the happening of another event are
referred to as mutually exclusive events
• Example: if a coin is tossed, it results either H or T, not both

18
1. Probability

Probability Rules
• The sum of the probabilities of all events in an Experiment is 1
• The probability of opposite event = 1 – probability of the event
• Lest assume, A and B are events in an experiment
• P(A) = 1 – P(B)
• Or
• P(B) = 1 – P(A)

19
1. Probability

Probability Rules
• The sum of the probabilities of all events in an Experiment is 1
• Let assume A and B are events in a sample space, then
• P(A) + P(B) = 1

• The probability of opposite/complement event = 1 – probability of the event


• Let assume A and A1 are the complement to each other, then
• P(A) = 1 – P(A1) or P(A1) = 1 – P(A)

20
1. Probability

Probability Rules
• Addition rule
• Say A and B are mutually exclusive events. Then
• P(A or B) = P(A) + P(B)
• If they are not mutually exclusive events. Then
• P(A or B) = P(A) + P(B) – P(A and B)

• Multiplication Rule
• P(A and B) = P(A) * P(B), where A, B are independent events
• P(A and B) = P(A) * P(B|A), where A, B are dependent events
• P(A and B) = 0, where A, B are mutually exclusive events
21
1. Probability

Probability Rules
• Conditional Probability
• P(A|B) = P(A and B) / P(B)

22
1. Probability

Examples
• If a box contains 3 red, 3 blue and 4 green balls. What is the probability of getting 1st and 2nd pick
is red ball?

Solution:
Total balls = 10 (3 red, 3 blue, 4 green)

P(picking 1st ball red) = (3/10)


P(picking 2nd ball red) = (3/10) * (2/9)
P(picking 2nd ball red) = 0.0666

the probability of getting 1st and 2nd pick red ball is 0.0666
23
1. Probability

Probability Distribution
• Let X be the event getting at least one head, when a coin tossed twice. The probability
distribution as follows:

X X=0 X=1 X=2


P(X) 0.25 0.5 0.25

24
1. Probability

Probability Mass Function


• The probability that a discrete random variable will exactly less or equal or greater a given value
is expressed by the probability mass function.
• With discrete random variables, just the probability mass function is employed.

• Example:
• Suppose a coin is tossed twice and the sample space is recorded as S = [HH, HT, TH, TT].
• The probability of getting heads needs to be determined.
• Let X be the random variable that shows how many heads are obtained.
• X can take on the values 0, 1, 2. The probability that X will be equal to 1 is 0.5.
• Thus, it can be said that the probability mass function of X evaluated at 1 will be 0.5.
25
1. Probability

Probability Mass Function


• Probability Mass Function Formula for Binomial Distribution

26
1. Probability

PMF Example
• What is the probability getting two heads, when a coin tossed twice?
• Total outcomes = 4
• At least one head = 1

• P(X= two heads) = (4C1)(0.5)1(1-0.5)3


• P(X= two heads) = (4!/(3!1!)) * (0.125) * (0.5)
• P(X= two heads) = (4) * (0.0625)
• P(X= two heads) = 0.25

27
1. Probability

Probability of a Continuous Random Variable


• A random variable 𝑋 has the uniform distribution on the interval
[0,1] : the density function is 𝑓(𝑥)=1 if 𝑥 is between 0 and 1
and 𝑓(𝑥)=0 for all other values of 𝑥

• What is the probability that 𝑋 assumes a value greater than 0.92/


𝑃(𝑋>0.92).
• P(X>0.92) with height = 1
• P(X>0.92) with base (1-0.92) = 0.08
• P(X>0.92) = height * base
• P(X>0.92) = 1 * 0.08 = 0.08
28
1. Probability

Probability of a Continuous Random Variable


• What is the probability that 𝑋 assumes a value between 0.3 and
0.8 𝑃(0.3<𝑋<0.8) .
• Height = 1
• Base (Area b/w 0.8 and 0.3) = 0.8-0.3 = 0.5

• P(0.3<X<0.8) = 1 * 0.5 = 0.5

29
1. Probability

Probability of a Continuous Random Variable - Example


• A boy arrives at a bus stop at a random time to catch the next bus. Buses run every 20 minutes
without fail, hence the next bus will come any time during the next 20 minutes with evenly
distributed probability (a uniform distribution).

• 1. Find the probability that a bus will come within the next 12 minutes

• Solution:
• Height = 1/20
• Base = 12
• P(0<=X<=12) = (1/20) * 12 = 0.06 * 12 = 0.72
30
1. Probability

Probability of a Continuous Random Variable - Example


• A boy arrives at a bus stop at a random time to catch the next bus. Buses run every 20 minutes
without fail, hence the next bus will come any time during the next 20 minutes with evenly
distributed probability (a uniform distribution).

• 1. Find the probability that a bus will come after 12 minutes and before 15 minutes

• Solution:
• Height = 1/20
• Base = 15-12 = 3
• P(12<X<15) = (1/20) * 3 = 0.06 * 3 = 0.18
31
1. Probability

Probability Density Function


• Probability Density Function Formula

• f(x) = P(a<x<=b) =

32
1. Probability

Probability Density Function - Example


• Let X be a continuous random variable with the PDF given by:

• f(x) = x; where x <=2


• f(x) = 0; where x>2

• What is the P(1.5<X<2.8)?


2 2.8
• ‫׬‬1.5 𝑓 𝑥 𝑑𝑥 +‫׬‬2 𝑓 𝑥 𝑑𝑥
• (x*x / 2)2 1.5 + (0)2.8 2
• ([2*2 / 2] – [1.5*1.5 / 2]) + ([0*2.8] – [0*2]) Replace the value of x
• (2 – 1.125) + (0 – 0) => (0.075) – (0) => .075
33
1. Probability

Bayes’ Theorem
• It is a mathematical formula that describes the probability of an event based on prior knowledge
or experience.
• The theorem is named after Thomas Bayes. It is also known as the formula for the probability of
“causes”
• It allows us to update our prior beliefs about the likelihood of an event based on new evidence.
• It is used in many fields:
• Statistics
• data science
• machine learning
• artificial intelligence.
34
1. Probability

Bayes’ Theorem
• Formula as follows:
• P(A|B) = (P(B|A) * P(A)) / P(B)

• Where:
• P(A|B) is the probability of event A given that event B has occurred
• P(B|A) is the probability of event B given that event A has occurred
• P(A) is the prior probability of event A
• P(B) is the prior probability of event B

35
Chapter 2

Descriptive Statistics

36
2. Descriptive Statistics

Statistics
• The area of mathematics known as statistics deals with the principles regulating random events,
as well as the gathering, examination, interpretation, and presentation of numerical data.
• Type of Statistics:
• Descriptive Statistics
• Used to summarize the data
• Inferential Statistics
• Used to analyze and make inferences from the data

37
2. Descriptive Statistics

Statistical Terms
• Population
• Entire data set is called population
• Example Indian Population
• Sample
• Subset of the data set is called sample
• Example Bangalore Population
• Parameter
• Value calculated on population
• Statistic
• Value calculated on sample
38
2. Descriptive Statistics

Statistical Method Steps


• Gathering data
• Collect data from population
• Describing and visualizing data
• Calculate values and visualize on collected data
• Making conclusions
• Make decisions/predictions based on visuals

39
2. Descriptive Statistics

Data Types
• Based on Measurement
• Qualitative
• Nominal
• apple, banana
• Ordinal
• ratings
• Quantitative
• Interval
• 1,2,3
• Ratio
• 1-3,4-6,7-9 40
2. Descriptive Statistics

Basics
• Descriptive statistics includes
• Measure of Central Tendency
• Mean
• Median
• Mode
• Measure of Dispersion
• Range
• Variance
• Standard Deviation

41
2. Descriptive Statistics

Data Types
• Based on Structure
• Structured
• Data inform of rows and columns
• Semi Structured
• Data inform of xml or json
• Unstructured
• Data inform of audios, videos

42
2. Descriptive Statistics

Measure of Central Tendency


• It aims to summarise an entire set of data with a single value that corresponds to the middle or
center of its distribution

• Central Tendency Measures:


• Mean
• Median
• Mode

43
2. Descriptive Statistics

Mean
• Summation of all the observations and divides by the total number of observation is known as
arithmetic mean of that observation or data set

• If x1, x2, x3..x5 are the five observations in a dataset, then


𝑥1 +𝑥2 +𝑥3 +𝑥4 +𝑥5
• 𝑀𝑒𝑎𝑛(μ/𝑥)ҧ =
5

• By using notation
σ𝑛
𝑖=1 𝑥𝑖
• 𝑀𝑒𝑎𝑛(μ/𝑥)ҧ =
𝑛

44
2. Descriptive Statistics

Mean Example

Age
Total = 120
20
19 N=6
20
18
21 Mean = Total / N
22
Mean = 120/6
= 20

45
2. Descriptive Statistics

Median
• Median is the middle most value of the set of observation after arranging the data set into
ascending order or descending order

𝑛+1 𝑡ℎ
• 𝑀𝑒𝑑𝑖𝑎𝑛 = , where n is odd
2

𝑛 𝑡ℎ 𝑛+2 𝑡ℎ
𝑖𝑡𝑒𝑚+ 𝑖𝑡𝑒𝑚
• 𝑀𝑒𝑑𝑖𝑎𝑛 = 2 2
, where n is event
2

46
2. Descriptive Statistics

Median Example

Age
Age in Sort:
20
19 18, 19, 20, 20, 21, 22, 60
20
18
21 Median = 20
22
60

47
2. Descriptive Statistics

Mode
• Most frequently occurred item in the data set is called mode of the given data points.
• Mode is a kind of average around which other observations lies clustered densely.

48
2. Descriptive Statistics

Mode Example

Sys-Cores
Sys-Cores:
1
2 1, 2, , 1, 2, 2, 4
3
1
2 Mode = 2
2
4

49
2. Descriptive Statistics

Measure of Dispersion/Spread
• How data spread across the central value

• Measure of Dispersion:
• Range
• Variance
• Standard Deviation

50
2. Descriptive Statistics

Range
• Range is the simplest measure of dispersion, which helps the researcher the overall knowledge
about the spread ness of data. If range is larger, then the data set have more spread ness and vice
versa.

• Range can be calculated by subtracting maximum value to minimum value.


• 𝑅𝑎𝑛𝑔𝑒 (𝑅)=𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒

51
2. Descriptive Statistics

Range Example

Age
Max = 22
20
19 Min = 19
20
18
21 Range = Max – Min
22 Range = 22 – 19
Range = 3

52
2. Descriptive Statistics

Variance
• Variance can be simply defined as the square of standard deviation.
• Variance usually will be used to test the significance of spreadness of two different data set.
• Variance describes the data that how far each data points deviates from its mean.

σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
• 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 σ2 =
𝑛−1

53
2. Descriptive Statistics

Variance Example

Age(x) Mean x x - Mean x (x - Mean x)^2


σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 = 10
20 20 0 0
19 20 -1 1 n=6
20 20 0 0
18 20 -2 4
21 20 1 1 Variance = 10/(6-1)
22 20 2 4

120 10 Variance = 2

54
2. Descriptive Statistics

Standard Deviation
• Standard deviation can be defined as, square root of sum of squares of deviations of observations
from its arithmetic mean and divided by its degrees of freedom.
• SD tells, that how well the observations standardly deviated from its mean.

σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
• 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜(σ) =
𝑛−1

55
2. Descriptive Statistics

Standard Deviation
• Benefits:
• Standard deviation much useful in comparing two or more sets of data to find how data is
deviated from its central value (mean).

• Standard deviation helps in finding the standard error of sample mean.

• Standard deviation helps in find out the coefficient of variation.

56
2. Descriptive Statistics

Standard Deviation Example

Age(x) Mean x x - Mean x (x - Mean x)^2


20 20 0 0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
19 20 -1 1
20 20 0 0
18 20 -2 4 Variance = 2
21 20 1 1
22 20 2 4
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 2
120 10

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 1.414

57
2. Descriptive Statistics

Inter Quartile Range


• It tells us the spread of the middle half of our distribution
• IQR = Q3 - Q1

58
2. Descriptive Statistics

5 Number Summary
• When conducting descriptive analyses or conducting an initial analysis of a sizable data set, a
five-number summary is particularly helpful.
• The maximum and minimum values in the data set, the lower and upper quartiles, and the median
make up a summary's five values.
• Together, these values are shown in the following order:
• minimum value
• lower quartile (Q1)
• median value (Q2)
• upper quartile (Q3)
• maximum value
59
2. Descriptive Statistics

Distribution
• Normal distribution
• Poison distribution

60
2. Descriptive Statistics

Normal Distribution
• A probability distribution that is symmetric
about the mean is the normal distribution
• Sometimes referred to as the Gaussian
distribution.
• It demonstrates that data that are close to the
mean occur more frequently than data that
are far from the mean.

61
2. Descriptive Statistics

Skewness and Kurtosis


• Skewness is a measure of symmetry, or more
precisely, the lack of symmetry
• Right skewed
• Left skewed

• Kurtosis is a measure of whether the data are


heavy-tailed or light-tailed relative to a normal
distribution.
• Heavy tailed
• Light tailed
62
2. Descriptive Statistics

Empirical Rule
• It also known as the three-sigma rule or 68-95-99.7 rule, holds that with a normal distribution,
almost all observed data will lie within three standard deviations of the mean

63
2. Descriptive Statistics

Covariance
• It establishes the relationship between the two variables' changes, i.e., that a change in one
variable is equivalent to a change in the other.
• A function has the ability to preserve its shape even after a linear transformation of the input
variables.
• The units of covariance are determined by multiplying the units of the two variables. And value
between -∞ and +∞
• Types:
• Positive Covariance
• Negative Covariance

64
2. Descriptive Statistics

Covariance
• Formula as follows:

65
2. Descriptive Statistics

Covariance Example
• Calculate Covariance for the following data set

Study Hours Marks


2.1 56
2.5 60
4.0 85
3.6 75
3.5 70
4.5 92
4.2 89
4.1 85

66
2. Descriptive Statistics

Correlation Coefficient
• The correlation method determines how closely related two variables are.
• It is a dimensionless estimated measure of covariance.
• To put it another way, the correlation coefficient has no units and always has a constant value.
• Value between -1 to 1
• Correlation Coefficient as follows:

67
2. Descriptive Statistics

Correlation Coefficient Example


• Calculate Correlation Coefficient for the following data set

Study Hours Marks


2.1 56
2.5 60
4.0 85
3.6 75
3.5 70
4.5 92
4.2 89
4.1 85

68
Chapter 3

Inferential Statistics

69
3. Inferential Statistics

Introduction
• It makes the use of various analytical tools to draw inferences about the population data from
sample data.
• It helps us come to conclusions and make predictions based on data presents
• We use inferential statistics to understand population parameter by using test statistic
• It has two main uses:
• Making estimates about population
• Testing hypotheses to draw conclusion

70
3. Inferential Statistics

Hypothesis Testing
• Hypothesis is an assumption about population parameter based on sample statistic
• Testing the assumption is called hypothesis testing
• Hypothesis is two kinds:
• Null Hypothesis
• H0 is used to represent null hypothesis
• Alternate Hypothesis
• H1 or Ha is used to represent alternate hypothesis

71
3. Inferential Statistics

Hypothesis Example
• XYZ college or institute believes its students score 90% on final exam
• Null Hypothesis
• H0 is average=90
• Alternate Hypothesis
• H1 is average<>90

72
3. Inferential Statistics

Terms Used in Testing


• Significant Level
• alpha=0.05
• Confidence Interval
• CI=1-alpha = 95%
• Critical Value
• Value from distribution table at alpha
• Rejection Region
• Area in the distribution to reject null
hypothesis

73
3. Inferential Statistics

One-tailed and Two-tailed Test

74
3. Inferential Statistics

Type-I and Type-II Error

• Type-I Error:
• Rejecting null hypothesis, when it is
true

• Type-II Error:
• Accepting null hypothesis, when it is
false

75
3. Inferential Statistics

Hypothesis Tests
• T-test
• Also called Student t-test
• It is used to conduct when sample size is small(<=30)
• Z-test
• It used to conduct when sample size is large(>=30)
• ANOVA
• It used to compare mean of groups

76
3. Inferential Statistics

77
3. Inferential Statistics

T-test
• When to conduct T-test
• Sample size is small
• Data follows normal distribution
• Population standard deviation is unknow
• Formula for one sample t-test as fallows:

78
3. Inferential Statistics

T-test
• Formula for two sample t-test as fallows:

79
3. Inferential Statistics

T-test Example
• XYZ college wants to improve its student performance. The previous performance shows that the
average performance of 28 students was 80%. After some (extra study hours) training, the current
data showed an average performance is 88%. If the standard deviation given is 20%. Did extra
study hours improve the performance?

80
3. Inferential Statistics

T-test Example - Solution


• Sample size = 28
• Population mean = 80
• Sample mean = 88
• Sample standard deviation = 20

• Null Hypothesis:
• H0: mean = 88

• Alternate hypothesis
• H1: mean<88 (mean=80)
81
3. Inferential Statistics

T-test Example - Solution

• T-statistic = (88-80)/(20/sqrt(28))
• T-statistic = (8)/(3.78)
• T-statistic = 2.11

82
3. Inferential Statistics

T-test Example - Solution


• P-value for t-score 2.11= 0.022
• Alpha = 0.05
• P-value<0.05 =>reject null hypothesis

• Use bellow link to calculate p-value:


• https://www.socscistatistics.com/pvalues/tdistribution.aspx

83
3. Inferential Statistics

Z-test
• When to conduct Z-test
• Sample size is large >=30
• Data follows normal distribution
• Population standard deviation is know
• Formula for one sample z-test as fallows:

84
3. Inferential Statistics

Z-test
• Formula for two sample z-test as fallows:

85
3. Inferential Statistics

Z-test Example
• A school teacher claims that the students in his/her school are above average intelligent. A
random sample of 40 students IQ Scores have mean of 120. The mean population IQ is 110 with
standard deviation of 18. is there sufficient evidence to support teachers' claim?

86
3. Inferential Statistics

Z-test Example
• Sample size = 40
• Population mean = 110
• Sample mean = 120
• Population standard deviation = 18

• Null Hypothesis:
• H0: mean = 120

• Alternate hypothesis
• H1: mean<120
87
3. Inferential Statistics

Z-test Example

• Z-statistic:
• Z=(120-110)/(18/sqrt(40))
• Z=3.513

88
3. Inferential Statistics

Z-test Example
• P-value for Z-value is 3.513:
• P-value =0.000808

• P-value<0.05 =>reject null hypothesis

• Use bellow link to calculate p-value:


• https://www.socscistatistics.com/pvalues/normaldistribution.aspx

89
3. Inferential Statistics

Z-test Example-2
• A school teacher claims that the students in his/her school are above average intelligent. A
random sample of 30 students IQ Scores have mean of 120. The mean population IQ is 116 with
standard deviation of 15. is there sufficient evidence to support teachers' claim?

90
3. Inferential Statistics

Z-test Example-2
• Sample size = 30
• Population mean = 116
• Sample mean = 120
• Population standard deviation = 15

• Null Hypothesis:
• H0: mean = 116

• Alternate hypothesis
• H1: mean>116
91
3. Inferential Statistics

Z-test Example-2

• Z-statistic:
• Z=(120-116)/(15/sqrt(30))
• Z=1.4605

92
3. Inferential Statistics

Z-test Example-2
• P-value for Z-value is 1.4605:
• P-value =0.072076

• P-value>0.5 =>don’t reject null hypothesis

• Use bellow link to calculate p-value:


• https://www.socscistatistics.com/pvalues/normaldistribution.aspx

93
3. Inferential Statistics

ANOVA Test
• It stands for Analysis of Variance, to test the difference in variance in groups
• One-way ANOVA
• H0 all means are equal
• H1 at least one mean is different

94
3. Inferential Statistics

One-way ANOVA

95
3. Inferential Statistics

One-way ANOVA Example

Marks
Morning Study Hours 56, 60, 70, 80, 90
Noon Study Hours 60, 67, 69, 78, 92
Evening Study Hours 85, 88, 89, 90, 90

96
3. Inferential Statistics

One-way ANOVA Example

Marks Sample Size Sample Mean Sample Variance


Morning
56, 60, 70, 80, 90 5 71.2 14.042
Study Hours
Noon Study
60, 67, 69, 78, 92 5 73.2 12.316
Hours
Evening
85, 88, 89, 90, 90 5 88.4 2.073
Study Hours
15 77.6 6.469

97
3. Inferential Statistics

One-way ANOVA Example


• SSB = 5*sqr(71.2-77.6) + 5*sqr(73.2-77.6) + 5*sqr(88.4-77.6)
• MSB = SSB/k-1=>884.8/2=>442.4

• SSW = (5-1)*14.042 + (5-1)*12.316 + (5-1)*2.073


• MSW=SSW/n-k=>113.724/12=>9.477

• F = MSB/MSW=> 442.4/9.477=>46.68

• F-critical value = 3.89 (2,12)


• Reject H0 => F-statistic>F-critical
98
Chapter 4

NumPy for Mathematical Computing

99
4. NumPy for Mathematical Computing

Introduction
• NumPy stands for numerical python
• Mainly developed for numerical operations on vectors and matrices
• It focuses on linear algebra and matrices
• NumPy arrays are (more than 30 times) faster than python list

• Installing NumPy package


• pip install numpy

• Using NumPy in a script


• import numpy as np
100
4. NumPy for Mathematical Computing

array() function
• Arrays in NumPy are called ndarrays
• We can use array() from numpy package to create array

• Following script creates simple 1D array with 5 values


• np.array([10, 20, 30, 40, 50])

101
4. NumPy for Mathematical Computing

array() function
• Arrays in NumPy are called ndarrays
• We can use array() from numpy package to create array

• Following script creates simple 1D array with 5 values


• a1 = np.array([10, 20, 30, 40, 50])

• Check the array


• print(a1)

102
4. NumPy for Mathematical Computing

array() function
• array() syntax:
• numpy.array(object, dtype=None, copy=True, order=‘F', subok=False, ndmin=0)

• Where
• object -> data
• dtype -> data type of elements
• copy -> create newly
• order -> F or C
• subok -> sub class pass through
• ndim -> number of dimensions
103
4. NumPy for Mathematical Computing

array() function
• Following script creates 1D array
• a1 = np.array([1,2,3,4])

• Following script creates 2D array


• a1 = np.array([[1,2],[3,4]])

• Note: if data starts with [ then it is 1D, [[ then it is 2D, [[[ then it is 3D
• 1D collection of normal values
• 2D collection of 1D arrays
• 3D collection of 2D arrays
104
4. NumPy for Mathematical Computing

Array Attributes
• shape -> returns shape of the array in terms of dimensions
• ndim -> returns number of dimensions
• size -> returns number of elements in the array
• itemsize -> returns memory occupied by each element

105
4. NumPy for Mathematical Computing

Array Attributes Example


import numpy as np

a1 = np.array([[1,2], [3,4]])

print(a1.shape)
print(a1.dtype)
print(a1.size)
print(a1.itemsize)

106
4. NumPy for Mathematical Computing

Array Creation Functions


• a2 = np.empty((2,2), dtype=‘i1’)

• a3 = np.zeros((2,2), dtype=‘i1’)

• a4 = np.ones((2,2), dtype=‘i1’)

• a5 = np.arrange(1, 11)

107
4. NumPy for Mathematical Computing

Reshaping Array
• Converting one shape to other shape
• We use reshape() function for above

• Example:
a1 = np.arrange(1, 21)
a2 = np.reshape (a1, (5, 4))

108
4. NumPy for Mathematical Computing

Generating Random Values and Arrays


• We use random sub module from numpy to generate random values and arrays

• Example:

r1 = np.random.rand() #returns 1 random value


r2 = np.random.randn(10) #returns 1D array with 10 elements

109
4. NumPy for Mathematical Computing

Indexing and Slicing


• We use +ve and –ve indexes to access elements from an array
• +ve index starts with 0, -ve index starts with -1
• We can use :: for range selection

• Note: similar to list slicing

110
4. NumPy for Mathematical Computing

Indexing and Slicing Example


a = np.random.randint(90, size=100)

print(a)

print(a[0]) #returns 1st value


print(a[-1]) #returns last value
print(a[2:6]) #returns values from postion 2 to 5
print(a[5:20:2]) #returns values from postion 2 to 21, by skipping 1 element

111
4. NumPy for Mathematical Computing

Important Methods
• np.sort() -> returns sorted array
• np.max() -> returns maximum from the array
• np.mean() -> returns the mean of elements in the array
• np.argmax() -> returns the position of the maximum value in the array
• np.argmin() -> returns the position of the minimum value in the array
• np.unique() -> returns array with unique values in the array

112
4. NumPy for Mathematical Computing

Important Methods
• np.std() -> returns standard deviation of elements in the array
• np.var() -> returns the variance
• np.cov()-> returns covariance
• np.corrcoef()->returns the correlation coefficient

• np.save() -> saves array into a file


• np.load() -> reads array from a file

113
4. NumPy for Mathematical Computing

Important Methods Example


a = np.random.randint(90, size=10)

print(a)
print(np.sort(a))
print(np.mean(a))
print(np.unique(a))
print(np.argmax(a))
print(np.argmin(a))
print(np.std(a))

114
4. NumPy for Mathematical Computing

Important Methods Example


a = np.random.randint(90, size=10)

#saving numpy array to a file


np.save(‘data’, a)

#reading numpy array from a file


a1 = np.load(‘data.npy’)

115
Chapter 5

Data Manipulation using Pandas

116
5. Data Manipulation using Pandas

Pandas Introduction
• Pandas stands for Panel Data
• Mainly created for data manipulation with high performance
• Introduced in 2008 by Wes McKinney
• It is built on top of NumPy
• We can perform data analysis tasks using Pandas very effectively and ease
• We need to install pandas package from repository to work with it
• pip install pandas

117
5. Data Manipulation using Pandas

Pandas Types
• Pandas as following Data Types:
• Series
• It is 1D array, with index column
• Dataframe
• It is 2D array for tabular data
• Panel
• It is 3D array, where it is collection of dataframes

Note: we mainly work with Dataframes for data analysis and building machine learning models
118
5. Data Manipulation using Pandas

Dataframe
• Dataframe is a 2D data set, where it contains row names/indexes and column names
• Where columns contain different data items

Columns

Eno Name Salary

100 Nagalli 45000

200 Shiva 35000 Rows

300 Kumar 38000

119
5. Data Manipulation using Pandas

Dataframe

Creating Dataframe

emp_df = pd.DataFrame(
{
'eno':[100,200,300,400,500,600],
'ename':['ab','cd','xy','mn','df','er'],
'salary':[12000, 14500, 50000, 45000, 20000, 25000],
'did':[1,1,2,2,3,3]
})

#displaying entire dataframe


emp_df

120
5. Data Manipulation using Pandas

Dataframe

Creating Dataframe

dept_df = pd.DataFrame(
{
'did':[1,2,3,4],
'ename':['Accounting', 'Sales', 'Marketing', 'IT'],
'location':[1,1,2,2,3,3]
})

#displaying dataframe
dept_df

121
5. Data Manipulation using Pandas

Dataframe Attributes
• Following are important attributes of a Dataframe
• shape
• Returns the shape of the dataframe in terms of rows and columns
• size
• Returns the number of element in the dataframe
• dtypes
• Returns datatype of each column
• columns & index
• Returns name of each column and name of each row
• values
• Returns entire data in from of a arrays 122
5. Data Manipulation using Pandas

Dataframe Attributes

DF Attributes

print(emp_df.shape) #prints shape

print(emp_df.size) #prints size(rows*columns)

print(emp_df.dtypes) #prints column data types

print(emp_df.columns) #prints cloumn names

print(emp_df.index) #prints row index

print(emp_df.values) #prints values in numpy array

123
5. Data Manipulation using Pandas

Dataframe Column Selection


• We use column name to select specific column values
• Selecting one column values
• df[‘col_name1’]
• Selecting more than one column values
• df[[‘col_name1’, ‘col_name4’]]

124
5. Data Manipulation using Pandas

Dataframe Column Selection

Column Selection

emp_df['eno'] #selects only eno column values

emp_df[['eno', 'ename']] #selects eno and ename column values

emp_df[['eno', 'did']] #selects eno and did column values

emp_df[['did', 'ename']] #selects did and ename columns values (order is not same)

125
5. Data Manipulation using Pandas

Dataframe Row Selection


• We mainly use loc attribute to select rows and columns using anmes
• Selecting first row with all columns
• df.loc[0, ]
• Selecting third row with all columns
• df.loc[2, ]
• Selecting first and fifth row with all columns
• df.loc[[0,4], ]
• Selecting first and fifth row with eno column
• df.loc[[0,4], ‘eno’]
• Selecting first and fifth row with eno and ename column
• df.loc[[0,4], [‘eno’, ‘ename’]] 126
5. Data Manipulation using Pandas

Dataframe Row Selection

Row Selection (loc)

emp_df.loc[2,] #select 3rd row with all column values

emp_df.loc[0,] #select 1st row with all column values

emp_df.loc[[0,4],] #selecting 1st and 5th rows with all column values

emp_df.loc[[0,4], ['eno']] #selecting 1st and 5th rows with eno column values

emp_df.loc[[0,4], ['eno', 'did']] #selecting 1st and 5th rows with eno and did column values

emp_df.loc[1:4,'eno':'did'] #selecting rows from 2nd to 4th and column from enm to did

127
5. Data Manipulation using Pandas

Dataframe Row Selection


• We mainly use iloc attribute to select rows
• It is same like loc, but here we use only index(numbers) values

• Selecting first row with all columns


• df.iloc[0, ]
• Selecting third row with all columns
• df.iloc[2, ]
• Selecting first and fifth row with all columns
• df.iloc[[0,4], ]
• Selecting first and fifth row with first column
• df.iloc[[0,4], 0] 128
5. Data Manipulation using Pandas

Dataframe Row Selection

Row Selection (iloc)

emp_df.iloc[2,] #select 3rd row with all column values

emp_df.iloc[0,] #select 1st row with all column values

emp_df.iloc[[0,4],] #selecting 1st and 5th rows with all column values

emp_df.iloc[[0,4],0] #selecting 1st and 5th rows with eno column values

emp_df.iloc[[0,4], [0, 2]] #selecting 1st and 5th rows with eno and did column values

emp_df.iloc[1:4,0:3] #selecting rows from 2nd to 4th and column from enm to did

129
5. Data Manipulation using Pandas

Dataframe Important Methods


• head()
• Returns tops 5 rows default
• head(20)
• Returns tops 10 rows default

• tail()
• Returns bottom 5 rows
• tail(10)
• Returns bottom 10 rows

130
5. Data Manipulation using Pandas

Dataframe Important Methods

head() and tail()

emp_df.head() #selects top 5 rows from dataframe

emp_df.head(3) #selects top 3 rows from dataframe

emp_df.tail() #selects bottom 5 rows from dataframe

emp_df.tail(3) #selects bottom 3 rows from dataframe

131
5. Data Manipulation using Pandas

Dataframe Important Methods


• describe()
• Returns the description about the data set in terms of DS
• apply()
• Apply function along with axis
• applymap()
• Apply function to all elements
• eval()
• Performs expression given in a string based on columns
• count()
• Returns number of rows
132
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df.describe() #displyas descriptive summary of the dataframe (only for number types)

emp_df.apply('max') #displyas max value from each column

emp_df.apply('count', axis=1) #displyas max value from each row

emp_df.apply(lambda x: x*2) #displyas value*2 from each column

emp_df.apply(lambda x: x*4) #displyas value*4 from each column

emp_df.applymap(lambda x: x*3) #displyas value*3 from each value

133
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df.eval('salary*0.1') #displyas 10% bomus for each employee

emp_df.count() #display number of rows in each column

emp_df.iloc[0,2] = np.nan #updating value to NaN at 1row and 3rd column

emp_df.iloc[1,2] = np.nan #updating value to NaN at 2row and 3rd column

emp_df.iloc[1,1] = np.nan #updating value to NaN at 2row and end column

emp_df.count() #display number of rows in each column

134
5. Data Manipulation using Pandas

Dataframe Important Methods


• value_counts()
• Returns count of unique rows
• rename()
• Renames the column names or row indexes
• duplicated()
• Returns false, if row repeated
• drop_duplicates()
• Removes duplicated rows
• isna() / isnull()
• Return true if values is missing
135
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df.value_counts() #displays count for repeated rows

emp_df['did'].value_counts() #displays count for repeated value, it is count for categorical values

emp_df.rename(columns={'did':'deptid', 'ename':'Name'}) #rename columns old_name:new_name

emp_df.duplicated() #displyas true if row is repeated

emp_df['did'].duplicated() #displyas true if row is repeated

emp_df.drop_duplicates() #delete repeated rows

emp_df['did'].drop_duplicates() #delete repeated rows

emp_df.isna() #display true if it contains na 136


5. Data Manipulation using Pandas

Dataframe Important Methods


• sort_values()
• Sort the rows by columns
• notna() / notnull()
• Return true if values is not missing
• dropna()
• Deletes rows with na values
• fillna()
• Replace a value, where na found
• groupby()
• Group the values based on column
137
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df.sort_values(by='salary') #displyas in increasing order by salary

emp_df.sort_values(by='salary', ascending=False) #displyas in decreasing order by salary

emp_df.sort_values(by=['did', 'salary']) #displyas in increasing order by did and salary

emp_df.notna() #displays true if values is not missing opposit to isna()

emp_df.dropna() #deletes rows if missing value present

emp_df.dropna(thresh=2) #deletes rows if 2 and more missing value present

138
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df.dropna(axis=1) #deletes column if missing value present

emp_df.fillna(15000) #replace missing value with 15000

emp_df.fillna(method='ffill') #replace missing value with before value

emp_df.fillna(method='bfill') #replace missing value with after value

emp_df.fillna(emp_df['salary'].mean()) #replace missing value with mean salary

139
5. Data Manipulation using Pandas

Dataframe Important Methods

emp_df_by_did = emp_df.groupby(by='did') #create object with data group by did

emp_df_by_did.keys #displays column names used for group

emp_df_by_did.groups #displays key and rows as dictionay

emp_df_by_did.agg(['mean']) #display mean for each group

emp_df_by_did.agg(['mean', 'max']) #display mean and max for each group

emp_df_by_did.agg({'eno':'mean', 'salary':'max'}) #display mean from eno and max for salary

140
5. Data Manipulation using Pandas

Dataframe Important Methods

Appending dataframes

emp_df.append(emp1) #appends emp1 to emp_df

emp_df.append([emp1, emp2]) #appends emp1, emp2 to emp_df

141
5. Data Manipulation using Pandas

Dataframe Important Methods

Merging dataframes

emp_df.merge(dept_df, how='inner', left_on='did', right_on='did') #inner join

emp_df.merge(dept_df, how='left', left_on='did', right_on='did') #left outer join

emp_df.merge(dept_df, how='right', left_on='did', right_on='did') #right outer join

emp_df.merge(dept_df, how='outer', left_on='did', right_on='did') #full outer join

emp_df.merge(dept_df, how='cross') #cross join

142
5. Data Manipulation using Pandas

Dataframe Important Methods

Creating New Columns

emp_df['bonus'] = emp_df['salary'] * 0.1 #creates new column bonus

emp_df['new_salary'] = emp_df['salary'] + emp_df['salary'] * 0.2 #creates new column new_salary

emp_df['new_salary'] = emp_df['salary'] + emp_df['salary'] * 0.1 #update column new_salary if exist

143
5. Data Manipulation using Pandas

Dataframe Important Methods

Filltering Rows

emp_df.where(emp_df['eno']>100) #selects details where eno>100

emp_df[emp_df['eno']>100] #selects details where eno>100

emp_df[emp_df['eno']>100][emp_df['salary']>20000] #selects details where eno>100 and salary>20000

emp_df.loc[np.where((emp_df['eno']>100) & (emp_df['salary']>20000))] #selects details where


eno>100 and salary>20000

emp_df.query('eno>100 and salary>20000') #selects details where eno>100 and salary>20000

144
emp_df.query('ename.str.contains("x")') #selects rows where ename contains x
5. Data Manipulation using Pandas

Dataframe Important Methods

Filltering Rows

emp_df.query('ename=="xy"') #selects rows where ename is xy

emp_df.query('ename.str.contains("x")') #selects rows where ename contains x

emp_df.query('did in [1,3]') #selects rows where did is 1 and 3

emp_df.query('did in [1,3]').sort_values(by='bonus', ascending=False) #sorting rows by salary

emp_df.query('did not in [1,3]') #selects rows where did is not 1 and 3

145
5. Data Manipulation using Pandas

Dataframe

Reading CSV

emp_df1 = pd.read_csv('emp.csv') #reads emp.csv, 1st row as column names

emp_df1 = pd.read_csv('emp.csv', skiprows=2) #reads emp.csv skipping two rows

emp_df1 = pd.read_csv('emp.csv', skiprows=2, header=None) #reads emp.csv skipping two rows

emp_df1 = pd.read_csv('emp.csv', skiprows=2, header=None, names=['col1','col2']) #reads


emp.csv skipping two rows

146
5. Data Manipulation using Pandas

Dataframe

Writing CSV

emp_df1 = pd.read_csv('emp.csv’)

emp_df1_d50 = emp_df1.query('DEPARTMENT_ID==50’) #filttering emp_df where


DEPARTMENT_ID is 50

emp_df1_d50.to_csv('emp_d50.csv') #writing emp_df1_d50 to emp_d50.csv

emp_df1_d50not = emp_df1.query('DEPARTMENT_ID!=50’) #filttering emp_df where


DEPARTMENT_ID is not 50

emp_df1_d50not.to_csv('emp_d50not.csv') #writing emp_df1_d50not to emp_d50not.csv

147
5. Data Manipulation using Pandas

Dataframe

Reading SQL Table

import pandas as pd
import pymysql
from sqlalchemy import create_engine

con = create_engine('mysql+pymysql://root:Staragile_123@localhost/my_db')
df_product = pd.read_sql('SELECT * FROM product', con) #read the entire table

148
5. Data Manipulation using Pandas

Dataframe

Writing SQL Table

df_product_apple = df_product .query(“vendor=‘Apple’”) #filttering df_product where vendor is


Apple

df_product_apple.to_sql('product_apple’, con)

149
Title
Code

150
Chapter 6

Data Visualization with Matplotlib and Seaborn

151
6. Data Visualization with Matplotlib and Seaborn

What is Data Visualization?


• Data visualization is the graphical representation of data
• We use visual elements such as charts, graphs
• These visual infers patterns, outliers and trends in data
• It also gives idea about data to the non-technical persons, such as
managers, end customers
• Visualization is so important to analyze massive amount of data
and take data-driven decisions
• We can use following two package for visualization:
• Matplotlib
• Seaborn
152
6. Data Visualization with Matplotlib and Seaborn

Matplotlib
• Matplotlib is a python package to make visual using data
• It was created by John D. Hunter
• It is open source and free

Installing Matplotlib

pip install matplotlib

153
6. Data Visualization with Matplotlib and Seaborn

Matplotlib

importing Matplotlib

#importing matplotlin
import matplotlib

#finding version of the matplotlib


print(matplotlib.__version__)

154
6. Data Visualization with Matplotlib and Seaborn

pyplot
• pyplot is the sub module of the Matplotlib
• It has all visual charts functions

Importing pyplot

from matplotlib import pyplot as plt


#here plt is the alias for pyplot

155
6. Data Visualization with Matplotlib and Seaborn

plot
• plot is the function to draw line chart
• It takes mainly two vector variables one for x-axis and one for y-
axis

Plotting line graph

from matplotlib import pyplot as plt

plt.plot([10,20,30,40,50], [100, 120, 130,110,100])

156
6. Data Visualization with Matplotlib and Seaborn

plot without x values


• Without x-axis values
• It considers as [0,1,2, ..….]

Plotting line graph

from matplotlib import pyplot as plt

plt.plot([100, 120, 130,110,100])

#here x-axis is [0,1,2,3,4]

157
6. Data Visualization with Matplotlib and Seaborn

plot line style


• We can change the style of the line as follows:
• solid (-)
• dotted (:)
• dashed (--)
• dashdot (-.)
• None (“/’)

158
6. Data Visualization with Matplotlib and Seaborn

plot line style

Line stile

from matplotlib import pyplot as plt

x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]

plt.plot(x, y, ls=':’)

#we can use linestyle or ls as parameter


#we can use dotted or : as value

159
6. Data Visualization with Matplotlib and Seaborn

Marker Style
• Marker is symbol at connected lines or points
• We can marker as follows:
• Circle (o)
• Star (*)
• Plus (+)
• Filled Plus (P)

160
6. Data Visualization with Matplotlib and Seaborn

Marker Style

Marker Style

from matplotlib import pyplot as plt

x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]

plt.plot(x, y, ls=':', marker='o’)

#here o is circle

161
6. Data Visualization with Matplotlib and Seaborn

Marker Size
• It sets the size of the marker
• We can use ‘markersize’ or ‘ms’ as to set size

162
6. Data Visualization with Matplotlib and Seaborn

Marker Size

Marker Style

from matplotlib import pyplot as plt

x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]

plt.plot(x, y, ls=':', marker=‘o’, markersize=20)

163
6. Data Visualization with Matplotlib and Seaborn

Marker Color
• It sets the color of the marker
• We can use ‘markeredgecolor’ or ‘mec’ as to set edge color
• We can use ‘markerfacecolor’ or ‘mfc’ as to set inside edge color
• Color as follows:
• Red (‘r’)
• Green (‘g’)
• Blue (‘b’)
• RGB (‘#ffffff’)

164
6. Data Visualization with Matplotlib and Seaborn

Marker Color

Marker Color

from matplotlib import pyplot as plt

x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]

plt.plot(x, y, ls=':', marker='o', ms=20, mec='r', mfc='#ffff00')

165
6. Data Visualization with Matplotlib and Seaborn

Chart Title
• we can use title() function from pyplot to set title for the chart
• Important parameters of the title():
• fontdict
• To set the font style
• loc
• To set the location, where title should be appeared

166
6. Data Visualization with Matplotlib and Seaborn

Chart Title

title()

from matplotlib import pyplot as plt

plt.plot(x, y, ls=':', marker='o', ms=20, mec='r', mfc='#ffff00’)

chart_title = 'Chart Title'


font_style = {'family':'serif','color':'green','size':25}
title_location = 'left’

plt.title(chart_title, fontdict=font_style, loc=title_location)

167
6. Data Visualization with Matplotlib and Seaborn

X and Y Labels
• xlabel()
• Displays x-axis name/label
• ylabel()
• Displays y-axis name/label

• Important parameters:
• fontdict
• loc
• labelpad

168
6. Data Visualization with Matplotlib and Seaborn

X and Y Labels

Labels

from matplotlib import pyplot as plt

plt.plot(x, y, ls=':', marker='o’)

plt.title('Chart Title’)

plt.xlabel('X-axis', labelpad=50)
plt.ylabel('Y-axis', loc='top')

169
6. Data Visualization with Matplotlib and Seaborn

X and Y Ticks
• Ticks are nothing but labels to the x and y axis
• Default is the data given
• To change we can use following:
• xticks() for x-axis
• yticks() for y-axis

170
6. Data Visualization with Matplotlib and Seaborn

X and Y Ticks

Ticks

from matplotlib import pyplot as plt

plt.plot(x, y, ls=':', marker='o')


plt.title('Chart Title’)

plt.xticks(ticks=[0,1,2,3,4,5])
plt.yticks(ticks=[100, 120, 130], labels=['low', 'mid', 'heigh'])

171
6. Data Visualization with Matplotlib and Seaborn

X and Y Grid
• Grid is nothing but lines on chart for x and y axis
• Default is absent
• grid() can be used to display lines
• Parameters are:
• Axis
• For x or y or both
• Color
• Linestyle
• Linewidth
• Alpha
172
6. Data Visualization with Matplotlib and Seaborn

X and Y Grid

Grid

from matplotlib import pyplot as plt

plt.plot(x, y, ls=':', marker='o')


plt.title('Chart Title')

plt.grid(color = '#ff00ff', linestyle = '-', linewidth = 1)

173
6. Data Visualization with Matplotlib and Seaborn

Scatter plot
• It also referred as dot plot
• scatter() can be use to display dot plot
• scatter() parameters:
• c
• Array of colors for each dot
• s
• Array of size for each dot
• cmap
• Color map

174
6. Data Visualization with Matplotlib and Seaborn

Scatter plot

Scatter plot - 1

import matplotlib.pyplot as plt


import numpy as np

x = np.random.randint(1,20, size=10)
y = np.random.randint(100,1000, size=10)

plt.scatter(x, y)
Plt.title(‘Scatter Plot’)

175
6. Data Visualization with Matplotlib and Seaborn

Scatter plot

Scatter plot - 2

import matplotlib.pyplot as plt


import numpy as np

x = np.random.randint(1,20, size=10)
y = np.random.randint(100,1000, size=10)
sizes = np.random.randint(10,200, size=10)

plt.scatter(x, y, sizes=sizes, c=sizes, cmap='rainbow')


plt.title('Scatter Plot')

176
6. Data Visualization with Matplotlib and Seaborn

Bar plot
• bar() and barh() can be use to display bar plot
• bar() parameters:
• color
• Bar color
• edgecolor
• Bar outline color
• width
• Bar width
• Height
• Bar height
177
6. Data Visualization with Matplotlib and Seaborn

Bar plot

Bar plot - 1

import matplotlib.pyplot as plt


import numpy as np

x = ['iPhone', 'Galaxy', 'Realme', 'Vivo', 'Nokia']


y = [20, 35, 18, 25, 38]

plt.bar(x, y)

178
6. Data Visualization with Matplotlib and Seaborn

Bar plot

Bar plot - 2

import matplotlib.pyplot as plt


import numpy as np

x = ['iPhone', 'Galaxy', 'Realme', 'Vivo', 'Nokia']


y = [20, 35, 18, 25, 38]

plt.bar(x, y, color='green', width=0.2, linewidth=1,


edgecolor='red')

179
6. Data Visualization with Matplotlib and Seaborn

Bar plot

Bar plot - 3

import matplotlib.pyplot as plt


import numpy as np

x = ['iPhone', 'Galaxy', 'Realme', 'Vivo', 'Nokia']


y = [20, 35, 18, 25, 38]

plt.barh(x, y, height=0.5)

180
6. Data Visualization with Matplotlib and Seaborn

Pie plot
• pie() can be use to display pie plot
• pie() parameters:
• labels
• Displays name for each potion
• startangle
• Change to start angle from deault x to specified angle
• explode
• Separating portions
• shadow
• Displays shadow
181
6. Data Visualization with Matplotlib and Seaborn

Pie plot

Pie plot - 1

import matplotlib.pyplot as plt


import numpy as np

x = [20, 35, 18, 25, 38]

plt.pie(x)

182
6. Data Visualization with Matplotlib and Seaborn

Box plot
• boxplot() can be use to display box plot
• boxplot() parameters:
• notch
• Curve at mean line
• vert
• Horizontal or vertical

183
6. Data Visualization with Matplotlib and Seaborn

Box plot

Box plot

import matplotlib.pyplot as plt


import numpy as np

marks = np.random.randint(30, 80, 25)


marks = np.append(marks, 180)

plt.boxplot(marks)

184
6. Data Visualization with Matplotlib and Seaborn

Pie plot

Pie plot - 2

import matplotlib.pyplot as plt


import numpy as np

labels = ['iPhone', 'Galaxy', 'Realme', 'Vivo', 'Nokia']


x = [20, 35, 18, 25, 38]
expld = [0.15, 0, 0, 0, 0]

plt.pie(x, labels=labels, explode=expld, shadow=True)

185
6. Data Visualization with Matplotlib and Seaborn

Histogram plot
• hist() can be use to histogram(frequency distributions) plot
• hist() parameters:
• color
• Bin color

186
6. Data Visualization with Matplotlib and Seaborn

Histogram plot

Histogram plot - 1

import matplotlib.pyplot as plt


import numpy as np

x = np.random.randint(10, 1000, 20)

plt.hist(x)

187
6. Data Visualization with Matplotlib and Seaborn

Histogram plot

Histogram plot - 1

import matplotlib.pyplot as plt


import numpy as np

x = np.random.normal(200, 10, 400)

plt.hist(x, color='black')

188
6. Data Visualization with Matplotlib and Seaborn

Multiple plot
• subplot() can be use to more than one plot at a time
• subplot() parameters:
• nrows
• Number of rows
• ncols
• Number of columns
• index
• Subplot number/position, always starts with 1

189
6. Data Visualization with Matplotlib and Seaborn

Multiple plot

Multiple plot

import matplotlib.pyplot as plt


import numpy as np

x = np.array(range(10))
y = np.random.randint(10, 20, 10)
plt.subplot(1, 2, 1)
plt.plot(x,y)

plt.subplot(1, 2, 2)
plt.scatter(x,y)
190
6. Data Visualization with Matplotlib and Seaborn

Customizing Plots
• To change background, color and more by using following:
• figure
• To change width and height of the plot
• style
• To change style of the plot
• rcParams
• To customizing the plots

191
6. Data Visualization with Matplotlib and Seaborn

Customizing Plots

Customizing

import matplotlib.pyplot as plt


import numpy as np

plt.figure(figsize=(7,4))
plt.style.use('ggplot')
x = np.array(range(10))
y = np.random.randint(10, 20, 10)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.linestyle'] = '-'
plt.rcParams['lines.marker'] = '^'
plt.plot(x,y,color='green') 192
6. Data Visualization with Matplotlib and Seaborn

Seaborn
• It is also kind of data visualization library like matplotlib
• It built on top of matplotlib
• It simplifies difficult visualization tasks into easy visualization task
• It has built-in themes
• Works well with numpy and pandas
• Comes with some datasets for exploring visualization

Installing seaborn

pip install seaborn

193
6. Data Visualization with Matplotlib and Seaborn

Seaborn
• Seaborn comes with built-in datasets for exploring visualizations
• get_dataset_names() method returns all built-in dataset names
• load_dataset() methid returns dataframe of the specific dataset

Loading dataset
import seaborn as sns
from matplotlib import pyplot as plt

#sns.get_dataset_names()

tips_df = sns.load_dataset('tips’)
tips_df.info()
194
6. Data Visualization with Matplotlib and Seaborn

countplot()
• It is used to display count for each category in a variable
• Parameters:
• X
• Variable
• Data
• dataframe
• Color
• Color name
• Hue
• Separates bars based on one more categorical variable
195
6. Data Visualization with Matplotlib and Seaborn

countplot()

Countplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

tips_df = sns.load_dataset('tips’)
sns.countplot(x='sex', data=tips_df)

196
6. Data Visualization with Matplotlib and Seaborn

countplot()

Countplot() - II
import seaborn as sns
from matplotlib import pyplot as plt

tips_df = sns.load_dataset('tips’)
sns.countplot(data=tips_df, x='sex', hue='day')

197
6. Data Visualization with Matplotlib and Seaborn

distplot()
• It is used to display distribution of the data
• Parameters:
• data
• 1D array
• bins
• Bins for hist
• kde
• To display kernel density estimation

198
6. Data Visualization with Matplotlib and Seaborn

distplot()

distplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

tips_df = sns.load_dataset('tips’)
sns.displot(tips_df['total_bill'], kde=True, bins=10)

199
6. Data Visualization with Matplotlib and Seaborn

pairplot()
• It is used to display (scatter) relationship among all variables in a
dataframe
• Parameters:
• data
• Dataframe
• hue
• Colors for categorical variable
• kind
• Scatter or reg
• diag_kind
• Kind of plot for the diagonal subplots 200
6. Data Visualization with Matplotlib and Seaborn

pairplot()

pairplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

tips_df = sns.load_dataset('tips’)
sns.pairplot(tips_df)

201
6. Data Visualization with Matplotlib and Seaborn

stripplot()
• It is used to display scatter based on categorical variable
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value

202
6. Data Visualization with Matplotlib and Seaborn

stripplot()

stripplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

iris_df = sns.load_dataset('iris’)

sns.stripplot(x = "species", y = "petal_width", data


= iris_df, jitter=False)

203
6. Data Visualization with Matplotlib and Seaborn

boxplot()
• It is used to display box plot
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value

204
6. Data Visualization with Matplotlib and Seaborn

boxplot()

boxplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

iris_df = sns.load_dataset('iris')
sns.boxplot(x = "species", y = "petal_width", data
= iris_df)

205
6. Data Visualization with Matplotlib and Seaborn

barplot()
• It is used to display relation ship between categorical value and
continues values
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value

206
6. Data Visualization with Matplotlib and Seaborn

barplot()

barplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

titanic_df = sns.load_dataset('titanic')
sns.barplot(x = "sex", y = "survived", data =
titanic_df, hue='class')

207
6. Data Visualization with Matplotlib and Seaborn

factorplot()
• It is used to display plot for categorical value
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value

208
6. Data Visualization with Matplotlib and Seaborn

factorplot()

factorplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

titanic_df = sns.load_dataset('titanic')
sns.factorplot(x = "sex", y = "survived", data =
titanic_df)

209
6. Data Visualization with Matplotlib and Seaborn

factorplot()

factorplot() - II
import seaborn as sns
from matplotlib import pyplot as plt

titanic_df = sns.load_dataset('titanic')
sns.factorplot(x = "sex", y = "survived", data =
titanic_df, col='class')

210
6. Data Visualization with Matplotlib and Seaborn

lmplot()
• It is used to display regression plot
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value

211
6. Data Visualization with Matplotlib and Seaborn

lmplot()

lmplot() - I
import seaborn as sns
from matplotlib import pyplot as plt

iris_df = sns.load_dataset('iris')
sns.lmplot(x = "petal_length", y = "petal_width",
data = iris_df)

212
6. Data Visualization with Matplotlib and Seaborn

FacetGrid()
• It is used to display number of plots
• Parameters:
• data
• Dataframe
• col
• Plots based on the column
• col_wrap
• Number of cols in the grid

213
6. Data Visualization with Matplotlib and Seaborn

FacetGrid()

FacetGrid() - I
import seaborn as sns
from matplotlib import pyplot as plt

iris_df = sns.load_dataset('iris')
grid = sns.FacetGrid(col='species', data=iris_df,
col_wrap=2)
grid.map(plt.scatter, 'sepal_length', 'petal_length')

214
6. Data Visualization with Matplotlib and Seaborn

PairGrid()

PairGrid() - I
import seaborn as sns
from matplotlib import pyplot as plt

iris_df = sns.load_dataset('iris')
grid = sns.PairGrid(iris_df)
grid.map(plt.scatter)
grid.map_diag(plt.hist)

215
Chapter 6

Web Scraping Using Beautifulsoup Using Beautifulsoup

216
Web Scraping Using Beautifulsoup

Introduction
• It is a technique to extract a large amount of data from a website
• Scrapping, obtain data from other resource and saving into local environment
• Sometimes it referred as web data mining or web harvesting
• Web scraping steps:
• Extractor
• Data Transformation and Cleaning Module
• Storage Module

217
Web Scraping Using Beautifulsoup

Introduction
• Web scraping modules:
• requests
• bs4 (Beautiful Soup)
• html.parser (HTML Parser)

218
Web Scraping Using Beautifulsoup

HTML Page
• Most of the data in web pages are in HTML forma as follows:

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>

<p>My first paragraph.</p>


</body>
</html>
219
Web Scraping Using Beautifulsoup

HTML DOM
• HTML content loads in memory as DOM.
• DOM stands for Document Object Model

220
Web Scraping Using Beautifulsoup

Beautiful Soup
• Methods:
• find_all(‘tag_name’, class_=‘class_name’)
• Returns all tags with specified class name
• find(‘tag_name ’, class_=‘class_name’)
• Returns first tag with specified class name
• find_parent()
• Returns the parent tag
• findChild()
• Returns the child tag

221
Web Scraping Using Beautifulsoup

Beautiful Soup
• Properties
• text
• Returns text from the tag along with the child tag
• attrs
• Returns the attributes of the tag
• contents
• Returns only text without tag

222
Web Scraping Using Beautifulsoup

Installing Modules

!pip install requests

!pip install bs4

223
Web Scraping Using Beautifulsoup

Importing Modules

import requests

from bs4 import BeautifulSoup as bs

224
Web Scraping Using Beautifulsoup

Scrapping Data

python_page = requests.get('https://en.wikipedia.org/wiki/Python’)

soup = bs(python_page.content, 'html.parser’)

for toc in data_toc:


print(toc.contents[-1].text)

225
Web Scraping Using Beautifulsoup

Scrapping From flipkart


import bs4
from bs4 import BeautifulSoup as bs
import requests

#empty lists for each value


products=[]

226
Web Scraping Using Beautifulsoup

Scrapping From flipkart


#getting each mobile details from page 1 to 10

for i in range(10):
page_url = 'https://www.flipkart.com/search?q=mobiles&as=on&as-
show=on&otracker=AS_Query_TrendingAutoSuggest_1_0_na_na_na&otracker1=AS_Query_Tren
dingAutoSuggest_1_0_na_na_na&as-pos=1&as-
type=HISTORY&suggestionId=mobiles&requestId=a83b1026-4c50-46af-b37f-
169ed3e41c8f&page='+str(i+1)

page=requests.get(page_url)
soup = bs(page.content, 'html.parser')

227
Web Scraping Using Beautifulsoup

Scrapping From flipkart


for data in soup.findAll('div',class_='_2kHMtA'):
name=data.find('div', attrs={'class':'_4rR01T'})
price=data.find('div', attrs={'class':"_30jeq3 _1_WHN1"})
rating=data.find('div', attrs={'class':'_3LWZlK'})
specification=data.find('div', attrs={'class':'fMghEO'})

for each in specification:


col=each.find_all('li', attrs={'class':'rgWa7D'})
ram_rom = col[0].text
display = col[1].text
camera = col[2].text

228
Web Scraping Using Beautifulsoup

Scrapping From flipkart

products.append({"Name":name.text,"Price":price.text,"RAM_ROM":ram_rom,"Display":display,"
Camera":camera, "Rating":rating.text})

#creating data frame


mobile_ds = pd.DataFrame(products)

mobile_ds.head()

229

You might also like