2. Statistics and Data Analysis
2. Statistics and Data Analysis
1
Statistics & Data Analysis
Chapters
1. Probability
2. Descriptive Statistics
3. Inferential Statistics
4. NumPy for Mathematical Computing
5. Data Manipulation using Pandas
6. Data Visualization with Matplotlib and Seaborn
7. Web Scraping Using Beautifulsoup Using Beautifulsoup
2
Chapter 1
Probability
3
1. Probability
What is probability?
• Probability defines the likelihood of occurrence of an event
• For example
• What is the chance to get head(H), when toss a coin?
• What is the change to get 1, when roll a dice?
• What is the change to get tow heads(HH), when toss two coins
4
1. Probability
Probability Formula
• Probability can also be defined as the ratio between the number of favorable outcomes and the
total number of outcomes of an event
• But in real time, we will get an event probability value in between 0 and 1, not exactly 0 or 1 5
1. Probability
Examples
• What is the probability of getting/landed head, when a coin is tossed?
Solution:
Total outcomes = 2 (H, T)
Possible outcome = 1 (H)
P(H) = 1/2
= 0.5
Examples
• What is the probability of getting/landed tail, when a coin is tossed?
Solution:
Total outcomes = 2 (H, T)
Possible outcome = 1 (T)
P(H) = 1/2
= 0.5
Examples
• What is the probability of getting/landed 2 consecutive heads, when a coin is tossed twice?
Solution:
Total outcomes = 2 (HH, HT, TH, TT)
Possible outcome = 1 (HH)
Examples
• What is the probability of getting/landed at least one head, when a coin is tossed twice?
Solution:
Total outcomes = 2 (HH, HT, TH, TT)
Possible outcome = 3 (HH, HT, TH)
Examples
• What is the probability of getting 1, when a dice is rolled?
Solution:
Total outcomes = 6 (1, 2, 3, 4, 5, 6)
Possible outcome = 1 (1)
P(1) = 1/6
= 0.1666
Examples
• What is the probability of getting sum 8, when a dice is rolled twice?
Solution:
Total outcomes = 36 [(1, 1), (1, 2) , (1,3) , ….. ,(6,4), (6,5), (6,6)]
Possible outcome = 5 [(2, 6), (3, 5), (4, 4), (5, 3), (2, 6)]
P(sum 8) = 5/36
= 0.1388
Probability Terminology
• Experiment
• An activity whose outcomes are not known is an experiment
• Example: Experiment to find Gravity
• Random Experiment
• A random experiment is an experiment for which the set of possible outcomes is known, but
which particular outcome will occur on a particular execution of the experiment cannot be
said prior to performing the experiment
• Example: tossing a coin
12
1. Probability
Probability Terminology
• Trail
• The numerous attempts in the process of an experiment are called trials
• Example: tossing a coin
• Event
• A trial with a clearly defined outcome is an event
• Example: getting 2, when rolling a dice
• Random Event
• An event that cannot be easily predicted is a random event
• Example: survival of a person, when he met sever accident 13
1. Probability
Probability Terminology
• Random Variable:
• Discrete Random Variable:
• Coin, Dice
14
1. Probability
Probability Terminology
• Outcome
• The result of a trail
• Example: head and tail, when coin is tossed
• Possible Outcome
• The list of all the outcomes in an experiment can be referred to as possible outcomes.
• Example: getting head, when coin is tossed
Probability Terminology
• Sample Space
• It is the set of outcomes of all the trials in an experiment
• Example: S = {H, T}, when coin is tossed
• Probable Event
• An event that can be predicted is called a probable event
• Example: probability an employee getting promotion
• Impossible Event
• An event that is not a part of the experiment
• Example: probability getting 7, when dice is rolled 16
1. Probability
Probability Terminology
• Complementary Events
• Complementary events occur when there are just two outcomes
• Example: {success, failure}, when a game is played
• P(success) + P(Failure) = 1
• Independent Events
• A and B are said to be independent, event A is not effecting event B or Vice-Versa
• Example: getting two consecutive heads, when coin is tossed twice
17
1. Probability
Probability Terminology
• Dependent Events
• A and B are said to be dependent, event A is effecting event B or Vice-Versa
• Example: getting blue ball in second pick from 5 red, 4 blue balls
18
1. Probability
Probability Rules
• The sum of the probabilities of all events in an Experiment is 1
• The probability of opposite event = 1 – probability of the event
• Lest assume, A and B are events in an experiment
• P(A) = 1 – P(B)
• Or
• P(B) = 1 – P(A)
19
1. Probability
Probability Rules
• The sum of the probabilities of all events in an Experiment is 1
• Let assume A and B are events in a sample space, then
• P(A) + P(B) = 1
20
1. Probability
Probability Rules
• Addition rule
• Say A and B are mutually exclusive events. Then
• P(A or B) = P(A) + P(B)
• If they are not mutually exclusive events. Then
• P(A or B) = P(A) + P(B) – P(A and B)
• Multiplication Rule
• P(A and B) = P(A) * P(B), where A, B are independent events
• P(A and B) = P(A) * P(B|A), where A, B are dependent events
• P(A and B) = 0, where A, B are mutually exclusive events
21
1. Probability
Probability Rules
• Conditional Probability
• P(A|B) = P(A and B) / P(B)
22
1. Probability
Examples
• If a box contains 3 red, 3 blue and 4 green balls. What is the probability of getting 1st and 2nd pick
is red ball?
Solution:
Total balls = 10 (3 red, 3 blue, 4 green)
the probability of getting 1st and 2nd pick red ball is 0.0666
23
1. Probability
Probability Distribution
• Let X be the event getting at least one head, when a coin tossed twice. The probability
distribution as follows:
24
1. Probability
• Example:
• Suppose a coin is tossed twice and the sample space is recorded as S = [HH, HT, TH, TT].
• The probability of getting heads needs to be determined.
• Let X be the random variable that shows how many heads are obtained.
• X can take on the values 0, 1, 2. The probability that X will be equal to 1 is 0.5.
• Thus, it can be said that the probability mass function of X evaluated at 1 will be 0.5.
25
1. Probability
26
1. Probability
PMF Example
• What is the probability getting two heads, when a coin tossed twice?
• Total outcomes = 4
• At least one head = 1
27
1. Probability
29
1. Probability
• 1. Find the probability that a bus will come within the next 12 minutes
• Solution:
• Height = 1/20
• Base = 12
• P(0<=X<=12) = (1/20) * 12 = 0.06 * 12 = 0.72
30
1. Probability
• 1. Find the probability that a bus will come after 12 minutes and before 15 minutes
• Solution:
• Height = 1/20
• Base = 15-12 = 3
• P(12<X<15) = (1/20) * 3 = 0.06 * 3 = 0.18
31
1. Probability
• f(x) = P(a<x<=b) =
32
1. Probability
Bayes’ Theorem
• It is a mathematical formula that describes the probability of an event based on prior knowledge
or experience.
• The theorem is named after Thomas Bayes. It is also known as the formula for the probability of
“causes”
• It allows us to update our prior beliefs about the likelihood of an event based on new evidence.
• It is used in many fields:
• Statistics
• data science
• machine learning
• artificial intelligence.
34
1. Probability
Bayes’ Theorem
• Formula as follows:
• P(A|B) = (P(B|A) * P(A)) / P(B)
• Where:
• P(A|B) is the probability of event A given that event B has occurred
• P(B|A) is the probability of event B given that event A has occurred
• P(A) is the prior probability of event A
• P(B) is the prior probability of event B
35
Chapter 2
Descriptive Statistics
36
2. Descriptive Statistics
Statistics
• The area of mathematics known as statistics deals with the principles regulating random events,
as well as the gathering, examination, interpretation, and presentation of numerical data.
• Type of Statistics:
• Descriptive Statistics
• Used to summarize the data
• Inferential Statistics
• Used to analyze and make inferences from the data
37
2. Descriptive Statistics
Statistical Terms
• Population
• Entire data set is called population
• Example Indian Population
• Sample
• Subset of the data set is called sample
• Example Bangalore Population
• Parameter
• Value calculated on population
• Statistic
• Value calculated on sample
38
2. Descriptive Statistics
39
2. Descriptive Statistics
Data Types
• Based on Measurement
• Qualitative
• Nominal
• apple, banana
• Ordinal
• ratings
• Quantitative
• Interval
• 1,2,3
• Ratio
• 1-3,4-6,7-9 40
2. Descriptive Statistics
Basics
• Descriptive statistics includes
• Measure of Central Tendency
• Mean
• Median
• Mode
• Measure of Dispersion
• Range
• Variance
• Standard Deviation
41
2. Descriptive Statistics
Data Types
• Based on Structure
• Structured
• Data inform of rows and columns
• Semi Structured
• Data inform of xml or json
• Unstructured
• Data inform of audios, videos
42
2. Descriptive Statistics
43
2. Descriptive Statistics
Mean
• Summation of all the observations and divides by the total number of observation is known as
arithmetic mean of that observation or data set
• By using notation
σ𝑛
𝑖=1 𝑥𝑖
• 𝑀𝑒𝑎𝑛(μ/𝑥)ҧ =
𝑛
44
2. Descriptive Statistics
Mean Example
Age
Total = 120
20
19 N=6
20
18
21 Mean = Total / N
22
Mean = 120/6
= 20
45
2. Descriptive Statistics
Median
• Median is the middle most value of the set of observation after arranging the data set into
ascending order or descending order
𝑛+1 𝑡ℎ
• 𝑀𝑒𝑑𝑖𝑎𝑛 = , where n is odd
2
𝑛 𝑡ℎ 𝑛+2 𝑡ℎ
𝑖𝑡𝑒𝑚+ 𝑖𝑡𝑒𝑚
• 𝑀𝑒𝑑𝑖𝑎𝑛 = 2 2
, where n is event
2
46
2. Descriptive Statistics
Median Example
Age
Age in Sort:
20
19 18, 19, 20, 20, 21, 22, 60
20
18
21 Median = 20
22
60
47
2. Descriptive Statistics
Mode
• Most frequently occurred item in the data set is called mode of the given data points.
• Mode is a kind of average around which other observations lies clustered densely.
48
2. Descriptive Statistics
Mode Example
Sys-Cores
Sys-Cores:
1
2 1, 2, , 1, 2, 2, 4
3
1
2 Mode = 2
2
4
49
2. Descriptive Statistics
Measure of Dispersion/Spread
• How data spread across the central value
• Measure of Dispersion:
• Range
• Variance
• Standard Deviation
50
2. Descriptive Statistics
Range
• Range is the simplest measure of dispersion, which helps the researcher the overall knowledge
about the spread ness of data. If range is larger, then the data set have more spread ness and vice
versa.
51
2. Descriptive Statistics
Range Example
Age
Max = 22
20
19 Min = 19
20
18
21 Range = Max – Min
22 Range = 22 – 19
Range = 3
52
2. Descriptive Statistics
Variance
• Variance can be simply defined as the square of standard deviation.
• Variance usually will be used to test the significance of spreadness of two different data set.
• Variance describes the data that how far each data points deviates from its mean.
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
• 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 σ2 =
𝑛−1
53
2. Descriptive Statistics
Variance Example
120 10 Variance = 2
54
2. Descriptive Statistics
Standard Deviation
• Standard deviation can be defined as, square root of sum of squares of deviations of observations
from its arithmetic mean and divided by its degrees of freedom.
• SD tells, that how well the observations standardly deviated from its mean.
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
• 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜(σ) =
𝑛−1
55
2. Descriptive Statistics
Standard Deviation
• Benefits:
• Standard deviation much useful in comparing two or more sets of data to find how data is
deviated from its central value (mean).
56
2. Descriptive Statistics
57
2. Descriptive Statistics
58
2. Descriptive Statistics
5 Number Summary
• When conducting descriptive analyses or conducting an initial analysis of a sizable data set, a
five-number summary is particularly helpful.
• The maximum and minimum values in the data set, the lower and upper quartiles, and the median
make up a summary's five values.
• Together, these values are shown in the following order:
• minimum value
• lower quartile (Q1)
• median value (Q2)
• upper quartile (Q3)
• maximum value
59
2. Descriptive Statistics
Distribution
• Normal distribution
• Poison distribution
60
2. Descriptive Statistics
Normal Distribution
• A probability distribution that is symmetric
about the mean is the normal distribution
• Sometimes referred to as the Gaussian
distribution.
• It demonstrates that data that are close to the
mean occur more frequently than data that
are far from the mean.
61
2. Descriptive Statistics
Empirical Rule
• It also known as the three-sigma rule or 68-95-99.7 rule, holds that with a normal distribution,
almost all observed data will lie within three standard deviations of the mean
63
2. Descriptive Statistics
Covariance
• It establishes the relationship between the two variables' changes, i.e., that a change in one
variable is equivalent to a change in the other.
• A function has the ability to preserve its shape even after a linear transformation of the input
variables.
• The units of covariance are determined by multiplying the units of the two variables. And value
between -∞ and +∞
• Types:
• Positive Covariance
• Negative Covariance
64
2. Descriptive Statistics
Covariance
• Formula as follows:
65
2. Descriptive Statistics
Covariance Example
• Calculate Covariance for the following data set
66
2. Descriptive Statistics
Correlation Coefficient
• The correlation method determines how closely related two variables are.
• It is a dimensionless estimated measure of covariance.
• To put it another way, the correlation coefficient has no units and always has a constant value.
• Value between -1 to 1
• Correlation Coefficient as follows:
67
2. Descriptive Statistics
68
Chapter 3
Inferential Statistics
69
3. Inferential Statistics
Introduction
• It makes the use of various analytical tools to draw inferences about the population data from
sample data.
• It helps us come to conclusions and make predictions based on data presents
• We use inferential statistics to understand population parameter by using test statistic
• It has two main uses:
• Making estimates about population
• Testing hypotheses to draw conclusion
70
3. Inferential Statistics
Hypothesis Testing
• Hypothesis is an assumption about population parameter based on sample statistic
• Testing the assumption is called hypothesis testing
• Hypothesis is two kinds:
• Null Hypothesis
• H0 is used to represent null hypothesis
• Alternate Hypothesis
• H1 or Ha is used to represent alternate hypothesis
71
3. Inferential Statistics
Hypothesis Example
• XYZ college or institute believes its students score 90% on final exam
• Null Hypothesis
• H0 is average=90
• Alternate Hypothesis
• H1 is average<>90
72
3. Inferential Statistics
73
3. Inferential Statistics
74
3. Inferential Statistics
• Type-I Error:
• Rejecting null hypothesis, when it is
true
• Type-II Error:
• Accepting null hypothesis, when it is
false
75
3. Inferential Statistics
Hypothesis Tests
• T-test
• Also called Student t-test
• It is used to conduct when sample size is small(<=30)
• Z-test
• It used to conduct when sample size is large(>=30)
• ANOVA
• It used to compare mean of groups
76
3. Inferential Statistics
77
3. Inferential Statistics
T-test
• When to conduct T-test
• Sample size is small
• Data follows normal distribution
• Population standard deviation is unknow
• Formula for one sample t-test as fallows:
78
3. Inferential Statistics
T-test
• Formula for two sample t-test as fallows:
79
3. Inferential Statistics
T-test Example
• XYZ college wants to improve its student performance. The previous performance shows that the
average performance of 28 students was 80%. After some (extra study hours) training, the current
data showed an average performance is 88%. If the standard deviation given is 20%. Did extra
study hours improve the performance?
80
3. Inferential Statistics
• Null Hypothesis:
• H0: mean = 88
• Alternate hypothesis
• H1: mean<88 (mean=80)
81
3. Inferential Statistics
• T-statistic = (88-80)/(20/sqrt(28))
• T-statistic = (8)/(3.78)
• T-statistic = 2.11
82
3. Inferential Statistics
83
3. Inferential Statistics
Z-test
• When to conduct Z-test
• Sample size is large >=30
• Data follows normal distribution
• Population standard deviation is know
• Formula for one sample z-test as fallows:
84
3. Inferential Statistics
Z-test
• Formula for two sample z-test as fallows:
85
3. Inferential Statistics
Z-test Example
• A school teacher claims that the students in his/her school are above average intelligent. A
random sample of 40 students IQ Scores have mean of 120. The mean population IQ is 110 with
standard deviation of 18. is there sufficient evidence to support teachers' claim?
86
3. Inferential Statistics
Z-test Example
• Sample size = 40
• Population mean = 110
• Sample mean = 120
• Population standard deviation = 18
• Null Hypothesis:
• H0: mean = 120
• Alternate hypothesis
• H1: mean<120
87
3. Inferential Statistics
Z-test Example
• Z-statistic:
• Z=(120-110)/(18/sqrt(40))
• Z=3.513
88
3. Inferential Statistics
Z-test Example
• P-value for Z-value is 3.513:
• P-value =0.000808
89
3. Inferential Statistics
Z-test Example-2
• A school teacher claims that the students in his/her school are above average intelligent. A
random sample of 30 students IQ Scores have mean of 120. The mean population IQ is 116 with
standard deviation of 15. is there sufficient evidence to support teachers' claim?
90
3. Inferential Statistics
Z-test Example-2
• Sample size = 30
• Population mean = 116
• Sample mean = 120
• Population standard deviation = 15
• Null Hypothesis:
• H0: mean = 116
• Alternate hypothesis
• H1: mean>116
91
3. Inferential Statistics
Z-test Example-2
• Z-statistic:
• Z=(120-116)/(15/sqrt(30))
• Z=1.4605
92
3. Inferential Statistics
Z-test Example-2
• P-value for Z-value is 1.4605:
• P-value =0.072076
93
3. Inferential Statistics
ANOVA Test
• It stands for Analysis of Variance, to test the difference in variance in groups
• One-way ANOVA
• H0 all means are equal
• H1 at least one mean is different
94
3. Inferential Statistics
One-way ANOVA
95
3. Inferential Statistics
Marks
Morning Study Hours 56, 60, 70, 80, 90
Noon Study Hours 60, 67, 69, 78, 92
Evening Study Hours 85, 88, 89, 90, 90
96
3. Inferential Statistics
97
3. Inferential Statistics
• F = MSB/MSW=> 442.4/9.477=>46.68
99
4. NumPy for Mathematical Computing
Introduction
• NumPy stands for numerical python
• Mainly developed for numerical operations on vectors and matrices
• It focuses on linear algebra and matrices
• NumPy arrays are (more than 30 times) faster than python list
array() function
• Arrays in NumPy are called ndarrays
• We can use array() from numpy package to create array
101
4. NumPy for Mathematical Computing
array() function
• Arrays in NumPy are called ndarrays
• We can use array() from numpy package to create array
102
4. NumPy for Mathematical Computing
array() function
• array() syntax:
• numpy.array(object, dtype=None, copy=True, order=‘F', subok=False, ndmin=0)
• Where
• object -> data
• dtype -> data type of elements
• copy -> create newly
• order -> F or C
• subok -> sub class pass through
• ndim -> number of dimensions
103
4. NumPy for Mathematical Computing
array() function
• Following script creates 1D array
• a1 = np.array([1,2,3,4])
• Note: if data starts with [ then it is 1D, [[ then it is 2D, [[[ then it is 3D
• 1D collection of normal values
• 2D collection of 1D arrays
• 3D collection of 2D arrays
104
4. NumPy for Mathematical Computing
Array Attributes
• shape -> returns shape of the array in terms of dimensions
• ndim -> returns number of dimensions
• size -> returns number of elements in the array
• itemsize -> returns memory occupied by each element
105
4. NumPy for Mathematical Computing
a1 = np.array([[1,2], [3,4]])
print(a1.shape)
print(a1.dtype)
print(a1.size)
print(a1.itemsize)
106
4. NumPy for Mathematical Computing
• a3 = np.zeros((2,2), dtype=‘i1’)
• a4 = np.ones((2,2), dtype=‘i1’)
• a5 = np.arrange(1, 11)
107
4. NumPy for Mathematical Computing
Reshaping Array
• Converting one shape to other shape
• We use reshape() function for above
• Example:
a1 = np.arrange(1, 21)
a2 = np.reshape (a1, (5, 4))
108
4. NumPy for Mathematical Computing
• Example:
109
4. NumPy for Mathematical Computing
110
4. NumPy for Mathematical Computing
print(a)
111
4. NumPy for Mathematical Computing
Important Methods
• np.sort() -> returns sorted array
• np.max() -> returns maximum from the array
• np.mean() -> returns the mean of elements in the array
• np.argmax() -> returns the position of the maximum value in the array
• np.argmin() -> returns the position of the minimum value in the array
• np.unique() -> returns array with unique values in the array
112
4. NumPy for Mathematical Computing
Important Methods
• np.std() -> returns standard deviation of elements in the array
• np.var() -> returns the variance
• np.cov()-> returns covariance
• np.corrcoef()->returns the correlation coefficient
113
4. NumPy for Mathematical Computing
print(a)
print(np.sort(a))
print(np.mean(a))
print(np.unique(a))
print(np.argmax(a))
print(np.argmin(a))
print(np.std(a))
114
4. NumPy for Mathematical Computing
115
Chapter 5
116
5. Data Manipulation using Pandas
Pandas Introduction
• Pandas stands for Panel Data
• Mainly created for data manipulation with high performance
• Introduced in 2008 by Wes McKinney
• It is built on top of NumPy
• We can perform data analysis tasks using Pandas very effectively and ease
• We need to install pandas package from repository to work with it
• pip install pandas
117
5. Data Manipulation using Pandas
Pandas Types
• Pandas as following Data Types:
• Series
• It is 1D array, with index column
• Dataframe
• It is 2D array for tabular data
• Panel
• It is 3D array, where it is collection of dataframes
Note: we mainly work with Dataframes for data analysis and building machine learning models
118
5. Data Manipulation using Pandas
Dataframe
• Dataframe is a 2D data set, where it contains row names/indexes and column names
• Where columns contain different data items
Columns
119
5. Data Manipulation using Pandas
Dataframe
Creating Dataframe
emp_df = pd.DataFrame(
{
'eno':[100,200,300,400,500,600],
'ename':['ab','cd','xy','mn','df','er'],
'salary':[12000, 14500, 50000, 45000, 20000, 25000],
'did':[1,1,2,2,3,3]
})
120
5. Data Manipulation using Pandas
Dataframe
Creating Dataframe
dept_df = pd.DataFrame(
{
'did':[1,2,3,4],
'ename':['Accounting', 'Sales', 'Marketing', 'IT'],
'location':[1,1,2,2,3,3]
})
#displaying dataframe
dept_df
121
5. Data Manipulation using Pandas
Dataframe Attributes
• Following are important attributes of a Dataframe
• shape
• Returns the shape of the dataframe in terms of rows and columns
• size
• Returns the number of element in the dataframe
• dtypes
• Returns datatype of each column
• columns & index
• Returns name of each column and name of each row
• values
• Returns entire data in from of a arrays 122
5. Data Manipulation using Pandas
Dataframe Attributes
DF Attributes
123
5. Data Manipulation using Pandas
124
5. Data Manipulation using Pandas
Column Selection
emp_df[['did', 'ename']] #selects did and ename columns values (order is not same)
125
5. Data Manipulation using Pandas
emp_df.loc[[0,4],] #selecting 1st and 5th rows with all column values
emp_df.loc[[0,4], ['eno']] #selecting 1st and 5th rows with eno column values
emp_df.loc[[0,4], ['eno', 'did']] #selecting 1st and 5th rows with eno and did column values
emp_df.loc[1:4,'eno':'did'] #selecting rows from 2nd to 4th and column from enm to did
127
5. Data Manipulation using Pandas
emp_df.iloc[[0,4],] #selecting 1st and 5th rows with all column values
emp_df.iloc[[0,4],0] #selecting 1st and 5th rows with eno column values
emp_df.iloc[[0,4], [0, 2]] #selecting 1st and 5th rows with eno and did column values
emp_df.iloc[1:4,0:3] #selecting rows from 2nd to 4th and column from enm to did
129
5. Data Manipulation using Pandas
• tail()
• Returns bottom 5 rows
• tail(10)
• Returns bottom 10 rows
130
5. Data Manipulation using Pandas
131
5. Data Manipulation using Pandas
emp_df.describe() #displyas descriptive summary of the dataframe (only for number types)
133
5. Data Manipulation using Pandas
134
5. Data Manipulation using Pandas
emp_df['did'].value_counts() #displays count for repeated value, it is count for categorical values
138
5. Data Manipulation using Pandas
139
5. Data Manipulation using Pandas
emp_df_by_did.agg({'eno':'mean', 'salary':'max'}) #display mean from eno and max for salary
140
5. Data Manipulation using Pandas
Appending dataframes
141
5. Data Manipulation using Pandas
Merging dataframes
142
5. Data Manipulation using Pandas
143
5. Data Manipulation using Pandas
Filltering Rows
144
emp_df.query('ename.str.contains("x")') #selects rows where ename contains x
5. Data Manipulation using Pandas
Filltering Rows
145
5. Data Manipulation using Pandas
Dataframe
Reading CSV
146
5. Data Manipulation using Pandas
Dataframe
Writing CSV
emp_df1 = pd.read_csv('emp.csv’)
147
5. Data Manipulation using Pandas
Dataframe
import pandas as pd
import pymysql
from sqlalchemy import create_engine
con = create_engine('mysql+pymysql://root:Staragile_123@localhost/my_db')
df_product = pd.read_sql('SELECT * FROM product', con) #read the entire table
148
5. Data Manipulation using Pandas
Dataframe
df_product_apple.to_sql('product_apple’, con)
149
Title
Code
150
Chapter 6
151
6. Data Visualization with Matplotlib and Seaborn
Matplotlib
• Matplotlib is a python package to make visual using data
• It was created by John D. Hunter
• It is open source and free
Installing Matplotlib
153
6. Data Visualization with Matplotlib and Seaborn
Matplotlib
importing Matplotlib
#importing matplotlin
import matplotlib
154
6. Data Visualization with Matplotlib and Seaborn
pyplot
• pyplot is the sub module of the Matplotlib
• It has all visual charts functions
Importing pyplot
155
6. Data Visualization with Matplotlib and Seaborn
plot
• plot is the function to draw line chart
• It takes mainly two vector variables one for x-axis and one for y-
axis
156
6. Data Visualization with Matplotlib and Seaborn
157
6. Data Visualization with Matplotlib and Seaborn
158
6. Data Visualization with Matplotlib and Seaborn
Line stile
x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]
plt.plot(x, y, ls=':’)
159
6. Data Visualization with Matplotlib and Seaborn
Marker Style
• Marker is symbol at connected lines or points
• We can marker as follows:
• Circle (o)
• Star (*)
• Plus (+)
• Filled Plus (P)
160
6. Data Visualization with Matplotlib and Seaborn
Marker Style
Marker Style
x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]
#here o is circle
161
6. Data Visualization with Matplotlib and Seaborn
Marker Size
• It sets the size of the marker
• We can use ‘markersize’ or ‘ms’ as to set size
162
6. Data Visualization with Matplotlib and Seaborn
Marker Size
Marker Style
x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]
163
6. Data Visualization with Matplotlib and Seaborn
Marker Color
• It sets the color of the marker
• We can use ‘markeredgecolor’ or ‘mec’ as to set edge color
• We can use ‘markerfacecolor’ or ‘mfc’ as to set inside edge color
• Color as follows:
• Red (‘r’)
• Green (‘g’)
• Blue (‘b’)
• RGB (‘#ffffff’)
164
6. Data Visualization with Matplotlib and Seaborn
Marker Color
Marker Color
x = [1, 2, 3, 4, 5]
y = [100, 120, 130, 110, 100]
165
6. Data Visualization with Matplotlib and Seaborn
Chart Title
• we can use title() function from pyplot to set title for the chart
• Important parameters of the title():
• fontdict
• To set the font style
• loc
• To set the location, where title should be appeared
166
6. Data Visualization with Matplotlib and Seaborn
Chart Title
title()
167
6. Data Visualization with Matplotlib and Seaborn
X and Y Labels
• xlabel()
• Displays x-axis name/label
• ylabel()
• Displays y-axis name/label
• Important parameters:
• fontdict
• loc
• labelpad
168
6. Data Visualization with Matplotlib and Seaborn
X and Y Labels
Labels
plt.title('Chart Title’)
plt.xlabel('X-axis', labelpad=50)
plt.ylabel('Y-axis', loc='top')
169
6. Data Visualization with Matplotlib and Seaborn
X and Y Ticks
• Ticks are nothing but labels to the x and y axis
• Default is the data given
• To change we can use following:
• xticks() for x-axis
• yticks() for y-axis
170
6. Data Visualization with Matplotlib and Seaborn
X and Y Ticks
Ticks
plt.xticks(ticks=[0,1,2,3,4,5])
plt.yticks(ticks=[100, 120, 130], labels=['low', 'mid', 'heigh'])
171
6. Data Visualization with Matplotlib and Seaborn
X and Y Grid
• Grid is nothing but lines on chart for x and y axis
• Default is absent
• grid() can be used to display lines
• Parameters are:
• Axis
• For x or y or both
• Color
• Linestyle
• Linewidth
• Alpha
172
6. Data Visualization with Matplotlib and Seaborn
X and Y Grid
Grid
173
6. Data Visualization with Matplotlib and Seaborn
Scatter plot
• It also referred as dot plot
• scatter() can be use to display dot plot
• scatter() parameters:
• c
• Array of colors for each dot
• s
• Array of size for each dot
• cmap
• Color map
174
6. Data Visualization with Matplotlib and Seaborn
Scatter plot
Scatter plot - 1
x = np.random.randint(1,20, size=10)
y = np.random.randint(100,1000, size=10)
plt.scatter(x, y)
Plt.title(‘Scatter Plot’)
175
6. Data Visualization with Matplotlib and Seaborn
Scatter plot
Scatter plot - 2
x = np.random.randint(1,20, size=10)
y = np.random.randint(100,1000, size=10)
sizes = np.random.randint(10,200, size=10)
176
6. Data Visualization with Matplotlib and Seaborn
Bar plot
• bar() and barh() can be use to display bar plot
• bar() parameters:
• color
• Bar color
• edgecolor
• Bar outline color
• width
• Bar width
• Height
• Bar height
177
6. Data Visualization with Matplotlib and Seaborn
Bar plot
Bar plot - 1
plt.bar(x, y)
178
6. Data Visualization with Matplotlib and Seaborn
Bar plot
Bar plot - 2
179
6. Data Visualization with Matplotlib and Seaborn
Bar plot
Bar plot - 3
plt.barh(x, y, height=0.5)
180
6. Data Visualization with Matplotlib and Seaborn
Pie plot
• pie() can be use to display pie plot
• pie() parameters:
• labels
• Displays name for each potion
• startangle
• Change to start angle from deault x to specified angle
• explode
• Separating portions
• shadow
• Displays shadow
181
6. Data Visualization with Matplotlib and Seaborn
Pie plot
Pie plot - 1
plt.pie(x)
182
6. Data Visualization with Matplotlib and Seaborn
Box plot
• boxplot() can be use to display box plot
• boxplot() parameters:
• notch
• Curve at mean line
• vert
• Horizontal or vertical
183
6. Data Visualization with Matplotlib and Seaborn
Box plot
Box plot
plt.boxplot(marks)
184
6. Data Visualization with Matplotlib and Seaborn
Pie plot
Pie plot - 2
185
6. Data Visualization with Matplotlib and Seaborn
Histogram plot
• hist() can be use to histogram(frequency distributions) plot
• hist() parameters:
• color
• Bin color
186
6. Data Visualization with Matplotlib and Seaborn
Histogram plot
Histogram plot - 1
plt.hist(x)
187
6. Data Visualization with Matplotlib and Seaborn
Histogram plot
Histogram plot - 1
plt.hist(x, color='black')
188
6. Data Visualization with Matplotlib and Seaborn
Multiple plot
• subplot() can be use to more than one plot at a time
• subplot() parameters:
• nrows
• Number of rows
• ncols
• Number of columns
• index
• Subplot number/position, always starts with 1
189
6. Data Visualization with Matplotlib and Seaborn
Multiple plot
Multiple plot
x = np.array(range(10))
y = np.random.randint(10, 20, 10)
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.subplot(1, 2, 2)
plt.scatter(x,y)
190
6. Data Visualization with Matplotlib and Seaborn
Customizing Plots
• To change background, color and more by using following:
• figure
• To change width and height of the plot
• style
• To change style of the plot
• rcParams
• To customizing the plots
191
6. Data Visualization with Matplotlib and Seaborn
Customizing Plots
Customizing
plt.figure(figsize=(7,4))
plt.style.use('ggplot')
x = np.array(range(10))
y = np.random.randint(10, 20, 10)
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['lines.linestyle'] = '-'
plt.rcParams['lines.marker'] = '^'
plt.plot(x,y,color='green') 192
6. Data Visualization with Matplotlib and Seaborn
Seaborn
• It is also kind of data visualization library like matplotlib
• It built on top of matplotlib
• It simplifies difficult visualization tasks into easy visualization task
• It has built-in themes
• Works well with numpy and pandas
• Comes with some datasets for exploring visualization
Installing seaborn
193
6. Data Visualization with Matplotlib and Seaborn
Seaborn
• Seaborn comes with built-in datasets for exploring visualizations
• get_dataset_names() method returns all built-in dataset names
• load_dataset() methid returns dataframe of the specific dataset
Loading dataset
import seaborn as sns
from matplotlib import pyplot as plt
#sns.get_dataset_names()
tips_df = sns.load_dataset('tips’)
tips_df.info()
194
6. Data Visualization with Matplotlib and Seaborn
countplot()
• It is used to display count for each category in a variable
• Parameters:
• X
• Variable
• Data
• dataframe
• Color
• Color name
• Hue
• Separates bars based on one more categorical variable
195
6. Data Visualization with Matplotlib and Seaborn
countplot()
Countplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
tips_df = sns.load_dataset('tips’)
sns.countplot(x='sex', data=tips_df)
196
6. Data Visualization with Matplotlib and Seaborn
countplot()
Countplot() - II
import seaborn as sns
from matplotlib import pyplot as plt
tips_df = sns.load_dataset('tips’)
sns.countplot(data=tips_df, x='sex', hue='day')
197
6. Data Visualization with Matplotlib and Seaborn
distplot()
• It is used to display distribution of the data
• Parameters:
• data
• 1D array
• bins
• Bins for hist
• kde
• To display kernel density estimation
198
6. Data Visualization with Matplotlib and Seaborn
distplot()
distplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
tips_df = sns.load_dataset('tips’)
sns.displot(tips_df['total_bill'], kde=True, bins=10)
199
6. Data Visualization with Matplotlib and Seaborn
pairplot()
• It is used to display (scatter) relationship among all variables in a
dataframe
• Parameters:
• data
• Dataframe
• hue
• Colors for categorical variable
• kind
• Scatter or reg
• diag_kind
• Kind of plot for the diagonal subplots 200
6. Data Visualization with Matplotlib and Seaborn
pairplot()
pairplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
tips_df = sns.load_dataset('tips’)
sns.pairplot(tips_df)
201
6. Data Visualization with Matplotlib and Seaborn
stripplot()
• It is used to display scatter based on categorical variable
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value
202
6. Data Visualization with Matplotlib and Seaborn
stripplot()
stripplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
iris_df = sns.load_dataset('iris’)
203
6. Data Visualization with Matplotlib and Seaborn
boxplot()
• It is used to display box plot
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value
204
6. Data Visualization with Matplotlib and Seaborn
boxplot()
boxplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
iris_df = sns.load_dataset('iris')
sns.boxplot(x = "species", y = "petal_width", data
= iris_df)
205
6. Data Visualization with Matplotlib and Seaborn
barplot()
• It is used to display relation ship between categorical value and
continues values
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value
206
6. Data Visualization with Matplotlib and Seaborn
barplot()
barplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
titanic_df = sns.load_dataset('titanic')
sns.barplot(x = "sex", y = "survived", data =
titanic_df, hue='class')
207
6. Data Visualization with Matplotlib and Seaborn
factorplot()
• It is used to display plot for categorical value
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value
208
6. Data Visualization with Matplotlib and Seaborn
factorplot()
factorplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
titanic_df = sns.load_dataset('titanic')
sns.factorplot(x = "sex", y = "survived", data =
titanic_df)
209
6. Data Visualization with Matplotlib and Seaborn
factorplot()
factorplot() - II
import seaborn as sns
from matplotlib import pyplot as plt
titanic_df = sns.load_dataset('titanic')
sns.factorplot(x = "sex", y = "survived", data =
titanic_df, col='class')
210
6. Data Visualization with Matplotlib and Seaborn
lmplot()
• It is used to display regression plot
• Parameters:
• data
• Dataframe
• x
• .categorical value
• y
• Numeric value
211
6. Data Visualization with Matplotlib and Seaborn
lmplot()
lmplot() - I
import seaborn as sns
from matplotlib import pyplot as plt
iris_df = sns.load_dataset('iris')
sns.lmplot(x = "petal_length", y = "petal_width",
data = iris_df)
212
6. Data Visualization with Matplotlib and Seaborn
FacetGrid()
• It is used to display number of plots
• Parameters:
• data
• Dataframe
• col
• Plots based on the column
• col_wrap
• Number of cols in the grid
213
6. Data Visualization with Matplotlib and Seaborn
FacetGrid()
FacetGrid() - I
import seaborn as sns
from matplotlib import pyplot as plt
iris_df = sns.load_dataset('iris')
grid = sns.FacetGrid(col='species', data=iris_df,
col_wrap=2)
grid.map(plt.scatter, 'sepal_length', 'petal_length')
214
6. Data Visualization with Matplotlib and Seaborn
PairGrid()
PairGrid() - I
import seaborn as sns
from matplotlib import pyplot as plt
iris_df = sns.load_dataset('iris')
grid = sns.PairGrid(iris_df)
grid.map(plt.scatter)
grid.map_diag(plt.hist)
215
Chapter 6
216
Web Scraping Using Beautifulsoup
Introduction
• It is a technique to extract a large amount of data from a website
• Scrapping, obtain data from other resource and saving into local environment
• Sometimes it referred as web data mining or web harvesting
• Web scraping steps:
• Extractor
• Data Transformation and Cleaning Module
• Storage Module
217
Web Scraping Using Beautifulsoup
Introduction
• Web scraping modules:
• requests
• bs4 (Beautiful Soup)
• html.parser (HTML Parser)
218
Web Scraping Using Beautifulsoup
HTML Page
• Most of the data in web pages are in HTML forma as follows:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
HTML DOM
• HTML content loads in memory as DOM.
• DOM stands for Document Object Model
220
Web Scraping Using Beautifulsoup
Beautiful Soup
• Methods:
• find_all(‘tag_name’, class_=‘class_name’)
• Returns all tags with specified class name
• find(‘tag_name ’, class_=‘class_name’)
• Returns first tag with specified class name
• find_parent()
• Returns the parent tag
• findChild()
• Returns the child tag
221
Web Scraping Using Beautifulsoup
Beautiful Soup
• Properties
• text
• Returns text from the tag along with the child tag
• attrs
• Returns the attributes of the tag
• contents
• Returns only text without tag
222
Web Scraping Using Beautifulsoup
Installing Modules
223
Web Scraping Using Beautifulsoup
Importing Modules
import requests
224
Web Scraping Using Beautifulsoup
Scrapping Data
python_page = requests.get('https://en.wikipedia.org/wiki/Python’)
225
Web Scraping Using Beautifulsoup
226
Web Scraping Using Beautifulsoup
for i in range(10):
page_url = 'https://www.flipkart.com/search?q=mobiles&as=on&as-
show=on&otracker=AS_Query_TrendingAutoSuggest_1_0_na_na_na&otracker1=AS_Query_Tren
dingAutoSuggest_1_0_na_na_na&as-pos=1&as-
type=HISTORY&suggestionId=mobiles&requestId=a83b1026-4c50-46af-b37f-
169ed3e41c8f&page='+str(i+1)
page=requests.get(page_url)
soup = bs(page.content, 'html.parser')
227
Web Scraping Using Beautifulsoup
228
Web Scraping Using Beautifulsoup
products.append({"Name":name.text,"Price":price.text,"RAM_ROM":ram_rom,"Display":display,"
Camera":camera, "Rating":rating.text})
mobile_ds.head()
229