Informatics Practices Class 12 Cbse Notes Data Handling
Informatics Practices Class 12 Cbse Notes Data Handling
Reshaping Datastructures:
In Pandas data reshaping means the transformation of the structure of a table (i.e. DataFrame or Series) to
make it suitable for further analysis. Mainly, 2 functions are mainly used to reshape, they are:
1. Pivot()
2. Pivot_table()
Pivot():
The pivot function is used to create a new derived table out of a given one.
Pivot reshapes data and uses unique values from index/columns to form axes of the resulting
DataFrame.
Pivot functions are used to reshape and manipulate an already existing DataFrame.
Index
Column
Values.
For each of the above parameters, a column name from the existing table should be defined.
Then the pivot function will create a new table, whose row and column indices are the unique
values of the respective parameters. The cell values of the new table are taken from column
specified as the values parameter.
However, if the values parameter is not specified, all the other columns in the original table will be
taken as values, and a individual table will be created for each of the values.
NOTE: If the index- column combination of two or more values is the same, ValueError is raised.
Syntax:
DataFrame.pivot(index=<name>, columns=<name>, values=<name>)
(or)
In the above example, as we saw, whenever the index and column combination is identical, a
value error is raised. Using pivot_table, the problem can be solved.
When there are two or more values for the same index-column combination, the aggfunc
aggreagates the duplicate values.
Consider the same example as above (ln 16), but with the table_pivot function:
Here, the aggregate function is the sum function. It added up the two values(27,23) for John-
Masters.
Often you want to sort Pandas data frame in a specific way. Typically, one may want to sort pandas
data frame based on the values of one or more columns or sort based on the values of row index or
row names of pandas DataFrame. Pandas data frame has two useful functions:
Each of these functions come with numerous options, like sorting the data frame in specific order
(ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm
and so on.
In case of nan values are present, the na_position determines whether the nan values should
be present in the beginning or the end.
Syntax:
df=df.sort_values(by=[<a>,<b>....],ascending=[True/False,
True/False, True/False....])
Example:
In the example below, the elements are first sorted by age, and in case the ages of 2 values are the
same, it is then next sorted by the score.
Thus, the data frame can be sorted even when there are duplicate values in the first column by
which it is to be sorted. Like here, For example, Dhoni and Virat have the same value for age:25.
since there is no other criteria to decide by, about whose data will come first, a second sorting
criteria is added. This is the score. Since we have score in descending order, the one with the
highest score comes first in the table as clearly seen in the output.
Sorting by index:
The sort_index() function is used to sort the values according to the indexes.
Syntax:
DataFrame.sort_index(axis=0/1, ascending=True/False,
na_position=’last’/’first’, sort_remaining=True, by=<name>)
There are many aggregations like count,sum,min,max,median,quartile etc. They are also called
descriptive statistics.
Syntax:
X=<dataframename>[[‘<columnname>’]].aggfunc()
Examples:
Variance:
var() –Variance Function in python pandas is used to calculate variance of a given set of numbers, Variance
of a data frame, Variance of column and Variance of rows can be calculated.
Syntax:
Example:
∑(𝑥𝑖 −𝑥̅ )2
The variance of given data can be calculated using the formula: where xi is the ith term and 𝑥̅ is
𝑛
the mean. N is the total number of terms.
Quantiles:
Quantile statistics is a part of a data set. It is used to describe data in a clear and understandable
way.
A quantile is part of a distribution which is divided into equal sized subgroups. It is also called a
fractile.
The 0,30 quantile is basically saying that 30 % of the observations in our data set is below a given
line. On the other hand, it is also stating that there are 70 % remaining above the line we set.
Common Quantiles
Certain types of quantiles are used commonly enough to have specific names.
Below is a list of these:
• 2 quantile is called the median
• 3 quantiles are called terciles
• 4 quantiles are called quartiles
• 5 quantiles are called quintiles
• 6 quantiles are called sextiles
• 7 quantiles are called septiles
• 8 quantiles are called octiles
• 10 quantiles are called deciles
• 12 quantiles are called duodeciles
• 20 quantiles are called vigintiles
• 100 quantiles are called percentiles
• 1000 quantiles are called permilles
Finding quantiles:
Sample question: Find the number in the following set of data where 30 percent of values fall
below it, and 70 percent fall above:
Step 1: Order the data from smallest to largest. The data in the question is already in ascending
order.
Step 2: Count how many observations you have in your data set. this particular data set has 40
items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the number where 30
percent of the values fall below it, so convert that to .3.
Step 4: Insert your values into the formula:
ithobservation = q (n + 1)
Here q is the quantile number, and n is the total no of elements in the distribution
Answer: The ith observation is at 12.3, so we round down to 12 (remembering that this formula is
an estimate). The 12th number in the set is 31, which is the number where 30 percent of the values
fall below it.
Here first, the ith term is calculated normally. Q is 0.5 and n is 4 which gives thr ith term as 2.5
For column b, it is calculated by the formula: i+q(j-i), where j is the element after i.
Histograms
A histogram is a powerful technique in data visualization.It is an accurate graphical representation of the
distribution of numerical data .It was first introduced by Karl Pearson.
To construct a histogram, the first step is to “bin” the range of values —i.e. divide the entire range of values
into a series of intervals — and then count how many values fall into each interval.
The bins(intervals) must be adjacent, and are often(but are not required to be) of equal size.
The bins iare represented on the x axis. And the y axis represents the frequency of each interval.
Example code:
Function Applications
To apply functions to a data structure(for ex: increasing all values by 2) we mainly use 3 functions:
Syntax:
X=df.pipe(<func.name>,value)
Example:
In the above code, 5 has been added to each value in the dataframe.
The pipe function can be applied to the values any number of times.
Syntax:
df.pipe(<func.name>,value).pipe(<func.name>,value).pipe(<func>,value)....
Syntax:
X=df.apply(<funct>,axis=1/0)
Example program:
Element wise Function Application in python pandas: applymap()
applymap() Function performs the specified operation for all the elements the DataFrame.
The applymap() uses a lambda function. These is a short version of a user defined function Instead of
the def syntax for function declaration, we can use a lambda expression to write Python functions. The
lambda syntax closely follows the def syntax.
The lambda expression takes in a comma separated sequence of inputs (like def). Then,
immediately following the colon, it returns the expression without using an explicit return
statement.
Syntax:
X=df.applymap(lambda <operation>)
Example: