0% found this document useful (1 vote)
939 views

Informatics Practices Class 12 Cbse Notes Data Handling

The document discusses various techniques for reshaping, sorting, aggregating and analyzing Pandas dataframes. It covers reshaping data using pivot() and pivot_table() functions. It describes sorting dataframes using sort_values() and sort_index() functions. It discusses calculating aggregates like count, sum, mean using agg() function. Finally, it covers applying functions to entire dataframes or rows/columns using pipe(), apply() and applymap() functions.

Uploaded by

ellastark
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
939 views

Informatics Practices Class 12 Cbse Notes Data Handling

The document discusses various techniques for reshaping, sorting, aggregating and analyzing Pandas dataframes. It covers reshaping data using pivot() and pivot_table() functions. It describes sorting dataframes using sort_values() and sort_index() functions. It discusses calculating aggregates like count, sum, mean using agg() function. Finally, it covers applying functions to entire dataframes or rows/columns using pipe(), apply() and applymap() functions.

Uploaded by

ellastark
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter 1: Advance Operations On Dataframes.

Reshaping Datastructures:
In Pandas data reshaping means the transformation of the structure of a table (i.e. DataFrame or Series) to
make it suitable for further analysis. Mainly, 2 functions are mainly used to reshape, they are:

1. Pivot()
2. Pivot_table()

Pivot():
The pivot function is used to create a new derived table out of a given one.

Pivot reshapes data and uses unique values from index/columns to form axes of the resulting
DataFrame.

Pivot functions are used to reshape and manipulate an already existing DataFrame.

Pivot takes 3 arguments with the following names:

 Index
 Column
 Values.

For each of the above parameters, a column name from the existing table should be defined.

Then the pivot function will create a new table, whose row and column indices are the unique
values of the respective parameters. The cell values of the new table are taken from column
specified as the values parameter.

However, if the values parameter is not specified, all the other columns in the original table will be
taken as values, and a individual table will be created for each of the values.

NOTE: If the index- column combination of two or more values is the same, ValueError is raised.

Syntax:
DataFrame.pivot(index=<name>, columns=<name>, values=<name>)

(or)

DataFrame.pivot(<indexname>, <columnsname>, <valuesname>)


Example for Pivot():
pivot_table:
This tool enables users to automatically sort, count, total, or average the data stored in one
table.

In the above example, as we saw, whenever the index and column combination is identical, a
value error is raised. Using pivot_table, the problem can be solved.

When there are two or more values for the same index-column combination, the aggfunc
aggreagates the duplicate values.

Consider the same example as above (ln 16), but with the table_pivot function:

Here, the aggregate function is the mean.

Syntax for pivot_table():


X=pandas.pivot_table(<nameofdataframe>,values=<name>, index=<name>,
columns=<name>, aggfunc=...)
Example for pivot_table():

Here, the aggregate function is the sum function. It added up the two values(27,23) for John-
Masters.

NOTE: The default aggregate function is the mean.


Sorting in DataFrames:
Sorting is arranging the datatype according to the values or the index.

Often you want to sort Pandas data frame in a specific way. Typically, one may want to sort pandas
data frame based on the values of one or more columns or sort based on the values of row index or
row names of pandas DataFrame. Pandas data frame has two useful functions:

sort_values(): to sort pandas data frame by one or more columns

sort_index(): to sort pandas data frame by row index

Each of these functions come with numerous options, like sorting the data frame in specific order
(ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm
and so on.

1) sort_values(): Pandas sort_values() function sorts a data frame in Ascending or


Descending order of passed Column.
It accepts a 'by' argument which will use the column name of the DataFrame with which the
values are to be sorted.
i) Ascending order:
Syntax:
X=df.sort_value(by <columnname>)
By default, the order is ascending.
Example:
ii) Descending order:
Syntax

DataFrame.sort_values(by, axis=0, ascending=True/False,


na_position=’last’/’first’)

In case of nan values are present, the na_position determines whether the nan values should
be present in the beginning or the end.

Sorting by multiple columns:


The values can also be sorted by multiple columns. Incase of multiple columns, the column on then
leftmost side of the command is given most preference.

Syntax:

df=df.sort_values(by=[<a>,<b>....],ascending=[True/False,
True/False, True/False....])

Example:

In the example below, the elements are first sorted by age, and in case the ages of 2 values are the
same, it is then next sorted by the score.

Thus, the data frame can be sorted even when there are duplicate values in the first column by
which it is to be sorted. Like here, For example, Dhoni and Virat have the same value for age:25.
since there is no other criteria to decide by, about whose data will come first, a second sorting
criteria is added. This is the score. Since we have score in descending order, the one with the
highest score comes first in the table as clearly seen in the output.

Sorting by index:
The sort_index() function is used to sort the values according to the indexes.

Syntax:
DataFrame.sort_index(axis=0/1, ascending=True/False,
na_position=’last’/’first’, sort_remaining=True, by=<name>)

0 sorts by the column index, and 1 sorts by the row index.

Consider the example below,


In the example, df is reindexed randomly. To arrange the new index in ascending
or descending order, Sort_index() is used.
Data aggregation –
Aggregation is the process of turning the values of a dataset (or a subset of it) into one single value.

There are many aggregations like count,sum,min,max,median,quartile etc. They are also called
descriptive statistics.

Syntax:
X=<dataframename>[[‘<columnname>’]].aggfunc()

Examples:
Variance:
var() –Variance Function in python pandas is used to calculate variance of a given set of numbers, Variance
of a data frame, Variance of column and Variance of rows can be calculated.

Syntax:

Df.var() : For variance of everything

df.loc[:,“<columnname>"].var() for variance of specific column

df.var(axis=0) : For variance of columns

df.var(axis=1) : For variance of rows

Example:

∑(𝑥𝑖 −𝑥̅ )2
The variance of given data can be calculated using the formula: where xi is the ith term and 𝑥̅ is
𝑛
the mean. N is the total number of terms.
Quantiles:
Quantile statistics is a part of a data set. It is used to describe data in a clear and understandable
way.

A quantile is part of a distribution which is divided into equal sized subgroups. It is also called a
fractile.

It can also refer to dividing a distribution into areas of equal probability.

The 0,30 quantile is basically saying that 30 % of the observations in our data set is below a given
line. On the other hand, it is also stating that there are 70 % remaining above the line we set.

Common Quantiles
Certain types of quantiles are used commonly enough to have specific names.
Below is a list of these:
• 2 quantile is called the median
• 3 quantiles are called terciles
• 4 quantiles are called quartiles
• 5 quantiles are called quintiles
• 6 quantiles are called sextiles
• 7 quantiles are called septiles
• 8 quantiles are called octiles
• 10 quantiles are called deciles
• 12 quantiles are called duodeciles
• 20 quantiles are called vigintiles
• 100 quantiles are called percentiles
• 1000 quantiles are called permilles

Finding quantiles:
Sample question: Find the number in the following set of data where 30 percent of values fall
below it, and 70 percent fall above:

2 4 5 7 9 11 12 17 19 21 22 31 35 36 45 44 55 68 79 80 81 88 90 91 92 100 112 113 114 120 121 132


145 148 149 152 157 170 180 190

Step 1: Order the data from smallest to largest. The data in the question is already in ascending
order.

Step 2: Count how many observations you have in your data set. this particular data set has 40
items.

Step 3: Convert any percentage to a decimal for “q”. We are looking for the number where 30
percent of the values fall below it, so convert that to .3.
Step 4: Insert your values into the formula:

ithobservation = q (n + 1)

Here q is the quantile number, and n is the total no of elements in the distribution

ithobservation = .3 (40 + 1) = 12.3

Answer: The ith observation is at 12.3, so we round down to 12 (remembering that this formula is
an estimate). The 12th number in the set is 31, which is the number where 30 percent of the values
fall below it.

Consider another example:

Here first, the ith term is calculated normally. Q is 0.5 and n is 4 which gives thr ith term as 2.5

For column b, it is calculated by the formula: i+q(j-i), where j is the element after i.

So the value is given by 10+0.5(100-10) =55

Histograms
A histogram is a powerful technique in data visualization.It is an accurate graphical representation of the
distribution of numerical data .It was first introduced by Karl Pearson.

To construct a histogram, the first step is to “bin” the range of values —i.e. divide the entire range of values
into a series of intervals — and then count how many values fall into each interval.

The bins are usually specified as consecutive, non-overlapping intervals of a variable.

The bins(intervals) must be adjacent, and are often(but are not required to be) of equal size.

The bins iare represented on the x axis. And the y axis represents the frequency of each interval.
Example code:
Function Applications

To apply functions to a data structure(for ex: increasing all values by 2) we mainly use 3 functions:

 Table wise Function Application: pipe()


 Row or Column Wise Function Application: apply()
 Element wise Function Application: applymap()

Table wise Function Application: pipe():


Pipe() function performs the operation for the entire data frame with the help of user defined or library
function.

The operation is performed on all elements in te table.

Syntax:
X=df.pipe(<func.name>,value)

Example:

In the above code, 5 has been added to each value in the dataframe.

The pipe function can be applied to the values any number of times.

Syntax:
df.pipe(<func.name>,value).pipe(<func.name>,value).pipe(<func>,value)....

It can be seen in the following code:


Apply() function:
The apply function is used for applying a function or a operation to rows or columns.

Syntax:
X=df.apply(<funct>,axis=1/0)

Axis =0 is for columns and axis=1 is for rows

Example program:
Element wise Function Application in python pandas: applymap()
applymap() Function performs the specified operation for all the elements the DataFrame.

The applymap() uses a lambda function. These is a short version of a user defined function Instead of
the def syntax for function declaration, we can use a lambda expression to write Python functions. The
lambda syntax closely follows the def syntax.

The lambda expression takes in a comma separated sequence of inputs (like def). Then,
immediately following the colon, it returns the expression without using an explicit return
statement.

Syntax:

X=df.applymap(lambda <operation>)

Example:

You might also like