0% found this document useful (0 votes)
6 views

Chapter 1 - Part 2 - DataFrame (1)

The document provides a comprehensive overview of DataFrames in Python's pandas library, detailing their structure, features, and significance in handling multi-dimensional data. It covers how to create DataFrames from various data structures, manipulate data, and access specific data points using methods like loc[], iloc[], and others. Additionally, it explains operations such as sorting, adding/removing columns and rows, and renaming indices.

Uploaded by

Reshmi Manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter 1 - Part 2 - DataFrame (1)

The document provides a comprehensive overview of DataFrames in Python's pandas library, detailing their structure, features, and significance in handling multi-dimensional data. It covers how to create DataFrames from various data structures, manipulate data, and access specific data points using methods like loc[], iloc[], and others. Additionally, it explains operations such as sorting, adding/removing columns and rows, and renaming indices.

Uploaded by

Reshmi Manoj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

10/26/2024

DATA FRAMES
Data frames:
• It is a two-dimensional data structure of python pandas with heterogeneous
data, usually represented in table format.
• The data is represented in rows and columns.
• They are similar to any spreadsheets or SQL tables.

Need of DataFrame
• Using series data structure, it is not able to handle 2D or multi dimensional
data related to real time.
• For such tasks, pandas provides another data structure called data frames.

Basic Features of Data Frame

• Columns may be of different types.


• Size is mutable (No. of rows/ columns can be increased/ decreased)
• Data is mutable (Data can be changed later)
C
• Arithmetic operations can be performed on rows and columns O
• Labelled axes (rows / columns) L
U
• Index can be numbers/ strings M
N
S

1
10/26/2024

Significance of data frame index

 A data frame has two indices or two axes :


• row index (axis=0) and

• column index (axis=1)

 The index can be of any data type.

 You can give row index (row label) explicitly using index argument, default

row index is 0, 1, 2, …
 Column index (Column heading) is given using argument columns

Creating Data frames

Data frames can be created using :

• Lists
• Series
• Dictionary
• Numpy ndarrays
• Text / csv

2
10/26/2024

Creating Data frames


Syntax:
DataFrame_object_name = pandas.DataFrame(data, index, columns, dtype, copy)

1. It is created using a function DataFrame() available in pandas library which


has the following parameters.
2. data: Can be represented as list, series, dictionary, numpy ndarray or constant
3. index: is optional. Default is from 0 to length-1
4. columns: for column label or name and it is optional. Default is from 0 to
length-1
5. dtype: Datatype of each column, it is optional , default is None
6. copy: For copy data. It is optional . Default is false

Creating an Empty Data frame

Output

3
10/26/2024

Creating Data frames using List

Column label • data1 is a list passed as argument and


• index is generated by default(0 to length-1)

Create data frame using Nested List with Column label

Column labels

4
10/26/2024

Creating data frame using Nested List with column label and row label

Column labels

Row labels

HOMEWORK
• Solved qns : 3, 17, 18, 27

5
10/26/2024

Create data frame using Series

• Data frames are 2D representation of series.


• Series in the form of rows and columns becomes a data frame.

Create data frame using Series


Eg. Consider two series st_marks (contains name and mark of students in dictionary) and
st_age(contains name and age of same students stored in dictionary).To create a dataframe
using these series:

• Index of series (keys of dictionaries used to create the


series), will appear as the row index of the data frame.
• Keys of the dictionary used to create the data frame (eg: df)
appears as the column index / names of the dataframe.

6
10/26/2024

Create data frame using dictionary of series [ series having dictionary ]

import pandas as pd
Output

• Index of series will appear as the row index of the data frame.

Create data frame using dictionary of Series [ series having list]


import pandas as pd

Output

• Index of series will appear as the row index of the data frame.

7
10/26/2024

Create data frame using dictionary with values as list


• A dictionary having items as (key : value).
• Value can be of any type - a list, a series, a dictionary etc.

Output

• The keys of the dictionary has become column index.


• The row index is assigned automatically (0 onwards).

Write a program to create a dataframe which contains name and mark of


3 subjects of 4 students as dictionary. Row index is their Rollno.

8
10/26/2024

Sorting Data in Dataframes – using sort_values() method

• sort_value method() of Dataframe helps us to display the data in a sorted form.


• Syntax : <Dataframe>.sort_values (by = [column name], ascending =
True/False)
• by parameter defines the column based on which data is to be sorted
• By default data is sorted in the ascending order.
• To sort the data in descending order, specify ascending = False

Example :

Using sort_values() method

9
10/26/2024

Create data frame using dictionary with different index

• We can specify our own index by using the index argument


• The number of indexes must match the number of values, otherwise Python
will give error.

(df2)
Output

Create data frame using list of dictionaries

Output

• NaN is automatically added in missing places.

10
10/26/2024

HOMEWORK
• Solved Questions: Q 4, 16

CHANGING THE INDEX COLUMN


• By default, index is 0, 1, 2,…,length-1
• Pandas allows us to set another column of the dataframe as the index column
• This is done using set_index() method
Syntax : <df>.set_index ( new index column, inplace = True)

11
10/26/2024

RESET THE INDEX COLUMN


• If a column is set as the index of a dataframe, and if we want to reset the index
column to the default values 0, 1, 2, … we can use the reset_index() method
Syntax : <df>.reset_index (inplace = True)

Attributes of a Dataframe
• Data frames have some properties known as attributes.
• We can access an attribute of a dataframe, using the dot(.) operator

• Syntax:
<Data frame object>.<attribute name>

• Eg:
df.index
df.columns

12
10/26/2024

Attributes of a Dataframe
1. index :
Returns the indexes of the dataframe. By default, it is 0,1,2,…
We can use some alternate name / labels as index
Dataframe : df

2. columns:
Returns the column names of the dataframe

3. axes
Returns both index and column names of the dataframe

Attributes of a Dataframe
4. dtypes:
Returns the data type of each column in the dataframe
Dataframe : df

5. size:
Returns the size of the dataframe, which is the product of no. of rows and columns

6. shape:
Returns the size and shape of the dataframe as the no. of rows and no. of columns

13
10/26/2024

Attributes of a Dataframe

7. ndim:
Returns the dimension of the dataframe
Dataframe : df

8. empty:
Returns a boolean value – True if the dataframe is completely empty,
otherwise False

9. count() :
count() or count(0) or count(axis=‘index’) displays the count of rows,
i.e., the no. of items in each column(default value)
count(1) or count(axis=‘columns’) displays the count of columns, i.e.,
the no. of items in each row

Attributes of a Dataframe

10. T (Transpose)
Allows us to transpose the dataframe. i.e., rows become columns and
columns become rows

Dataframe : df

14
10/26/2024

Accessing data from data frame


• To access the column ‘Name’ of dataframe df

• To access more than one columns

15
10/26/2024

Accessing data from data frame


• To access the data in a specified row of a column

• To access the data in a range of rows of a column

Accessing data from data frame using loc[ ]

16
10/26/2024

Using loc[ ]
• df.loc[ ] allows us to access
the data in a specified range
of rows & columns (inclusive
of stop value)
• : as the 2nd argument means
all columns
• : as the 1st argument means
all rows

Using loc[ ]
• To access the data in a single row

• Prints contents of all rows and columns

• To access the data in a single column

17
10/26/2024

Accessing data from data frame using iloc[ ]

Df.iloc[

Using iloc[ ]
• To access the data in a single row

• To access the data in a single column

18
10/26/2024

Using iloc[ ]
• To access the data in rows with • To access the data in rows with
index 1& 2 (stop value 3 is not index 0 & 1 and columns with
included) index 1 & 2

• To access the data in columns


with index 1& 2

Using iloc[ ]
• Prints contents from row index 1 to 2 and column index 1 to 2

• Prints contents of all rows and columns

19
10/26/2024

Accessing data from data frame using at[] and iat[]


• df.at[ ] is used to access a single value corresponding to the given row & column labels

• df.iat[ ] is used to access a single value corresponding to the numeric index of row & column

Adding a new column to data frame & fill with same value
Consider the following program which creates a data frame with name and
age as columns and row index as I,II,III

A new column mark is added to


the dataframe, with all students’
marks as 100

20
10/26/2024

Adding a new column filled with different values

• First adds a column Marks with values 80,95,88


• Changes the values of column Marks to
100,90,95

Insert a new column - insert() function


• By using the insert() function, a new column can be inserted to the existing
dataframe at any position/ column index.
• Syntax : df.insert (n, new column name, [data])
• n  index of the column where new column is to be inserted
• [ data ]  List of values to be added to the new column

21
10/26/2024

Insert a new column - insert() function

BEFORE AFTER

head() and tail() functions of Dataframe


• head() function is used to get the first n rows of a dataframe
• Syntax : df.head(n) where n is the no. of rows to be extracted.
• By default, it displays the first 5 rows

• tail() function is used to get the last n rows of a dataframe


• Syntax : df.tail(n) where n is the no. of rows to be extracted.
• By default, it displays the last 5 rows

22
10/26/2024

head() and tail() functions

df

Adding a new row to a data frame


Consider this dataframe, df

To add a new row IV with Name Arun, Age 12 and Mark 92

23
10/26/2024

Change or modify a single data value


Consider this dataframe, df

To change Ravi’s age as 14

Deleting columns in DataFrame


Consider this dataframe, df
To delete column Mark
Method 1: Using pop()

Method 2: Using del statement


Output :

Method 3: Using drop()

axis = 1 represents column and hence the column with the specified name will be deleted

24
10/26/2024

Deleting rows in DataFrame


Consider this dataframe, df
To delete row with index ‘II’

• As the value of axis parameter is not Output :


specified, its default value is taken,
which is 0.
• axis = 0 represents row and hence the
row with the specified label will be
deleted

RENAMING COLUMN LABEL


Method 1:Using rename function
This method is useful when we need to rename some selected columns because we
need to specify information only for the columns which are to be renamed.
Syntax: <DF>. rename (columns = {old1 : new1 , old2 : new2, …} , inplace =True)
inplace is a Boolean attribute , by default False.
If True then changes are made in the current dataframe.
If False it returns a new dataframe with renamed column labels and the existing
dataframe remains unchanged

25
10/26/2024

RENAMING COLUMN LABEL

RENAMING COLUMN LABEL


Method 2 : Using columns attribute of dataframe
Syntax: <df>.columns = [new names]

26
10/26/2024

RENAMING ROW INDEX

Method 1 : Using rename function


This method is useful when we need to rename some selected row indexes
because we need to specify information only for the rows which are to be
renamed.
Syntax : <DF>.rename ( index = {old1:new1, old2: new2}, inplace = True)
inplace is a Boolean attribute , by default False.
If True then changes are made in the current dataframe.
If False changes are made in another new dataframe and the existing
dataframe remains unchanged

RENAMING ROW INDEX

27
10/26/2024

RENAMING ROW INDEX


Method 2 : Using index attribute of the dataframe
Syntax : <DF>. index = [new names]

Indexing in Pandas means selecting particular rows and columns of data from a DataFrame. That can be
selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each
of the rows and columns. Indexing can also be known as Subset Selection.

Indexing a Dataframe using indexing operator []


• Indexing operator is used to refer to the square brackets following an object.
• The .loc and .iloc indexers also use the indexing operator to make selections

Selecting a single row


In order to select a single row using .loc[], we put a single row label in a .loc method.
Dataframe.loc[["row1“]] Eg: df.loc[“Rollno1"]

Selecting two rows and three columns


Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]
Eg:df.loc[[“Rollno1”],["Age", "College", "Salary"]]

Selecting all of the rows and some columns using [:]


Dataframe.loc[[:, ["column1", "column2", "column3"]]
Eg: df.loc[[:,["Age", "College", "Salary"]]

Selecting some rows and all the columns using [:]


Dataframe.loc[[“row 1", “row 2", “row 3"] , : ]
Eg: df.loc[[“r1”, “r2”, “r3”] , : ]

28
10/26/2024

Indexing a DataFrame using iloc[]


This function allows us to retrieve rows and columns by position. The df.iloc indexer is very similar
to df.loc but only uses the default numeric indexes to make its selections.

Selecting a single row


To select a single row using .iloc[], pass a single integer to .iloc[] function. Eg: df.iloc[3]

Selecting multiple rows


To select multiple rows, pass a list of integers to .iloc[] function.
Eg: df.iloc [[3, 5, 7]] to select 3rd ,5th and 7th rows
Selecting two rows and two columns
To select two rows and two columns, pass a list of integers for rows and list of integers for columns to
a .iloc[] Eg: df.iloc [[3, 4], [1, 2]] to select 3rd and 4th rows and 1st and 2nd columns
Selecting all the rows and some columns
Use single colon [:] to select all of rows and for columns, make a list of integer then pass to
a .iloc[] function. Eg: df.iloc [ :, [1, 2]] to select all rows and 2 columns

Selecting all the columns and some rows


Use single colon [:] to select all of columns and for rows, make a list of integer then pass to
a .iloc[] function. Eg: df.iloc [ [1, 2] , : ] to select 2 rows and all columns

Concatenation – append()

29
10/26/2024

Binary Operations
add(), sub(), mul() and div() performs the basic mathematical
operations for addition, subtraction, multiplication and division
of two dataframes

radd(), rsub(), rmul() and rdiv() performs the corresponding right


side - mathematical operations

For example:
df1.sub(df2) means df1 – df2
df1.rsub(df2) means df2 – df1

Binary Operations - Example

30
10/26/2024

Binary Operations – Example (contd.)

Matching and Broadcasting Operations

31
10/26/2024

Missing Data & Filling values


- Using fillna() method

Iteration on Rows and Columns


• If we want to access record or data from a Data frame row wise or
column wise then iteration is used.
• Pandas provide 2 functions to perform iterations:-
1. iterrows()
2. iteritems()

32
10/26/2024

iterrows()
• It is used to access the data row wise.
• iterrows() function iterates over each row of the dataframe.

iteritems()
• It is used to access the data column wise.
• iteritems() function iterates over each column of the dataframe.

33
10/26/2024

Boolean Indexing in Dataframe


• Boolean indexing is a type of indexing in which, subsets of data in the
DataFrame, is selected based on their actual values (boolean values)
• Selection of data is not based on their row/column labels or integer locations.
• A boolean vector is used to filter the data from the dataframe.
• To access a dataframe with a boolean index, first, we have to create a
dataframe with boolean value that is “True” or “False”.

Boolean Indexing in Dataframe

34
10/26/2024

CSV file
• CSV (Comma Separated Values) is a simple file format used to store tabular
data, such as a spreadsheet or a database.
• It stores tabular data in plain text
• Each line is a record
• Each record consists of one or more fields, separated by commas.
• To work with CSV files in Python, there is a module called csv.

Creating a CSV file


• CSV file can be created using any text
editor
• It can also be created using Microsoft Excel.

Method:
• Start Excel and enter the data in it.
• Save the file with file type as CSV (comma
delimitted) (*.csv)
OR
• Create a text file in notepad, with values
separated with comma.
• Save the file as .csv

35
10/26/2024

Create a DataFrame from .csv file

Output

• read_csv() converts simple csv file, without formatting, to a dataframe

Creating a DataFrame from .csv file


• Enter comma separated values in notepad
• Save it with extension .CSV
TestingDF.csv

List of Column names

List of Row labels

• If the CSV file and the python program are in the same
folder, only filename is needed
• Otherwise the complete path of the file should be given
• In the file path use \\ or / instead of \
• Eg: C:\\ NCERT\\ TestingDF.csv
OR C: / NCERT / TestingDF.csv

36
10/26/2024

Creating a dataframe from CSV /


Importing data to Dataframe from CSV
• The function read_csv() can be used to read the CSV file, if the file path is known.
• read_csv() loads the data from the csv file into a Pandas dataframe
• Missing values from the csv file shall be treated as NaN (Not a Number)

Creating a CSV from Dataframe /


Exporting data from Dataframe to CSV
• To export your Pandas DataFrame to a CSV file:
df.to_csv ( 'Path of CSV file', index = False)
• It creates a CSV file in the specified path and stores the
data from the dataframe in it

• To include the index, remove index = False

37
10/26/2024

DataFrame - REVISION
• Create dataframe using list • Add new row
• Add new column -insert()
• Create dataframe using dictionary
• Remove a row – drop()
• Create dataframe using series • Remove a column – drop(), pop(), del
• Attributes of DataFrame • Set_index()
 Index • Reset_index()
 Columns • Sort_values()
 Axes • Rename()
 Empty • Fillna()
 Ndim • Append()
 Shape • Head()
 Size • Tail()
 Dtypes • Binary Operations – add(), sub(), mul(), div(), radd(),
T rsub(), rmul(), rdiv()
 Count() • iterrows() & iteritems()
• Accessing Data – loc[], iloc[], at[], iat[] • Create dataframe from csv - read_csv(), to_csv()
• Create dataframe from text files – read_table()

Create a DataFrame from .txt file


• Create a text file in notepad , values separated with tab space.
• Save the file as .txt
• If the File path is c:/NCERT/newdata.txt

• read_table() converts simple text file without formatting to a data frame

38
10/26/2024

Creating a DataFrame from .txt file


• Enter TAB separated values in notepad
• Save it with extension .txt
TestingDF.txt

List of Column names

• If the text file and the python program are in the same
folder, only filename is needed
• Otherwise the complete path of the file should be given
• In the file path use \\ or / instead of \
• Eg: C:\\ NCERT\\ TestingDF.txt
OR C: / NCERT / TestingDF.txt

Displaying the shape (no. of rows & columns) of a CSV file


• Total number of rows and columns present in the table can be obtained by using the
shape command

39
10/26/2024

HOMEWORK
• Pg. No. 1.78
Unsolved questions : Q.no. 32 to 37

Homework Answers Q: 30

40
10/26/2024

Homework Answers

Homework Answers

41
10/26/2024

Homework Answers

Homework Answers

42
10/26/2024

Homework Answers

Homework Answers

43
10/26/2024

HOMEWORK
• Pg. No. 1.78 & 1.79
Unsolved questions : Q.no. 38 & 43

HOMEWORK
• Pg. No. 1.78 & 1.79
Unsolved questions : Q.no. 39, 40, 41 & 42

44
10/26/2024

HOMEWORK
• Solved Questions: Q 6, 13, 15, 19 & 20

Qn: 26 Pg. No. 2.54


Write a program to select the name and score columns from the following
dataframe. Sample python dictionary -exam_data and labels :

exam_data = {‘ name': ['Anastasia', ‘Dima', 'Katherine', 'James', 'Emily',


'Micheal', 'Mathew', 'Laura',‘ Kevin', ‘Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20,
14.5, np.nan, 8, 19], 'attempts': [1,3,2,3,2,3,1,1,2,1], 'qualify‘ : ['yes', 'no', 'yes',
‘no', 'no', 'yes', 'yes', 'no', 'no', 'yes'] }

label=['a','b','c','d','e','f','g','h','i','j']

45
10/26/2024

Qn: 27 Pg. No. 2.54


Write a program to select name and score columns in rows 1,3,5,6 from the previous dataframe

Qn: 28 Pg. No. 2.54


Select rows where number of attempts in the exam is greater than 2

46
10/26/2024

HOMEWORK
• Pg. No. 1.73
Solved questions : Q. no. 26, 28
• Pg. No. 1.77
Unsolved questions : Q.no. 31

Home work :
Unsolved questions : Q.no. 29

47
10/26/2024

Home work :
Pg. No. 1.77 Unsolved questions : Q.no. 29
(contd.)

Home work :
Pg. No. 1.77 Unsolved questions : Q.no. 29
(contd.)

48

You might also like