0% found this document useful (0 votes)
25 views

Python For Data Science - Mr. Vinumon V S - Unit II 2

Uploaded by

shamah06
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Python For Data Science - Mr. Vinumon V S - Unit II 2

Uploaded by

shamah06
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 89

PYTHON FOR DATA SCIENCE

PYTHON FOR DATA SCIENCE

Unit II - Data Structures and Manipulation

Lists

Python Lists are just like dynamically sized arrays, declared in other languages (vector in C+
+ and ArrayList in Java). In simple language, a Python list is a collection of things, enclosed
in [ ] and separated by commas.
The list is a sequence data type which is used to store the collection of
data. Tuples and String are other types of sequence data types.
Example of list in Python
Here we are creating Python List using [].

Var = ["Python", "for", "Data Science"]


print(Var)

Output:
["Python", "for", "Python"]
Lists are the simplest containers that are an integral part of the Python language. Lists need
not be homogeneous always which makes it the most powerful tool in Python. A single list
may contain DataTypes like Integers, Strings, as well as Objects. Lists are mutable, and
hence, they can be altered even after their creation.
Creating a List in Python
Lists in Python can be created by just placing the sequence inside the square brackets[].
Unlike Sets, a list doesn’t need a built-in function for its creation of a list.
Note: Unlike Sets, the list may contain mutable elements.

Example 1: Creating a list in Python

# Python program to demonstrate


# Creation of List

# Creating a List
List = []
print("Blank List:
") print(List)

# Creating a List of numbers


List = [10, 20, 14]
print("\nList of numbers: ")

MR. VINUMON V S 1
PYTHON FOR DATA SCIENCE

print(List)

# Creating a List of strings and accessing


# using index
List = ["Python", "For", "Data Science"]
print("\nList Items: ")
print(List[0])
print(List[2])

Output
Blank List:
[]

List of numbers:
[10, 20, 14]

List Items:
Python
Data Science
Complexities for Creating Lists
Time Complexity: O(1)
Space Complexity: O(n)

Example 2: Creating a list with multiple distinct or duplicate elements

A list may contain duplicate values with their distinct positions and hence, multiple distinct
or duplicate values can be passed as a sequence at the time of list creation.

# Creating a List with


# the use of Numbers
# (Having duplicate values)
List = [1, 2, 4, 4, 3, 3, 3, 6, 5]
print("\nList with the use of Numbers: ")
print(List)

# Creating a List with


# mixed type of values
# (Having numbers and strings)
List = [1, 2, 'Python', 4, 'For', 6, 'Data', ‘Science’]
print("\nList with the use of Mixed Values: ")
print(List)

MR. VINUMON V S 2
PYTHON FOR DATA SCIENCE
Output
List with the use of Numbers:
[1, 2, 4, 4, 3, 3, 3, 6, 5]

List with the use of Mixed Values:


[1, 2, 'Python', 4, 'For', 6, 'Data', ‘Science’]

Accessing elements from the List


In order to access the list items refer to the index number. Use the index operator [ ] to access
an item in a list. The index must be an integer. Nested lists are accessed using nested
indexing.
Example 1: Accessing elements from list

# Python program to demonstrate


# accessing of element from list

# Creating a List with


# the use of multiple values
List = ["Python", "For", "Data",”Science”]

# accessing a element from the


# list using index number
print("Accessing a element from the list")
print(List[0])
print(List[2])

Output
Accessing a element from the list
Python
Python
Example 2: Accessing elements from a multi-dimensional list

# Creating a Multi-Dimensional List


# (By Nesting a list inside a List)
List = [['PYTHON', 'For'], [‘DATA']]

# accessing an element from the


# Multi-Dimensional List using
# index number
print("Accessing a element from a Multi-Dimensional list")
print(List[0][1])
print(List[1][0])

MR. VINUMON V S 3
PYTHON FOR DATA SCIENCE
Output
Accessing a element from a Multi-Dimensional list
For
DATA

Negative indexing

In Python, negative sequence indexes represent positions from the end of the array. Instead of
having to compute the offset as in List[len(List)-3], it is enough to just write List[-3].
Negative indexing means beginning from the end, -1 refers to the last item, -2 refers to the
second-last item, etc.

List = [1, 2, 'PYTHON', 4, 'For', 6, 'DATA']

# accessing an element using


# negative indexing
print("Accessing element using negative indexing")

# print the last element of list


print(List[-1])

# print the third last element of list


print(List[-3])

Output
Accessing element using negative indexing
Python
For

Getting the size of Python list


Python len() is used to get the length of the list.

# Creating a List
List1 = []
print(len(List1))

# Creating a List of numbers


List2 = [10, 20, 14]

MR. VINUMON V S 4
PYTHON FOR DATA SCIENCE

print(len(List2))

Output
0
3

Taking Input of a Python List


We can take the input of a list of elements as string, integer, float, etc. But the default one is a
string.

Example 1:

# Python program to take space


# separated input as a string
# split and store it to a list
# and print the string list

# input the list as string


string = input("Enter elements (Space-Separated): ")

# split the strings and store it to a list


lst = string.split()
print('The list is:', lst) # printing the list

Output:
Enter elements: PYTHON FOR DATA
The list is: [‘PYTHON’,’FOR’,’DATA’]
Example 2:

# input size of the list


n = int(input("Enter the size of list : "))
# store integers in a list using map,
# split and strip functions
lst = list(map(int, input("Enter the integer\ elements:").strip().split()))
[:n]

# printing the list


print('The list is:', lst)

Output:
Enter the size of list : 4
Enter the integer elements: 6 3 9 10
The list is: [6, 3, 9, 10]

MR. VINUMON V S 5
PYTHON FOR DATA SCIENCE

Adding Elements to a Python List

Method 1: Using append() method

Elements can be added to the List by using the built-in append() function. Only one element
at a time can be added to the list by using the append() method, for the addition of multiple
elements with the append() method, loops are used. Tuples can also be added to the list with
the use of the append method because tuples are immutable. Unlike Sets, Lists can also be
added to the existing list with the use of the append() method.

# Python program to demonstrate


# Addition of elements in a List

# Creating a List
List = []
print("Initial blank List: ")
print(List)

# Addition of Elements
# in the List
List.append(1)
List.append(2)
List.append(4)
print("\nList after Addition of Three elements: ")
print(List)

# Adding elements to the List


# using Iterator
for i in range(1, 4):
List.append(i)
print("\nList after Addition of elements from 1-3: ")
print(List)

# Adding Tuples to the List


List.append((5, 6))
print("\nList after Addition of a Tuple: ")
print(List)

# Addition of List to a List


List2 = ['For', 'Data']
List.append(List2)
print("\nList after Addition of a List: ")
print(List)

MR. VINUMON V S 6
PYTHON FOR DATA SCIENCE
Output
Initial blank List:
[]

List after Addition of Three elements:


[1, 2, 4]

List after Addition of elements from 1-3:


[1, 2, 4, 1, 2, 3]

List after Addition of a Tuple:


[1, 2, 4, 1, 2, 3, (5, 6)]

List after Addition of a List:


[1, 2, 4, 1, 2, 3, (5, 6), ['For', 'Data’]]

Method 2: Using insert() method

append() method only works for the addition of elements at the end of the List, for the
addition of elements at the desired position, insert() method is used. Unlike append() which
takes only one argument, the insert() method requires two arguments(position, value).

# Python program to demonstrate


# Addition of elements in a List

# Creating a List
List = [1,2,3,4]
print(“Initial List: “)
print(List)

# Addition of Element at
# specific Position
# (using Insert Method)
List.insert(3, 12)
List.insert(0, ‘PDS’)
print(“\nList after performing Insert Operation: “)
print(List)

Output
Initial List:
[1, 2, 3, 4]

MR. VINUMON V S 7
PYTHON FOR DATA SCIENCE

List after performing Insert Operation:


[‘PDS’, 1, 2, 3, 12, 4]

Method 3: Using extend() method

Other than append() and insert() methods, there’s one more method for the Addition of
elements, extend(), this method is used to add multiple elements at the same time at the end
of the list.
Note: append() and extend() methods can only add elements at the end.

# Python program to demonstrate


# Addition of elements in a List

# Creating a List
List = [1, 2, 3, 4]
print(“Initial List: “)
print(List)

# Addition of multiple elements


# to the List at the end
# (using Extend Method)
List.extend([8, ‘Python’, ‘Always’])
print(“\nList after performing Extend Operation: “)
print(List)

Output
Initial List:
[1, 2, 3, 4]

List after performing Extend Operation:


[1, 2, 3, 4, 8, ‘Python’, ‘Always’]

Reversing a List
Method 1: A list can be reversed by using the reverse() method in Python.

# Reversing a list
mylist = [1, 2, 3, 4, 5, ‘Data’, ‘Python’]
mylist.reverse()
print(mylist)

MR. VINUMON V S 8
PYTHON FOR DATA SCIENCE
Output
[‘Python’, ‘Data’, 5, 4, 3, 2, 1]

Method 2: Using the reversed() function:


The reversed() function returns a reverse iterator, which can be converted to a list using the
list() function.

my_list = [1, 2, 3, 4, 5]
reversed_list = list(reversed(my_list))
print(reversed_list)

Output
[5, 4, 3, 2, 1]
Removing Elements from the List

Method 1: Using remove() method

Elements can be removed from the List by using the built-in remove() function but an Error
arises if the element doesn’t exist in the list. Remove() method only removes one element at a
time, to remove a range of elements, the iterator is used. The remove() method removes the
specified item.
Note: Remove method in List will only remove the first occurrence of the searched element.
Example 1:

# Python program to demonstrate


# Removal of elements in a List

# Creating a List
List = [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12]
print(“Initial List: “)
print(List)

# Removing elements from List


# using Remove() method
List.remove(5)
List.remove(6)
print(“\nList after Removal of two elements: “)
print(List)

Output
Initial List:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

MR. VINUMON V S G
PYTHON FOR DATA SCIENCE

List after Removal of two elements:


[1, 2, 3, 4, 7, 8, 9, 10, 11, 12]
Example 2:

# Creating a List
List = [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12]
# Removing elements from List
# using iterator method
for I in range(1, 5):
List.remove(i)
print(“\nList after Removing a range of elements: “)
print(List)

Output
List after Removing a range of elements:
[5, 6, 7, 8, 9, 10, 11, 12]
Complexities for Deleting elements in a Lists(remove() method):
Time Complexity: O(n)
Space Complexity: O(1)

Method 2: Using pop() method

pop() function can also be used to remove and return an element from the list, but by
default it removes only the last element of the list, to remove an element from a specific
position of the List, the index of the element is passed as an argument to the pop() method

List = [1, 2, 3, 4, 5]

# Removing element from the


# Set using the pop() method
List.pop()
print(“\nList after popping an element: “)
print(List)

# Removing element at a
# specific location from the
# Set using the pop() method
List.pop(2)
print(“\nList after popping a specific element: “)
print(List)

Output
List after popping an element:

MR. VINUMON V S 10
PYTHON FOR DATA SCIENCE
[1, 2, 3, 4]

List after popping a specific element:


[1, 2, 4]
Complexities for Deleting elements in a Lists(pop() method):
Time Complexity: O(1)/O(n) (O(1) for removing the last element, O(n) for removing the
first and middle elements)
Space Complexity: O(1)

Slicing of a List
We can get substrings and sublists using a slice. In Python List, there are multiple ways to
print the whole list with all the elements, but to print a specific range of elements from
the list, we use the Slice operation.
Slice operation is performed on Lists with the use of a colon(:).
To print elements from beginning to a range use:
[: Index]
To print elements from end-use:
[:-Index]
To print elements from a specific Index till the end use
[Index:]
To print the whole list in reverse order, use
[::-1]
Note – To print elements of List from rear-end, use Negative Indexes.

UNDERSTANDING SLICING OF LISTS:


 pr[0] accesses the first item, 2.
 pr[-4] accesses the fourth item from the end, 5.
 pr[2:] accesses [5, 7, 11, 13], a list of items from third to last.
 pr[:4] accesses [2, 3, 5, 7], a list of items from first to fourth.
 pr[2:4] accesses [5, 7], a list of items from third to fifth.
 pr[1::2] accesses [3, 7, 13], alternate items, starting from the second item.

# Python program to demonstrate


# Removal of elements in a List

# Creating a List
List = ['P', 'Y', 'T', 'H', 'O', 'N',
'F', 'O', 'R', 'D', 'A', 'T', 'A']
print("Initial List: ")
print(List)

MR. VINUMON V S 11
PYTHON FOR DATA SCIENCE

# Print elements of a range


# using Slice operation
Sliced_List = List[3:8]
print("\nSlicing elements in a range 3-8: ")
print(Sliced_List)

# Print elements from a


# pre-defined point to end
Sliced_List = List[5:] print("\
nElements sliced from 5th "
"element till the end: ")
print(Sliced_List)

# Printing elements from


# beginning till end
Sliced_List = List[:]
print("\nPrinting all elements using slice operation: ")
print(Sliced_List)

Output
Initial List:
[‘P’,’Y’,’T’,’H’,’O’,’N’,’F’,’O’,’R’,’D’,’S’]
Slicing elements in a range 3-8:
['H', 'O', 'N', 'F', 'O']
Elements sliced from 5th element till the end:
['N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']
Printing all elements using slice operation:
['P', 'Y', 'T', 'H', 'O', 'N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']

Negative index List slicing

# Creating a List
List = ['P', 'Y', 'T', 'H', 'O', 'N',
'F', 'O', 'R', 'D', 'A', 'T', 'A']
print("Initial List: ")
print(List)

# Print elements from beginning


# to a pre-defined point using Slice

MR. VINUMON V S 12
PYTHON FOR DATA SCIENCE

Sliced_List = List[:-6]
print("\nElements sliced till 6th element from last: ")
print(Sliced_List)

# Print elements of a range


# using negative index List slicing
Sliced_List = List[-6:-1]
print("\nElements sliced from index -6 to -1")
print(Sliced_List)

# Printing elements in reverse


# using Slice operation
Sliced_List = List[::-1]
print("\nPrinting List in reverse: ")
print(Sliced_List)

Output
Initial List:
['P', 'Y', 'T', 'H', 'O', 'N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']
Elements sliced till 6th element from last:
['P', 'Y', 'T', 'H', 'O', 'N', 'F']
Elements sliced from index -6 to -1
['O', 'R', 'D', 'A', 'T']
Printing List in reverse:
['A', 'T', 'A', 'D', 'R', 'O', 'F', 'N', 'O', 'H', 'T', 'Y', 'P']

List Comprehension
Python List comprehensions are used for creating new lists from other iterables like tuples,
strings, arrays, lists, etc. A list comprehension consists of brackets containing the expression,
which is executed for each element along with the for loop to iterate over each element.
Syntax:
newList = [ expression(element) for element in oldList if condition ]
Example:

# Python program to demonstrate list


# comprehension in Python

# below list contains square of all


# odd numbers from range 1 to 10
odd_square = [x ** 2 for x in range(1, 11) if x % 2 == 1]
print(odd_square)

Output

MR. VINUMON V S 13
PYTHON FOR DATA SCIENCE
[1, 9, 25, 49, 81]
For better understanding, the above code is similar to as follows:

# for understanding, above generation is same as,


odd_square = []

for x in range(1, 11):


if x % 2 == 1:
odd_square.append(x**2)

print(odd_square)

Output
[1, 9, 25, 49, 81]

List Methods
Function Description

Append() Add an element to the end of the list

Extend() Add all elements of a list to another list

Insert() Insert an item at the defined index

Remove() Removes an item from the list

Clear() Removes all items from the list

Index() Returns the index of the first matched item

Count() Returns the count of the number of items passed as an argument

Sort() Sort items in a list in ascending order

Reverse() Reverse the order of items in the list

copy() Returns a copy of the list

MR. VINUMON V S 14
PYTHON FOR DATA SCIENCE

Function Description

Removes and returns the item at the specified index. If no index is provided, it
pop()
removes and returns the last item.

To know more refer to this article – Python List methods


The operations mentioned above modify the list Itself.

Built-in functions with List

Function Description

apply a particular function passed in its argument to all of the list elements
reduce()
stores the intermediate result and only returns the final summation value

sum() Sums up the numbers in the list

Returns an integer representing the Unicode code point of the given Unicode
ord()
character

cmp() This function returns 1 if the first list is “greater” than the second list

max() return maximum element of a given list

min() return minimum element of a given list

all() Returns true if all element is true or if the list is empty

any() return true if any element of the list is true. if the list is empty, return false

len() Returns length of the list or size of the list

enumerate() Returns enumerate object of the list

apply a particular function passed in its argument to all of the list elements
accumulate()
returns a list containing the intermediate results

filter() tests if each element of a list is true or not

map() returns a list of the results after applying the given function to each item of a

MR. VINUMON V S 15
PYTHON FOR DATA SCIENCE

Function Description

given iterable

This function can have any number of arguments but only one expression,
lambda()
which is evaluated and returned.

Python String

A String is a data structure in Python Programming that represents a sequence of characters.
It is an immutable data type, meaning that once you have created a string, you cannot change
it. Python String are used widely in many different applications, such as storing and
manipulating text data, representing names, addresses, and other types of data that can be
represented as text.
What is a String in Python?
Python Programming does not have a character data type, a single character is simply a string
with a length of 1. Let’s see the Python string syntax:
Syntax of String Data Type in Python
string_variable = 'Hello, world!'
Example of string data type in Python

string_0 = "A Computer Science portal for Python"


print(string_0)
print(type(string_0))
Output:
A Computer Science portal for Python
<class 'str'>
String Data Type in Python

Creating a String in Python


Strings in Python can be created using single quotes or double quotes or even triple
quotes. Let us see how we can define a string in Python or how to write string in Python.
Example:
In this example, we will demonstrate different ways to create a Python String. We will create
a string using single quotes (‘ ‘), double quotes (” “), and triple double quotes (“”” “””). The
triple quotes can be used to declare multiline strings in Python.

# Creating a String
# with single Quotes
String1 = 'Welcome to the Python World'

MR. VINUMON V S 16
PYTHON FOR DATA SCIENCE
print("String with the use of Single Quotes: ")
print(String1)

# Creating a String
# with double Quotes
String1 = "I'm a Geek"
print("\nString with the use of Double Quotes: ")
print(String1)

# Creating a String
# with triple Quotes
String1 = '''I'm a Geek and I live in a world of "Python"'''
print("\nString with the use of Triple Quotes: ")
print(String1)

# Creating String with triple


# Quotes allows multiple lines
String1 = '''Python
For
Life'''
print("\nCreating a multiline String: ")
print(String1)
Output:
String with the use of Single Quotes:
Welcome to the Python World
String with the use of Double Quotes:
I'm a Geek
String with the use of Triple Quotes:
I'm a Geek and I live in a world of "Python"
Creating a multiline String:
Python
For
Life
Accessing characters in Python String
In Python Programming tutorials, individual characters of a String can be accessed by using
the method of Indexing. Indexing allows negative address references to access characters
from the back of the String, e.g. -1 refers to the last character, -2 refers to the second last
character, and so on.
While accessing an index out of the range will cause an IndexError. Only Integers are
allowed to be passed as an index, float or other types that will cause a TypeError.

Python String syntax indexing

Example:
In this example, we will define a string in Python Programming and access its characters
using positive and negative indexing. The 0th element will be the first character of the string
whereas the -1th element is the last character of the string.

MR. VINUMON V S 17
PYTHON FOR DATA SCIENCE
# Python Program to Access
# characters of String

String1 = "PythonForDataScience"
print("Initial String: ")
print(String1)

# Printing First character print("\


nFirst character of String is: ")
print(String1[0])

# Printing Last character print("\


nLast character of String is: ")
print(String1[-1])
Output:
Initial String:
PythonForDataScience
First character of String is:
G
Last cha racter of String is:
s
String Slicing
In Python Programming tutorials, the String Slicing method is used to access a range of
characters in the String. Slicing in a String is done by using a Slicing operator, i.e., a colon
(:). One thing to keep in mind while using this method is that the string returned after slicing
includes the character at the start index but not the character at the last index.
Example:
In this example, we will use the string-slicing method to extract a substring of the original
string. The [3:12] indicates that the string slicing will start from the 3rd index of the string to
the 12th index, (12th character not including). We can also use negative indexing in string
slicing.

# Python Program to
# demonstrate String slicing

# Creating a String
String1 = "PythonForDataScience"
print("Initial String: ")
print(String1)

# Printing 3rd to 12th character print("\


nSlicing characters from 3-12: ")
print(String1[3:12])

# Printing characters between


# 3rd and 2nd last character
print("\nSlicing characters between " +
"3rd and 2nd last character: ")
print(String1[3:-2])

MR. VINUMON V S 18
PYTHON FOR DATA SCIENCE
Output:

Initial String:
PythonForDataScience
Slicing characters from 3-12:
honForDat
Slicing characters between 3rd and 2nd last character:
honForDataScien

Reversing a Python String


In Python Programming tutorials, By accessing characters from a string, we can also reverse
strings in Python Programming. We can Reverse a string by using String slicing method.
Example:
In this example, we will reverse a string by accessing the index. We did not specify the first
two parts of the slice indicating that we are considering the whole string, from the start index
to the last index.

#Program to reverse a string


gfg = "PythonForDataScience"
print(gfg[::-1])
Output:
skeegrofskeeg
Example:
We can also reverse a string by using built-in join and reversed functions, and passing the
string as the parameter to the reversed() function.

# Program to reverse a string

gfg = "PythonForDataScience"

# Reverse the string using reversed and join function


gfg = "".join(reversed(gfg))

print(gfg)
Output:
ecneicSataDroFnohtyP

Deleting/Updating from a String


In Python, the Updation or deletion of characters from a String is not allowed. This will cause
an error because item assignment or item deletion from a String is not supported. Although
deletion of the entire String is possible with the use of a built-in del keyword. This is because
Strings are immutable, hence elements of a String cannot be changed once assigned. Only
new strings can be reassigned to the same name.
Updating a character
A character of a string can be updated in Python by first converting the string into a Python
List and then updating the element in the list. As lists are mutable in nature, we can update
the character and then convert the list back into the String.

MR. VINUMON V S 1G
PYTHON FOR DATA SCIENCE
Another method is using the string slicing method. Slice the string before the character you
want to update, then add the new character and finally add the other part of the string again
by string slicing.
Example:
In this example, we are using both the list and the string slicing method to update a character.
We converted the String1 to a list, changes its value at a particular element, and then
converted it back to a string using the Python string join() method.
In the string-slicing method, we sliced the string up to the character we want to update,
concatenated the new character, and finally concatenate the remaining part of the string.

# Python Program to Update


# character of a String

String1 = "Hello, I'm a Geek"


print("Initial String: ")
print(String1)

# Updating a character of the String


## As python strings are immutable, they don't support item updation directly
### there are following two ways
#1
list1 = list(String1)
list1[2] = 'p'
String2 = ''.join(list1)
print("\nUpdating character at 2nd Index: ")
print(String2)

#2
String3 = String1[0:2] + 'p' + String1[3:]
print(String3)
Output:
Initial String:
Hello, I'm a Geek
Updating character at 2nd Index:
Heplo, I'm a Geek
Heplo, I'm a Geek
Updating Entire String
In Python Programming, As Python strings are immutable in nature, we cannot update the
existing string. We can only assign a completely new value to the variable with the same
name.
Example:
In this example, we first assign a value to ‘String1’ and then updated it by assigning a
completely different value to it. We simply changed its reference.

# Python Program to Update


# entire String

String1 = "Hello, I'm a Geek"


print("Initial String: ")

MR. VINUMON V S 20
PYTHON FOR DATA SCIENCE
print(String1)

# Updating a String
String1 = "Welcome to the Python World" print("\
nUpdated String: ")
print(String1)
Output:
Initial String:
Hello, I'm a Geek
Updated String:
Welcome to the Python World
Deleting a character
Python strings are immutable, that means we cannot delete a character from it. When we try
to delete thecharacter using the del keyword, it will generate an error.

# Python Program to delete


# character of a String

String1 = "Hello, I'm a Geek"


print("Initial String: ")
print(String1)

print("Deleting character at 2nd Index: ")


del String1[2]
print(String1)
Output:
Initial String:
Hello, I'm a Geek
Deleting character at 2nd Index:
Traceback (most recent call last):
File "e:\GFG\Python codes\Codes\demo.py", line 9, in <module>
del String1[2]
TypeError: 'str' object doesn't support item deletion
But using slicing we can remove the character from the original string and store the result in a
new string.
Example:
In this example, we will first slice the string up to the character that we want to delete and
then concatenate the remaining string next from the deleted character.

# Python Program to Delete


# characters from a String

String1 = "Hello, I'm a Geek"


print("Initial String: ")
print(String1)

# Deleting a character
# of the String

MR. VINUMON V S 21
PYTHON FOR DATA SCIENCE
String2 = String1[0:2] + String1[3:] print("\
nDeleting character at 2nd Index: ") print(String2)

Output:
Initial String:
Hello, I'm a Geek
Deleting character at 2nd Index:
Helo, I'm a Geek
Deleting Entire String
In Python Programming, Deletion of the entire string is possible with the use of del keyword.
Further, if we try to print the string, this will produce an error because the String is deleted
and is unavailable to be printed.

# Python Program to Delete


# entire String

String1 = "Hello, I'm a Geek"


print("Initial String: ")
print(String1)

# Deleting a String
# with the use of del
del String1
print("\nDeleting entire String: ")
print(String1)
Error:
Traceback (most recent call last):
File "/home/e4b8f2170f140da99d2fe57d9d8c6a94.py", line 12, in
print(String1)
NameError: name 'String1' is not defined
Escape Sequencing in Python
While printing Strings with single and double quotes in it causes SyntaxError because
String already contains Single and Double Quotes and hence cannot be printed with the use
of either of these. Hence, to print such a String either Triple Quotes are used or Escape
sequences are used to print Strings.
Escape sequences start with a backslash and can be interpreted differently. If single quotes
are used to represent a string, then all the single quotes present in the string must be escaped
and the same is done for Double Quotes.
Example:

# Python Program for


# Escape Sequencing
# of String

# Initial String
String1 = '''I'm a "Geek"'''
print("Initial String with use of Triple Quotes: ")
print(String1)

MR. VINUMON V S 22
PYTHON FOR DATA SCIENCE

# Escaping Single Quote


String1 = 'I\'m a "Geek"' print("\
nEscaping Single Quote: ")
print(String1)

# Escaping Double Quotes


String1 = "I'm a \"Geek\"" print("\
nEscaping Double Quotes: ")
print(String1)

# Printing Paths with the


# use of Escape
Sequences
String1 = "C:\\Python\\Python\\"
print("\nEscaping Backslashes: ")
print(String1)

# Printing Paths with the


# use of Tab
String1 = "Hi\tPython"
print("\nTab: ")
print(String1)

# Printing Paths with the


# use of New Line
String1 = "Python\nPython"
print("\nNew Line: ")
print(String1)
Output:
Initial String with use of Triple Quotes:
I'm a "Geek"
Escaping Single Quote:
I'm a "Geek"
Escaping Double Quotes:
I'm a "Geek"
Escaping Backslashes:
C:\Python\Python\
Tab:
Hi Python
New Line:
Python
Python
Example:
To ignore the escape sequences in a String, r or R is used, this implies that the string is a raw
string and escape sequences inside it are to be ignored.

# Printing hello in octal


String1 = "\110\145\154\154\157"
print("\nPrinting in Octal with the use of Escape Sequences: ")

MR. VINUMON V S 23
PYTHON FOR DATA SCIENCE
print(String1)

# Using raw String to


# ignore Escape Sequences
String1 = r"This is \110\145\154\154\157" print("\
nPrinting Raw String in Octal Format: ")
print(String1)

# Printing Python in HEX


String1 = "This is \x47\x65\x65\x6b\x73 in \x48\x45\x58" print("\
nPrinting in HEX with the use of Escape Sequences: ")
print(String1)

# Using raw String to


# ignore Escape Sequences
String1 = r"This is \x47\x65\x65\x6b\x73 in \x48\x45\x58"
print("\nPrinting Raw String in HEX Format: ")
print(String1)
Output:
Printing in Octal with the use of Escape Sequences:
Hello
Printing Raw String in Octal Format:
This is \110\145\154\154\157
Printing in HEX with the use of Escape Sequences:
This is Python in HEX
Printing Raw String in HEX Format:
This is \x47\x65\x65\x6b\x73 in \x48\x45\x58
Formatting of Strings
Strings in Python or string data type in Python can be formatted with the use
of format() method which is a very versatile and powerful tool for formatting Strings. Format
method in String contains curly braces {} as placeholders which can hold arguments
according to position or keyword to specify the order.
Example 1:
In this example, we will declare a string which contains the curly braces {} that acts as a
placeholders and provide them values to see how string declaration position matters.

# Python Program for


# Formatting of Strings

# Default order
String1 = "{} {} {}".format('Python', 'For', 'Life')
print("Print String in default order: ")
print(String1)

# Positional Formatting
String1 = "{1} {0} {2}".format('Python', 'For', 'Life')
print("\nPrint String in Positional order: ")
print(String1)

MR. VINUMON V S 24
PYTHON FOR DATA SCIENCE
# Keyword Formatting
String1 = "{l} {f} {g}".format(g='Python', f='For', l='Life')
print("\nPrint String in order of Keywords: ")
print(String1)
Output:
Print String in default order:
Python For Life
Print String in Positional order:
For Python Life
Print String in order of Keywords:
Life For Python
Example 2:
Integers such as Binary, hexadecimal, etc., and floats can be rounded or displayed in the
exponent form with the use of format specifiers.

# Formatting of Integers
String1 = "{0:b}".format(16) print("\
nBinary representation of 16 is ")
print(String1)

# Formatting of Floats
String1 = "{0:e}".format(165.6458) print("\
nExponent representation of 165.6458 is ")
print(String1)

# Rounding off Integers


String1 = "{0:.2f}".format(1/6)
print("\none-sixth is : ")
print(String1)
Output:
Binary representation of 16 is
10000
Exponent representation of 165.6458 is
1.656458e+02
one-sixth is :
0.17
Example 3:
In String data type in Python , A string can be left, right, or center aligned with the use of
format specifiers, separated by a colon(:). The (<) indicates that the string should be aligned
to the left, (>) indicates that the string should be aligned to the right and (^) indicates that the
string should be aligned to the center. We can also specify the length in which it should be
aligned. For example, (<10) means that the string should be aligned to the left within a field
of width of 10 characters.

# String alignment
String1 = "|{:<10}|{:^10}|{:>10}|".format('Python',
'for',
'Python')

MR. VINUMON V S 25
PYTHON FOR DATA SCIENCE
print("\nLeft, center and right alignment with Formatting: ")
print(String1)

# To demonstrate aligning of spaces


String1 = "\n{0:^16} was founded in {1:<4}!".format("PythonForDataScience",
2009)
print(String1)
Output:
Left, center and right alignment with Formatting:
|Python | for | Python|
PythonForDataScience was founded in 2009 !
Example 4:
Old-style formatting was done without the use of the format method by using the % operator

# Python Program for


# Old Style Formatting
# of Integers

Integer1 = 12.3456789
print("Formatting in 3.2f format: ")
print('The value of Integer1 is %3.2f' % Integer1)
print("\nFormatting in 3.4f format: ")
print('The value of Integer1 is %3.4f' % Integer1)
Output:
Formatting in 3.2f format:
The value of Integer1 is 12.35
Formatting in 3.4f format:
The value of Integer1 is 12.3457
Useful Python String Operations
 Logical Operators on String
 String Formatting using %
 String Template Class
 Split a string
 Python Docstrings
 String slicing
 Find all duplicate characters in string
 Reverse string in Python (5 different ways)
 Python program to check if a string is palindrome or not
Python String constants
Built-In Function Description

Concatenation of the ascii_lowercase and ascii_uppercase


string.ascii_letters
constants.

string.ascii_lowercase Concatenation of lowercase letters

MR. VINUMON V S 26
PYTHON FOR DATA SCIENCE

Built-In Function Description

string.ascii_uppercase Concatenation of uppercase letters

string.digits Digit in strings

string.hexdigits Hexadigit in strings

string.letters concatenation of the strings lowercase and uppercase

string.lowercase A string must contain lowercase letters.

string.octdigits Octadigit in a string

string.punctuation ASCII characters having punctuation characters.

string.printable String of characters which are printable

Returns True if a string ends with the given suffix otherwise


String.endswith()
returns False

Returns True if a string starts with the given prefix otherwise


String.startswith()
returns False

Returns “True” if all characters in the string are digits, Otherwise,


String.isdigit()
It returns “False”.

Returns “True” if all characters in the string are alphabets,


String.isalpha()
Otherwise, It returns “False”.

string.isdecimal() Returns true if all characters in a string are decimal.

one of the string formatting methods in , which allows multiple


str.format()
substitutions and value formatting.

String.index Returns the position of the first occurrence of substring in a string

string.uppercase A string must contain uppercase letters.

MR. VINUMON V S 27
PYTHON FOR DATA SCIENCE

Built-In Function Description

string.whitespace A string containing all characters that are considered whitespace.

Method converts all uppercase characters to lowercase and vice


string.swapcase()
versa of the given string, and returns it

returns a copy of the string where all occurrences of a substring is


replace()
replaced with another substring.

Deprecated string functions


Built-In Function Description

string.Isdecimal Returns true if all characters in a string are decimal

String.Isalnum Returns true if all the characters in a given string are alphanumeric.

string.Istitle Returns True if the string is a title cased string

splits the string at the first occurrence of the separator and returns a
String.partition
tuple.

String.Isidentifier Check whether a string is a valid identifier or not.

String.len Returns the length of the string.

Returns the highest index of the substring inside the string if substring
String.rindex
is found.

String.Max Returns the highest alphabetical character in a string.

String.min Returns the minimum alphabetical character in a string.

String.splitlines Returns a list of lines in the string.

string.capitalize Return a word with its first character capitalized.

string.expandtabs Expand tabs in a string replacing them by one or more spaces

MR. VINUMON V S 28
PYTHON FOR DATA SCIENCE

Built-In Function Description

string.find Return the lowest indexing a sub string.

string.rfind find the highest index.

Return the number of (non-overlapping) occurrences of substring sub in


string.count
string

string.lower Return a copy of s, but with upper case, letters converted to lower case.

Return a list of the words of the string, If the optional second argument
string.split
sep is absent or None

string.rsplit() Return a list of the words of the string s, scanning s from the end.

rpartition() Method splits the given string into three parts

Return a list of the words of the string when only used with two
string.splitfields
arguments.

Concatenate a list or tuple of words with intervening occurrences of


string.join
sep.

It returns a copy of the string with both leading and trailing white
string.strip()
spaces removed

string.lstrip Return a copy of the string with leading white spaces removed.

string.rstrip Return a copy of the string with trailing white spaces removed.

string.swapcase Converts lower case letters to upper case and vice versa.

string.translate Translate the characters using table

string.upper lower case letters converted to upper case.

string.ljust left-justify in a field of given width.

MR. VINUMON V S 2G
PYTHON FOR DATA SCIENCE

Built-In Function Description

string.rjust Right-justify in a field of given width.

string.center() Center-justify in a field of given width.

Pad a numeric string on the left with zero digits until the given width is
string-zfill
reached.

Return a copy of string s with all occurrences of substring old replaced


string.replace
by new.

Returns the string in lowercase which can be used for caseless


string.casefold()
comparisons.

Encodes the string into any encoding supported by Python. The default
string.encode
encoding is utf-8.

string.maketrans Returns a translation table usable for str.translate()

Use of String in Python


 Strings are extensively used for text processing tasks such as searching,
extracting, modifying, and formatting text data.
 Strings are used to read input from users via the standard input (stdin) or
command-line arguments and to display output using print statements.
 Strings are used to represent data in various formats, including JSON, XML, CSV,
and more. They are often manipulated to extract specific information or transform
data structures.
 Strings are used to read from and write to text files. They facilitate operations
such as reading the contents of a file, writing data to a file, and manipulating file
paths.
 Strings support a wide range of operations such as concatenation, slicing,
indexing, searching, replacing, and splitting. These operations enable developers
to manipulate and transform text efficiently.
Advantages of String in Python:
 Strings are used at a larger scale i.e. for a wide areas of operations such as storing
and manipulating text data, representing names, addresses, and other types of data
that can be represented as text.
 Python has a rich set of string methods that allow you to manipulate and work
with strings in a variety of ways. These methods make it easy to perform common
tasks such as converting strings to uppercase or lowercase, replacing substrings,
and splitting strings into lists.
 Strings are immutable, meaning that once you have created a string, you cannot
change it. This can be beneficial in certain situations because it means that you
can be confident that the value of a string will not change unexpectedly.

MR. VINUMON V S 30
PYTHON FOR DATA SCIENCE
Python has built-in support for strings, which means that you do not need to
import any additional libraries or modules to work with strings. This makes it easy
to get started with strings and reduces the complexity of your code.
 Python has a concise syntax for creating and manipulating strings, which makes it
easy to write and read code that works with strings.
Drawbacks of String in Python:
 When we are dealing with large text data, strings can be inefficient. For instance,
if you need to perform a large number of operations on a string, such as replacing
substrings or splitting the string into multiple substrings, it can be slow and
consume a lot resources.
 Strings can be difficult to work with when you need to represent complex data
structures, such as lists or dictionaries. In these cases, it may be more efficient to
use a different data type, such as a list or a dictionary, to represent the data.

Python Tuples

Python Tuple is a collection of Python Programming objects much like a list. The sequence
of values stored in a tuple can be of any type, and they are indexed by integers. Values of a
tuple are syntactically separated by ‘commas‘. Although it is not necessary, it is more
common to define a tuple by closing the sequence of values in parentheses. This helps in
understanding the Python tuples more easily.
Creating a Tuple
In Python Programming, tuples are created by placing a sequence of values separated by
‘comma’ with or without the use of parentheses for grouping the data sequence.
Note: Creation of Python tuple without the use of parentheses is known as Tuple Packing.
Python Program to Demonstrate the Addition of Elements in a Tuple.

# Creating an empty Tuple


Tuple1 = ()
print("Initial empty Tuple: ")
print(Tuple1)

# Creating a Tuple
# with the use of string
Tuple1 = (‘Python’, 'For') print("\
nTuple with the use of String: ")
print(Tuple1)

# Creating a Tuple with


# the use of list
list1 = [1, 2, 4, 5, 6]
print("\nTuple using List: ")
print(tuple(list1))

# Creating a Tuple
# with the use of built-in function
Tuple1 = tuple(‘Python’)
print("\nTuple with the use of function: ")

MR. VINUMON V S 31
PYTHON FOR DATA SCIENCE
print(Tuple1)
Output:
Initial empty Tuple:
()

Tuple with the use of String:


(‘Python’, 'For')

Tuple using List:


(1, 2, 4, 5, 6)

Tuple with the use of function:


('G', 'e', 'e', 'k', 's')
Creating a Tuple with Mixed Datatypes.
Python Tuples can contain any number of elements and of any datatype (like strings,
integers, list, etc.). Tuples can also be created with a single element, but it is a bit tricky.
Having one element in the parentheses is not sufficient, there must be a trailing ‘comma’ to
m a k e i t a t u p l e .

# Creating a Tuple
# with Mixed Datatype
Tuple1 = (5, 'Welcome', 7, ‘Python’) print("\
nTuple with Mixed Datatypes: ") print(Tuple1)

# Creating a Tuple
# with nested tuples
Tuple1 = (0, 1, 2, 3)
Tuple2 = ('python', 'geek')
Tuple3 = (Tuple1, Tuple2)
print("\nTuple with nested tuples: ")
print(Tuple3)

# Creating a Tuple
# with repetition
Tuple1 = (‘Python’,) * 3 print("\
nTuple with repetition: ")
print(Tuple1)

# Creating a Tuple
# with the use of loop
Tuple1 = (‘Python’)
n=5
print("\nTuple with a loop")
for i in range(int(n)):
Tuple1 = (Tuple1,)
print(Tuple1)

Output:

MR. VINUMON V S 32
PYTHON FOR DATA SCIENCE
Tuple with Mixed Datatypes:
(5, 'Welcome', 7, ‘Python’)

Tuple with nested tuples:


((0, 1, 2, 3), ('python', 'geek'))

Tuple with repetition:


(‘Python’, ‘Python’, ‘Python’)

Tuple with a loop


(‘Python’,)
((‘Python’,),)
(((‘Python’,),),)
((((‘Python’,),),),)
(((((‘Python’,),),),),)
Time complexity: O(1)
Auxiliary Space : O(n)
Python Tuple Operations
Here, below are the Python tuple operations.
 Accessing of Python Tuples
 Concatenation of Tuples
 Slicing of Tuple
 Deleting a Tuple
Accessing of Tuples
In Python Programming, Tuples are immutable, and usually, they contain a sequence of
heterogeneous elements that are accessed via unpacking or indexing (or even by attribute in
the case of named tuples). Lists are mutable, and their elements are usually homogeneous and
are accessed by iterating over the list.
Note: In unpacking of tuple number of variables on the left-hand side should be equal to a
number of values in given tuple a.

# Accessing Tuple
# with Indexing
Tuple1 = tuple("Python") print("\
nFirst element of Tuple: ")
print(Tuple1[0])

# Tuple unpacking
Tuple1 = (“Python”,”For”,”Data Science”)

# This line unpack


# values of Tuple1
a, b, c = Tuple1
print("\nValues after unpacking: ")
print(a)
print(b)
print(c)
Output:

MR. VINUMON V S 33
PYTHON FOR DATA SCIENCE
First element of Tuple:
G

Values after unpacking:


Geeks
For
Geeks
Time complexity: O(1)
Space complexity: O(1)
Concatenation of Tuples
Concatenation of tuple is the process of joining two or more Tuples. Concatenation is done
by the use of ‘+’ operator. Concatenation of tuples is done always from the end of the
original tuple. Other arithmetic operations do not apply on Tuples.
Note- Only the same datatypes can be combined with concatenation, an error arises if a list
and a tuple are combined.

# Concatenation of tuples
Tuple1 = (0, 1, 2, 3)
Tuple2 = (‘Python’, 'For', ‘Python’)

Tuple3 = Tuple1 + Tuple2

# Printing first Tuple


print("Tuple 1: ")
print(Tuple1)

# Printing Second Tuple


print("\nTuple2: ")
print(Tuple2)

# Printing Final Tuple


print("\nTuples after Concatenation: ")
print(Tuple3)
Output:
Tuple 1:
(0, 1, 2, 3)

Tuple2:

MR. VINUMON V S 34
PYTHON FOR DATA SCIENCE
(‘Python’, 'For', ‘Python’)

Tuples after Concatenation:


(0, 1, 2, 3, ‘Python’, 'For', ‘Python’)
Time Complexity: O(1)
Auxiliary Space: O(1)
Slicing of Tuple
Slicing of a Tuple is done to fetch a specific range or slice of sub-elements from a Tuple.
Slicing can also be done to lists and arrays. Indexing in a list results to fetching a single
element whereas Slicing allows to fetch a set of elements.
Note- Negative Increment values can also be used to reverse the sequence of Tuples.

# Slicing of a Tuple

# Slicing of a Tuple
# with Numbers
Tuple1 = tuple('GEEKSFORGEEKS')

# Removing First element


print("Removal of First Element: ")
print(Tuple1[1:])

# Reversing the Tuple


print("\nTuple after sequence of Element is reversed: ")
print(Tuple1[::-1])

# Printing elements of a Range


print("\nPrinting elements between Range 4-9: ")
print(Tuple1[4:9])
Output:
Removal of First Element:
('E', 'E', 'K', 'S', 'F', 'O', 'R', 'G', 'E', 'E', 'K', 'S')

Tuple after sequence of Element is reversed:


('S', 'K', 'E', 'E', 'G', 'R', 'O', 'F', 'S', 'K', 'E', 'E', 'G')

MR. VINUMON V S 35
PYTHON FOR DATA SCIENCE
Printing elements between Range 4-9:
('S', 'F', 'O', 'R', 'G')
Time complexity: O(1)
Space complexity: O(1)
Deleting a Tuple
Tuples are immutable and hence they do not allow deletion of a part of it. The entire tuple
gets deleted by the use of del() method.
Note- Printing of Tuple after deletion results in an Error.
Python
# Deleting a Tuple

Tuple1 = (0, 1, 2, 3, 4)
del Tuple1

print(Tuple1)
Output
Traceback (most recent call last):
File "/home/efa50fd0709dec08434191f32275928a.py", line 7, in
print(Tuple1)
NameError: name 'Tuple1' is not defined
Built-In Methods
Built-in-
Method Description

Find in the tuple and returns the index of the given value where it’s
index( )
available

count( ) Returns the frequency of occurrence of a specified value

Built-In Functions
Built-in
Function Description

all() Returns true if all element are true or if tuple is empty

return true if any element of the tuple is true. if tuple is empty, return
any() false

len() Returns length of the tuple or size of the tuple

enumerate() Returns enumerate object of tuple

max() return maximum element of given tuple

MR. VINUMON V S 36
PYTHON FOR DATA SCIENCE

Built-in
Function Description

min() return minimum element of given tuple

sum() Sums up the numbers in the tuple

sorted() input elements in the tuple and return a new sorted list

tuple() Convert an iterable to a tuple.

Tuples VS Lists:
Similarities Differences

Functions that can be used for


both lists and tuples: Methods that cannot be used for tuples:
len(), max(), min(), sum(), append(), insert(), remove(), pop(), clear(), sort(),
any(), all(), sorted() reverse()

Methods that can be used for


we generally use ‘tuples’ for heterogeneous (different) data
both lists and tuples:
types and ‘lists’ for homogeneous (similar) data types.
count(), Index()

Tuples can be stored in lists. Iterating through a ‘tuple’ is faster than in a ‘list’.

Lists can be stored in tuples. ‘Lists’ are mutable whereas ‘tuples’ are immutable.

Both ‘tuples’ and ‘lists’ can Tuples that contain immutable elements can be used as a
be nested. key for a dictionary.

Dictionaries in Python

A Python dictionary is a data structure that stores the value in key:value pairs.
Example:
As you can see from the example, data is stored in key:value pairs in dictionaries, which
makes it easier to find values.

Dict = {1: ‘Python’, 2: 'For', 3: ‘Python’}


print(Dict)

MR. VINUMON V S 37
PYTHON FOR DATA SCIENCE
Output:
{1: ‘Python’, 2: 'For', 3: ‘Python’}

Python Dictionary Syntax


dict_var = {key1 : value1, key2 : value2, …..}
What is a Dictionary in Python?
Dictionaries in Python is a data structure, used to store values in key:value format. This
makes it different from lists, tuples, and arrays as in a dictionary each key has an associated
value.
Note: As of Python version 3.7, dictionaries are ordered and can not contain duplicate keys.
How to Create a Dictionary
In Python, a dictionary can be created by placing a sequence of elements within
curly {} braces, separated by a ‘comma’.
The dictionary holds pairs of values, one being the Key and the other corresponding pair
element being its Key:value.
Values in a dictionary can be of any data type and can be duplicated, whereas keys can’t be
repeated and must be immutable.
Note – Dictionary keys are case sensitive, the same name but different cases of Key will be
treated distinctly.
The code demonstrates creating dictionaries with different types of keys. The first dictionary
uses integer keys, and the second dictionary uses a mix of string and integer keys with
corresponding values. This showcases the flexibility of Python dictionaries in handling
various data types as keys.

Dict = {1: ‘Python’, 2: 'For', 3: ‘Python’} print("\


nDictionary with the use of Integer Keys: ") print(Dict)

Dict = {'Name': ‘Python’, 1: [1, 2, 3, 4]} print("\


nDictionary with the use of Mixed Keys: ") print(Dict)

Output
Dictionary with the use of Integer Keys:
{1: ‘Python’, 2: 'For', 3: ‘Python’}
Dictionary with the use of Mixed Keys:
{'Name': ‘Python’, 1: [1, 2, 3, 4]}

Dictionary Example
A dictionary can also be created by the built-in function dict(). An empty dictionary can be
created by just placing curly braces{}.
Different Ways to Create a Python Dictionary
The code demonstrates different ways to create dictionaries in Python. It first creates an
empty dictionary, and then shows how to create dictionaries using the dict() constructor with
key-value pairs specified within curly braces and as a list of tuples.

MR. VINUMON V S 38
PYTHON FOR DATA SCIENCE

Dict = {}
print("Empty Dictionary: ")
print(Dict)

Dict = dict({1: ‘Python’, 2: 'For', 3: ‘Python’})


print("\nDictionary with the use of dict(): ")
print(Dict)

Dict = dict([(1, ‘Python’), (2, 'For')]) print("\


nDictionary with each item as a pair: ") print(Dict)

Output:
Empty Dictionary:
{}
Dictionary with the use of dict():
{1: ‘Python’, 2: 'For', 3: ‘Python’}
Dictionary with each item as a pair:
{1: ‘Python’, 2: 'For'}

Complexities for Creating a Dictionary:


 Time complexity: O(len(dict))
 Space complexity: O(n)
Nested Dictionaries

Example: The code defines a nested dictionary named ‘Dict’ with multiple levels of key-
value pairs. It includes a top-level dictionary with keys 1, 2, and 3. The value associated with

MR. VINUMON V S 3G
PYTHON FOR DATA SCIENCE
key 3 is another dictionary with keys ‘A,’ ‘B,’ and ‘C.’ This showcases how Python
dictionaries can be nested to create hierarchical data structures.

Dict = {1: ‘Python’, 2: 'For',


3: {'A': 'Welcome', 'B': 'To', 'C': ‘Python’}}

print(Dict)

Output:
{1: ‘Python’, 2: 'For', 3: {'A': 'Welcome', 'B': 'To', 'C': ‘Python’}}

More on Python Nested Dictionary


Adding Elements to a Dictionary
The addition of elements can be done in multiple ways. One value at a time can be added to a
Dictionary by defining value along with the key e.g. Dict[Key] = ‘Value’.
Updating an existing value in a Dictionary can be done by using the built-
in update() method. Nested key values can also be added to an existing Dictionary.
Note- While adding a value, if the key-value already exists, the value gets updated otherwise
a new Key with the value is added to the Dictionary.
Example: Add Items to a Python Dictionary with Different DataTypes
The code starts with an empty dictionary and then adds key-value pairs to it. It demonstrates
adding elements with various data types, updating a key’s value, and even nesting
dictionaries within the main dictionary. The code shows how to manipulate dictionaries in
Python.

Dict = {}
print("Empty Dictionary: ")
print(Dict)
Dict[0] = ‘Python’
Dict[2] = 'For'
Dict[3] = 1
print("\nDictionary after adding 3 elements: ")
print(Dict)

Dict['Value_set'] = 2, 3, 4
print("\nDictionary after adding 3 elements: ")
print(Dict)

Dict[2] = 'Welcome' print("\


nUpdated key value: ") print(Dict)
Dict[5] = {'Nested': {'1': 'Life', '2': ‘Python’}} print("\
nAdding a Nested Key: ")
print(Dict)

MR. VINUMON V S 40
PYTHON FOR DATA SCIENCE
Output:
Empty Dictionary:
{}
Dictionary after adding 3 elements:
{0: ‘Python’, 2: 'For', 3: 1}
Dictionary after adding 3 elements:
{0: ‘Python’, 2: 'For', 3: 1, 'Value_set': (2, 3, 4)}
Updated key value:
{0: ‘Python’, 2: 'Welcome', 3: 1, 'Value_set': (2, 3, 4)}
Adding a Nested Key:
{0: ‘Python’, 2: 'Welcome', 3: 1, 'Value_set': (2, 3, 4), 5:
{'Nested': {'1': 'Life', '2': ‘Python’}}}

Complexities for Adding Elements in a Dictionary:


 Time complexity: O(1)/O(n)
 Space complexity: O(1)
Accessing Elements of a Dictionary
To access the items of a dictionary refer to its key name. Key can be used inside square
brackets.
Access a Value in Python Dictionary
The code demonstrates how to access elements in a dictionary using keys. It accesses and
prints the values associated with the keys ‘name’ and 1, showcasing that keys can be of
different data types (string and integer).

Dict = {1: ‘Python’, 'name': 'For', 3: ‘Python’}


print("Accessing a element using key:")
print(Dict['name'])
print("Accessing a element using key:")
print(Dict[1])

Output:
Accessing a element using key:
For
Accessing a element using key:
Python

There is also a method called get() that will also help in accessing the element from a
dictionary. This method accepts key as argument and returns the value.
Complexities for Accessing elements in a Dictionary:
 Time complexity: O(1)
 Space complexity: O(1)
Example: Access a Value in Dictionary using get() in Python
The code demonstrates accessing a dictionary element using the get() method. It retrieves and
prints the value associated with the key 3 in the dictionary ‘Dict’. This method provides a
safe way to access dictionary values, avoiding KeyError if the key doesn’t exist.

MR. VINUMON V S 41
PYTHON FOR DATA SCIENCE

Dict = {1: ‘Python’, 'name': 'For', 3: ‘Python’}

print("Accessing a element using get:")


print(Dict.get(3))

Output:
Accessing a element using get:
Geeks

Accessing an Element of a Nested Dictionary


To access the value of any key in the nested dictionary, use indexing [] syntax.
Example: The code works with nested dictionaries. It first accesses and prints the entire
nested dictionary associated with the key ‘Dict1’. Then, it accesses and prints a specific value
by navigating through the nested dictionaries. Finally, it retrieves and prints the value
associated with the key ‘Name’ within the nested dictionary under ‘Dict2’.

Dict = {'Dict1': {1: ‘Python’},


'Dict2': {'Name': 'For'}}

print(Dict['Dict1'])
print(Dict['Dict1'][1])
print(Dict['Dict2']['Name'])

Output:
{1: ‘Python’}
Geeks
For

Deleting Elements using ‘del’ Keyword


The items of the dictionary can be deleted by using the del keyword as given below.
Example: The code defines a dictionary, prints its original content, and then uses
the ‘del’ statement to delete the element associated with key 1. After deletion, it prints the
updated dictionary, showing that the specified element has been removed.

Dict = {1: ‘Python’, 'name': 'For', 3: ‘Python’}

print("Dictionary =")
print(Dict)
del(Dict[1])
print("Data after deletion Dictionary=")
print(Dict)

Output

MR. VINUMON V S 42
PYTHON FOR DATA SCIENCE

Dictionary ={1: ‘Python’, 'name': 'For', 3: ‘Python’}


Data after deletion Dictionary={'name': 'For', 3: ‘Python’}

Dictionary Methods
Here is a list of in-built dictionary functions with their description. You can use these
functions to operate on a dictionary.
Method Description

Remove all the elements from the


dict.clear()
dictionary

dict.copy() Returns a copy of the dictionary

dict.get(key, default = “None”) Returns the value of specified key

Returns a list containing a tuple for each


dict.items()
key value pair

dict.keys() Returns a list containing dictionary’s keys

Updates dictionary with specified key-


dict.update(dict2)
value pairs

dict.values() Returns a list of all the values of dictionary

pop() Remove the element with specified key

popItem() Removes the last inserted key-value pair

set the key to the default value if the key is


dict.setdefault(key,default= “None”)
not specified in the dictionary

returns true if the dictionary contains the


dict.has_key(key)
specified key.

For Detailed Explanations: Python Dictionary Methods


Multiple Dictionary Operations in Python
The code begins with a dictionary ‘dict1’ and creates a copy ‘dict2’. It then demonstrates
several dictionary operations: clearing ‘dict1’, accessing values, retrieving key-value pairs
and keys, removing specific key-value pairs, updating a value, and retrieving values. These
operations showcase how to work with dictionaries in Python.

MR. VINUMON V S 43
PYTHON FOR DATA SCIENCE

dict1 = {1: "Python", 2: "Java", 3: "Ruby", 4: "Scala"}


dict2 = dict1.copy()
print(dict2)
dict1.clear()
print(dict1)
print(dict2.get(1))
print(dict2.items())
print(dict2.keys())
dict2.pop(4)
print(dict2)
dict2.popitem()
print(dict2)
dict2.update({3: "Scala"})
print(dict2)
print(dict2.values())

Output:
{1: 'Python', 2: 'Java', 3: 'Ruby', 4: 'Scala'}
{}
Python
dict_items([(1, 'Python'), (2, 'Java'), (3, 'Ruby'), (4, 'Scala')])
dict_keys([1, 2, 3, 4])
{1: 'Python', 2: 'Java', 3: 'Ruby'}
{1: 'Python', 2: 'Java'}
{1: 'Python', 2: 'Java', 3: 'Scala'}
dict_values(['Python', 'Java', 'Scala'])

Python Functions
Python Functions is a block of statements that return the specific task. The idea is to put
some commonly or repeatedly done tasks together and make a function so that instead of
writing the same code again and again for different inputs, we can do the function calls to
reuse code contained in it over and over again.
Some Benefits of Using Functions
 Increase Code Readability
 Increase Code Reusability
Python Function Declaration
The syntax to declare a function is:

MR. VINUMON V S 44
PYTHON FOR DATA SCIENCE

Syntax of Python Function Declaration

Types of Functions in Python


Below are the different types of functions in Python:
 Built-in library function: These are Standard functions in Python that are
available to use.
 User-defined function: We can create our own functions based on our
requirements.
Creating a Function in Python
We can define a function in Python, using the def keyword. We can add any type of
functionalities and properties to it as we require. By the following example, we can
understand how to write a function in Python. In this way we can create Python function
definition by using def keyword.

# A simple Python function


def fun():
print("Welcome to GFG")
Calling a Function in Python
After creating a function in Python we can call it by using the name of the functions Python
followed by parenthesis containing parameters of that particular function. Below is the
example for calling def function Python.

# A simple Python function


def fun():
print("Welcome to GFG")

# Driver code to call a function


fun()
Output:
Welcome to GFG
Python Function with Parameters

MR. VINUMON V S 45
PYTHON FOR DATA SCIENCE
If you have experience in C/C++ or Java then you must be thinking about the return type of
the function and data type of arguments. That is possible in Python as well (specifically for
Python 3.5 and above).
Python Function Syntax with Parameters
def function_name(parameter: data_type) -> return_type:
"""Docstring"""
# body of the function
return expression
The following example uses arguments and parameters that you will learn later in this chapter
so you can come back to it again if not understood.

def add(num1: int, num2: int) -> int:


"""Add two numbers"""
num3 = num1 + num2

return num3

# Driver code
num1, num2 = 5, 15
ans = add(num1,
num2)
print(f"The addition of {num1} and {num2} results {ans}.")
Output:
The addition of 5 and 15 results 20.
Note: The following examples are defined using syntax 1, try to convert them in syntax 2 for
practice.

# some more functions


def is_prime(n):
if n in [2, 3]:
return True
if (n == 1) or (n % 2 == 0):
return False
r=3
while r * r <= n:
if n % r == 0:
return False
r += 2
return True
print(is_prime(78), is_prime(79))
Output:
False True
Python Function Arguments
Arguments are the values passed inside the parenthesis of the function. A function can have
any number of arguments separated by a comma.
In this example, we will create a simple function in Python to check whether the number
passed as an argument to the function is even or odd.

MR. VINUMON V S 46
PYTHON FOR DATA SCIENCE
# A simple Python function to check
# whether x is even or odd
def evenOdd(x):
if (x % 2 == 0):
print("even")
else:
print("odd")

# Driver code to call the function


evenOdd(2)
evenOdd(3)
Output:
even
odd
Types of Python Function Arguments
Python supports various types of arguments that can be passed at the time of the function call.
In Python, we have the following function argument types in Python:
 Default argument
 Keyword arguments (named arguments)
 Positional arguments
 Arbitrary arguments (variable-length arguments *args and
**kwargs) Let’s discuss each type in detail.
Default Arguments
A default argument is a parameter that assumes a default value if a value is not provided in
the function call for that argument. The following example illustrates Default arguments to
write functions in Python.

# Python program to demonstrate


# default arguments
def myFun(x, y=50):
print("x: ", x)
print("y: ", y)

# Driver code (We call myFun() with only


# argument)
myFun(10)
Output:
x: 10
y: 50
Like C++ default arguments, any number of arguments in a function can have a default value.
But once we have a default argument, all the arguments to its right must also have default
values.
Keyword Arguments
The idea is to allow the caller to specify the argument name with values so that the caller
does not need to remember the order of parameters.

MR. VINUMON V S 47
PYTHON FOR DATA SCIENCE
# Python program to demonstrate Keyword Arguments
def student(firstname, lastname):
print(firstname, lastname)

# Keyword arguments
student(firstname=‘Python”, lastname='Practice')
student(lastname='Practice', firstname=‘Python”)
Output:
Python Practice
Python Practice
Positional Arguments
We used the Position argument during the function call so that the first argument (or value) is
assigned to name and the second argument (or value) is assigned to age. By changing the
position, or if you forget the order of the positions, the values can be used in the wrong
places, as shown in the Case-2 example below, where 27 is assigned to the name and Suraj is
assigned to the age.

def nameAge(name, age):


print("Hi, I am", name)
print("My age is ", age)

# You will get correct output because


# argument is given in order
print("Case-1:")
nameAge("Suraj", 27)
# You will get incorrect output because
# argument is not in order print("\
nCase-2:")
nameAge(27, "Suraj")
Output:
Case-1:
Hi, I am Suraj
My age is 27
Case-2:
Hi, I am 27
My age is Suraj
Arbitrary Keyword Arguments
In Python Arbitrary Keyword Arguments, *args, and **kwargs can pass a variable number of
arguments to a function using special symbols. There are two special symbols:
 *args in Python (Non-Keyword Arguments)
 **kwargs in Python (Keyword Arguments)
Example 1: Variable length non-keywords argument

# Python program to illustrate


# *args for variable number of arguments
def myFun(*argv):

MR. VINUMON V S 48
PYTHON FOR DATA SCIENCE
for arg in argv:
print(arg)

myFun('Hello', 'Welcome', 'to', 'Python')


Output:
Hello
Welcome
to
Python
Example 2: Variable length keyword arguments

# Python program to illustrate


# *kwargs for variable number of keyword arguments

def myFun(**kwargs):
for key, value in kwargs.items():
print("%s == %s" % (key, value))

# Driver code
myFun(first=‘Python”, mid='for', last=‘Data”)
Output:
first == Python
mid == for
last == Data
Docstring
The first string after the function is called the Document string or Docstring in short. This is
used to describe the functionality of the function. The use of docstring in functions is optional
but it is considered a good practice.
The below syntax can be used to print out the docstring of a function.
Syntax: print(function_name. doc )
Example: Adding Docstring to the function

# A simple Python function to check


# whether x is even or odd

def evenOdd(x):
"""Function to check if the number is even or odd"""

if (x % 2 == 0):
print("even")
else:
print("odd")

MR. VINUMON V S 4G
PYTHON FOR DATA SCIENCE
# Driver code to call the function
print(evenOdd. doc )
Output:
Function to check if the number is even or odd
Python Function within Functions
A function that is defined inside another function is known as the inner function or nested
function. Nested functions can access variables of the enclosing scope. Inner functions are
used so that they can be protected from everything happening outside the function.

# Python program to
# demonstrate accessing of
# variables of nested functions

def f1():
s = 'I love Python'

def f2():
print(s)

f2()

# Driver's code
f1()
Output:
I love GeeksforGeeks
Anonymous Functions in Python
In Python, an anonymous function means that a function is without a name. As we already
know the def keyword is used to define the normal functions and the lambda keyword is used
to create anonymous functions.

# Python code to illustrate the cube of a number


# using lambda function
def cube(x): return x*x*x

cube_v2 = lambda x : x*x*x

print(cube(7))
print(cube_v2(7))
Output:
343
343
Recursive Functions in Python
Recursion in Python refers to when a function calls itself. There are many instances when
you have to build a recursive function to solve Mathematical and Recursive Problems.
Using a recursive function should be done with caution, as a recursive function can become
like a non-terminating loop. It is better to check your exit statement while creating a recursive
function.

MR. VINUMON V S 50
PYTHON FOR DATA SCIENCE

def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)

print(factorial(4))
Output
24
Here we have created a recursive function to calculate the factorial of the number. You can
see the end statement for this function is when n is equal to 0.
Return Statement in Python Function
The function return statement is used to exit from a function and go back to the function
caller and return the specified value or data item to the caller. The syntax for the return
statement is:
return [expression_list]
The return statement can consist of a variable, an expression, or a constant which is returned
at the end of the function execution. If none of the above is present with the return statement
a None object is returned.
Example: Python Function Return Statement

def square_value(num):
"""This function returns the square
value of the entered number"""
return num**2

print(square_value(2))
print(square_value(-4))
Output:
4
16
Pass by Reference and Pass by Value
One important thing to note is, in Python every variable name is a reference. When we pass a
variable to a function Python, a new reference to the object is created. Parameter passing in
Python is the same as reference passing in Java.

# Here x is a new reference to same list lst


def myFun(x):
x[0] = 20

# Driver Code (Note that lst is modified


# after function call.
lst = [10, 11, 12, 13, 14, 15]
myFun(lst)
print(lst)

MR. VINUMON V S 51
PYTHON FOR DATA SCIENCE
Output:
[20, 11, 12, 13, 14, 15]
When we pass a reference and change the received reference to something else, the
connection between the passed and received parameters is broken. For example, consider the
below program as follows:

def myFun(x):

# After below line link of x with previous


# object gets broken. A new object is assigned
# to x.
x = [20, 30, 40]

# Driver Code (Note that lst is not modified


# after function call.
lst = [10, 11, 12, 13, 14, 15]
myFun(lst)
print(lst)
Output:
[10, 11, 12, 13, 14, 15]
Another example demonstrates that the reference link is broken if we assign a new value
(inside the function).

def myFun(x):

# After below line link of x with previous


# object gets broken. A new object is assigned
# to x.
x = 20

# Driver Code (Note that x is not modified


# after function call.
x = 10
myFun(x)
print(x)
Output:
10
Exercise: Try to guess the output of the following code.

def swap(x, y):


temp = x
x=y
y = temp

# Driver code
x=2

MR. VINUMON V S 52
PYTHON FOR DATA SCIENCE
y=3
swap(x, y)
print(x)
print(y)
Output:
2
3

Working with Excel files using Pandas




Excel sheets are very instinctive and user-friendly, which makes them ideal for manipulating

large datasets even for less technical folks. If you are looking for places to learn to
manipulate and automate stuff in Excel files using Python, look no further. You are at the
right place.
In this article, you will learn how to use Pandas to work with Excel spreadsheets. In this
article we will learn about:
 Read Excel File using Pandas in Python
 Installing and Importing Pandas
 Reading multiple Excel sheets using Pandas
 Application of different Pandas functions
Reading Excel File using Pandas in Python
Installating Pandas
To install Pandas in Python, we can use the following command in the command prompt:
pip install pandas
To install Pandas in Anaconda, we can use the following command in Anaconda Terminal:
conda install pandas
Importing Pandas
First of all, we need to import the Pandas module which can be done by running the
command:

import pandas as pd

Input File: Let’s suppose the Excel file looks like this
Sheet 1:

MR. VINUMON V S 53
PYTHON FOR DATA SCIENCE

Sheet 1

Sheet 2:

Sheet 2

Now we can import the Excel file using the read_excel function in Pandas to read Excel file
using Pandas in Python. The second statement reads the data from Excel and stores it into a
pandas Data Frame which is represented by the variable newData.

df = pd.read_excel('Example.xlsx')
print(df)

Output:
Roll No. English Maths Science
0 1 19 13 17

MR. VINUMON V S 54
PYTHON FOR DATA SCIENCE
1 2 14 20 18
2 3 15 18 19
3 4 13 14 14
4 5 17 16 20
5 6 19 13 17
6 7 14 20 18
7 8 15 18 19
8 9 13 14 14
9 10 17 16 20
Loading multiple sheets using Concat() method
If there are multiple sheets in the Excel workbook, the command will import data from the
first sheet. To make a data frame with all the sheets in the workbook, the easiest method is to
create different data frames separately and then concatenate them. The read_excel method
takes argument sheet_name and index_col where we can specify the sheet of which the frame
should be made of and index_col specifies the title column, as is shown below:
Example:
The third statement concatenates both sheets. Now to check the whole data frame, we can
simply run the following command:

file = 'Example.xlsx'
sheet1 = pd.read_excel(file,
sheet_name = 0,
index_col = 0)

sheet2 = pd.read_excel(file,
sheet_name = 1,
index_col = 0)

# concatinating both the sheets


newData = pd.concat([sheet1, sheet2])
print(newData)

Output:
Roll No. English Maths Science
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
6 19 13 17
7 14 20 18
8 15 18 19
9 13 14 14
10 17 16 20
1 14 18 20
2 11 19 18
3 12 18 16

MR. VINUMON V S 55
PYTHON FOR DATA SCIENCE
4 15 18 19
5 13 14 14
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14

Head() and Tail() methods in Pandas


To view 5 columns from the top and from the bottom of the data frame, we can run the
command. This head() and tail() method also take arguments as numbers for the number of
columns to show.

print(newData.head())
print(newData.tail())

Output:
English Maths Science
Roll No.
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
English Maths Science
Roll No.
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14
Shape() method
The shape() method can be used to view the number of rows and columns in the data frame as
follows:

newData.shape

Output:
(20, 3)
Sort_values() method in Pandas
If any column contains numerical data, we can sort that column using
the sort_values() method in pandas as follows:

MR. VINUMON V S 56
PYTHON FOR DATA SCIENCE

sorted_column = newData.sort_values(['English'], ascending =


False)

Now, let’s suppose we want the top 5 values of the sorted column, we can use the head()
method here:

sorted_column.head(5)

Output:
English Maths Science
Roll No.
1 19 13 17
6 19 13 17
5 17 16 20
10 17 16 20
3 15 18 19
We can do that with any numerical column of the data frame as shown below:

newData['Maths'].head()

Output:
Roll No.
1 13
2 20
3 18
4 14
5 16
Name: Maths, dtype: int64
Pandas Describe() method
Now, suppose our data is mostly numerical. We can get the statistical information like mean,
max, min, etc. about the data frame using the describe() method as shown below:

newData.describe()

Output:
English Maths Science
count 20.00000 20.000000 20.000000
mean 14.30000 16.800000 17.500000
std 2.29645 2.330575 2.164304
min 11.00000 13.000000 14.000000
25% 13.00000 14.000000 16.000000
50% 14.00000 18.000000 18.000000
75% 15.00000 18.000000 19.000000

MR. VINUMON V S 57
PYTHON FOR DATA SCIENCE
max 19.00000 20.000000 20.000000
This can also be done separately for all the numerical columns using the following
command:

newData['English'].mean()

Output:
14.3
Other statistical information can also be calculated using the respective methods. Like in
Excel, formulas can also be applied, and calculated columns can be created as follows:

newData['Total Marks'] =
newData["English"] + newData["Maths"] +
newData["Science"]
newData['Total Marks'].head()

Output:
Roll No.
1 49
2 52
3 52
4 41
5 53
Name: Total Marks, dtype: int64
After operating on the data in the data frame, we can export the data back to an Excel file
using the method to_excel. For this, we need to specify an output Excel file where the
transformed data is to be written, as shown below:

newData.to_excel('Output File.xlsx')

Output:

MR. VINUMON V S 58
PYTHON FOR DATA SCIENCE

Pandas Read CSV in Python


CSV files are the Comma Separated Files. To access data from the CSV file, we require a
function read_csv() from Pandas that retrieves data in the form of the data frame.
Syntax of read_csv()
Here is the Pandas read CSV syntax with its parameters.
Syntax: pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’, index_col=None,
usecols=None, engine=None, skiprows=None, nrows=None)
Parameters:
 filepath_or_buffer: Location of the csv file. It accepts any string path or URL of
the file.
 sep: It stands for separator, default is ‘, ‘.
 header: It accepts int, a list of int, row numbers to use as the column names, and
the start of the data. If no names are passed, i.e., header=None, then, it will
display the first column as 0, the second as 1, and so on.
 usecols: Retrieves only selected columns from the CSV file.
 nrows: Number of rows to be displayed from the dataset.
 index_col: If None, there are no index numbers displayed along with records.
 skiprows: Skips passed rows in the new data frame.

Read CSV File using Pandas read_csv


Before using this function, we must import the Pandas library, we will load the CSV file
using Pandas.

MR. VINUMON V S 5G
PYTHON FOR DATA SCIENCE

# Import pandas
import pandas as pd

# reading csv file


df = pd.read_csv("people.csv")
print(df.head())

Output:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Phillip Summers Female [email protected] 1910-03-24 Phytotherapist
2 Kristine Travis Male [email protected] 1992-07-02 Homeopath
3 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
4 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
Using sep in read_csv()
In this example, we will take a CSV file and then add some special characters to see how
the sep parameter works.

# sample = "totalbill_tip, sex:smoker, day_time, size


# 16.99, 1.01:Female|No, Sun, Dinner, 2
# 10.34, 1.66, Male, No|Sun:Dinner, 3
# 21.01:3.5_Male, No:Sun, Dinner, 3
#23.68, 3.31, Male|No, Sun_Dinner, 2
# 24.59:3.61, Female_No, Sun, Dinner, 4
# 25.29, 4.71|Male, No:Sun, Dinner, 4"

# Importing pandas library


import pandas as pd

# Load the data of csv


df = pd.read_csv('sample.csv',
sep='[:, |_]',
engine='python')

# Print the Dataframe


print(df)

Output:
totalbill tip Unnamed: 2 sex smoker Unnamed: 5 day time Unnamed: 8 size
16.99 NaN 1.01 Female No NaN Sun NaN Dinner NaN 2
10.34 NaN 1.66 NaN Male NaN No Sun Dinner NaN 3
21.01 3.50 Male NaN No Sun NaN Dinner NaN 3.0 None
23.68 NaN 3.31 NaN Male No NaN Sun Dinner NaN 2
24.59 3.61 NaN Female No NaN Sun NaN Dinner NaN 2
25.29 NaN 4.71 Male NaN No Sun NaN Dinner NaN 4

MR. VINUMON V S 60
PYTHON FOR DATA SCIENCE

Using usecols in read_csv()


Here, we are specifying only 3 columns,i.e.[“First Name”, “Sex”, “Email”] to load and we
use the header 0 as its default header.

df = pd.read_csv('people.csv',
header=0,
usecols=["First Name", "Sex", "Email"])
# printing dataframe
print(df.head())

Output:
First Name Sex Email
0 Shelby Male [email protected]
1 Phillip Female [email protected]
2 Kristine Male [email protected]
3 Yesenia Male [email protected]
4 Lori Male [email protected]

Using index_col in read_csv()


Here, we use the “Sex” index first and then the “Job Title” index, we can simply reindex the
header with index_col parameter.

df = pd.read_csv('people.csv',
header=0,
index_col=["Sex", "Job Title"],
usecols=["Sex", "Job Title", "Email"])

print(df.head())

Output:
Email
Sex Job Title
Male Games developer [email protected]
Female Phytotherapist [email protected]
Male Homeopath [email protected]
Market researcher [email protected]
Veterinary surgeon [email protected]

Using nrows in read_csv()


Here, we just display only 5 rows using nrows parameter.

MR. VINUMON V S 61
PYTHON FOR DATA SCIENCE

df = pd.read_csv('people.csv',
header=0,
index_col=["Sex", "Job Title"],
usecols=["Sex", "Job Title", "Email"],
nrows=3)

print(df)

Output:
Email
Sex Job Title
Male Games developer [email protected]
Female Phytotherapist [email protected]
Male Homeopath [email protected]

Using skiprows in read_csv()


The skiprows help to skip some rows in CSV, i.e, here you will observe that the rows
mentioned in skiprows have been skipped from the original dataset.

df= pd.read_csv("people.csv")
print("Previous Dataset: ")
print(df)
# using skiprows
df = pd.read_csv("people.csv", skiprows = [1,5])
print("Dataset After skipping rows: ")
print(df)

Output:
Previous Dataset:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Phillip Summers Female [email protected] 1910-03-24 Phytotherapist
2 Kristine Travis Male [email protected] 1992-07-02 Homeopath
3 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
4 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
5 Erin Day Male [email protected] 2015-10-28 Management officer
6 Katherine Buck Female [email protected] 1989-01-22 Analyst
7 Ricardo Hinton Male [email protected] 1924-03-26 Hydrogeologist
Dataset After skipping rows:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Kristine Travis [email protected] 1992-07-02 Homeopath
2 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
3 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
4 Katherine Buck Female [email protected] 1989-01-22 Analyst
5 Ricardo Hinton Male [email protected] 1924-03-26 Hydrogeologist

MR. VINUMON V S 62
PYTHON FOR DATA SCIENCE

Understanding Data Processing


Data Processing is the task of converting data from a given form to a much more usable and
desired form i.e. making it more meaningful and informative. Using Machine Learning
algorithms, mathematical modeling, and statistical knowledge, this entire process can be
automated. The output of this complete process can be in any desired form like graphs,
videos, charts, tables, images, and many more, depending on the task we are performing and
the requirements of the machine. This might seem to be simple but when it comes to massive
organizations like Twitter, Facebook, Administrative bodies like Parliament, UNESCO, and
health sector organizations, this entire process needs to be performed in a very structured
manner. So, the steps to perform are as follows:
Data processing is a crucial step in the machine learning (ML) pipeline, as it prepares the data
for use in building and training ML models. The goal of data processing is to clean,
transform, and prepare the data in a format that is suitable for modeling.
The main steps involved in data processing typically include:
1. Data collection: This is the process of gathering data from various sources, such as sensors,
databases, or other systems. The data may be structured or unstructured, and may come in
various formats such as text, images, or audio.
2. Data preprocessing: This step involves cleaning, filtering, and transforming the data to
make it suitable for further analysis. This may include removing missing values, scaling or
normalizing the data, or converting it to a different format.
3. Data analysis: In this step, the data is analyzed using various techniques such as statistical
analysis, machine learning algorithms, or data visualization. The goal of this step is to derive
insights or knowledge from the data.
4. Data interpretation: This step involves interpreting the results of the data analysis and
drawing conclusions based on the insights gained. It may also involve presenting the findings
in a clear and concise manner, such as through reports, dashboards, or other visualizations.
5. Data storage and management: Once the data has been processed and analyzed, it must be
stored and managed in a way that is secure and easily accessible. This may involve storing
the data in a database, cloud storage, or other systems, and implementing backup and
recovery strategies to protect against data loss.
6. Data visualization and reporting: Finally, the results of the data analysis are presented to
stakeholders in a format that is easily understandable and actionable. This may involve
creating visualizations, reports, or dashboards that highlight key findings and trends in the
data.
There are many tools and libraries available for data processing in ML, including pandas for
Python, and the Data Transformation and Cleansing tool in RapidMiner. The choice of tools
will depend on the specific requirements of the project, including the size and complexity of
the data and the desired outcome.

MR. VINUMON V S 63
PYTHON FOR DATA SCIENCE

 Collection :
The most crucial step when starting with ML is to have data of good quality and
accuracy. Data can be collected from any authenticated source
like data.gov.in, Kaggle or UCI dataset repository. For example, while preparing
for a competitive exam, students study from the best study material that they can
access so that they learn the best to obtain the best results. In the same way, high-
quality and accurate data will make the learning process of the model easier and
better and at the time of testing, the model would yield state-of-the-art results.
A huge amount of capital, time and resources are consumed in collecting data.
Organizations or researchers have to decide what kind of data they need to
execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous images
having a variety of human expressions. Good data ensures that the results of the
model are valid and can be trusted upon.

 Preparation :
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further processing

MR. VINUMON V S 64
PYTHON FOR DATA SCIENCE
and exploration. This preparation can be performed either manually or from the
automatic approach. Data can also be prepared in numeric forms also which
would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions, the value
of each cell will indicate the image pixel.
 Input :
Now the prepared data can be in the form that may not be machine-readable, so to
convert this data to the readable form, some conversion algorithms are needed.
For this task to be executed, high computation and accuracy is needed. Example:
Data can be collected through the sources like MNIST Digit data(images), Twitter
comments, audio files, video clips.
 Processing :
This is the stage where algorithms and ML techniques are required to perform the
instructions provided over a large volume of data with accuracy and optimal
computation.
 Output :
In this stage, results are procured by the machine in a meaningful manner which
can be inferred easily by the user. Output can be in the form of reports, graphs,
videos, etc
 Storage :
This is the final step in which the obtained output and the data model data and all
the useful information are saved for future use.

Overview of Data Cleaning


Data  cleaning is one of the important parts of machine learning. It plays a significant part in
building a model. In this article, we’ll understand Data cleaning, its significance and Python
implementation.
What is Data Cleaning?
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data. The goal of data cleaning
is to ensure that the data is accurate, consistent, and free of errors, as incorrect or inconsistent
data can negatively impact the performance of the ML model. Professional data scientists
usually invest a very large portion of their time in this step because of the belief that “Better
data beats fancier algorithms”.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the
data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can
negatively impact the accuracy and reliability of the insights derived from it.
Why is Data Cleaning Important?
Data cleansing is a crucial step in the data preparation process, playing an important role in
ensuring the accuracy, reliability, and overall quality of a dataset.
For decision-making, the integrity of the conclusions drawn heavily relies on the cleanliness
of the underlying data. Without proper data cleaning, inaccuracies, outliers, missing values,
and inconsistencies can compromise the validity of analytical results. Moreover, clean data
facilitates more effective modeling and pattern recognition, as algorithms perform optimally
when fed high-quality, error-free input.

MR. VINUMON V S 65
PYTHON FOR DATA SCIENCE
Additionally, clean datasets enhance the interpretability of findings, aiding in the formulation
of actionable insights.
Data Cleaning in Data Science
Data clean-up is an integral component of data science, playing a fundamental role in
ensuring the accuracy and reliability of datasets. In the field of data science, where insights
and predictions are drawn from vast and complex datasets, the quality of the input data
significantly influences the validity of analytical results. Data cleaning involves the
systematic identification and correction of errors, inconsistencies, and inaccuracies within a
dataset, encompassing tasks such as handling missing values, removing duplicates, and
addressing outliers. This meticulous process is essential for enhancing the integrity of
analyses, promoting more accurate modeling, and ultimately facilitating informed decision-
making based on trustworthy and high-quality data.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset. The following are essential steps to perform
data cleaning.

Data Cleaning

 Removal of Unwanted Observations: Identify and eliminate irrelevant or


redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not

MR. VINUMON V S 66
PYTHON FOR DATA SCIENCE
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
 Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in data
representation. Fixing structure errors enhances data consistency and facilitates
accurate analysis and interpretation.
 Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context, decide
whether to remove outliers or transform them to minimize their impact on
analysis. Managing outliers is crucial for obtaining more accurate and reliable
insights from the data.
 Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods, removing
records with missing values, or employing advanced imputation techniques.
Handling missing data ensures a more complete dataset, preventing biases and
maintaining the integrity of analyses.
How to Perform Data Cleanliness
Performing data cleansing involves a systematic approach to enhance the quality and
reliability of a dataset. The process begins with a thorough understanding of the data,
inspecting its structure and identifying issues such as missing values, duplicates, and outliers.
Addressing missing data involves strategic decisions on imputation or removal, while
duplicates are systematically eliminated to reduce redundancy. Managing outliers ensures
that extreme values do not unduly influence analysis. Structural errors are corrected to
standardize formats and variable types, promoting consistency.
Throughout the process, documentation of changes is crucial for transparency and
reproducibility. Iterative validation and testing confirm the effectiveness of the data cleansing
steps, ultimately resulting in a refined dataset ready for meaningful analysis and insights.
Python Implementation for Database Cleaning
Let’s understand each step for Database Cleaning, using titanic dataset. Below are the
necessary steps:
 Import the necessary libraries
 Load the dataset
 Check the data information using df.info()

 Python3

import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('titanic.csv')
df.head()

Output:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN
S

MR. VINUMON V S 67
PYTHON FOR DATA SCIENCE
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0
PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282
7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803
53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Data Inspection and Exploration
Let’s first understand the data by inspecting its structure and identifying missing values,
outliers, and inconsistencies and check the duplicate rows with below python code:
 Python3

df.duplicated()

Output:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool
Check the data information using df.info()
 Python3

df.info()

Output:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype

0 PassengerId 891 non-null int64


1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64

MR. VINUMON V S 68
PYTHON FOR DATA SCIENCE
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal number of
counts. And some of the columns are categorical and have data type objects and some are
integer and float values.
Check the Categorical and Numerical Columns.
 Python3

# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)

Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Check the total number of Unique Values in the Categorical Columns
 Python3

df[cat_col].nunique()

Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
Steps to Perform Data Cleansing
Removal of all Above Unwanted Observations
This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate
observations most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
 Redundant observations alter the efficiency to a great extent as the data repeats
and may add towards the correct side or towards the incorrect side, thereby
producing unfaithful results.
 Irrelevant observations are any type of data that is of no use to us and can be
removed directly.
Now we have to make a decision according to the subject of analysis, which factor is
important for our discussion.
As we know our machines don’t understand the text data. So, we have to either drop or
convert the categorical column values into numerical types. Here we are dropping the Name

MR. VINUMON V S 6G
PYTHON FOR DATA SCIENCE
columns because the Name will be always unique and it hasn’t a great influence on target
variables. For the ticket, Let’s first print the 50 unique tickets.
 Python3

df['Ticket'].unique()[:50]

Output:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',
'330877', '17463', '349909', '347742', '237736', 'PP 9549',
'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
'244373', '345763', '2649', '239865', '248698', '330923', '113788',
'347077', '2631', '19950', '330959', '349216', 'PC 17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
'2662', '349237', '3101295'], dtype=object)
From the above tickets, we can observe that it is made of two like first values ‘A/5 21171’ is
joint from of ‘A/5’ and ‘21171’ this may influence our target variables. It will the case
of Feature Engineering. where we derived new features from a column or a group of
columns. In the current case, we are dropping the “Name” and “Ticket” columns.
Drop Name and Ticket Columns
 Python3

df1 = df.drop(columns=['Name','Ticket'])
df1.shape

Output:
(891, 10)
Handling Missing Data
Missing data is a common issue in real-world datasets, and it can occur due to various
reasons such as human errors, system failures, or data collection issues. Various techniques
can be used to handle missing data, such as imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull() it checks
whether the values are null or not and gives returns boolean values. and .sum() will sum the
total number of null values rows and we divide it by the total number of rows present in the
dataset then we multiply to get values in % i.e per 100 values how much values are null.
 Python3

round((df1.isnull().sum()/df1.shape[0])*100,2)

Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00

MR. VINUMON V S 70
PYTHON FOR DATA SCIENCE
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as
they can be an indication of something important.
The two most common ways to deal with missing data are:
 Dropping Observations with missing values.
 The fact that the value was missing may be informative in itself.
 Plus, in the real world, you often need to make predictions on new data
even if some of the features are missing!
As we can see from the above result that Cabin has 77% null values and Age has 19.87% and
Embarked has 0.22% of null values.
So, it’s not a good idea to fill 77% of null values. So, we will drop the Cabin column.
Embarked column has only 0.22% of null values so, we drop the null values rows of
Embarked column.
 Python3

df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape

Output:
(889, 9)
 Imputing the missing values from past observations.
 Again, “missingness” is almost always informative in itself, and you
should tell your algorithm if a value was missing.
 Even if you build a model to impute your values, you’re not adding
any real information. You’re just reinforcing the patterns already
provided by other features.
We can use Mean imputation or Median imputations for the case.
Note:
 Mean imputation is suitable when the data is normally distributed and has no
extreme outliers.
 Median imputation is preferable when the data contains outliers or is skewed.

 Python3

# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values
again df3.isnull().sum()

Output:
PassengerId 0
Survived 0
Pclass 0

MR. VINUMON V S 71
PYTHON FOR DATA SCIENCE
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
Handling Outliers
Outliers are extreme values that deviate significantly from the majority of the data. They can
negatively impact the analysis and model performance. Techniques such as clustering,
interpolation, or transformation can be used to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred to as a box-and-
whisker plot, is a graphical representation of a dataset’s distribution. It shows a variable’s
median, quartiles, and potential outliers. The line inside the box denotes the median, while
the box itself denotes the interquartile range (IQR). The whiskers extend to the most extreme
non-outlier values within 1.5 times the IQR. Individual points beyond the whiskers are
considered potential outliers. A box plot offers an easy-to-understand overview of the range
of the data and makes it possible to identify outliers or skewness in the distribution.
Let’s plot the box plot for Age column data.
 Python3

import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()

Output:

MR. VINUMON V S 72
PYTHON FOR DATA SCIENCE

Box Plot

As we can see from the above Box and whisker plot, Our age dataset has outliers values. The
values less than 5 and more than 55 are outliers.
 Python3

# calculate summary statistics


mean = df3['Age'].mean()
std = df3['Age'].std()

# Calculate the lower and upper bounds


lower_bound = mean - std*2
upper_bound = mean + std*2

print('Lower Bound :',lower_bound)


print('Upper Bound :',upper_bound)

# Drop the outliers


df4 = df3[(df3['Age'] >= lower_bound)
& (df3['Age'] <= upper_bound)]

Output:

MR. VINUMON V S 73
PYTHON FOR DATA SCIENCE
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Similarly, we can remove the outliers of the remaining columns.
Data Transformation
Data transformation involves converting the data from one form to another to make it more
suitable for analysis. Techniques such as normalization, scaling, or encoding can be used to
transform the data.
Data validation and verification
Data validation and verification involve ensuring that the data is accurate and consistent by
comparing it with external sources or expert knowledge.
For the machine learning prediction, First, we separate independent and target features. Here
we will consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’ ‘Fare’ ‘Embarked’ only as the
independent features and Survived as target variables. Because PassengerId will not affect
the survival rate.
 Python3

X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']

Data formatting
Data formatting involves converting the data into a standard format or structure that can be
easily processed by the algorithms or models used for analysis. Here we will discuss
commonly used data formatting techniques i.e. Scaling and Normalization.
Scaling
 Scaling involves transforming the values of features to a specific range. It
maintains the shape of the original distribution while changing the scale.
 Particularly useful when features have different scales, and certain algorithms are
sensitive to the magnitude of the features.
 Common scaling methods include Min-Max scaling and Standardization (Z-score
scaling).
Min-Max Scaling: Min-Max scaling rescales the values to a specified range, typically
between 0 and 1. It preserves the original distribution and ensures that the minimum value
maps to 0 and the maximum value maps to 1.
 Python3

from sklearn.preprocessing import MinMaxScaler

# initialising the MinMaxScaler


scaler = MinMaxScaler(feature_range=(0, 1))

# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and
transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])

MR. VINUMON V S 74
PYTHON FOR DATA SCIENCE

x1.head()

Output:
Pclass Sex Age SibSp Parch Fare Embarked
0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 0.0 female 0.472229 0.125 0.0 0.139136 C
2 1.0 female 0.321438 0.000 0.0 0.015469 S
3 0.0 female 0.434531 0.125 0.0 0.103644 S
4 1.0 male 0.434531 0.000 0.0 0.015713 S
Standardization (Z-score scaling): Standardization transforms the values to have a mean of
0 and a standard deviation of 1. It centers the data around the mean and scales it based on the
standard deviation. Standardization makes the data more suitable for algorithms that assume a
Gaussian distribution or require features to have zero mean and unit variance.
Z = (X - μ) /
σ Where,
 X = Data
 μ = Mean value of X
 σ = Standard deviation of X

Working with Missing Data in Pandas


Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, Suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
In Pandas missing data is represented by two value:
 None: None is a Python singleton object that is often used for missing data
in Python code.
 NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point representation
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()
In this article we are using CSV file, to download the CSV file used, Click Here.

Checking for missing values using isnull() and notnull()

MR. VINUMON V S 75
PYTHON FOR DATA SCIENCE
In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not. These function can
also be used in Pandas Series in order to find null values in a series.
Checking for missing values using isnull()
In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values.
Code #1:

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)

# using isnull() function


df.isnull()

Output:
Code #2:

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# creating bool series True for NaN values


bool_series = pd.isnull(data["Gender"])

# filtering data
# displaying data only with Gender = NaN
data[bool_series]

Output: As shown in the output image, only the rows having Gender = NULL are displayed.

MR. VINUMON V S 76
PYTHON FOR DATA SCIENCE

Checking for missing values using notnull()


In order to check null values in Pandas Dataframe, we use notnull() function this function
return dataframe of Boolean values which are False for NaN values.
Code #3:

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe using dictionary


df = pd.DataFrame(dict)

# using notnull() function

MR. VINUMON V S 77
PYTHON FOR DATA SCIENCE

df.notnull()

Output:

Code #4:

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# creating bool series True for NaN values


bool_series = pd.notnull(data["Gender"])

# filtering data
# displaying data only with Gender = Not NaN
data[bool_series]

Output: As shown in the output image, only the rows having Gender = NOT NULL are
displayed.

MR. VINUMON V S 78
PYTHON FOR DATA SCIENCE

Filling missing values using fillna(), replace() and interpolate()

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help in
filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill
NA values in the dataframe but it uses various interpolation technique to fill the missing
values rather than hard-coding the value.
Code #1: Filling null values with a single value

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)

# filling missing value using fillna()


df.fillna(0)

Output:

MR. VINUMON V S 7G
PYTHON FOR DATA SCIENCE

Code #2: Filling null values with the previous ones

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)

# filling a missing value with


# previous ones
df.fillna(method ='pad')

Output:

Code #3: Filling null value with the next ones


 Python

# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)

MR. VINUMON V S 80
PYTHON FOR DATA SCIENCE

# filling null value using fillna() function


df.fillna(method ='bfill')

Output:

Code #4: Filling null values in CSV File

# importing pandas package


import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]

Output

Now we are going to fill all the null values in Gender column with “No Gender”

# importing pandas package

MR. VINUMON V S 81
PYTHON FOR DATA SCIENCE

import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# filling a null values using fillna()
data["Gender"].fillna("No Gender", inplace = True)
data

Output:

Code #5: Filling a null values using replace() method

# importing pandas package


import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]

Output:

MR. VINUMON V S 82
PYTHON FOR DATA SCIENCE

Now we are going to replace the all Nan value in the data frame with -99 value.

# importing pandas package


import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# will replace Nan value in dataframe with value -99
data.replace(to_replace = np.nan, value = -99)

Output:

Code #6: Using interpolate() function to fill the missing values using linear method.

MR. VINUMON V S 83
PYTHON FOR DATA SCIENCE

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})

# Print the dataframe


df

Output:

Let’s interpolate the missing values using Linear method. Note that Linear method ignore the
index and treat the values as equally spaced.

# to interpolate the missing values


df.interpolate(method ='linear', limit_direction ='forward')

Output:

As we can see the output, values in the first row could not get filled as the direction of
filling of values is forward and there is no previous value which could have been used in
interpolation.

Dropping missing values using dropna()

In order to drop a null values from a dataframe, we used dropna() function this function drop
Rows/Columns of datasets with Null values in different ways.

MR. VINUMON V S 84
PYTHON FOR DATA SCIENCE
Code #1: Dropping rows with at least 1 null value.

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

df

Output

Now we drop rows with at least one Nan value (Null value)

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# using dropna() function


df.dropna()

MR. VINUMON V S 85
PYTHON FOR DATA SCIENCE

Output:
Code #2: Dropping rows if all values in that row are missing.

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

df

Output

Now we drop a rows whose all data is missing or contain null values(NaN)

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

df = pd.DataFrame(dict)

MR. VINUMON V S 86
PYTHON FOR DATA SCIENCE

# using dropna() function


df.dropna(how = 'all')

Output:

Code #3: Dropping columns with at least 1 null value.

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

df

Output

Now we drop a columns which have at least 1 missing values

# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists

MR. VINUMON V S 87
PYTHON FOR DATA SCIENCE

dict = {'First Score':[100, np.nan, np.nan, 95],


'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# using dropna() function


df.dropna(axis = 1)

Output :

Code #4: Dropping Rows with at least 1 null value in CSV file

# importing pandas module


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# making new data frame with dropped NA values


new_data = data.dropna(axis = 0, how ='any')

new_data

Output:

MR. VINUMON V S 88
PYTHON FOR DATA SCIENCE

Now we compare sizes of data frames so that we can come to know how many rows had at
least 1 Null value

 Python

print("Old data frame length:", len(data))


print("New data frame length:", len(new_data))
print("Number of rows with at least 1 NA value: ", (len(data)-
len(new_data)))

Output :
Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value: 236
Since the difference is 236, there were 236 rows which had at least 1 Null value in any
column.

MR. VINUMON V S 8G

You might also like