Python For Data Science - Mr. Vinumon V S - Unit II 2
Python For Data Science - Mr. Vinumon V S - Unit II 2
Lists
Python Lists are just like dynamically sized arrays, declared in other languages (vector in C+
+ and ArrayList in Java). In simple language, a Python list is a collection of things, enclosed
in [ ] and separated by commas.
The list is a sequence data type which is used to store the collection of
data. Tuples and String are other types of sequence data types.
Example of list in Python
Here we are creating Python List using [].
Output:
["Python", "for", "Python"]
Lists are the simplest containers that are an integral part of the Python language. Lists need
not be homogeneous always which makes it the most powerful tool in Python. A single list
may contain DataTypes like Integers, Strings, as well as Objects. Lists are mutable, and
hence, they can be altered even after their creation.
Creating a List in Python
Lists in Python can be created by just placing the sequence inside the square brackets[].
Unlike Sets, a list doesn’t need a built-in function for its creation of a list.
Note: Unlike Sets, the list may contain mutable elements.
# Creating a List
List = []
print("Blank List:
") print(List)
MR. VINUMON V S 1
PYTHON FOR DATA SCIENCE
print(List)
Output
Blank List:
[]
List of numbers:
[10, 20, 14]
List Items:
Python
Data Science
Complexities for Creating Lists
Time Complexity: O(1)
Space Complexity: O(n)
A list may contain duplicate values with their distinct positions and hence, multiple distinct
or duplicate values can be passed as a sequence at the time of list creation.
MR. VINUMON V S 2
PYTHON FOR DATA SCIENCE
Output
List with the use of Numbers:
[1, 2, 4, 4, 3, 3, 3, 6, 5]
Output
Accessing a element from the list
Python
Python
Example 2: Accessing elements from a multi-dimensional list
MR. VINUMON V S 3
PYTHON FOR DATA SCIENCE
Output
Accessing a element from a Multi-Dimensional list
For
DATA
Negative indexing
In Python, negative sequence indexes represent positions from the end of the array. Instead of
having to compute the offset as in List[len(List)-3], it is enough to just write List[-3].
Negative indexing means beginning from the end, -1 refers to the last item, -2 refers to the
second-last item, etc.
Output
Accessing element using negative indexing
Python
For
# Creating a List
List1 = []
print(len(List1))
MR. VINUMON V S 4
PYTHON FOR DATA SCIENCE
print(len(List2))
Output
0
3
Example 1:
Output:
Enter elements: PYTHON FOR DATA
The list is: [‘PYTHON’,’FOR’,’DATA’]
Example 2:
Output:
Enter the size of list : 4
Enter the integer elements: 6 3 9 10
The list is: [6, 3, 9, 10]
MR. VINUMON V S 5
PYTHON FOR DATA SCIENCE
Elements can be added to the List by using the built-in append() function. Only one element
at a time can be added to the list by using the append() method, for the addition of multiple
elements with the append() method, loops are used. Tuples can also be added to the list with
the use of the append method because tuples are immutable. Unlike Sets, Lists can also be
added to the existing list with the use of the append() method.
# Creating a List
List = []
print("Initial blank List: ")
print(List)
# Addition of Elements
# in the List
List.append(1)
List.append(2)
List.append(4)
print("\nList after Addition of Three elements: ")
print(List)
MR. VINUMON V S 6
PYTHON FOR DATA SCIENCE
Output
Initial blank List:
[]
append() method only works for the addition of elements at the end of the List, for the
addition of elements at the desired position, insert() method is used. Unlike append() which
takes only one argument, the insert() method requires two arguments(position, value).
# Creating a List
List = [1,2,3,4]
print(“Initial List: “)
print(List)
# Addition of Element at
# specific Position
# (using Insert Method)
List.insert(3, 12)
List.insert(0, ‘PDS’)
print(“\nList after performing Insert Operation: “)
print(List)
Output
Initial List:
[1, 2, 3, 4]
MR. VINUMON V S 7
PYTHON FOR DATA SCIENCE
Other than append() and insert() methods, there’s one more method for the Addition of
elements, extend(), this method is used to add multiple elements at the same time at the end
of the list.
Note: append() and extend() methods can only add elements at the end.
# Creating a List
List = [1, 2, 3, 4]
print(“Initial List: “)
print(List)
Output
Initial List:
[1, 2, 3, 4]
Reversing a List
Method 1: A list can be reversed by using the reverse() method in Python.
# Reversing a list
mylist = [1, 2, 3, 4, 5, ‘Data’, ‘Python’]
mylist.reverse()
print(mylist)
MR. VINUMON V S 8
PYTHON FOR DATA SCIENCE
Output
[‘Python’, ‘Data’, 5, 4, 3, 2, 1]
my_list = [1, 2, 3, 4, 5]
reversed_list = list(reversed(my_list))
print(reversed_list)
Output
[5, 4, 3, 2, 1]
Removing Elements from the List
Elements can be removed from the List by using the built-in remove() function but an Error
arises if the element doesn’t exist in the list. Remove() method only removes one element at a
time, to remove a range of elements, the iterator is used. The remove() method removes the
specified item.
Note: Remove method in List will only remove the first occurrence of the searched element.
Example 1:
# Creating a List
List = [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12]
print(“Initial List: “)
print(List)
Output
Initial List:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
MR. VINUMON V S G
PYTHON FOR DATA SCIENCE
# Creating a List
List = [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12]
# Removing elements from List
# using iterator method
for I in range(1, 5):
List.remove(i)
print(“\nList after Removing a range of elements: “)
print(List)
Output
List after Removing a range of elements:
[5, 6, 7, 8, 9, 10, 11, 12]
Complexities for Deleting elements in a Lists(remove() method):
Time Complexity: O(n)
Space Complexity: O(1)
pop() function can also be used to remove and return an element from the list, but by
default it removes only the last element of the list, to remove an element from a specific
position of the List, the index of the element is passed as an argument to the pop() method
List = [1, 2, 3, 4, 5]
# Removing element at a
# specific location from the
# Set using the pop() method
List.pop(2)
print(“\nList after popping a specific element: “)
print(List)
Output
List after popping an element:
MR. VINUMON V S 10
PYTHON FOR DATA SCIENCE
[1, 2, 3, 4]
Slicing of a List
We can get substrings and sublists using a slice. In Python List, there are multiple ways to
print the whole list with all the elements, but to print a specific range of elements from
the list, we use the Slice operation.
Slice operation is performed on Lists with the use of a colon(:).
To print elements from beginning to a range use:
[: Index]
To print elements from end-use:
[:-Index]
To print elements from a specific Index till the end use
[Index:]
To print the whole list in reverse order, use
[::-1]
Note – To print elements of List from rear-end, use Negative Indexes.
# Creating a List
List = ['P', 'Y', 'T', 'H', 'O', 'N',
'F', 'O', 'R', 'D', 'A', 'T', 'A']
print("Initial List: ")
print(List)
MR. VINUMON V S 11
PYTHON FOR DATA SCIENCE
Output
Initial List:
[‘P’,’Y’,’T’,’H’,’O’,’N’,’F’,’O’,’R’,’D’,’S’]
Slicing elements in a range 3-8:
['H', 'O', 'N', 'F', 'O']
Elements sliced from 5th element till the end:
['N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']
Printing all elements using slice operation:
['P', 'Y', 'T', 'H', 'O', 'N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']
# Creating a List
List = ['P', 'Y', 'T', 'H', 'O', 'N',
'F', 'O', 'R', 'D', 'A', 'T', 'A']
print("Initial List: ")
print(List)
MR. VINUMON V S 12
PYTHON FOR DATA SCIENCE
Sliced_List = List[:-6]
print("\nElements sliced till 6th element from last: ")
print(Sliced_List)
Output
Initial List:
['P', 'Y', 'T', 'H', 'O', 'N', 'F', 'O', 'R', 'D', 'A', 'T', 'A']
Elements sliced till 6th element from last:
['P', 'Y', 'T', 'H', 'O', 'N', 'F']
Elements sliced from index -6 to -1
['O', 'R', 'D', 'A', 'T']
Printing List in reverse:
['A', 'T', 'A', 'D', 'R', 'O', 'F', 'N', 'O', 'H', 'T', 'Y', 'P']
List Comprehension
Python List comprehensions are used for creating new lists from other iterables like tuples,
strings, arrays, lists, etc. A list comprehension consists of brackets containing the expression,
which is executed for each element along with the for loop to iterate over each element.
Syntax:
newList = [ expression(element) for element in oldList if condition ]
Example:
Output
MR. VINUMON V S 13
PYTHON FOR DATA SCIENCE
[1, 9, 25, 49, 81]
For better understanding, the above code is similar to as follows:
print(odd_square)
Output
[1, 9, 25, 49, 81]
List Methods
Function Description
MR. VINUMON V S 14
PYTHON FOR DATA SCIENCE
Function Description
Removes and returns the item at the specified index. If no index is provided, it
pop()
removes and returns the last item.
Function Description
apply a particular function passed in its argument to all of the list elements
reduce()
stores the intermediate result and only returns the final summation value
Returns an integer representing the Unicode code point of the given Unicode
ord()
character
cmp() This function returns 1 if the first list is “greater” than the second list
any() return true if any element of the list is true. if the list is empty, return false
apply a particular function passed in its argument to all of the list elements
accumulate()
returns a list containing the intermediate results
map() returns a list of the results after applying the given function to each item of a
MR. VINUMON V S 15
PYTHON FOR DATA SCIENCE
Function Description
given iterable
This function can have any number of arguments but only one expression,
lambda()
which is evaluated and returned.
Python String
A String is a data structure in Python Programming that represents a sequence of characters.
It is an immutable data type, meaning that once you have created a string, you cannot change
it. Python String are used widely in many different applications, such as storing and
manipulating text data, representing names, addresses, and other types of data that can be
represented as text.
What is a String in Python?
Python Programming does not have a character data type, a single character is simply a string
with a length of 1. Let’s see the Python string syntax:
Syntax of String Data Type in Python
string_variable = 'Hello, world!'
Example of string data type in Python
# Creating a String
# with single Quotes
String1 = 'Welcome to the Python World'
MR. VINUMON V S 16
PYTHON FOR DATA SCIENCE
print("String with the use of Single Quotes: ")
print(String1)
# Creating a String
# with double Quotes
String1 = "I'm a Geek"
print("\nString with the use of Double Quotes: ")
print(String1)
# Creating a String
# with triple Quotes
String1 = '''I'm a Geek and I live in a world of "Python"'''
print("\nString with the use of Triple Quotes: ")
print(String1)
Example:
In this example, we will define a string in Python Programming and access its characters
using positive and negative indexing. The 0th element will be the first character of the string
whereas the -1th element is the last character of the string.
MR. VINUMON V S 17
PYTHON FOR DATA SCIENCE
# Python Program to Access
# characters of String
String1 = "PythonForDataScience"
print("Initial String: ")
print(String1)
# Python Program to
# demonstrate String slicing
# Creating a String
String1 = "PythonForDataScience"
print("Initial String: ")
print(String1)
MR. VINUMON V S 18
PYTHON FOR DATA SCIENCE
Output:
Initial String:
PythonForDataScience
Slicing characters from 3-12:
honForDat
Slicing characters between 3rd and 2nd last character:
honForDataScien
gfg = "PythonForDataScience"
print(gfg)
Output:
ecneicSataDroFnohtyP
MR. VINUMON V S 1G
PYTHON FOR DATA SCIENCE
Another method is using the string slicing method. Slice the string before the character you
want to update, then add the new character and finally add the other part of the string again
by string slicing.
Example:
In this example, we are using both the list and the string slicing method to update a character.
We converted the String1 to a list, changes its value at a particular element, and then
converted it back to a string using the Python string join() method.
In the string-slicing method, we sliced the string up to the character we want to update,
concatenated the new character, and finally concatenate the remaining part of the string.
#2
String3 = String1[0:2] + 'p' + String1[3:]
print(String3)
Output:
Initial String:
Hello, I'm a Geek
Updating character at 2nd Index:
Heplo, I'm a Geek
Heplo, I'm a Geek
Updating Entire String
In Python Programming, As Python strings are immutable in nature, we cannot update the
existing string. We can only assign a completely new value to the variable with the same
name.
Example:
In this example, we first assign a value to ‘String1’ and then updated it by assigning a
completely different value to it. We simply changed its reference.
MR. VINUMON V S 20
PYTHON FOR DATA SCIENCE
print(String1)
# Updating a String
String1 = "Welcome to the Python World" print("\
nUpdated String: ")
print(String1)
Output:
Initial String:
Hello, I'm a Geek
Updated String:
Welcome to the Python World
Deleting a character
Python strings are immutable, that means we cannot delete a character from it. When we try
to delete thecharacter using the del keyword, it will generate an error.
# Deleting a character
# of the String
MR. VINUMON V S 21
PYTHON FOR DATA SCIENCE
String2 = String1[0:2] + String1[3:] print("\
nDeleting character at 2nd Index: ") print(String2)
Output:
Initial String:
Hello, I'm a Geek
Deleting character at 2nd Index:
Helo, I'm a Geek
Deleting Entire String
In Python Programming, Deletion of the entire string is possible with the use of del keyword.
Further, if we try to print the string, this will produce an error because the String is deleted
and is unavailable to be printed.
# Deleting a String
# with the use of del
del String1
print("\nDeleting entire String: ")
print(String1)
Error:
Traceback (most recent call last):
File "/home/e4b8f2170f140da99d2fe57d9d8c6a94.py", line 12, in
print(String1)
NameError: name 'String1' is not defined
Escape Sequencing in Python
While printing Strings with single and double quotes in it causes SyntaxError because
String already contains Single and Double Quotes and hence cannot be printed with the use
of either of these. Hence, to print such a String either Triple Quotes are used or Escape
sequences are used to print Strings.
Escape sequences start with a backslash and can be interpreted differently. If single quotes
are used to represent a string, then all the single quotes present in the string must be escaped
and the same is done for Double Quotes.
Example:
# Initial String
String1 = '''I'm a "Geek"'''
print("Initial String with use of Triple Quotes: ")
print(String1)
MR. VINUMON V S 22
PYTHON FOR DATA SCIENCE
MR. VINUMON V S 23
PYTHON FOR DATA SCIENCE
print(String1)
# Default order
String1 = "{} {} {}".format('Python', 'For', 'Life')
print("Print String in default order: ")
print(String1)
# Positional Formatting
String1 = "{1} {0} {2}".format('Python', 'For', 'Life')
print("\nPrint String in Positional order: ")
print(String1)
MR. VINUMON V S 24
PYTHON FOR DATA SCIENCE
# Keyword Formatting
String1 = "{l} {f} {g}".format(g='Python', f='For', l='Life')
print("\nPrint String in order of Keywords: ")
print(String1)
Output:
Print String in default order:
Python For Life
Print String in Positional order:
For Python Life
Print String in order of Keywords:
Life For Python
Example 2:
Integers such as Binary, hexadecimal, etc., and floats can be rounded or displayed in the
exponent form with the use of format specifiers.
# Formatting of Integers
String1 = "{0:b}".format(16) print("\
nBinary representation of 16 is ")
print(String1)
# Formatting of Floats
String1 = "{0:e}".format(165.6458) print("\
nExponent representation of 165.6458 is ")
print(String1)
# String alignment
String1 = "|{:<10}|{:^10}|{:>10}|".format('Python',
'for',
'Python')
MR. VINUMON V S 25
PYTHON FOR DATA SCIENCE
print("\nLeft, center and right alignment with Formatting: ")
print(String1)
Integer1 = 12.3456789
print("Formatting in 3.2f format: ")
print('The value of Integer1 is %3.2f' % Integer1)
print("\nFormatting in 3.4f format: ")
print('The value of Integer1 is %3.4f' % Integer1)
Output:
Formatting in 3.2f format:
The value of Integer1 is 12.35
Formatting in 3.4f format:
The value of Integer1 is 12.3457
Useful Python String Operations
Logical Operators on String
String Formatting using %
String Template Class
Split a string
Python Docstrings
String slicing
Find all duplicate characters in string
Reverse string in Python (5 different ways)
Python program to check if a string is palindrome or not
Python String constants
Built-In Function Description
MR. VINUMON V S 26
PYTHON FOR DATA SCIENCE
MR. VINUMON V S 27
PYTHON FOR DATA SCIENCE
String.Isalnum Returns true if all the characters in a given string are alphanumeric.
splits the string at the first occurrence of the separator and returns a
String.partition
tuple.
Returns the highest index of the substring inside the string if substring
String.rindex
is found.
MR. VINUMON V S 28
PYTHON FOR DATA SCIENCE
string.lower Return a copy of s, but with upper case, letters converted to lower case.
Return a list of the words of the string, If the optional second argument
string.split
sep is absent or None
string.rsplit() Return a list of the words of the string s, scanning s from the end.
Return a list of the words of the string when only used with two
string.splitfields
arguments.
It returns a copy of the string with both leading and trailing white
string.strip()
spaces removed
string.lstrip Return a copy of the string with leading white spaces removed.
string.rstrip Return a copy of the string with trailing white spaces removed.
string.swapcase Converts lower case letters to upper case and vice versa.
MR. VINUMON V S 2G
PYTHON FOR DATA SCIENCE
Pad a numeric string on the left with zero digits until the given width is
string-zfill
reached.
Encodes the string into any encoding supported by Python. The default
string.encode
encoding is utf-8.
MR. VINUMON V S 30
PYTHON FOR DATA SCIENCE
Python has built-in support for strings, which means that you do not need to
import any additional libraries or modules to work with strings. This makes it easy
to get started with strings and reduces the complexity of your code.
Python has a concise syntax for creating and manipulating strings, which makes it
easy to write and read code that works with strings.
Drawbacks of String in Python:
When we are dealing with large text data, strings can be inefficient. For instance,
if you need to perform a large number of operations on a string, such as replacing
substrings or splitting the string into multiple substrings, it can be slow and
consume a lot resources.
Strings can be difficult to work with when you need to represent complex data
structures, such as lists or dictionaries. In these cases, it may be more efficient to
use a different data type, such as a list or a dictionary, to represent the data.
Python Tuples
Python Tuple is a collection of Python Programming objects much like a list. The sequence
of values stored in a tuple can be of any type, and they are indexed by integers. Values of a
tuple are syntactically separated by ‘commas‘. Although it is not necessary, it is more
common to define a tuple by closing the sequence of values in parentheses. This helps in
understanding the Python tuples more easily.
Creating a Tuple
In Python Programming, tuples are created by placing a sequence of values separated by
‘comma’ with or without the use of parentheses for grouping the data sequence.
Note: Creation of Python tuple without the use of parentheses is known as Tuple Packing.
Python Program to Demonstrate the Addition of Elements in a Tuple.
# Creating a Tuple
# with the use of string
Tuple1 = (‘Python’, 'For') print("\
nTuple with the use of String: ")
print(Tuple1)
# Creating a Tuple
# with the use of built-in function
Tuple1 = tuple(‘Python’)
print("\nTuple with the use of function: ")
MR. VINUMON V S 31
PYTHON FOR DATA SCIENCE
print(Tuple1)
Output:
Initial empty Tuple:
()
# Creating a Tuple
# with Mixed Datatype
Tuple1 = (5, 'Welcome', 7, ‘Python’) print("\
nTuple with Mixed Datatypes: ") print(Tuple1)
# Creating a Tuple
# with nested tuples
Tuple1 = (0, 1, 2, 3)
Tuple2 = ('python', 'geek')
Tuple3 = (Tuple1, Tuple2)
print("\nTuple with nested tuples: ")
print(Tuple3)
# Creating a Tuple
# with repetition
Tuple1 = (‘Python’,) * 3 print("\
nTuple with repetition: ")
print(Tuple1)
# Creating a Tuple
# with the use of loop
Tuple1 = (‘Python’)
n=5
print("\nTuple with a loop")
for i in range(int(n)):
Tuple1 = (Tuple1,)
print(Tuple1)
Output:
MR. VINUMON V S 32
PYTHON FOR DATA SCIENCE
Tuple with Mixed Datatypes:
(5, 'Welcome', 7, ‘Python’)
# Accessing Tuple
# with Indexing
Tuple1 = tuple("Python") print("\
nFirst element of Tuple: ")
print(Tuple1[0])
# Tuple unpacking
Tuple1 = (“Python”,”For”,”Data Science”)
MR. VINUMON V S 33
PYTHON FOR DATA SCIENCE
First element of Tuple:
G
# Concatenation of tuples
Tuple1 = (0, 1, 2, 3)
Tuple2 = (‘Python’, 'For', ‘Python’)
Tuple2:
MR. VINUMON V S 34
PYTHON FOR DATA SCIENCE
(‘Python’, 'For', ‘Python’)
# Slicing of a Tuple
# Slicing of a Tuple
# with Numbers
Tuple1 = tuple('GEEKSFORGEEKS')
MR. VINUMON V S 35
PYTHON FOR DATA SCIENCE
Printing elements between Range 4-9:
('S', 'F', 'O', 'R', 'G')
Time complexity: O(1)
Space complexity: O(1)
Deleting a Tuple
Tuples are immutable and hence they do not allow deletion of a part of it. The entire tuple
gets deleted by the use of del() method.
Note- Printing of Tuple after deletion results in an Error.
Python
# Deleting a Tuple
Tuple1 = (0, 1, 2, 3, 4)
del Tuple1
print(Tuple1)
Output
Traceback (most recent call last):
File "/home/efa50fd0709dec08434191f32275928a.py", line 7, in
print(Tuple1)
NameError: name 'Tuple1' is not defined
Built-In Methods
Built-in-
Method Description
Find in the tuple and returns the index of the given value where it’s
index( )
available
Built-In Functions
Built-in
Function Description
return true if any element of the tuple is true. if tuple is empty, return
any() false
MR. VINUMON V S 36
PYTHON FOR DATA SCIENCE
Built-in
Function Description
sorted() input elements in the tuple and return a new sorted list
Tuples VS Lists:
Similarities Differences
Tuples can be stored in lists. Iterating through a ‘tuple’ is faster than in a ‘list’.
Lists can be stored in tuples. ‘Lists’ are mutable whereas ‘tuples’ are immutable.
Both ‘tuples’ and ‘lists’ can Tuples that contain immutable elements can be used as a
be nested. key for a dictionary.
Dictionaries in Python
A Python dictionary is a data structure that stores the value in key:value pairs.
Example:
As you can see from the example, data is stored in key:value pairs in dictionaries, which
makes it easier to find values.
MR. VINUMON V S 37
PYTHON FOR DATA SCIENCE
Output:
{1: ‘Python’, 2: 'For', 3: ‘Python’}
Output
Dictionary with the use of Integer Keys:
{1: ‘Python’, 2: 'For', 3: ‘Python’}
Dictionary with the use of Mixed Keys:
{'Name': ‘Python’, 1: [1, 2, 3, 4]}
Dictionary Example
A dictionary can also be created by the built-in function dict(). An empty dictionary can be
created by just placing curly braces{}.
Different Ways to Create a Python Dictionary
The code demonstrates different ways to create dictionaries in Python. It first creates an
empty dictionary, and then shows how to create dictionaries using the dict() constructor with
key-value pairs specified within curly braces and as a list of tuples.
MR. VINUMON V S 38
PYTHON FOR DATA SCIENCE
Dict = {}
print("Empty Dictionary: ")
print(Dict)
Output:
Empty Dictionary:
{}
Dictionary with the use of dict():
{1: ‘Python’, 2: 'For', 3: ‘Python’}
Dictionary with each item as a pair:
{1: ‘Python’, 2: 'For'}
Example: The code defines a nested dictionary named ‘Dict’ with multiple levels of key-
value pairs. It includes a top-level dictionary with keys 1, 2, and 3. The value associated with
MR. VINUMON V S 3G
PYTHON FOR DATA SCIENCE
key 3 is another dictionary with keys ‘A,’ ‘B,’ and ‘C.’ This showcases how Python
dictionaries can be nested to create hierarchical data structures.
print(Dict)
Output:
{1: ‘Python’, 2: 'For', 3: {'A': 'Welcome', 'B': 'To', 'C': ‘Python’}}
Dict = {}
print("Empty Dictionary: ")
print(Dict)
Dict[0] = ‘Python’
Dict[2] = 'For'
Dict[3] = 1
print("\nDictionary after adding 3 elements: ")
print(Dict)
Dict['Value_set'] = 2, 3, 4
print("\nDictionary after adding 3 elements: ")
print(Dict)
MR. VINUMON V S 40
PYTHON FOR DATA SCIENCE
Output:
Empty Dictionary:
{}
Dictionary after adding 3 elements:
{0: ‘Python’, 2: 'For', 3: 1}
Dictionary after adding 3 elements:
{0: ‘Python’, 2: 'For', 3: 1, 'Value_set': (2, 3, 4)}
Updated key value:
{0: ‘Python’, 2: 'Welcome', 3: 1, 'Value_set': (2, 3, 4)}
Adding a Nested Key:
{0: ‘Python’, 2: 'Welcome', 3: 1, 'Value_set': (2, 3, 4), 5:
{'Nested': {'1': 'Life', '2': ‘Python’}}}
Output:
Accessing a element using key:
For
Accessing a element using key:
Python
There is also a method called get() that will also help in accessing the element from a
dictionary. This method accepts key as argument and returns the value.
Complexities for Accessing elements in a Dictionary:
Time complexity: O(1)
Space complexity: O(1)
Example: Access a Value in Dictionary using get() in Python
The code demonstrates accessing a dictionary element using the get() method. It retrieves and
prints the value associated with the key 3 in the dictionary ‘Dict’. This method provides a
safe way to access dictionary values, avoiding KeyError if the key doesn’t exist.
MR. VINUMON V S 41
PYTHON FOR DATA SCIENCE
Output:
Accessing a element using get:
Geeks
print(Dict['Dict1'])
print(Dict['Dict1'][1])
print(Dict['Dict2']['Name'])
Output:
{1: ‘Python’}
Geeks
For
print("Dictionary =")
print(Dict)
del(Dict[1])
print("Data after deletion Dictionary=")
print(Dict)
Output
MR. VINUMON V S 42
PYTHON FOR DATA SCIENCE
Dictionary Methods
Here is a list of in-built dictionary functions with their description. You can use these
functions to operate on a dictionary.
Method Description
MR. VINUMON V S 43
PYTHON FOR DATA SCIENCE
Output:
{1: 'Python', 2: 'Java', 3: 'Ruby', 4: 'Scala'}
{}
Python
dict_items([(1, 'Python'), (2, 'Java'), (3, 'Ruby'), (4, 'Scala')])
dict_keys([1, 2, 3, 4])
{1: 'Python', 2: 'Java', 3: 'Ruby'}
{1: 'Python', 2: 'Java'}
{1: 'Python', 2: 'Java', 3: 'Scala'}
dict_values(['Python', 'Java', 'Scala'])
Python Functions
Python Functions is a block of statements that return the specific task. The idea is to put
some commonly or repeatedly done tasks together and make a function so that instead of
writing the same code again and again for different inputs, we can do the function calls to
reuse code contained in it over and over again.
Some Benefits of Using Functions
Increase Code Readability
Increase Code Reusability
Python Function Declaration
The syntax to declare a function is:
MR. VINUMON V S 44
PYTHON FOR DATA SCIENCE
MR. VINUMON V S 45
PYTHON FOR DATA SCIENCE
If you have experience in C/C++ or Java then you must be thinking about the return type of
the function and data type of arguments. That is possible in Python as well (specifically for
Python 3.5 and above).
Python Function Syntax with Parameters
def function_name(parameter: data_type) -> return_type:
"""Docstring"""
# body of the function
return expression
The following example uses arguments and parameters that you will learn later in this chapter
so you can come back to it again if not understood.
return num3
# Driver code
num1, num2 = 5, 15
ans = add(num1,
num2)
print(f"The addition of {num1} and {num2} results {ans}.")
Output:
The addition of 5 and 15 results 20.
Note: The following examples are defined using syntax 1, try to convert them in syntax 2 for
practice.
MR. VINUMON V S 46
PYTHON FOR DATA SCIENCE
# A simple Python function to check
# whether x is even or odd
def evenOdd(x):
if (x % 2 == 0):
print("even")
else:
print("odd")
MR. VINUMON V S 47
PYTHON FOR DATA SCIENCE
# Python program to demonstrate Keyword Arguments
def student(firstname, lastname):
print(firstname, lastname)
# Keyword arguments
student(firstname=‘Python”, lastname='Practice')
student(lastname='Practice', firstname=‘Python”)
Output:
Python Practice
Python Practice
Positional Arguments
We used the Position argument during the function call so that the first argument (or value) is
assigned to name and the second argument (or value) is assigned to age. By changing the
position, or if you forget the order of the positions, the values can be used in the wrong
places, as shown in the Case-2 example below, where 27 is assigned to the name and Suraj is
assigned to the age.
MR. VINUMON V S 48
PYTHON FOR DATA SCIENCE
for arg in argv:
print(arg)
def myFun(**kwargs):
for key, value in kwargs.items():
print("%s == %s" % (key, value))
# Driver code
myFun(first=‘Python”, mid='for', last=‘Data”)
Output:
first == Python
mid == for
last == Data
Docstring
The first string after the function is called the Document string or Docstring in short. This is
used to describe the functionality of the function. The use of docstring in functions is optional
but it is considered a good practice.
The below syntax can be used to print out the docstring of a function.
Syntax: print(function_name. doc )
Example: Adding Docstring to the function
def evenOdd(x):
"""Function to check if the number is even or odd"""
if (x % 2 == 0):
print("even")
else:
print("odd")
MR. VINUMON V S 4G
PYTHON FOR DATA SCIENCE
# Driver code to call the function
print(evenOdd. doc )
Output:
Function to check if the number is even or odd
Python Function within Functions
A function that is defined inside another function is known as the inner function or nested
function. Nested functions can access variables of the enclosing scope. Inner functions are
used so that they can be protected from everything happening outside the function.
# Python program to
# demonstrate accessing of
# variables of nested functions
def f1():
s = 'I love Python'
def f2():
print(s)
f2()
# Driver's code
f1()
Output:
I love GeeksforGeeks
Anonymous Functions in Python
In Python, an anonymous function means that a function is without a name. As we already
know the def keyword is used to define the normal functions and the lambda keyword is used
to create anonymous functions.
print(cube(7))
print(cube_v2(7))
Output:
343
343
Recursive Functions in Python
Recursion in Python refers to when a function calls itself. There are many instances when
you have to build a recursive function to solve Mathematical and Recursive Problems.
Using a recursive function should be done with caution, as a recursive function can become
like a non-terminating loop. It is better to check your exit statement while creating a recursive
function.
MR. VINUMON V S 50
PYTHON FOR DATA SCIENCE
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
print(factorial(4))
Output
24
Here we have created a recursive function to calculate the factorial of the number. You can
see the end statement for this function is when n is equal to 0.
Return Statement in Python Function
The function return statement is used to exit from a function and go back to the function
caller and return the specified value or data item to the caller. The syntax for the return
statement is:
return [expression_list]
The return statement can consist of a variable, an expression, or a constant which is returned
at the end of the function execution. If none of the above is present with the return statement
a None object is returned.
Example: Python Function Return Statement
def square_value(num):
"""This function returns the square
value of the entered number"""
return num**2
print(square_value(2))
print(square_value(-4))
Output:
4
16
Pass by Reference and Pass by Value
One important thing to note is, in Python every variable name is a reference. When we pass a
variable to a function Python, a new reference to the object is created. Parameter passing in
Python is the same as reference passing in Java.
MR. VINUMON V S 51
PYTHON FOR DATA SCIENCE
Output:
[20, 11, 12, 13, 14, 15]
When we pass a reference and change the received reference to something else, the
connection between the passed and received parameters is broken. For example, consider the
below program as follows:
def myFun(x):
def myFun(x):
# Driver code
x=2
MR. VINUMON V S 52
PYTHON FOR DATA SCIENCE
y=3
swap(x, y)
print(x)
print(y)
Output:
2
3
import pandas as pd
Input File: Let’s suppose the Excel file looks like this
Sheet 1:
MR. VINUMON V S 53
PYTHON FOR DATA SCIENCE
Sheet 1
Sheet 2:
Sheet 2
Now we can import the Excel file using the read_excel function in Pandas to read Excel file
using Pandas in Python. The second statement reads the data from Excel and stores it into a
pandas Data Frame which is represented by the variable newData.
df = pd.read_excel('Example.xlsx')
print(df)
Output:
Roll No. English Maths Science
0 1 19 13 17
MR. VINUMON V S 54
PYTHON FOR DATA SCIENCE
1 2 14 20 18
2 3 15 18 19
3 4 13 14 14
4 5 17 16 20
5 6 19 13 17
6 7 14 20 18
7 8 15 18 19
8 9 13 14 14
9 10 17 16 20
Loading multiple sheets using Concat() method
If there are multiple sheets in the Excel workbook, the command will import data from the
first sheet. To make a data frame with all the sheets in the workbook, the easiest method is to
create different data frames separately and then concatenate them. The read_excel method
takes argument sheet_name and index_col where we can specify the sheet of which the frame
should be made of and index_col specifies the title column, as is shown below:
Example:
The third statement concatenates both sheets. Now to check the whole data frame, we can
simply run the following command:
file = 'Example.xlsx'
sheet1 = pd.read_excel(file,
sheet_name = 0,
index_col = 0)
sheet2 = pd.read_excel(file,
sheet_name = 1,
index_col = 0)
Output:
Roll No. English Maths Science
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
6 19 13 17
7 14 20 18
8 15 18 19
9 13 14 14
10 17 16 20
1 14 18 20
2 11 19 18
3 12 18 16
MR. VINUMON V S 55
PYTHON FOR DATA SCIENCE
4 15 18 19
5 13 14 14
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14
print(newData.head())
print(newData.tail())
Output:
English Maths Science
Roll No.
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
English Maths Science
Roll No.
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14
Shape() method
The shape() method can be used to view the number of rows and columns in the data frame as
follows:
newData.shape
Output:
(20, 3)
Sort_values() method in Pandas
If any column contains numerical data, we can sort that column using
the sort_values() method in pandas as follows:
MR. VINUMON V S 56
PYTHON FOR DATA SCIENCE
Now, let’s suppose we want the top 5 values of the sorted column, we can use the head()
method here:
sorted_column.head(5)
Output:
English Maths Science
Roll No.
1 19 13 17
6 19 13 17
5 17 16 20
10 17 16 20
3 15 18 19
We can do that with any numerical column of the data frame as shown below:
newData['Maths'].head()
Output:
Roll No.
1 13
2 20
3 18
4 14
5 16
Name: Maths, dtype: int64
Pandas Describe() method
Now, suppose our data is mostly numerical. We can get the statistical information like mean,
max, min, etc. about the data frame using the describe() method as shown below:
newData.describe()
Output:
English Maths Science
count 20.00000 20.000000 20.000000
mean 14.30000 16.800000 17.500000
std 2.29645 2.330575 2.164304
min 11.00000 13.000000 14.000000
25% 13.00000 14.000000 16.000000
50% 14.00000 18.000000 18.000000
75% 15.00000 18.000000 19.000000
MR. VINUMON V S 57
PYTHON FOR DATA SCIENCE
max 19.00000 20.000000 20.000000
This can also be done separately for all the numerical columns using the following
command:
newData['English'].mean()
Output:
14.3
Other statistical information can also be calculated using the respective methods. Like in
Excel, formulas can also be applied, and calculated columns can be created as follows:
newData['Total Marks'] =
newData["English"] + newData["Maths"] +
newData["Science"]
newData['Total Marks'].head()
Output:
Roll No.
1 49
2 52
3 52
4 41
5 53
Name: Total Marks, dtype: int64
After operating on the data in the data frame, we can export the data back to an Excel file
using the method to_excel. For this, we need to specify an output Excel file where the
transformed data is to be written, as shown below:
newData.to_excel('Output File.xlsx')
Output:
MR. VINUMON V S 58
PYTHON FOR DATA SCIENCE
MR. VINUMON V S 5G
PYTHON FOR DATA SCIENCE
# Import pandas
import pandas as pd
Output:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Phillip Summers Female [email protected] 1910-03-24 Phytotherapist
2 Kristine Travis Male [email protected] 1992-07-02 Homeopath
3 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
4 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
Using sep in read_csv()
In this example, we will take a CSV file and then add some special characters to see how
the sep parameter works.
Output:
totalbill tip Unnamed: 2 sex smoker Unnamed: 5 day time Unnamed: 8 size
16.99 NaN 1.01 Female No NaN Sun NaN Dinner NaN 2
10.34 NaN 1.66 NaN Male NaN No Sun Dinner NaN 3
21.01 3.50 Male NaN No Sun NaN Dinner NaN 3.0 None
23.68 NaN 3.31 NaN Male No NaN Sun Dinner NaN 2
24.59 3.61 NaN Female No NaN Sun NaN Dinner NaN 2
25.29 NaN 4.71 Male NaN No Sun NaN Dinner NaN 4
MR. VINUMON V S 60
PYTHON FOR DATA SCIENCE
df = pd.read_csv('people.csv',
header=0,
usecols=["First Name", "Sex", "Email"])
# printing dataframe
print(df.head())
Output:
First Name Sex Email
0 Shelby Male [email protected]
1 Phillip Female [email protected]
2 Kristine Male [email protected]
3 Yesenia Male [email protected]
4 Lori Male [email protected]
df = pd.read_csv('people.csv',
header=0,
index_col=["Sex", "Job Title"],
usecols=["Sex", "Job Title", "Email"])
print(df.head())
Output:
Email
Sex Job Title
Male Games developer [email protected]
Female Phytotherapist [email protected]
Male Homeopath [email protected]
Market researcher [email protected]
Veterinary surgeon [email protected]
MR. VINUMON V S 61
PYTHON FOR DATA SCIENCE
df = pd.read_csv('people.csv',
header=0,
index_col=["Sex", "Job Title"],
usecols=["Sex", "Job Title", "Email"],
nrows=3)
print(df)
Output:
Email
Sex Job Title
Male Games developer [email protected]
Female Phytotherapist [email protected]
Male Homeopath [email protected]
df= pd.read_csv("people.csv")
print("Previous Dataset: ")
print(df)
# using skiprows
df = pd.read_csv("people.csv", skiprows = [1,5])
print("Dataset After skipping rows: ")
print(df)
Output:
Previous Dataset:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Phillip Summers Female [email protected] 1910-03-24 Phytotherapist
2 Kristine Travis Male [email protected] 1992-07-02 Homeopath
3 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
4 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
5 Erin Day Male [email protected] 2015-10-28 Management officer
6 Katherine Buck Female [email protected] 1989-01-22 Analyst
7 Ricardo Hinton Male [email protected] 1924-03-26 Hydrogeologist
Dataset After skipping rows:
First Name Last Name Sex Email Date of birth Job Title
0 Shelby Terrell Male [email protected] 1945-10-26 Games developer
1 Kristine Travis [email protected] 1992-07-02 Homeopath
2 Yesenia Martinez Male [email protected] 2017-08-03 Market researcher
3 Lori Todd Male [email protected] 1938-12-01 Veterinary surgeon
4 Katherine Buck Female [email protected] 1989-01-22 Analyst
5 Ricardo Hinton Male [email protected] 1924-03-26 Hydrogeologist
MR. VINUMON V S 62
PYTHON FOR DATA SCIENCE
MR. VINUMON V S 63
PYTHON FOR DATA SCIENCE
Collection :
The most crucial step when starting with ML is to have data of good quality and
accuracy. Data can be collected from any authenticated source
like data.gov.in, Kaggle or UCI dataset repository. For example, while preparing
for a competitive exam, students study from the best study material that they can
access so that they learn the best to obtain the best results. In the same way, high-
quality and accurate data will make the learning process of the model easier and
better and at the time of testing, the model would yield state-of-the-art results.
A huge amount of capital, time and resources are consumed in collecting data.
Organizations or researchers have to decide what kind of data they need to
execute their tasks or research.
Example: Working on the Facial Expression Recognizer, needs numerous images
having a variety of human expressions. Good data ensures that the results of the
model are valid and can be trusted upon.
Preparation :
The collected data can be in a raw form which can’t be directly fed to the
machine. So, this is a process of collecting datasets from different sources,
analyzing these datasets and then constructing a new dataset for further processing
MR. VINUMON V S 64
PYTHON FOR DATA SCIENCE
and exploration. This preparation can be performed either manually or from the
automatic approach. Data can also be prepared in numeric forms also which
would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions, the value
of each cell will indicate the image pixel.
Input :
Now the prepared data can be in the form that may not be machine-readable, so to
convert this data to the readable form, some conversion algorithms are needed.
For this task to be executed, high computation and accuracy is needed. Example:
Data can be collected through the sources like MNIST Digit data(images), Twitter
comments, audio files, video clips.
Processing :
This is the stage where algorithms and ML techniques are required to perform the
instructions provided over a large volume of data with accuracy and optimal
computation.
Output :
In this stage, results are procured by the machine in a meaningful manner which
can be inferred easily by the user. Output can be in the form of reports, graphs,
videos, etc
Storage :
This is the final step in which the obtained output and the data model data and all
the useful information are saved for future use.
MR. VINUMON V S 65
PYTHON FOR DATA SCIENCE
Additionally, clean datasets enhance the interpretability of findings, aiding in the formulation
of actionable insights.
Data Cleaning in Data Science
Data clean-up is an integral component of data science, playing a fundamental role in
ensuring the accuracy and reliability of datasets. In the field of data science, where insights
and predictions are drawn from vast and complex datasets, the quality of the input data
significantly influences the validity of analytical results. Data cleaning involves the
systematic identification and correction of errors, inconsistencies, and inaccuracies within a
dataset, encompassing tasks such as handling missing values, removing duplicates, and
addressing outliers. This meticulous process is essential for enhancing the integrity of
analyses, promoting more accurate modeling, and ultimately facilitating informed decision-
making based on trustworthy and high-quality data.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset. The following are essential steps to perform
data cleaning.
Data Cleaning
MR. VINUMON V S 66
PYTHON FOR DATA SCIENCE
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in data
representation. Fixing structure errors enhances data consistency and facilitates
accurate analysis and interpretation.
Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context, decide
whether to remove outliers or transform them to minimize their impact on
analysis. Managing outliers is crucial for obtaining more accurate and reliable
insights from the data.
Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods, removing
records with missing values, or employing advanced imputation techniques.
Handling missing data ensures a more complete dataset, preventing biases and
maintaining the integrity of analyses.
How to Perform Data Cleanliness
Performing data cleansing involves a systematic approach to enhance the quality and
reliability of a dataset. The process begins with a thorough understanding of the data,
inspecting its structure and identifying issues such as missing values, duplicates, and outliers.
Addressing missing data involves strategic decisions on imputation or removal, while
duplicates are systematically eliminated to reduce redundancy. Managing outliers ensures
that extreme values do not unduly influence analysis. Structural errors are corrected to
standardize formats and variable types, promoting consistency.
Throughout the process, documentation of changes is crucial for transparency and
reproducibility. Iterative validation and testing confirm the effectiveness of the data cleansing
steps, ultimately resulting in a refined dataset ready for meaningful analysis and insights.
Python Implementation for Database Cleaning
Let’s understand each step for Database Cleaning, using titanic dataset. Below are the
necessary steps:
Import the necessary libraries
Load the dataset
Check the data information using df.info()
Python3
import pandas as pd
import numpy as np
Output:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN
S
MR. VINUMON V S 67
PYTHON FOR DATA SCIENCE
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0
PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282
7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803
53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Data Inspection and Exploration
Let’s first understand the data by inspecting its structure and identifying missing values,
outliers, and inconsistencies and check the duplicate rows with below python code:
Python3
df.duplicated()
Output:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool
Check the data information using df.info()
Python3
df.info()
Output:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
MR. VINUMON V S 68
PYTHON FOR DATA SCIENCE
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal number of
counts. And some of the columns are categorical and have data type objects and some are
integer and float values.
Check the Categorical and Numerical Columns.
Python3
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)
Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Check the total number of Unique Values in the Categorical Columns
Python3
df[cat_col].nunique()
Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
Steps to Perform Data Cleansing
Removal of all Above Unwanted Observations
This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate
observations most frequently arise during data collection and Irrelevant observations are
those that don’t actually fit the specific problem that you’re trying to solve.
Redundant observations alter the efficiency to a great extent as the data repeats
and may add towards the correct side or towards the incorrect side, thereby
producing unfaithful results.
Irrelevant observations are any type of data that is of no use to us and can be
removed directly.
Now we have to make a decision according to the subject of analysis, which factor is
important for our discussion.
As we know our machines don’t understand the text data. So, we have to either drop or
convert the categorical column values into numerical types. Here we are dropping the Name
MR. VINUMON V S 6G
PYTHON FOR DATA SCIENCE
columns because the Name will be always unique and it hasn’t a great influence on target
variables. For the ticket, Let’s first print the 50 unique tickets.
Python3
df['Ticket'].unique()[:50]
Output:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',
'330877', '17463', '349909', '347742', '237736', 'PP 9549',
'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
'244373', '345763', '2649', '239865', '248698', '330923', '113788',
'347077', '2631', '19950', '330959', '349216', 'PC 17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
'2662', '349237', '3101295'], dtype=object)
From the above tickets, we can observe that it is made of two like first values ‘A/5 21171’ is
joint from of ‘A/5’ and ‘21171’ this may influence our target variables. It will the case
of Feature Engineering. where we derived new features from a column or a group of
columns. In the current case, we are dropping the “Name” and “Ticket” columns.
Drop Name and Ticket Columns
Python3
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
Handling Missing Data
Missing data is a common issue in real-world datasets, and it can occur due to various
reasons such as human errors, system failures, or data collection issues. Various techniques
can be used to handle missing data, such as imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull() it checks
whether the values are null or not and gives returns boolean values. and .sum() will sum the
total number of null values rows and we divide it by the total number of rows present in the
dataset then we multiply to get values in % i.e per 100 values how much values are null.
Python3
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
MR. VINUMON V S 70
PYTHON FOR DATA SCIENCE
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as
they can be an indication of something important.
The two most common ways to deal with missing data are:
Dropping Observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on new data
even if some of the features are missing!
As we can see from the above result that Cabin has 77% null values and Age has 19.87% and
Embarked has 0.22% of null values.
So, it’s not a good idea to fill 77% of null values. So, we will drop the Cabin column.
Embarked column has only 0.22% of null values so, we drop the null values rows of
Embarked column.
Python3
df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape
Output:
(889, 9)
Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and you
should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding
any real information. You’re just reinforcing the patterns already
provided by other features.
We can use Mean imputation or Median imputations for the case.
Note:
Mean imputation is suitable when the data is normally distributed and has no
extreme outliers.
Median imputation is preferable when the data contains outliers or is skewed.
Python3
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values
again df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
MR. VINUMON V S 71
PYTHON FOR DATA SCIENCE
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
Handling Outliers
Outliers are extreme values that deviate significantly from the majority of the data. They can
negatively impact the analysis and model performance. Techniques such as clustering,
interpolation, or transformation can be used to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred to as a box-and-
whisker plot, is a graphical representation of a dataset’s distribution. It shows a variable’s
median, quartiles, and potential outliers. The line inside the box denotes the median, while
the box itself denotes the interquartile range (IQR). The whiskers extend to the most extreme
non-outlier values within 1.5 times the IQR. Individual points beyond the whiskers are
considered potential outliers. A box plot offers an easy-to-understand overview of the range
of the data and makes it possible to identify outliers or skewness in the distribution.
Let’s plot the box plot for Age column data.
Python3
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
MR. VINUMON V S 72
PYTHON FOR DATA SCIENCE
Box Plot
As we can see from the above Box and whisker plot, Our age dataset has outliers values. The
values less than 5 and more than 55 are outliers.
Python3
Output:
MR. VINUMON V S 73
PYTHON FOR DATA SCIENCE
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Similarly, we can remove the outliers of the remaining columns.
Data Transformation
Data transformation involves converting the data from one form to another to make it more
suitable for analysis. Techniques such as normalization, scaling, or encoding can be used to
transform the data.
Data validation and verification
Data validation and verification involve ensuring that the data is accurate and consistent by
comparing it with external sources or expert knowledge.
For the machine learning prediction, First, we separate independent and target features. Here
we will consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’ ‘Fare’ ‘Embarked’ only as the
independent features and Survived as target variables. Because PassengerId will not affect
the survival rate.
Python3
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
Data formatting
Data formatting involves converting the data into a standard format or structure that can be
easily processed by the algorithms or models used for analysis. Here we will discuss
commonly used data formatting techniques i.e. Scaling and Normalization.
Scaling
Scaling involves transforming the values of features to a specific range. It
maintains the shape of the original distribution while changing the scale.
Particularly useful when features have different scales, and certain algorithms are
sensitive to the magnitude of the features.
Common scaling methods include Min-Max scaling and Standardization (Z-score
scaling).
Min-Max Scaling: Min-Max scaling rescales the values to a specified range, typically
between 0 and 1. It preserves the original distribution and ensures that the minimum value
maps to 0 and the maximum value maps to 1.
Python3
# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and
transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
MR. VINUMON V S 74
PYTHON FOR DATA SCIENCE
x1.head()
Output:
Pclass Sex Age SibSp Parch Fare Embarked
0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 0.0 female 0.472229 0.125 0.0 0.139136 C
2 1.0 female 0.321438 0.000 0.0 0.015469 S
3 0.0 female 0.434531 0.125 0.0 0.103644 S
4 1.0 male 0.434531 0.000 0.0 0.015713 S
Standardization (Z-score scaling): Standardization transforms the values to have a mean of
0 and a standard deviation of 1. It centers the data around the mean and scales it based on the
standard deviation. Standardization makes the data more suitable for algorithms that assume a
Gaussian distribution or require features to have zero mean and unit variance.
Z = (X - μ) /
σ Where,
X = Data
μ = Mean value of X
σ = Standard deviation of X
MR. VINUMON V S 75
PYTHON FOR DATA SCIENCE
In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not. These function can
also be used in Pandas Series in order to find null values in a series.
Checking for missing values using isnull()
In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values.
Code #1:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
Output:
Code #2:
# filtering data
# displaying data only with Gender = NaN
data[bool_series]
Output: As shown in the output image, only the rows having Gender = NULL are displayed.
MR. VINUMON V S 76
PYTHON FOR DATA SCIENCE
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
MR. VINUMON V S 77
PYTHON FOR DATA SCIENCE
df.notnull()
Output:
Code #4:
# filtering data
# displaying data only with Gender = Not NaN
data[bool_series]
Output: As shown in the output image, only the rows having Gender = NOT NULL are
displayed.
MR. VINUMON V S 78
PYTHON FOR DATA SCIENCE
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help in
filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill
NA values in the dataframe but it uses various interpolation technique to fill the missing
values rather than hard-coding the value.
Code #1: Filling null values with a single value
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
Output:
MR. VINUMON V S 7G
PYTHON FOR DATA SCIENCE
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
Output:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
MR. VINUMON V S 80
PYTHON FOR DATA SCIENCE
Output:
Output
Now we are going to fill all the null values in Gender column with “No Gender”
MR. VINUMON V S 81
PYTHON FOR DATA SCIENCE
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# filling a null values using fillna()
data["Gender"].fillna("No Gender", inplace = True)
data
Output:
Output:
MR. VINUMON V S 82
PYTHON FOR DATA SCIENCE
Now we are going to replace the all Nan value in the data frame with -99 value.
Output:
Code #6: Using interpolate() function to fill the missing values using linear method.
MR. VINUMON V S 83
PYTHON FOR DATA SCIENCE
# importing pandas as pd
import pandas as pd
Output:
Let’s interpolate the missing values using Linear method. Note that Linear method ignore the
index and treat the values as equally spaced.
Output:
As we can see the output, values in the first row could not get filled as the direction of
filling of values is forward and there is no previous value which could have been used in
interpolation.
In order to drop a null values from a dataframe, we used dropna() function this function drop
Rows/Columns of datasets with Null values in different ways.
MR. VINUMON V S 84
PYTHON FOR DATA SCIENCE
Code #1: Dropping rows with at least 1 null value.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df
Output
Now we drop rows with at least one Nan value (Null value)
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
MR. VINUMON V S 85
PYTHON FOR DATA SCIENCE
Output:
Code #2: Dropping rows if all values in that row are missing.
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df
Output
Now we drop a rows whose all data is missing or contain null values(NaN)
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
MR. VINUMON V S 86
PYTHON FOR DATA SCIENCE
Output:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
df
Output
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
MR. VINUMON V S 87
PYTHON FOR DATA SCIENCE
Output :
Code #4: Dropping Rows with at least 1 null value in CSV file
new_data
Output:
MR. VINUMON V S 88
PYTHON FOR DATA SCIENCE
Now we compare sizes of data frames so that we can come to know how many rows had at
least 1 Null value
Python
Output :
Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value: 236
Since the difference is 236, there were 236 rows which had at least 1 Null value in any
column.
MR. VINUMON V S 8G