0% found this document useful (0 votes)
9 views

Ln. 1 - Data handling using Pandas - Series & Dataframe

Uploaded by

t5pj99x788
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Ln. 1 - Data handling using Pandas - Series & Dataframe

Uploaded by

t5pj99x788
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ln.

1 – Data Handling using Pandas


INTRODUCTION
• Data science is a large field covering everything from data collection, cleaning,
standardization, analysis, visualization and reporting.
• Data processing is an important part of analyzing the data because the data is not always
available in the desired format.
• Various processing is required before analyzing the data such as cleaning, restructuring or
merging etc.
• NumPy, Spicy, Cython, Panda are the tools available in Python which can be used for fast
processing of data.

Modules and Libraries


 Python libraries contain a collection of built-in modules that allow us to perform many
actions without writing detailed programs for it.
 Each library in Python contains a large number of modules that one can import and use.
 Eg:- Numpy, Pandas, matplotlib
 Module is a file which contains various Python functions and global variables.

Pandas-
• Pandas is a high-performance open-source library for data analysis in Python developed by
Wes McKinney in 2008.
• The term ‘Pandas’ is derived from ‘Panel data system’, which is a term used for
multidimensional, structured data set.
• Pandas is built on top of two core Python libraries—matplotlib for data visualization and
NumPy (Numerical Python) for mathematical operations.
• It is a most famous Python package for data science, which offers powerful and flexible data
structures that make data analysis and manipulation easy.

Key Features of Pandas


 Quick and efficient data manipulation and analysis.
 Merges and joins two datasets easily.
 Easy handling of missing data
 Represents the data in tabular form.
 Support for multiple file formats
 Easy sorting of data
 Flexible reshaping and organising of data sets.
 Time Series functionality.
 Summarising data by classification variable
 Handles large data efficiently

Note- Installing Pandas


 Use command prompt to install pandas
 Type pip install pandas
• pip is the standard package manager for Python. It allows you to install and manage
additional packages that are not part of the Python standard library.

Numpy vs Pandas-
Pandas Numpy

When we have to work on Tabular data, we When we have to work on Numerical


prefer the pandas module. data, we prefer the numpy module.
The powerful tools of pandas are Data frame The powerful tool of numpy is Arrays.
and Series.
Pandas consume more memory Numpy is memory efficient.
Indexing of the pandas series is very slow as Indexing of numpy Arrays is very fast.
compared to numpy arrays.
Pandas offers 2d table object called Numpy is capable of providing multi-
DataFrame. dimensional arrays.

Pandas Datatypes

Pandas Data structures :


 A data structure is a collection of data values and operations that can be applied to that
data. It enables efficient storage, retrieval and modification to the data.
 Pandas deals with the following three data structures −
✓ Series : It is a one-dimensional structure storing homogeneous data.
✓ DataFrame : It is a two-dimensional structure storing heterogeneous data.
✓ Panel: It is a three dimensional way of storing items.
 These data structures are built on top of Numpy array, which means they are fast.

Series
• The Series is the primary building block of Pandas.
• It is a one-dimensional labelled array capable of holding data of any type (integer, string,
float etc ) with homogeneous data.
• The data values are mutable (can be changed) but the size of Series data is immutable.
• It contains a sequence of values and an associated position of data labels called its index.
• If we add different data types, then all of the data will get upcasted to the same
dtype=object.
• We can imagine a Pandas Series as a column in a spreadsheet.

Creation of Series
• A Series in Pandas can be created using the ‘Series’ method.
• It can be created using various input data like − Array , Dict , Scalar value or constant , List
• Syntax-
import pandas as pd
pandas.Series( data, index, dtype, copy)
• import statement is used to load Pandas module into memory and can be used to work with.
• pd is an alternate name given to the Pandas module. Its significance is that we can use ‘pd’
instead of typing Pandas every time we need to use it.

Creation of Empty Series

Note –
• Series () displays an empty list along with its default data type.
• Here ‘s’ is the Series Object.

Create a Series from Scalar


• When a scalar is passed, all the elements of the series is initialized to the same value.
• The value will be repeated to match the length of index.
• If we do not explicitly specify an index for the data values while creating a series, then by
default indices range from 0 through N – 1. Here N is the number of data elements.

Alternatively, this can be done using range() method

Creating DataSeries with a list


• Syntax:
<Series Object>=pandas.Series([data],index=[index])
Note- To give a name to the column index and values ,
st.index.name = 'Animals’ # shown at the top of the index column
st.name=‘Pets’ # shown at the bottom of the Series

Program - Print the output as shown below-


1 Jan
2 Feb
3 Mar
4 Apr
5 June
6 July
dtype: object

To create a series using range() method.

Create a series using 2 different lists


>>> import pandas as pd
>>> m=['jan','feb']
>>> n=[23,34]
>>> s=pd.Series(m,index=n)
>>> s

Note-
• type() will give the data type of the series.
• tolist() will convert the series back to a list.

Create a Series from dictionary


• A dictionary can be passed as input to a Series.
• Dictionary keys are used to construct index.

Program
Write a program to convert a dictionary to a Pandas series. The dictionary named Students must contain-
Key : Name, RollNo, Class ,Marks , Grade
Value : Your name, rollNo, class,marks and grade
Arrays-
 An array is a data structure that contains a group of elements.
 Arrays are commonly used in computer programs to organize data so that a related set of
values can be easily sorted or searched.
 Each element can be uniquely identified by its index in the array.
Array Series

Indexing is by default from 0. Indexing can be given manually to the elements.


Elements are arranged horizontally. Arranged vertically.
Indexes are not visible in the array. Indexes are shown along with the elements.

Create series from ndarray


✓ An array of values can be passed to a Series.
✓ If data is an ndarray, index must be the same length as data.
✓ If no index is passed, one will be created having values [0, ..., len(data) - 1].
✓ When index labels are passed with the array, then the length of the index and array must be
of the same size, else it will result in a ValueError.
✓ Example- array1 contains 4 values whereas there are only 3 indices, hence ValueError is
displayed.
>>> series5 = pd.Series(array1, index = ["Jan", "Feb", "Mar"])
ValueError: Length of passed values is 4, index implies 3

import pandas as pd
import numpy as np
a=['J','F','M','A']
s= pd.Series(a, index = ["Jan", "Feb", "Mar", "Apr"])
print (s)

NaN
Any item for which one or the other does not have an entry is marked by
NaN, or “Not a Number”, which is how Pandas marks missing data.
>>> import numpy as np
>>> s = pd.Series([1,2,3,4,np.NaN,5,np.NaN])
>>> s
import pandas as pd
import numpy as np s = pd.Series([2,3,np.nan,7,"The Hobbit"])

To test we need to use s.isnull()


0 False
1 False
2 True
3 False
4 False
dtype: bool

Accessing Elements of a Series


 There are two common ways for accessing the elements of a series: Indexing and Slicing.
 Indexing is used to access elements in a series.
 Indexes are of two types: positional index and labelled index.
 Positional index takes an integer value that corresponds to its position in the series starting
from 0, whereas labelled index takes any user-defined label as index.
>>> import pandas as pd
>>> a = pd.Series([2,3,4],index=["Feb","Mar","Apr"])

Note-
The index values associated with the series can be altered by assigning new index values.
Eg:- a.index=[‘May’,’June’,’July’]
To extract part of a series, slicing is done.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

Write a python program to create a series of odd numbers.


odd=pd.Series(range(1, 10, 2))
>>> odd
0 1
1 3
2 5
3 7
4 9
dtype: int64

Modifying Series data with slicing


>>> import numpy as np
>>> abc = pd.Series(np.arange(10,16,1), index = ['a', 'b', 'c', 'd', 'e', 'f'])
>>> abc
>>> abc[1:3] = 50
>>> abc

Observe that updating the values in a series using slicing also excludes the value at the end
index position.
But, it changes the value at the end index label when slicing is done using labels.
>>> seriesAlph['c':'e'] = 500
>>> seriesAlph

Accessing Data from Series with indexing and slicing


• In a series we can access any position values based on the index number.
• Slicing is used to retrieve subsets of data by position.
• A slice object is built using a syntax of start:end:step, the segments representing the first
item, last item, and the increment between each item that you would like as the step.

Vector operations in Series


• Series support vector operations.
• Any operation to be performed on a series gets performed on every single element.
Eg:-

Binary operations in Series


 We can perform binary operation on series using mathematical operations.
 While performing operations on series, index matching is implemented and all missing values
are filled in with NaN by default.
 The output of operations is NaN if one of the elements or both elements have no value.
 When we do not want to have NaN values in the output, we can use the series method add(),
sub()…. and a parameter fill_value to replace missing value with a specified value.
Binary operations in Series [ Using other functions ] -

Program

Write a Pandas program to compare the elements of the two Pandas Series.
Attributes in Series

Program – To sort values

Note-The output of both the given codes below are the same. We can use np.arange or range
function to generate a set of numbers automatically.
Accessing rows using head () and tail() function
✓ Series.head() function will display the top 5 rows in the series.
✓ Series.tail() function will display the last 5 rows in the series

RETRIEVING VALUES USING CONDITIONS


>>> import pandas as pd
>>> S=pd.Series([1.0,1.4,1.7,2.0])
>>> S

Displaying the data using Boolean indexing

Deleting elements from a Series


 The drop() function is used to get series with specified index labels removed.
 del() can be used to remove a series fully.
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C'])
>>> s
A 0
B 1
C 2
dtype: int32
Dataframes
• It is a two-dimensional data structure, with rows & columns.
• It consists of three principal components-data, rows, & columns.
• DataFrame can be created with the following- Lists , dict , Series, Numpy arrays, Another
DataFrame
Syntax: pd.DataFrame( data, index, columns, dtype, copy)

Series vs Dataframe
• A Series is essentially a column, and a DataFrame is a multi-dimensional table made
up of a collection of Series.

Basic Features of DataFrame


• Columns may be of different types
• Size can be changed (Mutable)
• Labelled axes (rows / columns)
• Can perform arithmetic operations on rows and columns

Create DataFrame
It can be created using- Lists , dict , Series , Numpy arrays , Another DataFrame

Creating an empty Dataframe

Creating a dataframe from single list

Creating a dataframe from list of lists


Creating a Dataframe from lists of lists (multidimensional list)
• Using multi-dimensional list with column name and dtype specified.
import pandas as pd
lst = [['tom', 'reacher', 25], ['krish', 'pete', 30], ['nick', 'wilson', 26],
['juli', 'williams', 22]]
df = pd.DataFrame(lst, columns =['FName', 'LName', 'Age'], dtype =
float)
print(df)

Displaying index and col


>>> df = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=['row1', 'row2’], columns=['col1', 'col2', 'col3'])
>>> df
col1 col2 col3
row1 0 1 2
row2 3 4 5
>>> print(df.index)
Index(['row1', 'row2'], dtype='object')
>>> print(df.columns)
Index(['col1', 'col2', 'col3'], dtype='object')

Creating DataFrames from Numpy Array


Create a DataFrame from List of Dictionaries
import pandas as pd
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z':5}]
df1 =pd.DataFrame(data1)

✓ Here, the dictionary keys are taken as column labels, and the values corresponding to each
key are taken as rows.
✓ There will be as many rows as the number of dictionaries present in the list.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b’,’c’])
>>> df1

import pandas as pd
ab=[{'Name': 'Shaun' , 'Age': 35, 'Marks': 91},{'Name': 'Ritika', 'Age':
31, 'Marks': 87},{'Name': 'Smriti', 'Age': 33, 'Marks': 78},{'Name':
'Jacob' , 'Age': 23, 'Marks': 93}]
ab1=pd.DataFrame(ab,index=['a','b','c','d'])
ab1

Creating DataFrame from Dictionary of Lists


Dictionary keys become column labels by default in a DataFrame, and the lists become the
rows.
>>> data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
>>> df = pd.DataFrame(data)
>>> df

>>> dForest = {'State': ['Assam', 'Delhi','Kerala'],'GArea': [78438,


1483, 38852] ,’TArea' : [2797, 6.72,1663]}
>>> dfForest= pd.DataFrame(dForest)
>>> dfForest

Creating DataFrames from Series


>>> p=pd.Series([10,20,30],index=['a','b','c'])
>>> q=pd.Series([40,50,60],index=['a','b','c'])
>>> r=pd.DataFrame([p,q])
>>> r

import pandas as pd
a=["Jitender","Purnima","Arpit","Jyoti"]
b=[210,211,114,178]
s = pd.Series(a)
s1= pd.Series(b)
df=pd.DataFrame({"Author":s,"Article":s1})
df
>>> p={'one':pd.Series([1,2,3], index=['a','b','c']), 'two':pd.Series([11,22,33,44],
index=['a','b','c','d'])}
>>> q=pd.DataFrame(p)
>>> q

Creation of DataFrame from Dictionary of Series


# To create dataframe from 2 series of student data
import pandas as pd
stud_marks=pd.Series([89,94,93,83,89],index=['Anuj','Deepak','S
ohail','Tresa','Hima'])
stud_age=pd.Series([18,17,19,16,18],index=['Anuj','Deepak','Soh
ail','Tresa','Hima'])
>>> stud=pd.DataFrame({'Marks':stud_marks,'Age':stud_age})
>>> stud

>>> ResultSheet={ 'Arnab': pd.Series([90, 91, 97], index=['Maths','Science’, 'Hindi’]), 'Ramit':


pd.Series([92, 81, 96], index=['Maths','Science','Hindi']),
'Samridhi': pd.Series([89, 91, 88], index=['Maths','Science','Hindi’]), 'Riya': pd.Series([81, 71, 67],
index=['Maths','Science','Hindi’]), 'Mallika':
pd.Series([94, 95, 99],
index=['Maths','Science','Hindi'])}
>>> ResultDF = pd.DataFrame(ResultSheet)
>>> ResultDF

You might also like