Data Handling Using Pandas I - Series
Data Handling Using Pandas I - Series
Data Analytics is necessary to handle huge data. Before analyzing data, the data is to be processed as the
data may not be readily available for analyzing. The data is generally available in different formats like CSV file,
Excel file, HTML file etc. and all these formats are to be converted into a single format.
The analysis of data will have sequence of steps like converting data of different types in to one type,
storing it, performing operations like join, merge, search etc. and plotting data in form of a graph. Python
supports different libraries for all these sequence of operations for data analysis.
Python Pandas is a library that enables data analysis, with various methods available in it
PANDAS: It is a high–level data manipulation tool developed by Wes McKinney for data analysis and
visualization work. It offers powerful and flexible data structures to make data analysis and manipulation easy.
The term ‘Pandas’ is derived from ‘Panel data system’, which is a term used for multidimensional, structured
data set. Pandas provide easy to use data structures and data analysis tools.
Features of Pandas: Pandas is the most popular library in scientific Python ecosystem for doing data analysis.
Pandas can handle several tasks related to data processing and offers the following features
It can read or write in many different data formats like integers, float, double etc.
Columns from a Pandas data structure can be deleted or inserted
It supports group by operation for data aggregation and transformations, and allows high performance
merging and joining of data
It offers good I/O capabilities as it easily data from a MySQL database directly into a dataframe
It can easily select subsets of data from bulky datasets and can even combine multiple data sets together
It has the functionality to find and fill missing data
It allows to apply operations to independent groups within the data
It supports reshaping of data into different forms
It supports advanced time–series functionality, which is the use of a model to predict future values based
on previously observed values
It supports visualization by integrating libraries such as matplotlib and seaborn etc. Pandas is best at
heading huge tabular datasets comprising different data formats
DATA STRUCTURES IN PANDAS: A data structure is a specialized format for organizing, processing, retrieving
and storing data. Python Pandas provides three data structures namely, Series, Dataframes and Panel
Series: It is a one–dimensional structure storing homogeneous(all data elements of same type) mutable
data
SERIES: A series is a one–dimensional array like structure with homogeneous data. i.e. all the data elements in
the series are of same type. However, the data elements may be of any type like integer, string, float, object etc.
Ex1: 10 23 56 17 52 61 73 26
A series can also be described as an ordered dictionary with mapping of index values to data values
All data elements in a series are homogeneous i.e. of same data type
The size of series is immutable i.e. the size of series is not alterable. Hence, it is not possible to add or
remove data elements after creating a series
The values of data are mutable i.e. the values of data elements can be changed in a series
Creating a Series: A series can be created by using Series( ) method with various inputs like (i) List (ii) Scalar
Value or Constant (iii) Dictionary (iv) Array etc.
To use Series( ) method to create a series, the library “pandas” is to be imported using the import
statement, like below
1. Creating an Empty Series: An empty series can be created by using the Series( ) function, without any
parameters.
Ex : >>>mtsrs = pd.Series( )
>>> mtsrs
Series([ ], dtype: float64)
Here,
mtsrs is the series variable
Series( ) method creates an empty list, with default data type
The dtype indicates the data type of the elements of the series
pd is an alternate name given to the pandas module. Hence, instead of the module name ‘pandas’
the short name ‘pd’ can be used
2. Creating a Series using List: A list can be passed as an argument to Series( ) function to create a series.
The syntax for creating a series using list is,
Index is the numeric value displayed with given values. Providing index is
optional, and the default index starts from 0
Ex : >>> daysinmonths=pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
>>> daysinmonths
0 31
1 28
2 31
3 30
4 31
5 30
6 31
7 31
8 30
9 31
10 30
11 31
dtype: int64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 3
When the index is not provided, the default index starts from 0 and ranges up to len–1. However, index
can also be provided while creating a series using the argument index
Index can be assigned to a series at the time of creating the series or even after creating series
Ex : >>> srs_nat=pd.Series([1,2,3,4,5])
>>> srs_nat
0 1
1 2
2 3
3 4
4 5
>>> srs_nat.index=["First","Second","Third","Fourth","Fifth"]
>>> srs_nat
First 1
Second 2
Third 3
Fourth 4
Fifth 5
If a single value is in float in series, then the rest of the integer values will be converted into float and
hence when the series was displayed, it will be displayed as a float series
3. Creating Series by providing data with range( ) function: The sequence of values generated using
range( ) function can be used to create a Series
Ex: >>> srs_data=pd.Series(range(3,20,4))
>>> srs_data
0 3
1 7
2 11
3 15
4 19
4. Create Series from Scalar or Constant Value: A series can be created for a scalar or constant value. In
this case, it is possible to provide only one scalar value
Ex: >>> srs_const=pd.Series(18)
>>> srs_const
0 18
dtype: int64
If index is provided that index will be applicable to the scalar value and if more indices provided all the
indices will have the same scalar value
Ex: >>> srs_const=pd.Series(18,['h','i','j','k'])
>>> srs_const
h 18
i 18
j 18
k 18
dtype: int64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 4
The range( ) function can also be applied to provide indices while creating series
5. Creating Series with Index of String (Text) Type: A string can also be specified as an index to an element
of series.
6. Creating a Series with range( ) and for loop: The data and indices can be generated using range( ) function
and for loop as well.
However, to generate numeric values either for data or indices the range( ) function alone can be used without
using for statement.
But, to generate characters as data or index, the range function along with for to be used, as follows
7. Creating a Series using two different lists: A series can be created by providing data as one list and the
indices as the other list
8. Creating a Series by using NaN for missing values: A series having missing numbers can be created. For
this purpose the constant NaN of NumPy library can be used for missing numbers. The NaN of NumPy
library can be accessed using the statement np.NaN, where np is equivalent to import numpy as np
9. Creating a Series from Dictionary: A series can also be created using a Dictionary. However, a dictionary
is collection of elements, where each element is a combination of Key and Value. As every element of
dictionary is already having a key, the series should not possess a separate key while declaring.
10. Creating a Series using Mathematics Expression / Function: The data values or index values for a series
object can also be provided, from a result of expression or function.
>>> srs_prime[3]
7
>>> srs_prime[[2,4,7]]
2 5
4 11
7 19
>>> srs_odd[:3]
a 1
b 3
c 5
dtype: int64
>>> srs_odd[2:8]
c 5
d 7
e 9
f 11
g 13
h 15
dtype: int64
>>> srs_odd[4:10:3]
e 9
h 15
dtype: int64
>>> srs_odd[–3:]
h 15
i 17
j 19
dtype: int64
loc: It is used for indexing or selecting based on name, i.e., by row name and column name. It
refers to name–based indexing. The syntax for using loc is,
Ex:
>>> weeksrs = pd.Series (index = ["S", "M", "T", "W", "Th", "F", "Sa"],
data = ["Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"])
>>> weeksrs
S Sunday
M Monday
T Tuesday
W Wednesday
Th Thursday
F Friday
Sa Saturday
dtype: object
>>> weeksrs.iloc[2 : 5]
T Tuesday
W Wednesday
Th Thursday
dtype: object
Naming a Series: To name the values and index of a series, the name property can be used. The name assigned
to the index will be displayed above the index and the name assigned to values will be displayed at the bottom of
the series
Ex: >>> srs=pd.Series(["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],index=[1, 2, 3, 4, 5, 6, 7])
>>> srs
1 Sun
2 Mon
3 Tue
4 Wed
5 Thu
6 Fri
7 Sat
dtype: object
>>> srs.name="Day"
>>> srs.index.name="S.No."
>>> srs
S.No.
1 Sun
2 Mon
3 Tue
4 Wed
5 Thu
6 Fri
7 Sat
Name: Day, dtype: object
Series Object Attributes: The various properties of a series can be accessed by using its attributes. The syntax
for accessing an attribute with Series Object is,
<Series_Object> <Attribute_Name>
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 8
Attribute Description
Series.index Returns index of the series
Ex: >>> sales = pd.Series ([536, 486, np.NaN, 472, 86, np.NaN, 145], index = ["Sun", "Mon","Tue",
"Wed", "Thu", "Fri", "Sat"])
>>> sales.index
Index(['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'], dtype='object')
>>> sales.values
array([536., 486., nan, 472., 86., nan, 145.])
>>> sales.dtype
dtype('float64')
>>> sales.shape
(7,)
>>> sales.nbytes
56
>>> sales.ndim
1
>>> sales.size
7
>>> sales.hasnans
True
>>> sales.empty
False
The head( ) function, when invoked with a series object, returns the specified number of rows from top.
By default, this function fetches 5 rows
>>> srs.head( )
0 1
1 11
2 21
3 31
4 41
dtype: int64
>>> srs.head(3)
0 1
1 11
2 21
dtype: int64
The tail( ) function, when invoked with a series object, returns the specified number of rows from
bottom. By default, this function fetches 5 rows from bottom
Ex: >>> srs=pd.Series(data=range(1,100,10),index=range(0,10))
>>> srs
0 1
1 11
2 21
3 31
4 41
5 51
6 61
7 71
8 81
9 91
dtype: int64
>>> srs.tail( )
5 51
6 61
7 71
8 81
9 91
dtype: int64
>>> srs.tail(7)
3 31
4 41
5 51
6 61
7 71
8 81
9 91
dtype: int64
To perform arithmetic operations, the index of the series in operation must be same; otherwise the
operation results into producing NaN values.
Now,
>>> srs1+srs2 >>> srs2–srs1 >>> srs1*srs2 >>> srs2/srs1
1 32 1 10 1 231 1 1.909091
2 34 2 10 2 264 2 1.833333
3 36 3 10 3 299 3 1.769231
4 38 4 10 4 336 4 1.714286
dtype: int64 dtype: int64 dtype: int64 dtype: float64
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Jawahar Navodaya Vidyalaya, Chittoor 10
But,
>>> srs1+srs3 >>> srs3–srs2 >>> srs3*srs1 >>> srs2/srs3
1 NaN 1 NaN 1 NaN 1 NaN
2 NaN 2 NaN 2 NaN 2 NaN
3 NaN 3 NaN 3 NaN 3 NaN
4 NaN 4 NaN 4 NaN 4 NaN
7 NaN 7 NaN 7 NaN 7 NaN
8 NaN 8 NaN 8 NaN 8 NaN
9 NaN 9 NaN 9 NaN 9 NaN
10 NaN 10 NaN 10 NaN 10 NaN
dtype: float64 dtype: float64 dtype: float64 dtype: float64
Vector Operations on Series: It is possible to perform Vector Operations on series. i.e. Arithmetic operations
such as addition(+), subtraction(–), multiplication(*), division(/) etc. on series can be performed with a scalar
value (constant)
Ex:
>>> srs
1 11
2 12
3 13
4 14
dtype: int64
Retrieving Values using Conditions: While displaying the series, condition can be applied using relational
operators, like below
Ex: >>> numsrs=pd.Series([1, 2, 3, 4, 5, 6], [11, 22, 33, 44, 55, 66])
>>> numsrs[numsrs<3]
11 1
22 2
dtype: int64
>>> numsrs[numsrs>=4]
44 4
55 5
66 6
dtype: int64
Deleting Elements from a Series: An element in a series can be deleted by passing the index of the element to be
deleted to the method drop( ). When this function is used, it actually does not change the Series Object, as it is
immutable, but creates another Series Object internally and displays it.
>>> primesrs
0 2
1 3
2 5
3 7
4 9
5 11
6 13
dtype: int64
>>> primesrs.drop(4)
0 2
1 3
2 5
3 7
5 11
6 13
dtype: int64
>>> primesrs
0 2
1 3
2 5
3 7
4 9
5 11
6 13
dtype: int64
Sorting Series Values: The sort_values( ) function can be used to display the sorted Series Object. This function
displays the Series Object in sorted order of data items, but never changes the Series Object.
>>> srs.sort_values()
2 13
0 18
1 25
4 35
3 90
dtype: int64
>>> srs
0 18
1 25
2 13
3 90
4 35
dtype: int64