0% found this document useful (0 votes)

3 views

18_Pandas

Pandas is a Python library developed for data manipulation and analysis, featuring key data structures like Series and DataFrame. It allows users to perform various data processing steps including loading, preparing, and analyzing data, while providing tools for handling missing data and reshaping datasets. The library is essential for data scientists and analysts, supporting multiple file formats and enabling efficient data operations.

Uploaded by

Arif Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

18_Pandas

Uploaded by

Arif Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Pandas

Python Library
Learning objective
• What is pandas?
• Key features of pandas
• Working with Pandas
• Pandas – data structure
• Series and DataFrame
• Data analysis
• Data manipulation
What is Pandas?
• In 2008, developer Wes McKinney started developing pandas
when in need of high performance, flexible tool for analysis
of data.

• Pandas is basically used for data manipulation and analysis.

• Using Pandas, we can accomplish five typical steps in the

processing and analysis of data, regardless of the origin of
data.
• Load, prepare, manipulate, model, and analyse.
What is Pandas?
• The pandas package is the most important tool at the disposal of
Data Scientists and Analysts working in Python today.

• The powerful machine learning and glamorous visualization tools

may get all the attention, but pandas is the backbone of most data
projects.

• Not only is the pandas library a central component of the data

science toolkit but it is used in conjunction with other libraries in
that collection.
Key features of pandas
• Fast and efficient DataFrame object with default and customized
indexing.

• Tools for loading data into in-memory data objects from different
file formats.

• Data alignment and integrated handling of missing data.

• Reshaping and pivoting of data sets.

Key features of pandas
• Columns from a data structure can be deleted or inserted.

• Group by data for aggregation and transformations.

• High performance merging and joining of data.

• Time Series (sensor data) functionality. Data is recorded over

consistent interval of times.
Working with Pandas
• Panel Data, data in the form of panel or data frame.
• Can store data of different data types like string, float, boolean
• DataFrame: Two-dimensional, tabular data structure
• Series: A single dimensional arraylike structure
• Pandas support reading/writing data from various file formats like
.xlsx, .csv, .sql, .xml, .json, .yaml, .html etc.
• Provides data cleaning, data transformation, and data reshaping.
• In real time, 70-80% of time is spent in data pre-processing like
data cleaning.
Pandas – Data structure
• Pandas deals with the following two primary data structures .
1. Series
2. DataFrame

• These data structures are built on top of Numpy array, which

means they are fast.
• The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower dimensional
data structure.
• For example, DataFrame is a container of Series.
Series:
• Series is a one-dimensional arraylike structure with homogeneous
data.
• For example, the following series is a collection of integers 10, 23, 56,
17, 52

• Key Points:
• Homogeneous data
• Size Immutable
• Values of Data Mutable
Series:
• A pandas Series can be created as follows:

pandas.Series( data, index, dtype, copy)

• data: data takes various forms like ndarray, list, constants

• index: Index values must be unique and hashable, same length as
data. Default np.arrange(n) if no index is passed.
• dtype: dtype is for data type.
• copy: Copy data. Default False
Series: example
import pandas as pd
data = ['a','b','c','d']
s = pd.Series(data,index=[100,101,102,103])
print(s)
print(type(s))
DataFrame
• As a matter of first importance, we are going to discuss from
where the idea of a data frame came.

• The cause of data frame came from serious experimental research

in the realm of statistical software.

• The tabular data is referred by the data frames.

• Specifically, data frame is a data structure that speaks to cases in

which there are various observations(rows) or measurements
(columns).
DataFrame
• DataFrame is a two-dimensional array with heterogeneous data.
• Each column represents an attribute and each row represents a
person.
DataFrame – Key points
• Heterogeneous data
• Size Mutable
• Data Mutable
• Potentially columns are of different types
• Size – Mutable
• Labelled axis (rows and columns)
• Can Perform Arithmetic operations on rows and columns
DataFrame
• It can be created from various data sources, such as lists,
dictionaries, NumPy arrays, or other DataFrames.

• We can create DataFrame from different structures.

DataFrame - example
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)
print (type(df))
DataFrame - example
import pandas as pd
data2 = [['Alex',10],['Bob',12],['Clarke',13]]
df1 = pd.DataFrame(data2,columns=['Name','Age'])
print (df1)
Create a DataFrame from dictionary
import pandas as pd
data3 = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
'Age':[28,34,29,42]}
df2 = pd.DataFrame(data3)
print (df2)
Data Analysis using Pandas
Reading CSV files with Pandas using DataFrame

import pandas as pd
df5 = pd.read_csv('heart.csv')
print(df5)
import pandas as pd

# Create a dictionary with data

data = {
'Name': ['Ali', 'Ahmed', 'Yousuf'],
'Age': [5, 30, 35],
'City': ['Muscat', 'Nizwa', 'Sur']
}

# Create the DataFrame

df = pd.DataFrame(data)

# Display the DataFrame

print(df)
Create DataFrames from Lists
import pandas as pd

# Data as separate lists

names = ['Arif', 'Ahmed', 'Vijay']
ages = [45, 40, 41]
cities = ['Muscat', 'Sohar', 'Salalah']

# Create DataFrame
df = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities})
print(df)
Creating a DataFrame with Index Labels
data = {
'Name': ['arif','vijay'],
'Age': [44,42],
'City': ['muscat','salalah']
}

# Create DataFrame
df = pd.DataFrame(data, index=['A','B'])
print(df)
Creating a dataframe and then adding
columns and data
import pandas as pd

df = pd.DataFrame()

df['Empno'] = [101,102,103,104]
df['Ename'] = ['Ali','Mohammed','Nasser','Abdullah']
df['Salary']= [2000,3000,5000,7000]

print(df)
Data Manipulation using pandas
Filtering, Sorting, Specific columns, Grouping etc.
Filtering Data
import pandas as pd
df = pd.read_csv('Admission.csv')

admit_filter = df[df['admitted'] == 1]
print(admit_filter)
Sorting Data
import pandas as pd
df = pd.read_csv('Admission.csv')

sort_on_gpa = df.sort_values(by='gpa')
print(sort_on_gpa)
Display single column
import pandas as pd