0% found this document useful (0 votes)
3 views

18_Pandas

Pandas is a Python library developed for data manipulation and analysis, featuring key data structures like Series and DataFrame. It allows users to perform various data processing steps including loading, preparing, and analyzing data, while providing tools for handling missing data and reshaping datasets. The library is essential for data scientists and analysts, supporting multiple file formats and enabling efficient data operations.

Uploaded by

Arif Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

18_Pandas

Pandas is a Python library developed for data manipulation and analysis, featuring key data structures like Series and DataFrame. It allows users to perform various data processing steps including loading, preparing, and analyzing data, while providing tools for handling missing data and reshaping datasets. The library is essential for data scientists and analysts, supporting multiple file formats and enabling efficient data operations.

Uploaded by

Arif Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Pandas

Python Library
Learning objective
• What is pandas?
• Key features of pandas
• Working with Pandas
• Pandas – data structure
• Series and DataFrame
• Data analysis
• Data manipulation
What is Pandas?
• In 2008, developer Wes McKinney started developing pandas
when in need of high performance, flexible tool for analysis
of data.

• Pandas is basically used for data manipulation and analysis.

• Using Pandas, we can accomplish five typical steps in the


processing and analysis of data, regardless of the origin of
data.
• Load, prepare, manipulate, model, and analyse.
What is Pandas?
• The pandas package is the most important tool at the disposal of
Data Scientists and Analysts working in Python today.

• The powerful machine learning and glamorous visualization tools


may get all the attention, but pandas is the backbone of most data
projects.

• Not only is the pandas library a central component of the data


science toolkit but it is used in conjunction with other libraries in
that collection.
Key features of pandas
• Fast and efficient DataFrame object with default and customized
indexing.

• Tools for loading data into in-memory data objects from different
file formats.

• Data alignment and integrated handling of missing data.

• Reshaping and pivoting of data sets.


Key features of pandas
• Columns from a data structure can be deleted or inserted.

• Group by data for aggregation and transformations.

• High performance merging and joining of data.

• Time Series (sensor data) functionality. Data is recorded over


consistent interval of times.
Working with Pandas
• Panel Data, data in the form of panel or data frame.
• Can store data of different data types like string, float, boolean
• DataFrame: Two-dimensional, tabular data structure
• Series: A single dimensional arraylike structure
• Pandas support reading/writing data from various file formats like
.xlsx, .csv, .sql, .xml, .json, .yaml, .html etc.
• Provides data cleaning, data transformation, and data reshaping.
• In real time, 70-80% of time is spent in data pre-processing like
data cleaning.
Pandas – Data structure
• Pandas deals with the following two primary data structures .
1. Series
2. DataFrame

• These data structures are built on top of Numpy array, which


means they are fast.
• The best way to think of these data structures is that the higher
dimensional data structure is a container of its lower dimensional
data structure.
• For example, DataFrame is a container of Series.
Series:
• Series is a one-dimensional arraylike structure with homogeneous
data.
• For example, the following series is a collection of integers 10, 23, 56,
17, 52

• Key Points:
• Homogeneous data
• Size Immutable
• Values of Data Mutable
Series:
• A pandas Series can be created as follows:

pandas.Series( data, index, dtype, copy)

• data: data takes various forms like ndarray, list, constants


• index: Index values must be unique and hashable, same length as
data. Default np.arrange(n) if no index is passed.
• dtype: dtype is for data type.
• copy: Copy data. Default False
Series: example
import pandas as pd
data = ['a','b','c','d']
s = pd.Series(data,index=[100,101,102,103])
print(s)
print(type(s))
DataFrame
• As a matter of first importance, we are going to discuss from
where the idea of a data frame came.

• The cause of data frame came from serious experimental research


in the realm of statistical software.

• The tabular data is referred by the data frames.

• Specifically, data frame is a data structure that speaks to cases in


which there are various observations(rows) or measurements
(columns).
DataFrame
• DataFrame is a two-dimensional array with heterogeneous data.
• Each column represents an attribute and each row represents a
person.
DataFrame – Key points
• Heterogeneous data
• Size Mutable
• Data Mutable
• Potentially columns are of different types
• Size – Mutable
• Labelled axis (rows and columns)
• Can Perform Arithmetic operations on rows and columns
DataFrame
• It can be created from various data sources, such as lists,
dictionaries, NumPy arrays, or other DataFrames.

• We can create DataFrame from different structures.


DataFrame - example
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)
print (type(df))
DataFrame - example
import pandas as pd
data2 = [['Alex',10],['Bob',12],['Clarke',13]]
df1 = pd.DataFrame(data2,columns=['Name','Age'])
print (df1)
Create a DataFrame from dictionary
import pandas as pd
data3 = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
'Age':[28,34,29,42]}
df2 = pd.DataFrame(data3)
print (df2)
Data Analysis using Pandas
Reading CSV files with Pandas using DataFrame

import pandas as pd
df5 = pd.read_csv('heart.csv')
print(df5)
import pandas as pd

# Create a dictionary with data


data = {
'Name': ['Ali', 'Ahmed', 'Yousuf'],
'Age': [5, 30, 35],
'City': ['Muscat', 'Nizwa', 'Sur']
}

# Create the DataFrame


df = pd.DataFrame(data)

# Display the DataFrame


print(df)
Create DataFrames from Lists
import pandas as pd

# Data as separate lists


names = ['Arif', 'Ahmed', 'Vijay']
ages = [45, 40, 41]
cities = ['Muscat', 'Sohar', 'Salalah']

# Create DataFrame
df = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities})
print(df)
Creating a DataFrame with Index Labels
data = {
'Name': ['arif','vijay'],
'Age': [44,42],
'City': ['muscat','salalah']
}

# Create DataFrame
df = pd.DataFrame(data, index=['A','B'])
print(df)
Creating a dataframe and then adding
columns and data
import pandas as pd

df = pd.DataFrame()

df['Empno'] = [101,102,103,104]
df['Ename'] = ['Ali','Mohammed','Nasser','Abdullah']
df['Salary']= [2000,3000,5000,7000]

print(df)
Data Manipulation using pandas
Filtering, Sorting, Specific columns, Grouping etc.
Filtering Data
import pandas as pd
df = pd.read_csv('Admission.csv')

admit_filter = df[df['admitted'] == 1]
print(admit_filter)
Sorting Data
import pandas as pd
df = pd.read_csv('Admission.csv')

sort_on_gpa = df.sort_values(by='gpa')
print(sort_on_gpa)
Display single column
import pandas as pd

df = pd.read_csv('Admission.csv')

select_gmat_gpa = df['gmat’]
print(select_gmat_gpa)
Display Specific columns only
import pandas as pd

df = pd.read_csv('Admission.csv')

selected_columns = df[['gmat','gpa','admitted']]
print(selected_columns)
Group data by work experience and get the
average gpa score
import pandas as pd

df = pd.read_csv('Admission.csv')

group_by_exp = df.groupby('work_experience')['gpa'].mean()

print(group_by_exp)
Get the statistics
import pandas as pd

df = pd.read_csv('Admission.csv')

print(df.describe())
Display information about DataFrame
import pandas as pd

df = pd.read_csv('Admission.csv')

print(df.info())
Drop rows with any missing values
import pandas as pd
df = pd.read_csv('Admission.csv')

df_cleaned = df.dropna()
print(df_cleaned)
You must have learnt:
• What is pandas?
• Key features of pandas
• Working with Pandas
• Pandas – data structure
• Series and DataFrame
• Data analysis
• Data manipulation

You might also like