0% found this document useful (0 votes)
13 views60 pages

Chapter - 4 Data Analysis With Pandas

Chapter 4 provides an overview of the Pandas library for data analysis in Python, detailing its key data structures, including Series and DataFrames, and essential functionalities such as data manipulation, cleaning, and exploratory data analysis. It covers installation, data loading, and methods for handling missing data, as well as advanced features like hierarchical indexing. The chapter emphasizes best practices in data analysis and the importance of visualizing data for better understanding.

Uploaded by

anelesikhondze28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

Chapter - 4 Data Analysis With Pandas

Chapter 4 provides an overview of the Pandas library for data analysis in Python, detailing its key data structures, including Series and DataFrames, and essential functionalities such as data manipulation, cleaning, and exploratory data analysis. It covers installation, data loading, and methods for handling missing data, as well as advanced features like hierarchical indexing. The chapter emphasizes best practices in data analysis and the importance of visualizing data for better understanding.

Uploaded by

anelesikhondze28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Chapter – 4

Data Analysis with Pandas


An overview of the Pandas package, The Pandas data structure-Series, The
DataFrame, The Essential Basic Functionality: Reindexing and altering labels, Head and
tail, Binary operations, Functional statistics Function application Sorting, Indexing and
selecting data, Computational tools, Working with Missing Data, Advanced Uses of
Pandas for Data Analysis Hierarchical indexing, The Panel data
Data Analysis with Pandas
Data Analysis
• Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful
information, draw conclusions, and support decision-making.

What is Pandas
• Pandas is an open-source data manipulation and analysis library for Python.
• It provides data structures for efficiently storing and manipulating large datasets.

Key Data Structures:


• Series:
• A one-dimensional labeled array capable of holding any data type.
• Created using ‘pd.Series(data, index)’.

• DataFrame:
• A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled
axes (rows and columns).
• Created using ‘pd.DataFrame(data, columns, index)’
Data Analysis with Pandas
Installing Pandas:
• Install Pandas using: ‘pip install pandas’

Importing Pandas:
• Import Pandas at the beginning of your script/notebook:
‘import pandas as pd’

Loading Data:
• Pandas can read data from various sources, such as CSV, Excel, SQL databases,
etc
• Read data from various sources using functions like ‘pd.read_csv()’,
‘pd.read_excel()’
Data Analysis with Pandas
Exploratory Data Analysis (EDA)
• head() and tail() to view the first and last rows.
• Info() to get a concise summary of the DataFrame.
• describe() to get statistical information.

Indexing and Selection:


• ‘[]’, ‘.loc[]’ and ‘.iloc[]’ for selecting specific rows or columns.

Data Cleaning:
• Handling missing values using ‘isnull()’, ‘notnul()’, ‘dropna()’
• Removing duplicates with ‘duplicated()’ and ‘drop_duplicates()’
• Data type conversion using ‘astype()’
Data Analysis with Pandas
Data Manipulation:
• Adding and removing columns using ‘df[‘new_column’]’ = …..’ and ‘drop()’
• Applying functions to data using ‘apply()’

Grouping and Aggregation:


• Grouping data using ‘groupby()’
• Aggregating data using functions like ‘sum()’, ‘mean()’

Plotting
• Pandas provides a convenient interface for basic plotting using ‘plot()’

Best Practices in Data Analysis:


• Clean and preprocess data before analysis.
• Document your code and analysis steps.
• Use meaningful variable and column names.
• Visualize data for better understanding.
An overview of the Pandas
package
An overview of the Pandas package
• What is Pandas
• Pandas is an open-source data manipulation and analysis library for Python. It
provides data structures like Series and DataFrame for efficient data
manipulation with integrated indexing.

• Key Features:
• Data structures: Series (1D labeled array) and DataFrame (2D labeled table).
• Handling missing data.
• Merging and joining datasets.
• Reshaping and pivoting data.
• Data alignment and indexing.
• Time series functionality.
An overview of the Pandas package
Key Data Structures:
• Series:
• A one-dimensional labeled array that can hold any data type. It is similar to a column in a
spreadsheet or a single column in a SQL table.

• DataFrame:
• A two-dimensional labeled data structure with columns that can be of different types. It is similar
to a spreadsheet or SQL table.
An overview of the Pandas package
Data Manipulation:
• Reading Data:
• Pandas supports reading data from various file formats such as CSV, Excel, SQL,
and more.

Data Selection and Filtering:


• Selecting specific rows or columns based on conditions.
An overview of the Pandas package
Missing Data:
• Handling missing data is crucial in data analysis. Pandas provides methods to
handle NaN values.

Data Analysis:
Grouping and Aggregation:
• Grouping data based on a column and applying aggregate functions.
An overview of the Pandas package
Merging and Joining:
• Combining multiple DataFrames based on common columns.

Time Series:
• Pandas provides functionality for working with time series data.
The Pandas data
structure-Series
The Pandas data structure-Series
Pandas Series:

Introduction:
• Definition: A Pandas Series is a one-dimensional labeled array capable
of holding any data type.

Key Characteristics:
• Homogeneity: All elements in a Series must be of the same data type.
• Labeled: Each element in a Series has a label or index.
The Pandas data structure-Series
Creating a Series:

Syntax: ‘pd.Series(data, index=index)’


• ‘data’: The data to be stored in the Series (can be a list, NumPy array, dictionary,
etc.).
• ‘index’: (Optional) A label or list of labels for the Series. If not provided, a default
integer index is used.

Accessing Elements:
• By Index: Similar to lists or arrays, elements can be accessed using their index.
The Pandas data structure-Series
Attributes and Methods:
• Attributes:
• ‘.values’: Returns the values of the Series as a NumPy array.
• ‘.index’: Returns the index of the Series.

• Methods:
• ‘.head(n)’:Returns the first n rows of the Series.
• ‘.tail(n)’: Returns the last n rows of the Series.
• ‘describe(n)’: Generates descriptive statistics of the Series.
The Pandas data structure-Series

Use Cases:
• Data Cleaning:
• Series are often used in data cleaning processes to handle missing or inconsistent data.
• Data Analysis:
• Series form the foundation for more complex data structures in Pandas and are extensively used in data
analysis tasks.
• Visualization:
• Series can be visualized using various plotting libraries like Matplotlib or Seaborn.
The Essential Basic
Functionality: Reindexing and
altering labels
The Essential Basic Functionality: Reindexing
and altering labels
Reindexing and Altering Labels in Python with Pandas
Introduction:
Data Structures in Pandas:
• Pandas is a powerful library in Python for data manipulation and
analysis.
• It provides two primary data structures: Series and DataFrame.
• These structures come with flexible indexing capabilities,
allowing users to manipulate and reorganize their data
efficiently.
The Essential Basic Functionality: Reindexing
and altering labels
Understanding Indexing:

Index:
• The index is a fundamental concept in pandas that labels the rows or
elements in a data structure.
• It helps in identifying, selecting, and manipulating data in a structured
manner.

Series Indexing:
• In a Pandas Series, the index is automatically assigned starting from 0 to
N-1, where N is the number of elements in the Series.
• You can explicitly set the index while creating a Series or change it later.
The Essential Basic Functionality: Reindexing
and altering labels
DataFrame Indexing:
• DataFrames, being two-dimensional structures, have both row and
column indices.
• By default, row indices are assigned similarly to Series, while column
names serve as column indices.

Reindexing:
Definition:
• Reindexing is the process of changing the index of a Series or DataFrame
to a new set of labels.
• It's useful when you want to align two data structures with different
indices or when you need to reshape your data.
The Essential Basic Functionality: Reindexing
and altering labels
‘reindex() method’
• It is used to conform the data to a new index. It returns a new
object with the data aligned to the new index.
• Missing values (NaN) are introduced for labels that were not
present in the original index.
The Essential Basic Functionality: Reindexing
and altering labels
• Altering Labels:
• Changing Labels in Series:
• You can alter the labels of a Series using the ‘rename()’ method.
The Essential Basic Functionality: Reindexing
and altering labels
Changing Labels in DataFrame:
• For DataFrames, you can alter both row and column labels.
Head and Tail
Head and Tail in Pandas
Head:
• Definition:
• In pandas, the ‘head()’ method is used to display the first few rows of a DataFrame or Series.
• Syntax:
• For a DataFrame: ‘df.head(n), n is the number of rows to display (default is 5).
• For a Series: ‘series.head(n)
Head and Tail in Pandas
Tail:
• Definition:
• In pandas, the ‘tail()’ method is used to display the last few rows of a DataFrame or Series.
• Syntax:
• For a DataFrame: df.tail(n), n is the number of rows to display (default is 5).
• For a Series: ‘series.tail(n)
Head and Tail in Pandas
Example:
Binary Operations
Binary Operations in Pandas:
• Binary operations involve operations between two objects. In the context of Pandas, these operations are
performed element-wise, aligning indices.
• a. Series Operations:
Binary Operations in Pandas:
b. DataFrame Operations:
Binary Operations in Pandas:
b. DataFrame Operations:
Binary Operations in Pandas:
Handling Missing Values:
• Pandas provides ways to handle missing or NaN values during binary operations.
• ‘fillna’: Fills NaN values with a specified value or a value derived from other DataFrame/Series.
• ‘dropna’: Drops NaN values, allowing for cleaner data handling during operations.

• 4. Broadcasting:
• Pandas supports broadcasting, which means performing operations on data with different shapes.
The smaller object is broadcasted to match the shape of the larger one.
Binary Operations in Pandas:
Comparison Operations:
• Binary operations also include comparison operations that return boolean values.
Functional statistics Function
application Sorting
Functional statistics Function application
Sorting
Functional Statistics:
• Functional statistics involves applying statistical methods or operations to functional data. In the
context of Pandas, functional statistics often revolve around using functions like ‘mean’, ‘median’,
‘sum’ , etc. on DataFrame columns.
• Here's an example using Pandas:
Functional statistics Function application
Sorting
Function Application:
• Pandas provides the ‘apply()’ function, which allows you to apply a function along the axis of a
DataFrame.
Functional statistics Function application
Sorting
Sorting in Python:
• Sorting is a fundamental operation in data analysis. In Pandas, you can use the ‘sort_values()’
function to sort DataFrame by one or more columns.
• Here's an example:
Indexing and selecting data
Indexing and selecting data
Indexing in Pandas:
• Index:
• The label or key assigned to each row or column in a Pandas DataFrame or
Series.
• Default Index:
• If an index is not specified, Pandas assigns a default numerical index starting
from 0.
• Custom Index:
• Users can set custom labels for indexing, making it more meaningful and
user-friendly.
Indexing and selecting data
Selecting Data by Label:
• ‘.loc[]’ accessor:
• Used for label-based indexing.
• Enables selecting data by specifying row and column labels.
• Example:

Selecting Data by Position:


• ‘.iloc[]’ accessor:
• Used for integer-location based indexing.
• Enables selecting data by specifying row and column positions.
• Example:
Computational Tools
Computational Tools
1. Introduction to Pandas:
• Definition: Pandas is an open-source data manipulation and analysis library for
Python.
• Key Components: Series and DataFrame are the primary data structures in Pandas.
• Installation: Use ‘pip install pandas’ to install Pandas.

2. Data Structures in Pandas:


• Series:
• One-dimensional array-like object.
• Contains homogeneous data.
• DataFrame:
• Two-dimensional table with rows and columns.
• Can store heterogeneous data types.
Computational Tools
3. Importing Pandas and Loading Data:
• Import Pandas using ‘import pandas as pd’.
• Loading data into Pandas:
• ‘pd.read_csv()’ for CSV files.
• ‘pd.read_excel()’ for Excel files.
• ‘pd.read_sql()’ for SQL files.

4. Data Exploration:
• Head and Tail: Use ‘df.head()’ and ‘df.tail()’ to view the top and bottom rows of the
DataFrame.
• Info and Describe: ‘df.info()’ provides information about the DataFrame, while
‘df.describe()’ gives statistical summaries.
Computational Tools
5. Data Cleaning:
• Handling Missing Values: Use ‘df.dropna()’ or ‘df.fillna()’ to handle
missing data.
• Dropping Columns or Rows: ‘df.drop()’ allows removal of specified
columns or rows.

6. Data Selection and Indexing:


• Column Selection: Use ‘df[‘column_name’]’ or ‘df.column_name’ to
select a single column.
• Row Selection: Use ‘df.loc[]’ or ‘df.iloc[]’ for label-based or index-based
selection.
Computational Tools
7. Data Filtering and Sorting:
• Filtering: Apply boolean conditions to filter data using ‘df[df[‘condition’]]’.
• Sorting: Use ‘df.sort_values()’ to sort DataFrame based on column values.

8. Grouping and Aggregation:


• GroupBy: Group data using ‘df.groupby(‘column_name’)’.
• Aggregation Functions: Apply functions like ‘sum()’, ‘mean()’ and ‘count()’ to grouped
data.

9. Data Visualization with Pandas:


• Basic Plotting: Use ‘df.plot()’ or basic plots.
• Customization: Pandas integrates with Matplotlib for advanced customization.
Working with Missing Data
Working with Missing Data

1. Introduction to Missing Data:


• Missing data is a common occurrence in datasets and can arise due to
various reasons such as data entry errors, sensor malfunctions, or
incomplete records.

2. Identifying Missing Data:


• Pandas provides methods to identify missing data in a DataFrame:
• ‘isnull()’: Returns a DataFrame of the same shape as the input with
‘True’ for missing values and ‘False’ for non-missing values.
• ‘notnull()’: Opposite of ‘isnull()’; returns ‘True’ for non-missing values.
Working with Missing Data
3. Handling Missing Data:
• There are several strategies to deal with missing data:
Dropping Missing Values:
• ‘dropna()’: Drops rows or columns containing missing values.
Filling Missing Values:
• ‘fillna(value)’: Fills missing values with a specified constant.
• ‘ffill()’ or ‘bfill()’: Forward or backward fill to propagate the previous or next value to fill
missing values.

4. Imputation:
• Imputation involves replacing missing values with estimated values based on the available
data. Common imputation methods include mean, median, or mode imputation.
• ‘fillna(df.mean())’: Fill missing values with the mean of each column.
• ‘fillna(df.median())’: Fill missing values with the median of each column.
Working with Missing Data
5. Interpolation:
• Interpolation is another technique for estimating missing values based on the
surrounding data points. Pandas provides the ‘interpolate()’ method for this
purpose.
• ‘interpolate()’: Estimates missing values using various interpolation methods.

6. Handling Missing Data in Time Series:


• Time series data often involves missing values due to irregular time intervals.
Pandas provides specific methods to handle missing data in time series.
• ‘asfreq()’: Adjusts the frequency of the time series data.
• ‘reindex()’: Adjusts the index of the time series data.
Advanced Uses of Pandas for
Data Analysis Hierarchical
Indexing
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Hierarchical indexing, also known as MultiIndex, is a powerful feature in
Pandas that allows you to have multiple levels of indexing for your DataFrame.
This feature is particularly useful when dealing with complex datasets where
you need to organize and analyze data across multiple dimensions.

Hierarchical Indexing in Python with Pandas


Introduction:
• Hierarchical indexing, or MultiIndex, is a feature in Pandas that allows you to
have multiple levels of indices on a single axis. It is particularly useful when
dealing with data that has multiple dimensions or levels of categorization.
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Creating a MultiIndex:
• You can create a MultiIndex in Pandas using the ‘pd.MultiIndex’ class.
It can be applied to both rows and columns.
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Accessing Data with MultiIndex:
• Locating Data:
• You can use the ‘.loc’ accessor to access data using a MultiIndex.

• Cross-section (xs):
• The ‘xs’ method allows you to get a cross-section from a DataFrame
with a MultiIndex.
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Indexing and Slicing:
• Indexing with ‘slice’
• You can use ‘slice’ to perform advanced indexing and slicing.

Stacking and Unstacking:


• Stacking:
• The ‘stack’ method pivots the level of the column labels, converting
the DataFrame to a Series.
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Unstacking:
• The ‘unstack’ method does the reverse, converting a Series with
MultiIndex to a DataFrame.

Aggregation and Grouping:


• Grouping by Level:
• You can perform grouping and aggregation based on one or more
levels of the MultiIndex.
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Reshaping with ‘stack’ and ‘unstack’:
• Hierarchical Column Index:
• MultiIndex can also be applied to columns, creating a hierarchical column
index.
• # Creating a MultiIndex for columns
• columns = pd.MultiIndex.from_tuples([('Value', 'First'), ('Value',
'Second')], names=['Type', 'Order'])

• # Creating a DataFrame with MultiIndex columns


• df_columns = pd.DataFrame(data={'First': [10, 20, 30, 40], 'Second': [50,
60, 70, 80]}, index=index, columns=columns)
Advanced Uses of Pandas for Data Analysis
Hierarchical Indexing
Stacking and Unstacking Columns:
• You can use ‘stack’ and ‘unstack’ with columns as well.
The Panel Data
The Panel Data
In the context of Pandas, the ‘Panel’ data structure has been
deprecated and removed from the library. It was designed to handle
three-dimensional data but was deemed not very intuitive and not used
frequently, so it was removed in Pandas version 0.25.0.

Pandas primarily focuses on two main data structures:


‘Series (1D)’
‘DataFrame(2D)’
These two structures are powerful and cover a wide range of data
manipulation and analysis tasks.
The Panel Data
The recommended approach is to use the ‘MultiIndex’ with a ‘DataFrame’
Here's a basic example of working with panel data using pandas:

import pandas as pd
import numpy as np

# Create a sample panel data using MultiIndex


index = pd.MultiIndex.from_product([['Entity1', 'Entity2'], pd.date_range('2023-01-01', '2023-01-03')],
names=['Entity', 'Date'])

# Create a DataFrame with the MultiIndex


data = pd.DataFrame(np.random.randn(len(index), 3), index=index, columns=['Variable1', 'Variable2',
'Variable3'])

# Display the sample panel data


print(data)

You might also like