How to Learn Pandas ?

Last Updated : 21 Aug, 2024

Pandas simplify complex data operations, allowing you to handle large datasets with ease, transform raw data into meaningful insights, and perform a wide range of data analysis tasks with minimal code. Whether you're new to programming or an experienced developer looking to enhance your data analysis skills, learning Pandas is a crucial step in your journey.

Table of Content

Understand the Basics of Python
Introduction to Pandas
Getting Started with Pandas

Basic Pandas Structures
Reading Data
Basic Operations with DataFrames
Handling Missing Data
Grouping and Aggregating Data
Merging and Joining DataFrames
Saving Data
Best Practices

By the end of this article, you'll have the knowledge and confidence to tackle real-world data challenges with ease. So, let's dive in and start your journey to becoming a Pandas expert!

Understand the Basics of Python

Before diving into Pandas, ensure you have a solid understanding of Python. Pandas is built on top of Python, so familiarity with Python's syntax, data types, and basic operations is crucial. If you're not yet comfortable with Python, consider starting with introductory courses or tutorials.

You can refer to this article -
Python Tutorial | Learn Python Programming
Pandas Tutorial

Introduction to Pandas

Panda is an open-source library that makes working with data easier. It provides two main data structures: Series and DataFrame. A Series is a one-dimensional array, like a list or a column in a table, while a DataFrame is a two-dimensional table, like an Excel sheet or a SQL table. It can used for several purposes such as -

Data Cleaning: Pandas help clean messy data by handling missing values, removing duplicates, and correcting data types.
Data Transformation: You can easily transform data by adding, removing, or modifying columns and rows.
Data Analysis: Pandas allows you to group, filter, and aggregate data to perform quick and efficient analyses.
Data Visualization: You can create basic plots and graphs directly from your data using pandas, helping you visualize trends and patterns.
Data Import/Export: Pandas makes it easy to read data from different file formats (like CSV, and Excel) and save it back in the same or other formats.
Time Series Analysis: Pandas offers powerful tools to work with time series data, making it ideal for financial and stock market analysis.
Merging and Joining Data: You can combine multiple datasets using merge or join operations, similar to SQL.
Efficient Data Handling: Pandas is optimized for handling large datasets efficiently, making it a great choice for big data tasks.

Getting Started with Pandas

Install Pandas using pip: Before starting the implementation at first install the Pandas libary -

pip install pandas

Basic Pandas Structures

Series: A Series is like a column in a spreadsheet. It has an index and values. You can create a Series by passing a list to the pd.Series() function.

Python

import pandas as pd

s = pd.Series([1, 2, 3, 4, 5])
print(s)

Output

0    1
1    2
2    3
3    4
4    5
dtype: int64

DataFrame: A DataFrame is like a full table. It has rows and columns, where each column can have different types of data. You can create a DataFrame by passing a dictionary to the pd.DataFrame() function.

Python

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Reading Data

Pandas can read data from various sources, such as CSV files, Excel files, and SQL databases. The most common method is reading data from a CSV file using the pd.read_csv() function.

Python

import pandas as pd
df = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/20240819155616/mtcars.csv')
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2

Basic Operations with DataFrames

Selecting Columns

You can select a single column by using the column name.

Python

mpg = df['mpg']
print(mpg)

Output:

0     21.0
1     21.0
2     22.8
3     21.4
4     18.7
5     18.1
6     14.3
7     24.4
8     22.8
9     19.2
10    17.8
11    16.4
12    17.3
13    15.2
14    10.4
15    10.4
16    14.7
17    32.4
18    30.4
19    33.9
20    21.5
21    15.5
22    15.2
23    13.3
24    19.2
25    27.3
26    26.0
27    30.4
28    15.8
29    19.7
30    15.0
31    21.4
Name: mpg, dtype: float64

Filtering Rows

Filter rows where the mpg is greater than 20.

Python

high_mpg = df[df['mpg'] > 20]
print(high_mpg)

Output:

             model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0        Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1    Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2       Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3   Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
7        Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
8         Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0     4     2
17        Fiat 128  32.4    4   78.7   66  4.08  2.200  19.47   1   1     4     1
18     Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2
19  Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4     1
20   Toyota Corona  21.5    4  120.1   97  3.70  2.465  20.01   1   0     3     1
25       Fiat X1-9  27.3    4   79.0   66  4.08  1.935  18.90   1   1     4     1
26   Porsche 914-2  26.0    4  120.3   91  4.43  2.140  16.70   0   1     5     2
27    Lotus Europa  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
31      Volvo 142E  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2

Adding New Columns

Add a new column named performance, which is the ratio of hp (horsepower) to wt (weight).

Python

df['performance'] = df['hp'] / df['wt']
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb  performance
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4    41.984733
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4    38.260870
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1    40.086207
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1    34.214619
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2    50.872093

Removing Columns

Remove the performance column that we just added.

Python

df = df.drop('performance', axis=1)
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2

Handling Missing Data

For demonstration, let's assume we have some missing data. We will first introduce some missing values, and then handle them.

Python

import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
# Introduce some missing values in the 'hp' column
df.loc[5:10, 'hp'] = None

# Fill missing values in 'hp' with the mean value
df['hp'] = df['hp'].fillna(df['hp'].mean())

# Drop rows with any missing values (though after filling, there shouldn't be any)
df = df.dropna()
print(df.head())

Output:

               model   mpg  cyl   disp     hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110.0  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110.0  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93.0  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110.0  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175.0  3.15  3.440  17.02   0   0     3     2

Grouping and Aggregating Data

Group the data by the number of cylinders (cyl) and calculate the average miles per gallon (mpg) for each group.

Python

import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
grouped_mpg = df.groupby('cyl')['mpg'].mean()
print(grouped_mpg)

Output:

cyl
4    26.663636
6    19.742857
8    15.100000
Name: mpg, dtype: float64

Merging and Joining DataFrames

Let's create two small DataFrames and merge them.

Python

import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
df1 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'hp': [110, 93]})
df2 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'mpg': [21.0, 22.8]})

merged_df = pd.merge(df1, df2, on='car')
print(merged_df)

Output:

          car   hp   mpg
0   Mazda RX4  110  21.0
1  Datsun 710   93  22.8

Saving Data

Finally, save the modified mtcars dataset to a CSV file.

Python

df.to_csv('mtcars_modified.csv', index=False)

Output:

Screenshot-2024-08-19-160558 — Succesfully saved

Best Practices

Begin by practicing with small datasets to understand the basics.
Use descriptive variable names and comments to make your code easy to understand.
Pandas is optimized for vectorized operations, so avoid using loops when possible.
Pandas is continuously evolving, so keep up with the latest updates and best practices.

Conclusion

Pandas is an indispensable tool for anyone working with data in Python. By learning the basics and gradually exploring advanced features, you can efficiently manipulate, analyze, and visualize data. Start practicing today, and soon you'll be proficient in using pandas for all your data-related tasks.

Compare Two Columns in Pandas

mrmishraoofc

Improve

Article Tags :

How to Learn Pandas ?

Understand the Basics of Python

Introduction to Pandas

Getting Started with Pandas

Basic Pandas Structures

Reading Data

Basic Operations with DataFrames

Selecting Columns

Filtering Rows

Adding New Columns

Removing Columns

Handling Missing Data

Grouping and Aggregating Data

Merging and Joining DataFrames

Saving Data

Best Practices

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?