Open In App

How to Learn Pandas ?

Last Updated : 21 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Pandas simplify complex data operations, allowing you to handle large datasets with ease, transform raw data into meaningful insights, and perform a wide range of data analysis tasks with minimal code. Whether you're new to programming or an experienced developer looking to enhance your data analysis skills, learning Pandas is a crucial step in your journey.

By the end of this article, you'll have the knowledge and confidence to tackle real-world data challenges with ease. So, let's dive in and start your journey to becoming a Pandas expert!

Understand the Basics of Python

Before diving into Pandas, ensure you have a solid understanding of Python. Pandas is built on top of Python, so familiarity with Python's syntax, data types, and basic operations is crucial. If you're not yet comfortable with Python, consider starting with introductory courses or tutorials.

You can refer to this article -

Introduction to Pandas

Panda is an open-source library that makes working with data easier. It provides two main data structures: Series and DataFrame. A Series is a one-dimensional array, like a list or a column in a table, while a DataFrame is a two-dimensional table, like an Excel sheet or a SQL table. It can used for several purposes such as -

  • Data Cleaning: Pandas help clean messy data by handling missing values, removing duplicates, and correcting data types.
  • Data Transformation: You can easily transform data by adding, removing, or modifying columns and rows.
  • Data Analysis: Pandas allows you to group, filter, and aggregate data to perform quick and efficient analyses.
  • Data Visualization: You can create basic plots and graphs directly from your data using pandas, helping you visualize trends and patterns.
  • Data Import/Export: Pandas makes it easy to read data from different file formats (like CSV, and Excel) and save it back in the same or other formats.
  • Time Series Analysis: Pandas offers powerful tools to work with time series data, making it ideal for financial and stock market analysis.
  • Merging and Joining Data: You can combine multiple datasets using merge or join operations, similar to SQL.
  • Efficient Data Handling: Pandas is optimized for handling large datasets efficiently, making it a great choice for big data tasks.

Getting Started with Pandas

Install Pandas using pip: Before starting the implementation at first install the Pandas libary -

pip install pandas

Basic Pandas Structures

Series: A Series is like a column in a spreadsheet. It has an index and values. You can create a Series by passing a list to the pd.Series() function.

Python
import pandas as pd

s = pd.Series([1, 2, 3, 4, 5])
print(s)

Output
0    1
1    2
2    3
3    4
4    5
dtype: int64

DataFrame: A DataFrame is like a full table. It has rows and columns, where each column can have different types of data. You can create a DataFrame by passing a dictionary to the pd.DataFrame() function.

Python
import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Reading Data

Pandas can read data from various sources, such as CSV files, Excel files, and SQL databases. The most common method is reading data from a CSV file using the pd.read_csv() function.

Python
import pandas as pd
df = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/20240819155616/mtcars.csv')
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Basic Operations with DataFrames

Selecting Columns

You can select a single column by using the column name.

Python
mpg = df['mpg']
print(mpg)

Output:

0     21.0
1 21.0
2 22.8
3 21.4
4 18.7
5 18.1
6 14.3
7 24.4
8 22.8
9 19.2
10 17.8
11 16.4
12 17.3
13 15.2
14 10.4
15 10.4
16 14.7
17 32.4
18 30.4
19 33.9
20 21.5
21 15.5
22 15.2
23 13.3
24 19.2
25 27.3
26 26.0
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
Name: mpg, dtype: float64

Filtering Rows

Filter rows where the mpg is greater than 20.

Python
high_mpg = df[df['mpg'] > 20]
print(high_mpg)

Output:

             model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Adding New Columns

Add a new column named performance, which is the ratio of hp (horsepower) to wt (weight).

Python
df['performance'] = df['hp'] / df['wt']
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb  performance
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 41.984733
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 38.260870
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 40.086207
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 34.214619
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 50.872093

Removing Columns

Remove the performance column that we just added.

Python
df = df.drop('performance', axis=1)
print(df.head())

Output:

               model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Handling Missing Data

For demonstration, let's assume we have some missing data. We will first introduce some missing values, and then handle them.

Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
# Introduce some missing values in the 'hp' column
df.loc[5:10, 'hp'] = None

# Fill missing values in 'hp' with the mean value
df['hp'] = df['hp'].fillna(df['hp'].mean())

# Drop rows with any missing values (though after filling, there shouldn't be any)
df = df.dropna()
print(df.head())

Output:

               model   mpg  cyl   disp     hp  drat     wt   qsec  vs  am  gear  carb
0 Mazda RX4 21.0 6 160.0 110.0 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110.0 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93.0 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110.0 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175.0 3.15 3.440 17.02 0 0 3 2

Grouping and Aggregating Data

Group the data by the number of cylinders (cyl) and calculate the average miles per gallon (mpg) for each group.

Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
grouped_mpg = df.groupby('cyl')['mpg'].mean()
print(grouped_mpg)

Output:

cyl
4 26.663636
6 19.742857
8 15.100000
Name: mpg, dtype: float64

Merging and Joining DataFrames

Let's create two small DataFrames and merge them.

Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
df1 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'hp': [110, 93]})
df2 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'mpg': [21.0, 22.8]})

merged_df = pd.merge(df1, df2, on='car')
print(merged_df)

Output:

          car   hp   mpg
0 Mazda RX4 110 21.0
1 Datsun 710 93 22.8

Saving Data

Finally, save the modified mtcars dataset to a CSV file.

Python
df.to_csv('mtcars_modified.csv', index=False)

Output:

Screenshot-2024-08-19-160558
Succesfully saved

Best Practices

  • Begin by practicing with small datasets to understand the basics.
  • Use descriptive variable names and comments to make your code easy to understand.
  • Pandas is optimized for vectorized operations, so avoid using loops when possible.
  • Pandas is continuously evolving, so keep up with the latest updates and best practices.

Conclusion

Pandas is an indispensable tool for anyone working with data in Python. By learning the basics and gradually exploring advanced features, you can efficiently manipulate, analyze, and visualize data. Start practicing today, and soon you'll be proficient in using pandas for all your data-related tasks.


Next Article

Similar Reads