Pandas simplify complex data operations, allowing you to handle large datasets with ease, transform raw data into meaningful insights, and perform a wide range of data analysis tasks with minimal code. Whether you're new to programming or an experienced developer looking to enhance your data analysis skills, learning Pandas is a crucial step in your journey.
By the end of this article, you'll have the knowledge and confidence to tackle real-world data challenges with ease. So, let's dive in and start your journey to becoming a Pandas expert!
Understand the Basics of Python
Before diving into Pandas, ensure you have a solid understanding of Python. Pandas is built on top of Python, so familiarity with Python's syntax, data types, and basic operations is crucial. If you're not yet comfortable with Python, consider starting with introductory courses or tutorials.
You can refer to this article -
Introduction to Pandas
Panda is an open-source library that makes working with data easier. It provides two main data structures: Series and DataFrame. A Series is a one-dimensional array, like a list or a column in a table, while a DataFrame is a two-dimensional table, like an Excel sheet or a SQL table. It can used for several purposes such as -
- Data Cleaning: Pandas help clean messy data by handling missing values, removing duplicates, and correcting data types.
- Data Transformation: You can easily transform data by adding, removing, or modifying columns and rows.
- Data Analysis: Pandas allows you to group, filter, and aggregate data to perform quick and efficient analyses.
- Data Visualization: You can create basic plots and graphs directly from your data using pandas, helping you visualize trends and patterns.
- Data Import/Export: Pandas makes it easy to read data from different file formats (like CSV, and Excel) and save it back in the same or other formats.
- Time Series Analysis: Pandas offers powerful tools to work with time series data, making it ideal for financial and stock market analysis.
- Merging and Joining Data: You can combine multiple datasets using merge or join operations, similar to SQL.
- Efficient Data Handling: Pandas is optimized for handling large datasets efficiently, making it a great choice for big data tasks.
Getting Started with Pandas
Install Pandas using pip: Before starting the implementation at first install the Pandas libary -
pip install pandas
Basic Pandas Structures
Series: A Series is like a column in a spreadsheet. It has an index and values. You can create a Series by passing a list to the pd.Series() function.
Python
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
print(s)
Output0 1
1 2
2 3
3 4
4 5
dtype: int64
DataFrame: A DataFrame is like a full table. It has rows and columns, where each column can have different types of data. You can create a DataFrame by passing a dictionary to the pd.DataFrame() function.
Python
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Reading Data
Pandas can read data from various sources, such as CSV files, Excel files, and SQL databases. The most common method is reading data from a CSV file using the pd.read_csv() function.
Python
import pandas as pd
df = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/20240819155616/mtcars.csv')
print(df.head())
Output:
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Basic Operations with DataFrames
Selecting Columns
You can select a single column by using the column name.
Python
mpg = df['mpg']
print(mpg)
Output:
0 21.0
1 21.0
2 22.8
3 21.4
4 18.7
5 18.1
6 14.3
7 24.4
8 22.8
9 19.2
10 17.8
11 16.4
12 17.3
13 15.2
14 10.4
15 10.4
16 14.7
17 32.4
18 30.4
19 33.9
20 21.5
21 15.5
22 15.2
23 13.3
24 19.2
25 27.3
26 26.0
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
Name: mpg, dtype: float64
Filtering Rows
Filter rows where the mpg is greater than 20.
Python
high_mpg = df[df['mpg'] > 20]
print(high_mpg)
Output:
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Adding New Columns
Add a new column named performance, which is the ratio of hp (horsepower) to wt (weight).
Python
df['performance'] = df['hp'] / df['wt']
print(df.head())
Output:
model mpg cyl disp hp drat wt qsec vs am gear carb performance
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 41.984733
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 38.260870
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 40.086207
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 34.214619
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 50.872093
Removing Columns
Remove the performance column that we just added.
Python
df = df.drop('performance', axis=1)
print(df.head())
Output:
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Handling Missing Data
For demonstration, let's assume we have some missing data. We will first introduce some missing values, and then handle them.
Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
# Introduce some missing values in the 'hp' column
df.loc[5:10, 'hp'] = None
# Fill missing values in 'hp' with the mean value
df['hp'] = df['hp'].fillna(df['hp'].mean())
# Drop rows with any missing values (though after filling, there shouldn't be any)
df = df.dropna()
print(df.head())
Output:
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110.0 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110.0 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93.0 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110.0 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175.0 3.15 3.440 17.02 0 0 3 2
Grouping and Aggregating Data
Group the data by the number of cylinders (cyl) and calculate the average miles per gallon (mpg) for each group.
Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
grouped_mpg = df.groupby('cyl')['mpg'].mean()
print(grouped_mpg)
Output:
cyl
4 26.663636
6 19.742857
8 15.100000
Name: mpg, dtype: float64
Merging and Joining DataFrames
Let's create two small DataFrames and merge them.
Python
import pandas as pd
df = pd.read_csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\mtcars.csv")
df1 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'hp': [110, 93]})
df2 = pd.DataFrame({'car': ['Mazda RX4', 'Datsun 710'], 'mpg': [21.0, 22.8]})
merged_df = pd.merge(df1, df2, on='car')
print(merged_df)
Output:
car hp mpg
0 Mazda RX4 110 21.0
1 Datsun 710 93 22.8
Saving Data
Finally, save the modified mtcars dataset to a CSV file.
Python
df.to_csv('mtcars_modified.csv', index=False)
Output:
Succesfully savedBest Practices
- Begin by practicing with small datasets to understand the basics.
- Use descriptive variable names and comments to make your code easy to understand.
- Pandas is optimized for vectorized operations, so avoid using loops when possible.
- Pandas is continuously evolving, so keep up with the latest updates and best practices.
Conclusion
Pandas is an indispensable tool for anyone working with data in Python. By learning the basics and gradually exploring advanced features, you can efficiently manipulate, analyze, and visualize data. Start practicing today, and soon you'll be proficient in using pandas for all your data-related tasks.
Similar Reads
How to Fix: No module named pandas
In this article, we will discuss how to fix the No module named pandas error. The error "No module named pandas " will occur when there is no pandas library in your environment IE the pandas module is either not installed or there is an issue while downloading the module right. Let's see the error b
2 min read
Compare Two Columns in Pandas
In this article, we learn how to compare the columns in the pandas' dataframe. Pandas is a very useful library in python, it is mainly used for data analysis, visualization, data cleaning, and many. Compare Two Columns in PandasComparing the columns is very needful, when we want to compare the value
4 min read
How to Install Python Pandas on macOS?
In this article, we will learn how to install Pandas in Python on macOS using Terminal. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time seri
3 min read
How to Install Pandas in Python?
Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer various operations and data structures to perform numerical data manipulations and time series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is known for its high pro
5 min read
How to Use Python Pandas
to manipulate and analyze data efficientlyPandas is a Python toolbox for working with data collections. It includes functions for analyzing, cleaning, examining, and modifying data. In this article, we will see how we can use Python Pandas with the help of examples. What is Python Pandas?A Python li
5 min read
Pandas Select Columns
Simplest way to select a specific or multiple columns in pandas dataframe is by using bracket notation, where you place the column name inside square brackets. Let's consider following example: [GFGTABS] Python import pandas as pd data = {'Name': ['John', 'Alice', 'Bob
3 min read
Pandas Convert JSON to DataFrame
When working with data, it's common to encounter JSON (JavaScript Object Notation) files, which are widely used for storing and exchanging data. Pandas, a powerful data manipulation library in Python, provides a convenient way to convert JSON data into a Pandas data frame. In this article, we'll exp
4 min read
How to Install Pandas-Profiling on MacOS?
In this article, we will learn how to install Pandas-Profiling in Python on macOS. Pandas profiling is an open-source Python module with which we can quickly do an exploratory data analysis with just a few lines of code. Besides, if this is not enough to convince us to use this tool, it also generat
2 min read
How to Upgrade Pandas in Anaconda
Pandas is a Python library used for working with data sets. Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Pandas 2.0+ provides several improvements to the library, including performance
3 min read
Pandas Full Form
In the filed of Python programming, especially within the data science and analysis community, the term "pandas" is highly recognized. It is not just a name but a powerful library that has become indispensable for data manipulation and analysis. Many might wonder if "pandas" is an acronym and what i
4 min read