Pandas_Notes
Pandas_Notes
Interviews
Introduction to Pandas
Pandas is an open-source Python library providing high-performance, easy-to-use data
structures and data analysis tools. It is a fundamental library for data manipulation and
preprocessing in machine learning.
Key Features:
• Data Structures: Offers Series and DataFrame for handling labeled and tabular
data.
• Data Manipulation: Provides tools for reshaping, merging, sorting, and filtering
data.
• Data Cleaning: Supports handling missing values, duplicates, and applying trans-
formations.
import pandas as pd
s = pd.Series([1, 2, 3, 4], index=[’a’, ’b’, ’c’, ’d’])
1
2. DataFrame
A two-dimensional labeled data structure, similar to a spreadsheet or SQL table.
df = pd.read_csv(’file.csv’)
df.to_csv(’output.csv’, index=False)
• Excel Files:
df = pd.read_excel(’file.xlsx’)
df.to_excel(’output.xlsx’, index=False)
• JSON Files:
df = pd.read_json(’file.json’)
df.to_json(’output.json’)
df[’column_name’]
df[[’col1’, ’col2’]]
• Accessing Rows:
2
df.loc[0] # By label
df.iloc[0] # By position
• Slicing:
3. Data Cleaning
• Handling Missing Values:
• Renaming Columns:
• Removing Duplicates:
df.drop_duplicates(inplace=True)
4. Data Transformation
• Apply Functions:
df[’col’] = df[’col’].apply(lambda x: x * 2)
• Mapping Values:
• Replacing Values:
3
5. Merging and Joining Data
• Concatenation:
• Merging:
• Joining:
df1.join(df2, how=’left’)
grouped = df.groupby(’column_name’)
grouped[’col’].mean()
• Aggregations:
df[’date’] = pd.to_datetime(df[’date’])
• Setting Index:
df.set_index(’date’, inplace=True)
• Resampling:
4
2. Handling Categorical Data
• Converting to Categorical:
df[’category’] = df[’category’].astype(’category’)
• Creating Dummies:
pd.get_dummies(df[’category’])
3. Pivot Tables
• Creating Pivot Tables:
df.fillna(df.mean(), inplace=True)
df[’encoded’] = pd.get_dummies(df[’category’], drop_first=True)
2. Feature Engineering
• Creating new features using existing columns.
df.describe()
df.corr()
5
4. Integration with Other Libraries
• Scikit-learn: Used for feature extraction and model training.
7. Explain how you would preprocess categorical data for machine learning.
Summary
Pandas is a powerful library essential for data manipulation and preprocessing in ma-
chine learning workflows. Its wide range of functionalities, from data cleaning to feature
engineering, makes it an indispensable tool in any data scientist’s toolkit.