0% found this document useful (0 votes)
7 views

Pandas_Notes

Pandas is an open-source Python library crucial for data manipulation and preprocessing in machine learning, offering features like Series and DataFrame for data handling, and tools for data cleaning and transformation. It supports various data operations including reading/writing files, indexing, merging, and grouping, while also integrating seamlessly with other libraries like NumPy and Matplotlib. The library is essential for tasks such as data preprocessing, feature engineering, and exploratory data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Pandas_Notes

Pandas is an open-source Python library crucial for data manipulation and preprocessing in machine learning, offering features like Series and DataFrame for data handling, and tools for data cleaning and transformation. It supports various data operations including reading/writing files, indexing, merging, and grouping, while also integrating seamlessly with other libraries like NumPy and Matplotlib. The library is essential for tasks such as data preprocessing, feature engineering, and exploratory data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Detailed Notes on Pandas for Machine Learning

Interviews

Introduction to Pandas
Pandas is an open-source Python library providing high-performance, easy-to-use data
structures and data analysis tools. It is a fundamental library for data manipulation and
preprocessing in machine learning.

Key Features:
• Data Structures: Offers Series and DataFrame for handling labeled and tabular
data.

• Data Manipulation: Provides tools for reshaping, merging, sorting, and filtering
data.

• Data Cleaning: Supports handling missing values, duplicates, and applying trans-
formations.

• Integration: Works seamlessly with NumPy, Matplotlib, and other ML libraries.

Core Data Structures


1. Series
A one-dimensional labeled array capable of holding any data type.

import pandas as pd
s = pd.Series([1, 2, 3, 4], index=[’a’, ’b’, ’c’, ’d’])

Key Attributes and Methods:

• s.index: Returns the index of the Series.

• s.values: Returns the values of the Series.

• s.head(n): Returns the first n elements.

• s.tail(n): Returns the last n elements.

1
2. DataFrame
A two-dimensional labeled data structure, similar to a spreadsheet or SQL table.

data = {’Name’: [’Alice’, ’Bob’], ’Age’: [25, 30]}


df = pd.DataFrame(data)

Key Attributes and Methods:

• df.shape: Returns the dimensions of the DataFrame.

• df.columns: Lists column labels.

• df.dtypes: Displays data types of each column.

• df.info(): Provides a summary of the DataFrame.

• df.describe(): Generates descriptive statistics for numerical columns.

Essential Pandas Operations


1. Reading and Writing Data
• CSV Files:

df = pd.read_csv(’file.csv’)
df.to_csv(’output.csv’, index=False)

• Excel Files:

df = pd.read_excel(’file.xlsx’)
df.to_excel(’output.xlsx’, index=False)

• JSON Files:

df = pd.read_json(’file.json’)
df.to_json(’output.json’)

2. Indexing and Selecting Data


• Accessing Columns:

df[’column_name’]
df[[’col1’, ’col2’]]

• Accessing Rows:

2
df.loc[0] # By label
df.iloc[0] # By position

• Slicing:

df.loc[1:3, [’col1’, ’col2’]]


df.iloc[1:3, 0:2]

3. Data Cleaning
• Handling Missing Values:

df.isnull().sum() # Count missing values


df.fillna(value) # Fill missing values
df.dropna() # Remove rows with missing values

• Renaming Columns:

df.rename(columns={’old_name’: ’new_name’}, inplace=True)

• Removing Duplicates:

df.drop_duplicates(inplace=True)

4. Data Transformation
• Apply Functions:

df[’col’] = df[’col’].apply(lambda x: x * 2)

• Mapping Values:

df[’col’] = df[’col’].map({’A’: 1, ’B’: 2})

• Replacing Values:

df.replace({’old_val’: ’new_val’}, inplace=True)

3
5. Merging and Joining Data
• Concatenation:

pd.concat([df1, df2], axis=0)

• Merging:

pd.merge(df1, df2, on=’key’, how=’inner’)

• Joining:

df1.join(df2, how=’left’)

6. Grouping and Aggregation


• Group By:

grouped = df.groupby(’column_name’)
grouped[’col’].mean()

• Aggregations:

df.agg({’col1’: ’mean’, ’col2’: ’sum’})

Advanced Topics in Pandas


1. Working with Time Series
• Converting to Datetime:

df[’date’] = pd.to_datetime(df[’date’])

• Setting Index:

df.set_index(’date’, inplace=True)

• Resampling:

df.resample(’M’).mean() # Monthly average

4
2. Handling Categorical Data
• Converting to Categorical:

df[’category’] = df[’category’].astype(’category’)

• Creating Dummies:

pd.get_dummies(df[’category’])

3. Pivot Tables
• Creating Pivot Tables:

df.pivot_table(values=’value_col’, index=’row_col’, columns=’col_col’, aggfun

Applications in Machine Learning


1. Data Preprocessing
• Handling missing values, normalization, and encoding.

df.fillna(df.mean(), inplace=True)
df[’encoded’] = pd.get_dummies(df[’category’], drop_first=True)

2. Feature Engineering
• Creating new features using existing columns.

df[’new_feature’] = df[’col1’] / df[’col2’]

3. Exploratory Data Analysis (EDA)


• Summarizing data using descriptive statistics.

df.describe()
df.corr()

5
4. Integration with Other Libraries
• Scikit-learn: Used for feature extraction and model training.

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

• Matplotlib and Seaborn: Used for visualization.

import matplotlib.pyplot as plt


import seaborn as sns
sns.heatmap(df.corr(), annot=True)

Practice Questions for Interviews


1. How would you handle missing data in Pandas?

2. Explain the difference between loc and iloc.

3. How do you perform one-hot encoding in Pandas?

4. What is the use of pivot tables in Pandas?

5. Demonstrate how to merge two DataFrames with different keys.

6. How do you group data and calculate the mean in Pandas?

7. Explain how you would preprocess categorical data for machine learning.

8. Write code to calculate the correlation between numerical columns in a DataFrame.

9. How do you filter rows based on a condition in Pandas?

10. Describe how Pandas can be used for feature engineering.

Summary
Pandas is a powerful library essential for data manipulation and preprocessing in ma-
chine learning workflows. Its wide range of functionalities, from data cleaning to feature
engineering, makes it an indispensable tool in any data scientist’s toolkit.

You might also like