0% found this document useful (0 votes)
86 views22 pages

NumPy Basics for AI/ML Beginners

The document provides teaching notes for a 2-hour beginner session on NumPy, focusing on its significance in AI/ML and covering both basic and advanced array operations. Students will learn to create and manipulate NumPy arrays, perform mathematical operations, and understand concepts like broadcasting and reshaping. The session includes practical exercises and real-world applications to reinforce learning.

Uploaded by

steven royal son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views22 pages

NumPy Basics for AI/ML Beginners

The document provides teaching notes for a 2-hour beginner session on NumPy, focusing on its significance in AI/ML and covering both basic and advanced array operations. Students will learn to create and manipulate NumPy arrays, perform mathematical operations, and understand concepts like broadcasting and reshaping. The session includes practical exercises and real-world applications to reinforce learning.

Uploaded by

steven royal son
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NumPy Teaching Notes - Gen AI Course

Duration: 2 Hours | Level: Beginner

Session Overview
This session introduces students to NumPy, the fundamental package for scientific computing in Python and a cornerstone of AI/ML development.

Learning Objectives

By the end of this session, students will be able to:

Understand what NumPy is and its importance in AI/ML


Create and manipulate NumPy arrays
Perform basic and advanced array operations
Apply NumPy functions commonly used in AI/ML workflows

Part 1: NumPy Basics (45 minutes)


What is NumPy? (10 minutes)
NumPy (Numerical Python) is a fundamental library for scientific computing in Python.

Key Points to Emphasize:

Foundation of AI/ML Stack: NumPy is the building block for libraries like Pandas, Scikit-learn, TensorFlow, and PyTorch
Performance: Operations are 10-100x faster than pure Python due to C implementation
Memory Efficiency: Uses less memory than Python lists
Vectorization: Allows mathematical operations on entire arrays without writing loops

Real-world AI/ML Examples:

Image processing: Images are stored as NumPy arrays (height × width × channels)
Neural networks: Weights and activations are NumPy arrays
Data preprocessing: Feature scaling, normalization
Mathematical computations: Matrix multiplication, statistical operations

NumPy Installation (5 minutes)


# Standard installation
pip install numpy

# With Anaconda/Miniconda
conda install numpy

# Import convention
import numpy as np

Teaching Tip: Explain why we use np alias - it's the standard convention in the community.

Arrays: The Heart of NumPy (15 minutes)

What are NumPy Arrays?

NumPy arrays (ndarray) are homogeneous collections of elements with a fixed size.

Key Characteristics:

Homogeneous: All elements must be the same data type


Fixed size: Size is determined at creation
N-dimensional: Can have 1, 2, 3, or more dimensions

Creating Arrays

1D Arrays (Vectors)
import numpy as np

# From Python list


arr_1d = [Link]([1, 2, 3, 4, 5])
print(f"1D Array: {arr_1d}")
print(f"Shape: {arr_1d.shape}") # (5,)
print(f"Dimensions: {arr_1d.ndim}") # 1

2D Arrays (Matrices)
# From nested Python list
arr_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print(f"2D Array:\n{arr_2d}")
print(f"Shape: {arr_2d.shape}") # (2, 3) - 2 rows, 3 columns
print(f"Dimensions: {arr_2d.ndim}") # 2

Higher Dimensional Arrays


# 3D array (common in image processing)
arr_3d = [Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"3D Array:\n{arr_3d}")
print(f"Shape: {arr_3d.shape}") # (2, 2, 2)
print(f"Dimensions: {arr_3d.ndim}") # 3

Python List vs NumPy Array (10 minutes)


Create a comparison table on the board:

Feature Python List NumPy Array


Data Types Mixed (heterogeneous) Same type (homogeneous)
Memory More memory overhead Memory efficient
Speed Slower for numerical operations Much faster
Functionality Basic operations Rich mathematical functions
Dimensions 1D only (nested for 2D+) True multidimensional
Broadcasting Not supported Supported

Demonstration Code:
# Memory comparison
import sys

python_list = [1, 2, 3, 4, 5] * 1000


numpy_array = [Link](python_list)

print(f"Python list memory: {[Link](python_list)} bytes")


print(f"NumPy array memory: {numpy_array.nbytes} bytes")

# Speed comparison (simple demo)


import time

# Python list operation


start = [Link]()
result_list = [x * 2 for x in python_list]
list_time = [Link]() - start

# NumPy array operation


start = [Link]()
result_numpy = numpy_array * 2
numpy_time = [Link]() - start

print(f"Python list time: {list_time:.6f} seconds")


print(f"NumPy array time: {numpy_time:.6f} seconds")
print(f"NumPy is {list_time/numpy_time:.1f}x faster!")

Code Examples - Hands-on Practice (5 minutes)

Example 1: Create a 2D array and print its shape


# Create a 2D array representing student grades
grades = [Link]([[85, 92, 78],
[90, 88, 95],
[76, 84, 89]])

print("Student Grades Array:")


print(grades)
print(f"Shape: {[Link]}") # (3, 3) - 3 students, 3 subjects
print(f"Total elements: {[Link]}") # 9
print(f"Data type: {[Link]}") # int64 (or int32)

Example 2: Convert Python list to NumPy array and find mean


# Python list of temperatures
temperatures = [23.5, 25.1, 22.8, 26.2, 24.7, 21.9, 25.5]

# Convert to NumPy array


temp_array = [Link](temperatures)

print(f"Original list: {temperatures}")


print(f"NumPy array: {temp_array}")
print(f"Mean temperature: {temp_array.mean():.2f}°C")
print(f"Max temperature: {temp_array.max():.2f}°C")
print(f"Min temperature: {temp_array.min():.2f}°C")
print(f"Standard deviation: {temp_array.std():.2f}°C")

Part 2: NumPy Advanced (60 minutes)


Array Indexing and Slicing (15 minutes)
1D Array Indexing
arr = [Link]([10, 20, 30, 40, 50])

# Basic indexing (same as Python lists)


print(f"First element: {arr[0]}") # 10
print(f"Last element: {arr[-1]}") # 50
print(f"Third element: {arr[2]}") # 30

# Slicing
print(f"First three: {arr[:3]}") # [10 20 30]
print(f"From index 2: {arr[2:]}") # [30 40 50]
print(f"Every second: {arr[::2]}") # [10 30 50]

2D Array Indexing
matrix = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Access single element


print(f"Element at row 1, col 2: {matrix[1, 2]}") # 6

# Access entire row


print(f"Second row: {matrix[1, :]}") # [4 5 6]

# Access entire column


print(f"Third column: {matrix[:, 2]}") # [3 6 9]

# Subarray
print(f"Top-left 2x2:\n{matrix[:2, :2]}")
# [[1 2]
# [4 5]]

Boolean Indexing (Very Important for AI/ML)


data = [Link]([1, 5, 3, 8, 2, 7, 6])

# Create boolean mask


mask = data > 4
print(f"Mask: {mask}") # [False True False True False True True]

# Apply mask
filtered = data[mask]
print(f"Values > 4: {filtered}") # [5 8 7 6]

# Direct boolean indexing


print(f"Values <= 3: {data[data <= 3]}") # [1 3 2]

Mathematical Operations (15 minutes)


Element-wise Operations
a = [Link]([1, 2, 3, 4])
b = [Link]([10, 20, 30, 40])

# Basic arithmetic
print(f"Addition: {a + b}") # [11 22 33 44]
print(f"Subtraction: {b - a}") # [ 9 18 27 36]
print(f"Multiplication: {a * b}") # [10 40 90 160]
print(f"Division: {b / a}") # [10. 10. 10. 10.]
print(f"Power: {a ** 2}") # [ 1 4 9 16]

# With scalars
print(f"Add 10: {a + 10}") # [11 12 13 14]
print(f"Multiply by 3: {a * 3}") # [3 6 9 12]

Mathematical Functions
angles = [Link]([0, [Link]/4, [Link]/2, [Link]])

print(f"Sin: {[Link](angles)}")
print(f"Cos: {[Link](angles)}")
print(f"Square root: {[Link]([1, 4, 9, 16])}")
print(f"Exponential: {[Link]([1, 2, 3])}")
print(f"Natural log: {[Link]([1, np.e, np.e**2])}")

Matrix Operations (Important for AI/ML)


# Matrix multiplication
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[5, 6], [7, 8]])

# Element-wise multiplication
print(f"Element-wise: \n{A * B}")

# Matrix multiplication (dot product)


print(f"Matrix multiplication: \n{[Link](A, B)}")
# or
print(f"Using @ operator: \n{A @ B}")

Reshape and Flatten (10 minutes)


Reshape: Change array dimensions without changing data
# Original 1D array
original = [Link]([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(f"Original shape: {[Link]}") # (12,)

# Reshape to 2D
reshaped_2d = [Link](3, 4) # 3 rows, 4 columns
print(f"Reshaped to 3x4:\n{reshaped_2d}")

# Reshape to 3D
reshaped_3d = [Link](2, 2, 3) # 2 blocks, 2 rows, 3 columns
print(f"Reshaped to 2x2x3:\n{reshaped_3d}")

# Auto-calculate dimension with -1


auto_reshape = [Link](4, -1) # 4 rows, auto-calculate columns
print(f"Auto-reshaped to 4x?: \n{auto_reshape}")

Flatten: Convert to 1D
matrix_2d = [Link]([[1, 2, 3], [4, 5, 6]])

# Flatten to 1D
flattened = matrix_2d.flatten()
print(f"Flattened: {flattened}") # [1 2 3 4 5 6]

# Alternative: ravel() - returns view if possible


raveled = matrix_2d.ravel()
print(f"Raveled: {raveled}") # [1 2 3 4 5 6]

AI/ML Context: Reshaping is crucial when preparing data for neural networks, converting images to vectors, etc.

Broadcasting (10 minutes)


What is Broadcasting? Broadcasting allows NumPy to perform operations on arrays with different shapes without explicitly reshaping them.

Rules of Broadcasting:

1. Start from the trailing dimension


2. Dimensions are compatible if they are equal, or one of them is 1
3. Missing dimensions are assumed to be 1

Examples:
# Scalar with array
arr = [Link]([1, 2, 3, 4])
result = arr + 10 # 10 is broadcast to [10, 10, 10, 10]
print(f"Array + scalar: {result}")

# 1D with 2D
matrix = [Link]([[1, 2, 3],
[4, 5, 6]])
vector = [Link]([10, 20, 30])

result = matrix + vector # vector is broadcast to each row


print(f"Matrix + vector:\n{result}")
# [[11 22 33]
# [14 25 36]]

# Column vector with row vector


col_vector = [Link]([[1], [2], [3]]) # 3x1
row_vector = [Link]([10, 20]) # 1x2

result = col_vector + row_vector # Broadcasts to 3x2


print(f"Column + row broadcasting:\n{result}")
# [[11 21]
# [12 22]
# [13 23]]

Useful Functions (10 minutes)

Array Creation Functions


# arange(): Like Python's range but returns NumPy array
print("arange() - evenly spaced values in a range:")
print(f"0 to 10: {[Link](10)}") # [0 1 2 3 4 5 6 7 8 9]
print(f"2 to 20 by 3: {[Link](2, 20, 3)}") # [2 5 8 11 14 17]
print(f"Float steps: {[Link](0, 1, 0.2)}") # [0. 0.2 0.4 0.6 0.8]

# linspace(): Evenly spaced numbers over specified interval


print("\nlinspace() - linearly spaced values:")
print(f"5 points from 0 to 10: {[Link](0, 10, 5)}")
# [0. 2.5 5. 7.5 10.]
print(f"Including endpoint: {[Link](0, 10, 5, endpoint=True)}")
print(f"Excluding endpoint: {[Link](0, 10, 5, endpoint=False)}")

# eye(): Identity matrix (important for linear algebra)


print(f"\neye() - identity matrix:")
print(f"3x3 identity:\n{[Link](3)}")
print(f"4x4 identity:\n{[Link](4)}")

# ones(): Array filled with ones


print(f"\nones() - array of ones:")
print(f"1D: {[Link](5)}")
print(f"2D:\n{[Link]((3, 4))}") # Note: shape as tuple

# zeros(): Array filled with zeros


print(f"\nzeros() - array of zeros:")
print(f"1D: {[Link](4)}")
print(f"2D:\n{[Link]((2, 3))}")

# Additional useful functions


print(f"\nBonus functions:")
print(f"Random array: {[Link](5)}")
print(f"Array like another: {np.ones_like([Link]([1, 2, 3]))}")

AI/ML Applications of These Functions:

arange(): Creating epoch numbers, batch indices


linspace(): Creating smooth curves for visualization, learning rate schedules
eye(): Identity matrices for linear algebra operations
ones()/zeros(): Initializing weights, creating masks, padding
random(): Data augmentation, weight initialization

Practical Exercise (15 minutes)


Mini-Project: Student Grade Analysis

Give students this exercise to practice all concepts:


# Student grade analysis using NumPy
import numpy as np

# Sample data: 5 students, 4 subjects (Math, Science, English, History)


grades = [Link]([
[85, 92, 78, 88], # Student 1
[90, 88, 95, 85], # Student 2
[76, 84, 89, 91], # Student 3
[88, 91, 82, 87], # Student 4
[92, 87, 94, 89] # Student 5
])

# Tasks for students:


print("1. Array information:")
print(f"Shape: {[Link]}")
print(f"Total grades: {[Link]}")

print("\n2. Student averages:")


student_averages = [Link](axis=1) # axis=1 for row-wise mean
for i, avg in enumerate(student_averages):
print(f"Student {i+1}: {avg:.2f}")

print("\n3. Subject averages:")


subject_names = ['Math', 'Science', 'English', 'History']
subject_averages = [Link](axis=0) # axis=0 for column-wise mean
for subject, avg in zip(subject_names, subject_averages):
print(f"{subject}: {avg:.2f}")

print("\n4. Students with overall average > 87:")


high_performers = grades[student_averages > 87]
print(f"Count: {len(high_performers)}")

print("\n5. Best subject (highest class average):")


best_subject_idx = subject_averages.argmax()
print(f"Best subject: {subject_names[best_subject_idx]} ({subject_averages[best_subject_idx]:.2f})")

Summary and Next Steps (15 minutes)


Key Takeaways
NumPy is the foundation of the Python data science/AI ecosystem
Arrays are more efficient than Python lists for numerical computations
Understanding shapes and dimensions is crucial for AI/ML work
Broadcasting makes operations between different-shaped arrays possible
Array creation functions are essential tools for data preparation

Connection to AI/ML

Data Preprocessing: NumPy arrays are used to clean and transform data
Feature Engineering: Mathematical operations create new features
Model Implementation: Neural networks operate on NumPy arrays
Performance: Vectorized operations are essential for handling large datasets

What's Next?
Pandas: Built on NumPy, handles structured data (DataFrames)
Matplotlib: Uses NumPy arrays for plotting
Scikit-learn: NumPy arrays for machine learning algorithms
TensorFlow/PyTorch: Deep learning frameworks built on NumPy concepts

Common Beginner Mistakes to Avoid


1. Shape mismatches: Always check array shapes before operations
2. Copy vs View: Understand when NumPy creates copies vs views
3. Data types: Be aware of integer vs float operations
4. Broadcasting confusion: Practice with simple examples first

Additional Resources for Students


NumPy Official Documentation: [Link]
NumPy Quickstart Tutorial: [Link]
Practice Problems: HackerRank, LeetCode NumPy problems
Cheat Sheet: DataCamp NumPy cheat sheet

Teaching Tips for Instructors


Interactive Elements

Use live coding demonstrations


Encourage students to predict outputs before running code
Use real-world examples (image data, sensor readings, grades)
Show performance comparisons between NumPy and pure Python

Common Student Questions


"Why not just use Python lists?" - Show performance and memory comparisons
"What's the difference between shape and size?" - Draw visual diagrams
"When do I use reshape vs flatten?" - Give practical AI/ML examples
"How does broadcasting work?" - Use step-by-step visual examples

Assessment Ideas

Create arrays with specific shapes


Perform mathematical operations on real datasets
Debug shape mismatch errors
Optimize Python list operations using NumPy

Remember to relate every concept back to practical AI/ML applications to maintain student engagement and show relevance!

Pandas Teaching Notes - Gen AI Course


Duration: 2 Hours | Level: Beginner

Session Overview
This session introduces students to Pandas, the essential data manipulation library for AI/ML workflows. Students will learn how to prepare, clean,
and analyze datasets - crucial skills for training and fine-tuning AI models.

Learning Objectives

By the end of this session, students will be able to:

Understand Pandas' role in the AI/ML pipeline


Create and manipulate DataFrames and Series
Import and export data in various formats (CSV, JSON)
Clean and preprocess datasets for AI model training
Perform data analysis and aggregations
Handle missing data effectively

Part 1: Pandas Basics (55 minutes)


What is Pandas and Its Use Cases? (15 minutes)
What is Pandas? Pandas is Python's premier data analysis and manipulation library, built on top of NumPy. Think of it as "Excel on steroids" for
data scientists and AI engineers.

Key Features:
Data Structures: Series (1D) and DataFrame (2D)
Data Import/Export: CSV, JSON, Excel, SQL databases
Data Cleaning: Handle missing values, duplicates, data types
Data Analysis: Grouping, aggregation, statistical operations
Data Transformation: Merge, join, pivot, reshape

Why Pandas is Critical for Gen AI:

1. Dataset Preparation for LLM Training


# Example: Preparing conversational data for chatbot training
import pandas as pd

# Sample chatbot training data


chat_data = [Link]({
'user_input': ['Hello', 'How are you?', 'Tell me a joke', 'Goodbye'],
'bot_response': ['Hi there!', 'I am fine, thank you!', 'Why did the chicken cross the road?', 'See you later!'],
'intent': ['greeting', 'small_talk', 'entertainment', 'farewell'],
'confidence': [0.95, 0.88, 0.92, 0.97]
})

2. Data Preprocessing for Model Fine-tuning

Text cleaning and tokenization preparation


Feature engineering for training data
Dataset splitting (train/validation/test)
Label encoding and data normalization

3. Model Performance Analysis

Analyzing model predictions and accuracy


A/B testing results analysis
Error analysis and model debugging

4. Real-world AI Applications:

Content Generation: Preparing training datasets for GPT models


Computer Vision: Organizing image metadata and labels
Recommendation Systems: User behavior and preference analysis
Sentiment Analysis: Social media data preprocessing
Time Series AI: Financial or sensor data preparation

Introduction to Series and DataFrames (15 minutes)


Series: 1D Labeled Array Think of a Series as a single column of data with an index (like a labeled list).
import pandas as pd
import numpy as np

# Creating a Series - like a single feature column


user_ratings = [Link]([4.5, 3.2, 5.0, 2.8, 4.1],
index=['user1', 'user2', 'user3', 'user4', 'user5'],
name='movie_rating')

print("Series Example:")
print(user_ratings)
print(f"\nSeries info:")
print(f"Data type: {user_ratings.dtype}")
print(f"Index: {user_ratings.[Link]()}")
print(f"Values: {user_ratings.values}")

DataFrame: 2D Labeled Data Structure Think of a DataFrame as a spreadsheet or SQL table - rows and columns with labels.
# Creating a DataFrame - like a complete dataset
user_data = [Link]({
'user_id': [101, 102, 103, 104, 105],
'age': [25, 34, 28, 45, 31],
'subscription': ['premium', 'basic', 'premium', 'basic', 'premium'],
'avg_rating': [4.5, 3.2, 5.0, 2.8, 4.1],
'total_reviews': [23, 45, 12, 67, 34]
})

print("DataFrame Example:")
print(user_data)
print(f"\nDataFrame info:")
print(f"Shape: {user_data.shape}") # (rows, columns)
print(f"Columns: {user_data.[Link]()}")
print(f"Index: {user_data.[Link]()}")

Key Differences: | Feature | Series | DataFrame | |---------|--------|-----------| | Dimensions | 1D | 2D | | Structure | Single column | Multiple columns | |
Use Case | One feature/variable | Complete dataset | | Index | Single index | Row index + column names |

Creating DataFrames from Different Sources (15 minutes)

Method 1: From Dictionary


# Method 1: Dictionary (most common for small datasets)
ai_models = {
'model_name': ['GPT-4', 'Claude-3', 'Gemini', 'LLaMA-2'],
'parameters': ['1.76T', '175B', '540B', '70B'],
'release_year': [2023, 2024, 2023, 2023],
'accuracy_score': [95.3, 94.8, 93.2, 91.7],
'use_case': ['general', 'reasoning', 'multimodal', 'open_source']
}

ai_df = [Link](ai_models)
print("DataFrame from Dictionary:")
print(ai_df)

Method 2: From Lists


# Method 2: From lists (when you have separate lists)
models = ['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT']
f1_scores = [0.88, 0.91, 0.85, 0.89]
training_time = [120, 150, 80, 100] # in minutes

model_performance = [Link]({
'model': models,
'f1_score': f1_scores,
'training_time_min': training_time
})

print("DataFrame from Lists:")


print(model_performance)

Method 3: From CSV File (Most Common in Real Projects)


# Method 3: From CSV (real-world scenario)
# First, let's create a sample CSV for demonstration
sample_data = [Link]({
'text': ['I love this product!', 'Terrible service', 'Great quality', 'Not satisfied'],
'sentiment': ['positive', 'negative', 'positive', 'negative'],
'confidence': [0.92, 0.88, 0.85, 0.79]
})

# Save to CSV
sample_data.to_csv('sentiment_data.csv', index=False)

# Read from CSV


sentiment_df = pd.read_csv('sentiment_data.csv')
print("DataFrame from CSV:")
print(sentiment_df)

Teaching Tip: Emphasize that CSV is the most common format in real AI projects.

Viewing and Understanding Data (10 minutes)


Essential Data Exploration Commands
# Create a larger dataset for demonstration
[Link](42) # For reproducible results

training_data = [Link]({
'epoch': range(1, 101),
'train_loss': [Link](0.5, 100) + 0.1,
'val_loss': [Link](0.6, 100) + 0.15,
'accuracy': [Link](8, 2, 100),
'learning_rate': [0.001 * (0.95 ** (i//10)) for i in range(100)],
'batch_size': [Link]([16, 32, 64], 100)
})

print("=== DATA EXPLORATION COMMANDS ===")

# 1. head() - First few rows (default 5)


print("1. First 5 rows:")
print(training_data.head())

print("\n2. First 3 rows:")


print(training_data.head(3))

# 3. tail() - Last few rows


print("\n3. Last 5 rows:")
print(training_data.tail())

# 4. info() - Data types and memory usage


print("\n4. Data types and info:")
print(training_data.info())

# 5. describe() - Statistical summary


print("\n5. Statistical summary:")
print(training_data.describe())

# 6. Additional useful methods


print(f"\n6. Additional info:")
print(f"Shape: {training_data.shape}")
print(f"Columns: {training_data.[Link]()}")
print(f"Memory usage: {training_data.memory_usage().sum()} bytes")

AI/ML Context: These commands are essential for:

Understanding dataset structure before training


Identifying data quality issues early
Choosing appropriate preprocessing steps
Debugging model performance issues

Part 2: Pandas Advanced (55 minutes)


Indexing, Slicing, and Filtering (15 minutes)

Column Selection
# Sample AI experiment tracking data
experiment_data = [Link]({
'experiment_id': ['exp_001', 'exp_002', 'exp_003', 'exp_004', 'exp_005'],
'model_type': ['CNN', 'RNN', 'Transformer', 'CNN', 'Transformer'],
'accuracy': [0.85, 0.78, 0.92, 0.88, 0.94],
'f1_score': [0.83, 0.76, 0.90, 0.86, 0.93],
'training_time': [120, 200, 300, 150, 350],
'dataset_size': [10000, 15000, 25000, 12000, 30000]
})

print("Original Dataset:")
print(experiment_data)

# Select single column


print(f"\n1. Single column (accuracy):")
print(experiment_data['accuracy'])

# Select multiple columns


print(f"\n2. Multiple columns:")
print(experiment_data[['model_type', 'accuracy', 'f1_score']])

# Select columns by position (iloc)


print(f"\n3. First 3 columns:")
print(experiment_data.iloc[:, :3])

Row Selection and Filtering


# Row selection by position
print(f"\n4. First 3 rows:")
print(experiment_data.head(3))

print(f"\n5. Specific rows by index:")


print(experiment_data.iloc[1:4]) # Rows 1, 2, 3

# Boolean filtering (VERY IMPORTANT for AI/ML)


print(f"\n6. High-performance models (accuracy > 0.9):")
high_performance = experiment_data[experiment_data['accuracy'] > 0.9]
print(high_performance)

print(f"\n7. Transformer models only:")


transformers = experiment_data[experiment_data['model_type'] == 'Transformer']
print(transformers)

print(f"\n8. Complex filtering (high accuracy AND fast training):")


efficient_models = experiment_data[
(experiment_data['accuracy'] > 0.85) &
(experiment_data['training_time'] < 250)
]
print(efficient_models)

AI/ML Applications:

Filter experiments by performance thresholds


Select specific model types for comparison
Identify best-performing configurations

Adding and Deleting Columns (10 minutes)

Adding Columns
print("=== ADDING COLUMNS ===")

# Add calculated columns (common in feature engineering)


experiment_data['efficiency_score'] = experiment_data['accuracy'] / (experiment_data['training_time'] / 100)
print(f"\n1. Added efficiency score:")
print(experiment_data[['experiment_id', 'accuracy', 'training_time', 'efficiency_score']])

# Add categorical column based on conditions


experiment_data['performance_category'] = experiment_data['accuracy'].apply(
lambda x: 'Excellent' if x >= 0.9 else 'Good' if x >= 0.8 else 'Needs Improvement'
)
print(f"\n2. Added performance category:")
print(experiment_data[['experiment_id', 'accuracy', 'performance_category']])

# Add constant column


experiment_data['researcher'] = 'AI_Team_2024'
print(f"\n3. Added researcher column:")
print(experiment_data[['experiment_id', 'researcher']].head(3))

Deleting Columns
print("\n=== DELETING COLUMNS ===")

# Method 1: Using drop()


experiment_clean = experiment_data.drop('researcher', axis=1)
print(f"\n1. Dropped 'researcher' column:")
print(f"Original columns: {experiment_data.[Link]()}")
print(f"New columns: {experiment_clean.[Link]()}")

# Method 2: Using del (permanent)


experiment_copy = experiment_data.copy()
del experiment_copy['efficiency_score']
print(f"\n2. Deleted 'efficiency_score':")
print(f"Remaining columns: {experiment_copy.[Link]()}")

# Method 3: Drop multiple columns


minimal_data = experiment_data.drop(['researcher', 'efficiency_score'], axis=1)
print(f"\n3. Dropped multiple columns:")
print(minimal_data.head(3))

Aggregations and Group Operations (15 minutes)


Basic Aggregations
print("=== BASIC AGGREGATIONS ===")

# Sample model comparison data


model_results = [Link]({
'model_family': ['CNN', 'CNN', 'RNN', 'RNN', 'Transformer', 'Transformer', 'CNN', 'RNN'],
'accuracy': [0.85, 0.88, 0.78, 0.82, 0.92, 0.94, 0.86, 0.80],
'training_time': [120, 150, 200, 180, 300, 350, 140, 190],
'memory_usage': [2.1, 2.3, 1.8, 1.9, 4.2, 4.5, 2.2, 1.7],
'dataset': ['small', 'large', 'small', 'large', 'small', 'large', 'medium', 'medium']
})

print("Model Results Dataset:")


print(model_results)

# Basic statistics
print(f"\n1. Overall Statistics:")
print(f"Average accuracy: {model_results['accuracy'].mean():.3f}")
print(f"Best accuracy: {model_results['accuracy'].max():.3f}")
print(f"Total experiments: {model_results['accuracy'].count()}")
print(f"Standard deviation: {model_results['accuracy'].std():.3f}")

GroupBy Operations (Critical for AI Analysis)


print(f"\n=== GROUPBY OPERATIONS ===")

# Group by model family


print(f"\n2. Performance by Model Family:")
model_performance = model_results.groupby('model_family').agg({
'accuracy': ['mean', 'max', 'count'],
'training_time': 'mean',
'memory_usage': 'mean'
}).round(3)
print(model_performance)

# Group by dataset size


print(f"\n3. Performance by Dataset Size:")
dataset_analysis = model_results.groupby('dataset')['accuracy'].agg([
'mean', 'std', 'min', 'max'
]).round(3)
print(dataset_analysis)

# Multiple grouping
print(f"\n4. Model Family vs Dataset Size:")
detailed_analysis = model_results.groupby(['model_family', 'dataset'])['accuracy'].mean().round(3)
print(detailed_analysis)

# Custom aggregations
print(f"\n5. Custom Analysis:")
custom_agg = model_results.groupby('model_family').agg({
'accuracy': lambda x: f"{[Link]():.3f} ± {[Link]():.3f}",
'training_time': lambda x: f"{[Link]()}-{[Link]()} min"
})
print(custom_agg)

Handling Missing Values (15 minutes)

Understanding Missing Data


print("=== HANDLING MISSING VALUES ===")

# Create dataset with missing values (common in real AI projects)


messy_data = [Link]({
'user_id': [1, 2, 3, 4, 5, 6, 7, 8],
'age': [25, [Link], 30, 35, [Link], 28, 32, 29],
'rating': [4.5, 3.2, [Link], 4.8, 2.1, [Link], 4.0, 3.8],
'review_text': ['Great!', None, 'Good product', 'Excellent', 'Poor', None, 'Amazing', 'OK'],
'purchase_date': ['2024-01-15', '2024-02-20', None, '2024-03-10', '2024-01-25', None, '2024-02-14', '2024-03-05']
})

print("Dataset with Missing Values:")


print(messy_data)
# Detecting missing values
print(f"\n1. Missing Value Detection:")
print(f"Missing values per column:")
print(messy_data.isna().sum())

print(f"\nPercentage of missing values:")


print((messy_data.isna().sum() / len(messy_data) * 100).round(2))

# Visual representation
print(f"\n2. Missing Value Pattern:")
print(messy_data.isna())

Handling Missing Values Strategies


print(f"\n=== MISSING VALUE STRATEGIES ===")

# Strategy 1: Drop rows with any missing values


print(f"\n1. Drop rows with ANY missing values:")
clean_data = messy_data.dropna()
print(f"Original shape: {messy_data.shape}")
print(f"After dropping: {clean_data.shape}")
print(clean_data)

# Strategy 2: Drop rows only if specific columns have missing values


print(f"\n2. Drop rows only if 'rating' is missing:")
rating_clean = messy_data.dropna(subset=['rating'])
print(f"Shape after dropping missing ratings: {rating_clean.shape}")

# Strategy 3: Fill missing values


print(f"\n3. Fill Missing Values:")

# Fill with mean (for numerical data)


filled_data = messy_data.copy()
filled_data['age'].fillna(filled_data['age'].mean(), inplace=True)
filled_data['rating'].fillna(filled_data['rating'].mean(), inplace=True)

# Fill with specific values


filled_data['review_text'].fillna('No review provided', inplace=True)
filled_data['purchase_date'].fillna('Unknown', inplace=True)

print("After filling missing values:")


print(filled_data)

# Strategy 4: Forward fill (useful for time series)


time_series_example = [Link]({
'timestamp': pd.date_range('2024-01-01', periods=8, freq='D'),
'sensor_value': [100, [Link], [Link], 150, [Link], 200, [Link], 180]
})

print(f"\n4. Forward Fill Example (Time Series):")


print("Original:")
print(time_series_example)

time_series_example['sensor_value_ffill'] = time_series_example['sensor_value'].fillna(method='ffill')
print("After forward fill:")
print(time_series_example)

AI/ML Context for Missing Values:

Training Data: Missing values can break model training


Feature Engineering: Different strategies affect model performance
Real-time Inference: Must handle missing values in production
Data Quality: Understanding missing patterns reveals data collection issues

Part 3: File Operations and Advanced Data Processing (10 minutes)


Reading JSON and CSV Files

Working with CSV Files


print("=== WORKING WITH CSV FILES ===")

# Create sample training dataset


training_dataset = [Link]({
'text': [
'I love this product!',
'Terrible customer service',
'Great quality and fast delivery',
'Not worth the money',
'Excellent experience overall',
'Could be better'
],
'sentiment': ['positive', 'negative', 'positive', 'negative', 'positive', 'neutral'],
'confidence': [0.92, 0.88, 0.95, 0.85, 0.91, 0.67],
'language': ['en', 'en', 'en', 'en', 'en', 'en']
})

# Save to CSV
training_dataset.to_csv('sentiment_training.csv', index=False)
print("Saved training dataset to CSV")
# Read from CSV with various options
df_csv = pd.read_csv('sentiment_training.csv')
print("\nRead from CSV:")
print(df_csv)

# Advanced CSV reading options


df_subset = pd.read_csv('sentiment_training.csv',
usecols=['text', 'sentiment'], # Only specific columns
nrows=3) # Only first 3 rows
print("\nSubset reading (first 3 rows, specific columns):")
print(df_subset)

Working with JSON Files


print(f"\n=== WORKING WITH JSON FILES ===")

# Create sample API response data (common in AI applications)


api_responses = [
{
"id": 1,
"user_query": "What is machine learning?",
"ai_response": "Machine learning is a subset of AI...",
"response_time": 0.5,
"metadata": {"model": "GPT-4", "tokens": 150}
},
{
"id": 2,
"user_query": "Explain neural networks",
"ai_response": "Neural networks are computing systems...",
"response_time": 0.8,
"metadata": {"model": "Claude-3", "tokens": 200}
}
]

# Save to JSON
import json
with open('ai_responses.json', 'w') as f:
[Link](api_responses, f, indent=2)

# Read JSON into DataFrame


df_json = pd.read_json('ai_responses.json')
print("Read from JSON:")
print(df_json)

Converting Nested JSON to Flat Table


print(f"\n=== FLATTENING NESTED JSON ===")

# Handle nested JSON (common in API responses)


from pandas import json_normalize

# Flatten the nested structure


df_flat = json_normalize(api_responses)
print("Flattened JSON:")
print(df_flat)

# More complex nested example


complex_data = [
{
"experiment_id": "exp_001",
"model_config": {
"type": "transformer",
"layers": 12,
"attention_heads": 8
},
"results": {
"accuracy": 0.92,
"f1_score": 0.89,
"precision": 0.91
},
"training_details": {
"epochs": 100,
"batch_size": 32,
"learning_rate": 0.001
}
}
]

# Flatten complex nested structure


df_complex = json_normalize(complex_data)
print("\nComplex flattened structure:")
print(df_complex.[Link]()) # Show all flattened column names
print(df_complex)

Summary and Real-World Applications (10 minutes)


Key Pandas Operations for AI/ML Pipeline
print("=== COMPLETE AI/ML PREPROCESSING PIPELINE ===")

# Simulate a complete preprocessing workflow


raw_data = [Link]({
'text': ['Great product!', 'Bad service', None, 'Amazing quality', 'Poor value'],
'rating': [5, 2, [Link], 5, 1],
'user_id': [101, 102, 103, 104, 105],
'category': ['electronics', 'service', 'electronics', 'electronics', 'service'],
'timestamp': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
})

print("1. Raw Data:")


print(raw_data)

# Step 1: Handle missing values


cleaned_data = raw_data.copy()
cleaned_data['text'].fillna('No comment', inplace=True)
cleaned_data['rating'].fillna(cleaned_data['rating'].mean(), inplace=True)

print("\n2. After handling missing values:")


print(cleaned_data)

# Step 2: Feature engineering


cleaned_data['text_length'] = cleaned_data['text'].[Link]()
cleaned_data['is_positive'] = cleaned_data['rating'] >= 4
cleaned_data['timestamp'] = pd.to_datetime(cleaned_data['timestamp'])

print("\n3. After feature engineering:")


print(cleaned_data)

# Step 3: Filtering and selection


final_dataset = cleaned_data[cleaned_data['text_length'] > 5][['text', 'rating', 'category', 'is_positive']]

print("\n4. Final processed dataset:")


print(final_dataset)

# Step 4: Export for model training


final_dataset.to_csv('processed_training_data.csv', index=False)
print("\n5. Exported processed data for model training!")

Connection to Gen AI Applications


1. LLM Training Data Preparation

Clean and format conversational datasets


Handle multilingual text data
Remove duplicates and inappropriate content
Create training/validation splits

2. Fine-tuning Datasets

Prepare instruction-following datasets


Format prompt-response pairs
Balance dataset categories
Quality filtering based on metrics

3. Model Evaluation and Analysis

Analyze model performance across different categories


Track experiments and hyperparameters
Compare model versions
Identify failure patterns

4. Production Data Processing

Handle real-time user inputs


Process API response logs
Monitor model performance metrics
A/B testing analysis

Best Practices for AI/ML with Pandas

1. Always Explore First: Use head(), info(), describe() before processing


2. Handle Missing Values Thoughtfully: Choose strategy based on data context
3. Validate Data Quality: Check for duplicates, outliers, inconsistencies
4. Document Preprocessing Steps: Keep track of transformations for reproducibility
5. Save Intermediate Results: Don't lose preprocessing work
6. Use Meaningful Column Names: Make code self-documenting
7. Consider Memory Usage: Use appropriate data types for large datasets

What's Next?
Building on Pandas Foundation:

Matplotlib/Seaborn: Visualizing your Pandas data


Scikit-learn: Using Pandas DataFrames directly in ML models
Streamlit: Building interactive web apps with Pandas
Dask: Scaling Pandas to larger-than-memory datasets

Advanced Pandas for AI:


Time Series Analysis: For sequential data and forecasting
Text Processing: String operations for NLP preprocessing
Multi-indexing: Complex hierarchical data structures
Performance Optimization: Vectorization and memory management

Hands-On Exercise (Remaining Time)


Complete Preprocessing Pipeline Exercise

Give students this practical exercise:


# Exercise: Prepare customer feedback data for sentiment analysis model
import pandas as pd
import numpy as np

# Sample messy customer feedback data


messy_feedback = [Link]({
'review_id': range(1, 11),
'customer_text': [
'Love this product!',
'terrible quality',
None,
'AMAZING SERVICE!!!',
'not worth it',
'pretty good',
None,
'Excellent experience',
'could be better',
'WORST PURCHASE EVER'
],
'star_rating': [5, 1, [Link], 5, 2, 4, [Link], 5, 3, 1],
'purchase_verified': [True, True, False, True, True, None, True, True, False, True],
'review_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18',
'2024-01-19', '2024-01-20', '2024-01-21', '2024-01-22',
'2024-01-23', '2024-01-24']
})

# Student tasks:
print("Student Tasks:")
print("1. Explore the dataset using head(), info(), describe()")
print("2. Identify and handle missing values appropriately")
print("3. Create new features: text_length, sentiment_label (positive/negative/neutral)")
print("4. Filter out unverified purchases")
print("5. Group by sentiment_label and calculate statistics")
print("6. Export the cleaned dataset for model training")

# This exercise combines all concepts learned in the session!

Assessment and Troubleshooting


Common Student Mistakes and Solutions
1. Forgetting to handle missing values

Solution: Always check with [Link]().sum() first

2. Modifying original DataFrames accidentally

Solution: Use [Link]() when experimenting

3. Confusing iloc and loc

Solution: iloc = position-based, loc = label-based

4. Not understanding boolean indexing

Solution: Break complex filters into steps

5. Inefficient loops instead of vectorization

Solution: Use Pandas built-in methods and .apply()

Quick Reference Cheat Sheet


# Essential Pandas Operations for AI/ML
import pandas as pd

# Data Loading
df = pd.read_csv('[Link]')
df = pd.read_json('[Link]')

# Data Exploration
[Link](), [Link](), [Link](), [Link]()
[Link], [Link], [Link]

# Data Selection
df['column'], df[['col1', 'col2']]
[Link][0:5], [Link][df['col'] > 5]

# Data Cleaning
[Link]().sum(), [Link](), [Link](value)
df.drop_duplicates(), [Link]('column', axis=1)

# Data Analysis
[Link]('column').mean()
df['column'].value_counts()
[Link](), df.sort_values('column')

# Data Export
df.to_csv('[Link]', index=False)
df.to_json('[Link]')

This comprehensive foundation in Pandas will enable students to handle any dataset they encounter in their AI/ML journey!

Matplotlib Teaching Notes - Gen AI Course


Duration: 2 Hours | Level: Beginner

Session Overview
This session introduces students to Matplotlib, the foundational plotting library for data visualization in Python. Students will learn to create
compelling visualizations essential for AI/ML model analysis, data exploration, and result presentation.

Learning Objectives

By the end of this session, students will be able to:

Understand Matplotlib's role in the AI/ML workflow


Create essential plot types for AI data analysis
Customize visualizations for professional presentation
Use subplots for comprehensive model comparison
Apply appropriate chart types for different AI/ML scenarios

Part 1: Matplotlib Basics (60 minutes)


Introduction to Matplotlib and Its AI/ML Context (15 minutes)

What is Matplotlib? Matplotlib is Python's foundational plotting library, providing the building blocks for data visualization. Think of it as the
"photoshop" for data - it creates the visual stories that make complex AI insights understandable.

Why Matplotlib is Critical for Gen AI:

1. Model Performance Visualization

Training/validation loss curves


Accuracy progression over epochs
Learning rate schedules
Confusion matrices and ROC curves

2. Data Exploration and EDA

Dataset distribution analysis


Feature correlation visualization
Outlier detection
Data quality assessment

3. Research and Communication

Academic paper figures


Business presentation charts
Model comparison visualizations
A/B testing results

4. Real-world AI Applications:

LLM Training: Visualizing loss curves and perplexity scores


Computer Vision: Plotting accuracy across different image categories
Recommendation Systems: Showing user engagement patterns
Time Series AI: Forecasting visualizations
Reinforcement Learning: Reward progression charts

AI Industry Context: Every major AI breakthrough is accompanied by compelling visualizations:

OpenAI's GPT papers feature training loss curves


DeepMind's papers show performance comparisons
Tesla's Autopilot progress is tracked through visual metrics
Google's AI research relies heavily on matplotlib-generated figures

Understanding pyplot and Plotting Syntax (10 minutes)


The pyplot Interface
import [Link] as plt
import numpy as np

# The magic command for Jupyter notebooks


%matplotlib inline

# Basic plotting syntax follows a simple pattern:


# 1. Create data
# 2. Plot data
# 3. Customize plot
# 4. Display plot

print("Matplotlib: Where data becomes insight!")

The Anatomy of a Plot


# Simple example: AI model accuracy over training epochs
epochs = [Link](1, 11) # Training epochs 1-10
accuracy = [0.45, 0.62, 0.71, 0.78, 0.83, 0.87, 0.89, 0.91, 0.92, 0.93]

# Create the plot


[Link](figsize=(8, 6)) # Set figure size
[Link](epochs, accuracy) # Create line plot
[Link]('AI Model Training Progress')
[Link]('Training Epochs')
[Link]('Accuracy')
[Link](True) # Add grid for better readability
[Link]() # Display the plot

print("This is exactly how AI engineers track model training!")

Key Concepts:

Figure: The entire plotting window


Axes: The area where data is plotted
Plot Elements: Lines, markers, text, legends
Customization: Colors, styles, labels, titles

Line Plots: The Foundation of AI Visualization (10 minutes)

Basic Line Plots for AI Applications


# Example 1: Training vs Validation Loss (Most Important AI Plot!)
epochs = [Link](1, 21)
train_loss = [Link](-epochs/8) + 0.1 + [Link](0, 0.02, 20)
val_loss = [Link](-epochs/6) + 0.15 + [Link](0, 0.03, 20)

[Link](figsize=(10, 6))
[Link](epochs, train_loss, label='Training Loss', color='blue')
[Link](epochs, val_loss, label='Validation Loss', color='red')
[Link]('Model Training Progress: Loss Curves')
[Link]('Epochs')
[Link]('Loss')
[Link]()
[Link](True, alpha=0.3)
[Link]()

print("This plot can save weeks of debugging in AI projects!")

Multiple Line Plots: Model Comparison


# Example 2: Comparing different AI models
epochs = [Link](1, 16)
gpt_accuracy = [0.5 + 0.4*(1 - [Link](-i/5)) + [Link](0, 0.01) for i in epochs]
bert_accuracy = [0.4 + 0.45*(1 - [Link](-i/4)) + [Link](0, 0.01) for i in epochs]
transformer_accuracy = [0.45 + 0.5*(1 - [Link](-i/6)) + [Link](0, 0.01) for i in epochs]

[Link](figsize=(12, 7))
[Link](epochs, gpt_accuracy, 'b-', linewidth=2, label='GPT Model')
[Link](epochs, bert_accuracy, 'r--', linewidth=2, label='BERT Model')
[Link](epochs, transformer_accuracy, 'g:', linewidth=2, label='Custom Transformer')

[Link]('AI Model Performance Comparison', fontsize=16)


[Link]('Training Epochs', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](fontsize=11)
[Link](True, alpha=0.3)
[Link](0, 1) # Set y-axis limits
[Link]()

print("This visualization helps choose the best AI architecture!")

Line Plot Variations


# Example 3: Learning Rate Scheduling
epochs = [Link](1, 101)
learning_rate = 0.001 * (0.95 ** (epochs // 10))

[Link](figsize=(10, 6))
[Link](epochs, learning_rate, 'purple', linewidth=2, marker='o', markersize=3)
[Link]('Learning Rate Schedule for Neural Network Training')
[Link]('Training Epochs')
[Link]('Learning Rate')
[Link]('log') # Logarithmic scale for learning rates
[Link](True, alpha=0.3)
[Link]()

print("Learning rate schedules are crucial for stable AI training!")

Bar Plots: Categorical AI Data Visualization (10 minutes)


Basic Bar Plots for AI Metrics
# Example 1: Model Performance Across Categories
categories = ['Text Classification', 'Image Recognition', 'Speech Processing', 'Translation', 'Summarization']
accuracy_scores = [0.92, 0.89, 0.85, 0.94, 0.87]

[Link](figsize=(12, 7))
bars = [Link](categories, accuracy_scores, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'])
[Link]('AI Model Performance Across Different Tasks', fontsize=16)
[Link]('AI Task Categories', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](0, 1)

# Add value labels on bars


for bar, score in zip(bars, accuracy_scores):
[Link](bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.2f}', ha='center', va='bottom', fontweight='bold')

[Link](rotation=45, ha='right')
plt.tight_layout()
[Link]()

print("This shows which AI tasks your model excels at!")

Horizontal Bar Plots: Feature Importance


# Example 2: Feature Importance in ML Model
features = ['Word Frequency', 'Sentence Length', 'Sentiment Score', 'Topic Category', 'Grammar Score']
importance = [0.35, 0.15, 0.25, 0.18, 0.07]

[Link](figsize=(10, 6))
bars = [Link](features, importance, color='skyblue', edgecolor='navy', alpha=0.7)
[Link]('Feature Importance in Text Classification Model', fontsize=14)
[Link]('Importance Score', fontsize=12)

# Add value labels


for i, bar in enumerate(bars):
[Link](bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
f'{importance[i]:.2f}', va='center', fontweight='bold')

plt.tight_layout()
[Link]()

print("Feature importance helps understand what your AI model focuses on!")

Grouped Bar Charts: Model Comparison


# Example 3: Comparing metrics across different models
models = ['BERT', 'GPT-3', 'T5', 'RoBERTa']
precision = [0.89, 0.91, 0.87, 0.90]
recall = [0.86, 0.88, 0.85, 0.89]
f1_score = [0.87, 0.89, 0.86, 0.89]

x = [Link](len(models))
width = 0.25

[Link](figsize=(12, 7))
[Link](x - width, precision, width, label='Precision', color='lightcoral', alpha=0.8)
[Link](x, recall, width, label='Recall', color='lightblue', alpha=0.8)
[Link](x + width, f1_score, width, label='F1-Score', color='lightgreen', alpha=0.8)

[Link]('NLP Model Performance Comparison', fontsize=16)


[Link]('Models', fontsize=12)
[Link]('Score', fontsize=12)
[Link](x, models)
[Link]()
[Link](True, alpha=0.3, axis='y')
[Link](0, 1)
plt.tight_layout()
[Link]()

print("This helps choose the best NLP model for your specific needs!")

Scatter Plots: Relationship Analysis in AI (10 minutes)

Basic Scatter Plots for AI Data


# Example 1: Model Size vs Performance
model_size = [100, 250, 500, 1000, 1750, 3500, 7000, 15000] # Million parameters
accuracy = [0.78, 0.83, 0.87, 0.91, 0.93, 0.94, 0.95, 0.96]

[Link](figsize=(10, 7))
[Link](model_size, accuracy, c='red', s=100, alpha=0.7, edgecolors='black')
[Link]('AI Model Size vs Performance Trade-off', fontsize=16)
[Link]('Model Size (Million Parameters)', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](True, alpha=0.3)

# Add trend line


z = [Link](model_size, accuracy, 2)
p = np.poly1d(z)
[Link](model_size, p(model_size), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
[Link]()

print("This reveals the diminishing returns of larger AI models!")

Color-coded Scatter Plots: Multi-dimensional Analysis


# Example 2: Training Data Analysis
data_size = [Link](1000, 50000, 50)
training_time = data_size * 0.02 + [Link](0, 100, 50)
model_type = [Link](['CNN', 'RNN', 'Transformer'], 50)

# Create color map for model types


colors = {'CNN': 'red', 'RNN': 'blue', 'Transformer': 'green'}
scatter_colors = [colors[mt] for mt in model_type]

[Link](figsize=(12, 8))
for model in ['CNN', 'RNN', 'Transformer']:
mask = model_type == model
[Link](data_size[mask], training_time[mask],
c=colors[model], label=model, s=80, alpha=0.7)

[Link]('Training Time vs Dataset Size by Model Type', fontsize=16)


[Link]('Dataset Size (samples)', fontsize=12)
[Link]('Training Time (minutes)', fontsize=12)
[Link](title='Model Type')
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()

print("This helps predict training costs for different AI architectures!")

Bubble Charts: Three-dimensional Data


# Example 3: AI Model Evaluation Dashboard
models = ['BERT-Base', 'BERT-Large', 'GPT-2', 'GPT-3', 'T5-Small', 'T5-Large']
accuracy = [0.89, 0.92, 0.87, 0.94, 0.88, 0.91]
inference_speed = [150, 80, 200, 60, 180, 90] # tokens per second
model_size = [110, 340, 117, 175000, 60, 770] # million parameters

[Link](figsize=(12, 8))
# Bubble size represents model size
bubble_sizes = [size/100 for size in model_size] # Scale down for visualization

scatter = [Link](accuracy, inference_speed, s=bubble_sizes,


c=model_size, cmap='viridis', alpha=0.7, edgecolors='black')

[Link]('AI Model Performance Dashboard\n(Bubble size = Model Size)', fontsize=16)


[Link]('Accuracy Score', fontsize=12)
[Link]('Inference Speed (tokens/sec)', fontsize=12)

# Add colorbar for model size


cbar = [Link](scatter)
cbar.set_label('Model Size (Million Parameters)', fontsize=11)

# Add model labels


for i, model in enumerate(models):
[Link](model, (accuracy[i], inference_speed[i]),
xytext=(5, 5), textcoords='offset points', fontsize=9)

[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()

print("This dashboard view helps select the optimal AI model!")

Titles, Labels, and Legends: Professional Presentation (5 minutes)


Complete Plot Customization
# Example: Professional AI Research Figure
epochs = [Link](1, 26)
baseline_loss = 2.5 * [Link](-epochs/10) + 0.3
our_model_loss = 2.0 * [Link](-epochs/8) + 0.25
sota_model_loss = 1.8 * [Link](-epochs/12) + 0.2

[Link](figsize=(12, 8))

# Plot with custom styling


[Link](epochs, baseline_loss, 'b--', linewidth=2.5,
label='Baseline Model', marker='o', markersize=4)
[Link](epochs, our_model_loss, 'r-', linewidth=2.5,
label='Our Proposed Model', marker='s', markersize=4)
[Link](epochs, sota_model_loss, 'g-.', linewidth=2.5,
label='State-of-the-Art Model', marker='^', markersize=4)

# Professional customization
[Link]('Training Loss Comparison: Novel Architecture vs Baselines',
fontsize=18, fontweight='bold', pad=20)
[Link]('Training Epochs', fontsize=14, fontweight='bold')
[Link]('Cross-Entropy Loss', fontsize=14, fontweight='bold')

# Enhanced legend
[Link](loc='upper right', fontsize=12, frameon=True,
shadow=True, fancybox=True, framealpha=0.9)

# Professional grid and styling


[Link](True, alpha=0.3, linestyle='-', linewidth=0.5)
[Link](fontsize=12)
[Link](fontsize=12)

# Add annotations
min_epoch = [Link](our_model_loss) + 1
min_loss = our_model_loss[[Link](our_model_loss)]
[Link](f'Best Performance\nEpoch {min_epoch}: {min_loss:.3f}',
xy=(min_epoch, min_loss), xytext=(min_epoch+5, min_loss+0.3),
arrowprops=dict(arrowstyle='->', color='red', lw=1.5),
fontsize=11, ha='center',
bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7))

plt.tight_layout()
[Link]()

print("This is publication-ready AI research visualization!")

Part 2: Matplotlib Advanced (60 minutes)


Subplots: Comprehensive AI Analysis (15 minutes)

Basic Subplots for Model Analysis


# Example 1: Complete Training Dashboard
fig, axes = [Link](2, 2, figsize=(15, 12))

# Data for different metrics


epochs = [Link](1, 21)
train_loss = [Link](-epochs/8) + 0.1 + [Link](0, 0.02, 20)
val_loss = [Link](-epochs/6) + 0.15 + [Link](0, 0.03, 20)
train_acc = 1 - [Link](-epochs/5) * 0.8 + [Link](0, 0.01, 20)
val_acc = 1 - [Link](-epochs/4) * 0.85 + [Link](0, 0.015, 20)

# Subplot 1: Loss curves


axes[0, 0].plot(epochs, train_loss, 'b-', label='Training Loss', linewidth=2)
axes[0, 0].plot(epochs, val_loss, 'r-', label='Validation Loss', linewidth=2)
axes[0, 0].set_title('Model Loss Progression', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Epochs')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Subplot 2: Accuracy curves


axes[0, 1].plot(epochs, train_acc, 'b-', label='Training Accuracy', linewidth=2)
axes[0, 1].plot(epochs, val_acc, 'r-', label='Validation Accuracy', linewidth=2)
axes[0, 1].set_title('Model Accuracy Progression', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Epochs')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Subplot 3: Learning rate schedule


learning_rates = [0.001 * (0.9 ** (i//5)) for i in epochs]
axes[1, 0].plot(epochs, learning_rates, 'g-', linewidth=2, marker='o')
axes[1, 0].set_title('Learning Rate Schedule', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Epochs')
axes[1, 0].set_ylabel('Learning Rate')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)

# Subplot 4: Model complexity


layers = [2, 4, 6, 8, 10, 12]
performance = [0.75, 0.83, 0.89, 0.92, 0.94, 0.93]
axes[1, 1].bar(layers, performance, color='purple', alpha=0.7)
axes[1, 1].set_title('Model Depth vs Performance', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Number of Layers')
axes[1, 1].set_ylabel('Test Accuracy')
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
[Link]('Complete AI Model Training Dashboard', fontsize=18, fontweight='bold', y=1.02)
[Link]()

print("This dashboard gives you complete visibility into AI model training!")


Advanced Subplot Configurations
# Example 2: AI Model Comparison Grid
models = ['BERT', 'GPT-2', 'T5', 'RoBERTa', 'ELECTRA', 'DeBERTa']
tasks = ['Classification', 'Generation', 'Q&A', 'Summarization']

# Create performance matrix


[Link](42)
performance_matrix = [Link](len(tasks), len(models)) * 0.3 + 0.65

fig, axes = [Link](2, 3, figsize=(18, 12))


axes = [Link]()

for i, model in enumerate(models):


task_performance = performance_matrix[:, i]
bars = axes[i].bar(tasks, task_performance,
color=[Link].Set3(i), alpha=0.8, edgecolor='black')
axes[i].set_title(f'{model} Performance', fontsize=12, fontweight='bold')
axes[i].set_ylabel('Accuracy Score')
axes[i].set_ylim(0.6, 1.0)
axes[i].tick_params(axis='x', rotation=45)

# Add value labels


for bar, score in zip(bars, task_performance):
axes[i].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
[Link]('AI Model Performance Across Different Tasks', fontsize=20, fontweight='bold', y=1.02)
[Link]()

print("This comparison helps choose the right model for each AI task!")

Histograms: Distribution Analysis for AI (15 minutes)


Basic Histograms for AI Data
# Example 1: Model Prediction Confidence Distribution
[Link](42)
high_conf_predictions = [Link](8, 2, 1000) # Skewed towards high confidence
low_conf_predictions = [Link](2, 5, 500) # Skewed towards low confidence
all_predictions = [Link]([high_conf_predictions, low_conf_predictions])

[Link](figsize=(12, 8))
[Link](all_predictions, bins=30, alpha=0.7, color='skyblue',
edgecolor='black', density=True)
[Link]('AI Model Prediction Confidence Distribution', fontsize=16, fontweight='bold')
[Link]('Confidence Score', fontsize=12)
[Link]('Density', fontsize=12)
[Link](all_predictions.mean(), color='red', linestyle='--',
linewidth=2, label=f'Mean: {all_predictions.mean():.3f}')
[Link]([Link](all_predictions), color='green', linestyle='--',
linewidth=2, label=f'Median: {[Link](all_predictions):.3f}')
[Link]()
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()

print("This helps assess whether your AI model is over-confident or under-confident!")

Comparative Histograms
# Example 2: Training Data Quality Analysis
[Link](42)
clean_data_scores = [Link](0.8, 0.1, 1000)
noisy_data_scores = [Link](0.6, 0.2, 800)

[Link](figsize=(12, 8))
[Link](clean_data_scores, bins=25, alpha=0.6, label='Clean Training Data',
color='green', density=True, edgecolor='black')
[Link](noisy_data_scores, bins=25, alpha=0.6, label='Noisy Training Data',
color='red', density=True, edgecolor='black')

[Link]('Training Data Quality Distribution Comparison', fontsize=16, fontweight='bold')


[Link]('Data Quality Score', fontsize=12)
[Link]('Density', fontsize=12)
[Link](fontsize=12)
[Link](True, alpha=0.3)

# Add statistical annotations


[Link](clean_data_scores.mean(), color='green', linestyle=':',
linewidth=2, alpha=0.8)
[Link](noisy_data_scores.mean(), color='red', linestyle=':',
linewidth=2, alpha=0.8)

[Link](0.85, 3, f'Clean Data\nMean: {clean_data_scores.mean():.3f}\nStd: {clean_data_scores.std():.3f}',


bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen", alpha=0.7))
[Link](0.3, 2, f'Noisy Data\nMean: {noisy_data_scores.mean():.3f}\nStd: {noisy_data_scores.std():.3f}',
bbox=dict(boxstyle="round,pad=0.3", facecolor="lightcoral", alpha=0.7))

plt.tight_layout()
[Link]()

print("This analysis helps decide if you need data cleaning before AI training!")
2D Histograms: Feature Correlation
# Example 3: Feature Correlation Analysis
[Link](42)
feature1 = [Link].multivariate_normal([0.7, 0.6], [[0.02, 0.01], [0.01, 0.03]], 500)
feature2 = [Link].multivariate_normal([0.3, 0.8], [[0.03, -0.01], [-0.01, 0.02]], 300)

all_features = [Link]([feature1, feature2])


x_features = all_features[:, 0]
y_features = all_features[:, 1]

[Link](figsize=(12, 10))

# Create 2D histogram
plt.hist2d(x_features, y_features, bins=25, cmap='Blues', alpha=0.8)
[Link](label='Frequency')
[Link]('Feature Correlation Heatmap for AI Model Input', fontsize=16, fontweight='bold')
[Link]('Feature 1 (e.g., Text Sentiment)', fontsize=12)
[Link]('Feature 2 (e.g., Text Length)', fontsize=12)

# Add marginal histograms


from [Link] import Rectangle

# Top histogram (marginal for x)


ax_top = [Link]().inset_axes([0, 1.02, 1, 0.2])
ax_top.hist(x_features, bins=25, color='lightblue', alpha=0.7, edgecolor='black')
ax_top.set_xlim([Link]().get_xlim())
ax_top.tick_params(labelbottom=False)

# Right histogram (marginal for y)


ax_right = [Link]().inset_axes([1.02, 0, 0.2, 1])
ax_right.hist(y_features, bins=25, orientation='horizontal',
color='lightblue', alpha=0.7, edgecolor='black')
ax_right.set_ylim([Link]().get_ylim())
ax_right.tick_params(labelleft=False)

plt.tight_layout()
[Link]()

print("This reveals feature correlations that affect AI model performance!")

Pie Charts: Categorical Analysis for AI (10 minutes)


Basic Pie Charts for AI Metrics
# Example 1: AI Model Error Analysis
error_types = ['False Positives', 'False Negatives', 'True Positives', 'True Negatives']
error_counts = [120, 80, 1650, 1850]
colors = ['#FF6B6B', '#FFE66D', '#4ECDC4', '#95E1D3']

[Link](figsize=(10, 8))
wedges, texts, autotexts = [Link](error_counts, labels=error_types, colors=colors,
autopct='%1.1f%%', startangle=90, explode=(0.1, 0.1, 0, 0))

[Link]('AI Model Prediction Analysis\n(Confusion Matrix Distribution)',


fontsize=16, fontweight='bold', pad=20)

# Enhance text
for autotext in autotexts:
autotext.set_color('white')
autotext.set_fontweight('bold')
autotext.set_fontsize(12)

# Add legend with counts


legend_labels = [f'{label}: {count}' for label, count in zip(error_types, error_counts)]
[Link](wedges, legend_labels, title="Prediction Counts",
loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))

plt.tight_layout()
[Link]()

print("This pie chart helps understand your AI model's error patterns!")

Advanced Pie Charts with Subplots


# Example 2: AI Model Resource Usage Comparison
fig, axes = [Link](1, 3, figsize=(18, 6))

# Model A: Lightweight model


model_a_resources = [20, 15, 10, 55] # Training, Inference, Storage, Available
labels = ['Training', 'Inference', 'Storage', 'Available']
colors = ['#FF9999', '#66B2FF', '#99FF99', '#FFCC99']

axes[0].pie(model_a_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)


axes[0].set_title('Lightweight AI Model\nResource Usage', fontsize=14, fontweight='bold')

# Model B: Heavy model


model_b_resources = [45, 25, 20, 10]
axes[1].pie(model_b_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Heavy AI Model\nResource Usage', fontsize=14, fontweight='bold')

# Model C: Optimized model


model_c_resources = [30, 18, 12, 40]
axes[2].pie(model_c_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[2].set_title('Optimized AI
<style type="text/css">@media print {
*, :after, :before {background: 0 0 !important;color: #000 !important;box-shadow: none !important;text-shadow: none !im
a, a:visited {text-decoration: underline}
a[href]:after {content: " (" attr(href) ")"}
abbr[title]:after {content: " (" attr(title) ")"}
a[href^="#"]:after, a[href^="javascript:"]:after {content: ""}
blockquote, pre {border: 1px solid #999;page-break-inside: avoid}
thead {display: table-header-group}
img, tr {page-break-inside: avoid}
img {max-width: 100% !important}
h2, h3, p {orphans: 3;widows: 3}
h2, h3 {page-break-after: avoid}
}
html {font-size: 12px}
@media screen and (min-width: 32rem) and (max-width: 48rem) {
html {font-size: 15px}
}
@media screen and (min-width: 48rem) {
html {font-size: 16px}
}
body {line-height: 1.85}
.air-p, p {font-size: 1rem;margin-bottom: 1.3rem}
.air-h1, .air-h2, .air-h3, .air-h4, h1, h2, h3, h4 {margin: 1.414rem 0 .5rem;font-weight: inherit;line-height: 1.42}
.air-h1, h1 {margin-top: 0;font-size: 3.998rem}
.air-h2, h2 {font-size: 2.827rem}
.air-h3, h3 {font-size: 1.999rem}
.air-h4, h4 {font-size: 1.414rem}
.air-h5, h5 {font-size: 1.121rem}
.air-h6, h6 {font-size: .88rem}
.air-small, small {font-size: .707em}
canvas, iframe, img, select, svg, textarea, video {max-width: 100%}
body {color: #444;font-family: 'Open Sans', Helvetica, sans-serif;font-weight: 300;margin: 0;text-align: center}
img {border-radius: 50%;height: 200px;margin: 0 auto;width: 200px}
a, a:visited {color: #3498db}
a:active, a:focus, a:hover {color: #2980b9}
pre {background-color: #fafafa;padding: 1rem;text-align: left}
blockquote {margin: 0;border-left: 5px solid #7a7a7a;font-style: italic;padding: 1.33em;text-align: left}
li, ol, ul {text-align: left}
p {color: #777}</style>

Common questions

Powered by AI

NumPy is foundational to the Python AI/ML ecosystem because it provides the core array operations used by other libraries. It acts as a base for libraries such as Pandas for data manipulation, TensorFlow and PyTorch for deep learning, and Scikit-learn for machine learning algorithms. NumPy's efficient array management and mathematical operations enable these libraries to perform large-scale computations and data processing essential in AI/ML workflows .

NumPy's vectorization capability allows for operations to be performed on entire arrays rather than using loops to iterate over elements individually. This is achieved through low-level C implementations, which provide performance enhancements of 10-100 times faster than pure Python loops. Vectorization also facilitates more concise and readable code, reducing errors associated with loop constructs .

NumPy’s ndarray is utilized in various real-world AI/ML applications due to its efficient handling of multi-dimensional data. In image processing, images are treated as ndarrays where dimensions represent height, width, and channels. In neural networks, ndarrays store weights and activations. They are also used in data preprocessing tasks like feature scaling and normalization, and in mathematical operations such as matrix multiplication and statistical evaluations .

Feature engineering with NumPy arrays involves creating new input features or transforming existing ones to improve model performance. This process includes scaling features, generating polynomial features, and encoding categorical variables. NumPy arrays efficiently manage and process such transformations due to their speed and efficient memory usage, enabling faster experimentation with feature combinations and dimensionality reduction techniques, which are essential for building robust AI/ML models .

The 'linspace' function in NumPy can be applied to design learning rate schedules by generating evenly spaced learning rates over a specified interval. This is critical for slowly adjusting the learning rate to ensure stable convergence during neural network training. For instance, a 'linspace' could be used to reduce the learning rate from 0.1 to 0.01 over 100 epochs, ensuring that the model's weight updates become more precise as training progresses .

Beginners might face issues such as shape mismatches when performing operations, confusion between copies and views, unintended data type conversions, and misunderstanding broadcasting mechanics. These can be mitigated through careful checking of array shapes before operations, understanding the behavior of NumPy when modifying arrays, explicitly converting data types as needed, and practicing broadcasting with simple examples to build intuition .

Broadcasting in NumPy allows operations to be performed on arrays of different shapes, enabling element-wise calculations without explicit replication of data. This is particularly useful in AI/ML workflows where operations on multi-dimensional data are common. For example, in neural network training, broadcasting can be used to adjust learning rates across layers without resizing weight matrices, facilitating efficient manipulation of data dimensions .

NumPy's 'eye' function creates identity matrices, which are critical in linear algebra operations in AI for initializing transformations that do not alter original data. Identity matrices are used when implementing algorithms that require maintaining initial data properties, such as inverting matrices and normalizing results in various AI algorithms, ensuring accuracy and efficiency in computation .

NumPy arrays are more memory-efficient than Python lists because they store elements of the same data type, allowing for a dense storage and reduced memory overhead compared to the heterogeneous nature of Python lists. This efficiency is critical in data science where large datasets are common, as it enables better use of system memory and resources, leading to faster processing times and the ability to handle larger datasets .

NumPy contributes significantly to data preprocessing by providing efficient tools for transformation and manipulation of data. This includes scaling and normalizing features, handling missing data through imputation, and transforming categorical data into numerical format via one-hot encoding. These preprocessing steps are crucial for ensuring that AI models receive well-structured and correctly scaled inputs, which directly impacts their learning efficiency and overall performance .

You might also like