NumPy Teaching Notes - Gen AI Course
Duration: 2 Hours | Level: Beginner
Session Overview
This session introduces students to NumPy, the fundamental package for scientific computing in Python and a cornerstone of AI/ML development.
Learning Objectives
By the end of this session, students will be able to:
Understand what NumPy is and its importance in AI/ML
Create and manipulate NumPy arrays
Perform basic and advanced array operations
Apply NumPy functions commonly used in AI/ML workflows
Part 1: NumPy Basics (45 minutes)
What is NumPy? (10 minutes)
NumPy (Numerical Python) is a fundamental library for scientific computing in Python.
Key Points to Emphasize:
Foundation of AI/ML Stack: NumPy is the building block for libraries like Pandas, Scikit-learn, TensorFlow, and PyTorch
Performance: Operations are 10-100x faster than pure Python due to C implementation
Memory Efficiency: Uses less memory than Python lists
Vectorization: Allows mathematical operations on entire arrays without writing loops
Real-world AI/ML Examples:
Image processing: Images are stored as NumPy arrays (height × width × channels)
Neural networks: Weights and activations are NumPy arrays
Data preprocessing: Feature scaling, normalization
Mathematical computations: Matrix multiplication, statistical operations
NumPy Installation (5 minutes)
# Standard installation
pip install numpy
# With Anaconda/Miniconda
conda install numpy
# Import convention
import numpy as np
Teaching Tip: Explain why we use np alias - it's the standard convention in the community.
Arrays: The Heart of NumPy (15 minutes)
What are NumPy Arrays?
NumPy arrays (ndarray) are homogeneous collections of elements with a fixed size.
Key Characteristics:
Homogeneous: All elements must be the same data type
Fixed size: Size is determined at creation
N-dimensional: Can have 1, 2, 3, or more dimensions
Creating Arrays
1D Arrays (Vectors)
import numpy as np
# From Python list
arr_1d = [Link]([1, 2, 3, 4, 5])
print(f"1D Array: {arr_1d}")
print(f"Shape: {arr_1d.shape}") # (5,)
print(f"Dimensions: {arr_1d.ndim}") # 1
2D Arrays (Matrices)
# From nested Python list
arr_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print(f"2D Array:\n{arr_2d}")
print(f"Shape: {arr_2d.shape}") # (2, 3) - 2 rows, 3 columns
print(f"Dimensions: {arr_2d.ndim}") # 2
Higher Dimensional Arrays
# 3D array (common in image processing)
arr_3d = [Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"3D Array:\n{arr_3d}")
print(f"Shape: {arr_3d.shape}") # (2, 2, 2)
print(f"Dimensions: {arr_3d.ndim}") # 3
Python List vs NumPy Array (10 minutes)
Create a comparison table on the board:
Feature Python List NumPy Array
Data Types Mixed (heterogeneous) Same type (homogeneous)
Memory More memory overhead Memory efficient
Speed Slower for numerical operations Much faster
Functionality Basic operations Rich mathematical functions
Dimensions 1D only (nested for 2D+) True multidimensional
Broadcasting Not supported Supported
Demonstration Code:
# Memory comparison
import sys
python_list = [1, 2, 3, 4, 5] * 1000
numpy_array = [Link](python_list)
print(f"Python list memory: {[Link](python_list)} bytes")
print(f"NumPy array memory: {numpy_array.nbytes} bytes")
# Speed comparison (simple demo)
import time
# Python list operation
start = [Link]()
result_list = [x * 2 for x in python_list]
list_time = [Link]() - start
# NumPy array operation
start = [Link]()
result_numpy = numpy_array * 2
numpy_time = [Link]() - start
print(f"Python list time: {list_time:.6f} seconds")
print(f"NumPy array time: {numpy_time:.6f} seconds")
print(f"NumPy is {list_time/numpy_time:.1f}x faster!")
Code Examples - Hands-on Practice (5 minutes)
Example 1: Create a 2D array and print its shape
# Create a 2D array representing student grades
grades = [Link]([[85, 92, 78],
[90, 88, 95],
[76, 84, 89]])
print("Student Grades Array:")
print(grades)
print(f"Shape: {[Link]}") # (3, 3) - 3 students, 3 subjects
print(f"Total elements: {[Link]}") # 9
print(f"Data type: {[Link]}") # int64 (or int32)
Example 2: Convert Python list to NumPy array and find mean
# Python list of temperatures
temperatures = [23.5, 25.1, 22.8, 26.2, 24.7, 21.9, 25.5]
# Convert to NumPy array
temp_array = [Link](temperatures)
print(f"Original list: {temperatures}")
print(f"NumPy array: {temp_array}")
print(f"Mean temperature: {temp_array.mean():.2f}°C")
print(f"Max temperature: {temp_array.max():.2f}°C")
print(f"Min temperature: {temp_array.min():.2f}°C")
print(f"Standard deviation: {temp_array.std():.2f}°C")
Part 2: NumPy Advanced (60 minutes)
Array Indexing and Slicing (15 minutes)
1D Array Indexing
arr = [Link]([10, 20, 30, 40, 50])
# Basic indexing (same as Python lists)
print(f"First element: {arr[0]}") # 10
print(f"Last element: {arr[-1]}") # 50
print(f"Third element: {arr[2]}") # 30
# Slicing
print(f"First three: {arr[:3]}") # [10 20 30]
print(f"From index 2: {arr[2:]}") # [30 40 50]
print(f"Every second: {arr[::2]}") # [10 30 50]
2D Array Indexing
matrix = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Access single element
print(f"Element at row 1, col 2: {matrix[1, 2]}") # 6
# Access entire row
print(f"Second row: {matrix[1, :]}") # [4 5 6]
# Access entire column
print(f"Third column: {matrix[:, 2]}") # [3 6 9]
# Subarray
print(f"Top-left 2x2:\n{matrix[:2, :2]}")
# [[1 2]
# [4 5]]
Boolean Indexing (Very Important for AI/ML)
data = [Link]([1, 5, 3, 8, 2, 7, 6])
# Create boolean mask
mask = data > 4
print(f"Mask: {mask}") # [False True False True False True True]
# Apply mask
filtered = data[mask]
print(f"Values > 4: {filtered}") # [5 8 7 6]
# Direct boolean indexing
print(f"Values <= 3: {data[data <= 3]}") # [1 3 2]
Mathematical Operations (15 minutes)
Element-wise Operations
a = [Link]([1, 2, 3, 4])
b = [Link]([10, 20, 30, 40])
# Basic arithmetic
print(f"Addition: {a + b}") # [11 22 33 44]
print(f"Subtraction: {b - a}") # [ 9 18 27 36]
print(f"Multiplication: {a * b}") # [10 40 90 160]
print(f"Division: {b / a}") # [10. 10. 10. 10.]
print(f"Power: {a ** 2}") # [ 1 4 9 16]
# With scalars
print(f"Add 10: {a + 10}") # [11 12 13 14]
print(f"Multiply by 3: {a * 3}") # [3 6 9 12]
Mathematical Functions
angles = [Link]([0, [Link]/4, [Link]/2, [Link]])
print(f"Sin: {[Link](angles)}")
print(f"Cos: {[Link](angles)}")
print(f"Square root: {[Link]([1, 4, 9, 16])}")
print(f"Exponential: {[Link]([1, 2, 3])}")
print(f"Natural log: {[Link]([1, np.e, np.e**2])}")
Matrix Operations (Important for AI/ML)
# Matrix multiplication
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[5, 6], [7, 8]])
# Element-wise multiplication
print(f"Element-wise: \n{A * B}")
# Matrix multiplication (dot product)
print(f"Matrix multiplication: \n{[Link](A, B)}")
# or
print(f"Using @ operator: \n{A @ B}")
Reshape and Flatten (10 minutes)
Reshape: Change array dimensions without changing data
# Original 1D array
original = [Link]([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(f"Original shape: {[Link]}") # (12,)
# Reshape to 2D
reshaped_2d = [Link](3, 4) # 3 rows, 4 columns
print(f"Reshaped to 3x4:\n{reshaped_2d}")
# Reshape to 3D
reshaped_3d = [Link](2, 2, 3) # 2 blocks, 2 rows, 3 columns
print(f"Reshaped to 2x2x3:\n{reshaped_3d}")
# Auto-calculate dimension with -1
auto_reshape = [Link](4, -1) # 4 rows, auto-calculate columns
print(f"Auto-reshaped to 4x?: \n{auto_reshape}")
Flatten: Convert to 1D
matrix_2d = [Link]([[1, 2, 3], [4, 5, 6]])
# Flatten to 1D
flattened = matrix_2d.flatten()
print(f"Flattened: {flattened}") # [1 2 3 4 5 6]
# Alternative: ravel() - returns view if possible
raveled = matrix_2d.ravel()
print(f"Raveled: {raveled}") # [1 2 3 4 5 6]
AI/ML Context: Reshaping is crucial when preparing data for neural networks, converting images to vectors, etc.
Broadcasting (10 minutes)
What is Broadcasting? Broadcasting allows NumPy to perform operations on arrays with different shapes without explicitly reshaping them.
Rules of Broadcasting:
1. Start from the trailing dimension
2. Dimensions are compatible if they are equal, or one of them is 1
3. Missing dimensions are assumed to be 1
Examples:
# Scalar with array
arr = [Link]([1, 2, 3, 4])
result = arr + 10 # 10 is broadcast to [10, 10, 10, 10]
print(f"Array + scalar: {result}")
# 1D with 2D
matrix = [Link]([[1, 2, 3],
[4, 5, 6]])
vector = [Link]([10, 20, 30])
result = matrix + vector # vector is broadcast to each row
print(f"Matrix + vector:\n{result}")
# [[11 22 33]
# [14 25 36]]
# Column vector with row vector
col_vector = [Link]([[1], [2], [3]]) # 3x1
row_vector = [Link]([10, 20]) # 1x2
result = col_vector + row_vector # Broadcasts to 3x2
print(f"Column + row broadcasting:\n{result}")
# [[11 21]
# [12 22]
# [13 23]]
Useful Functions (10 minutes)
Array Creation Functions
# arange(): Like Python's range but returns NumPy array
print("arange() - evenly spaced values in a range:")
print(f"0 to 10: {[Link](10)}") # [0 1 2 3 4 5 6 7 8 9]
print(f"2 to 20 by 3: {[Link](2, 20, 3)}") # [2 5 8 11 14 17]
print(f"Float steps: {[Link](0, 1, 0.2)}") # [0. 0.2 0.4 0.6 0.8]
# linspace(): Evenly spaced numbers over specified interval
print("\nlinspace() - linearly spaced values:")
print(f"5 points from 0 to 10: {[Link](0, 10, 5)}")
# [0. 2.5 5. 7.5 10.]
print(f"Including endpoint: {[Link](0, 10, 5, endpoint=True)}")
print(f"Excluding endpoint: {[Link](0, 10, 5, endpoint=False)}")
# eye(): Identity matrix (important for linear algebra)
print(f"\neye() - identity matrix:")
print(f"3x3 identity:\n{[Link](3)}")
print(f"4x4 identity:\n{[Link](4)}")
# ones(): Array filled with ones
print(f"\nones() - array of ones:")
print(f"1D: {[Link](5)}")
print(f"2D:\n{[Link]((3, 4))}") # Note: shape as tuple
# zeros(): Array filled with zeros
print(f"\nzeros() - array of zeros:")
print(f"1D: {[Link](4)}")
print(f"2D:\n{[Link]((2, 3))}")
# Additional useful functions
print(f"\nBonus functions:")
print(f"Random array: {[Link](5)}")
print(f"Array like another: {np.ones_like([Link]([1, 2, 3]))}")
AI/ML Applications of These Functions:
arange(): Creating epoch numbers, batch indices
linspace(): Creating smooth curves for visualization, learning rate schedules
eye(): Identity matrices for linear algebra operations
ones()/zeros(): Initializing weights, creating masks, padding
random(): Data augmentation, weight initialization
Practical Exercise (15 minutes)
Mini-Project: Student Grade Analysis
Give students this exercise to practice all concepts:
# Student grade analysis using NumPy
import numpy as np
# Sample data: 5 students, 4 subjects (Math, Science, English, History)
grades = [Link]([
[85, 92, 78, 88], # Student 1
[90, 88, 95, 85], # Student 2
[76, 84, 89, 91], # Student 3
[88, 91, 82, 87], # Student 4
[92, 87, 94, 89] # Student 5
])
# Tasks for students:
print("1. Array information:")
print(f"Shape: {[Link]}")
print(f"Total grades: {[Link]}")
print("\n2. Student averages:")
student_averages = [Link](axis=1) # axis=1 for row-wise mean
for i, avg in enumerate(student_averages):
print(f"Student {i+1}: {avg:.2f}")
print("\n3. Subject averages:")
subject_names = ['Math', 'Science', 'English', 'History']
subject_averages = [Link](axis=0) # axis=0 for column-wise mean
for subject, avg in zip(subject_names, subject_averages):
print(f"{subject}: {avg:.2f}")
print("\n4. Students with overall average > 87:")
high_performers = grades[student_averages > 87]
print(f"Count: {len(high_performers)}")
print("\n5. Best subject (highest class average):")
best_subject_idx = subject_averages.argmax()
print(f"Best subject: {subject_names[best_subject_idx]} ({subject_averages[best_subject_idx]:.2f})")
Summary and Next Steps (15 minutes)
Key Takeaways
NumPy is the foundation of the Python data science/AI ecosystem
Arrays are more efficient than Python lists for numerical computations
Understanding shapes and dimensions is crucial for AI/ML work
Broadcasting makes operations between different-shaped arrays possible
Array creation functions are essential tools for data preparation
Connection to AI/ML
Data Preprocessing: NumPy arrays are used to clean and transform data
Feature Engineering: Mathematical operations create new features
Model Implementation: Neural networks operate on NumPy arrays
Performance: Vectorized operations are essential for handling large datasets
What's Next?
Pandas: Built on NumPy, handles structured data (DataFrames)
Matplotlib: Uses NumPy arrays for plotting
Scikit-learn: NumPy arrays for machine learning algorithms
TensorFlow/PyTorch: Deep learning frameworks built on NumPy concepts
Common Beginner Mistakes to Avoid
1. Shape mismatches: Always check array shapes before operations
2. Copy vs View: Understand when NumPy creates copies vs views
3. Data types: Be aware of integer vs float operations
4. Broadcasting confusion: Practice with simple examples first
Additional Resources for Students
NumPy Official Documentation: [Link]
NumPy Quickstart Tutorial: [Link]
Practice Problems: HackerRank, LeetCode NumPy problems
Cheat Sheet: DataCamp NumPy cheat sheet
Teaching Tips for Instructors
Interactive Elements
Use live coding demonstrations
Encourage students to predict outputs before running code
Use real-world examples (image data, sensor readings, grades)
Show performance comparisons between NumPy and pure Python
Common Student Questions
"Why not just use Python lists?" - Show performance and memory comparisons
"What's the difference between shape and size?" - Draw visual diagrams
"When do I use reshape vs flatten?" - Give practical AI/ML examples
"How does broadcasting work?" - Use step-by-step visual examples
Assessment Ideas
Create arrays with specific shapes
Perform mathematical operations on real datasets
Debug shape mismatch errors
Optimize Python list operations using NumPy
Remember to relate every concept back to practical AI/ML applications to maintain student engagement and show relevance!
Pandas Teaching Notes - Gen AI Course
Duration: 2 Hours | Level: Beginner
Session Overview
This session introduces students to Pandas, the essential data manipulation library for AI/ML workflows. Students will learn how to prepare, clean,
and analyze datasets - crucial skills for training and fine-tuning AI models.
Learning Objectives
By the end of this session, students will be able to:
Understand Pandas' role in the AI/ML pipeline
Create and manipulate DataFrames and Series
Import and export data in various formats (CSV, JSON)
Clean and preprocess datasets for AI model training
Perform data analysis and aggregations
Handle missing data effectively
Part 1: Pandas Basics (55 minutes)
What is Pandas and Its Use Cases? (15 minutes)
What is Pandas? Pandas is Python's premier data analysis and manipulation library, built on top of NumPy. Think of it as "Excel on steroids" for
data scientists and AI engineers.
Key Features:
Data Structures: Series (1D) and DataFrame (2D)
Data Import/Export: CSV, JSON, Excel, SQL databases
Data Cleaning: Handle missing values, duplicates, data types
Data Analysis: Grouping, aggregation, statistical operations
Data Transformation: Merge, join, pivot, reshape
Why Pandas is Critical for Gen AI:
1. Dataset Preparation for LLM Training
# Example: Preparing conversational data for chatbot training
import pandas as pd
# Sample chatbot training data
chat_data = [Link]({
'user_input': ['Hello', 'How are you?', 'Tell me a joke', 'Goodbye'],
'bot_response': ['Hi there!', 'I am fine, thank you!', 'Why did the chicken cross the road?', 'See you later!'],
'intent': ['greeting', 'small_talk', 'entertainment', 'farewell'],
'confidence': [0.95, 0.88, 0.92, 0.97]
})
2. Data Preprocessing for Model Fine-tuning
Text cleaning and tokenization preparation
Feature engineering for training data
Dataset splitting (train/validation/test)
Label encoding and data normalization
3. Model Performance Analysis
Analyzing model predictions and accuracy
A/B testing results analysis
Error analysis and model debugging
4. Real-world AI Applications:
Content Generation: Preparing training datasets for GPT models
Computer Vision: Organizing image metadata and labels
Recommendation Systems: User behavior and preference analysis
Sentiment Analysis: Social media data preprocessing
Time Series AI: Financial or sensor data preparation
Introduction to Series and DataFrames (15 minutes)
Series: 1D Labeled Array Think of a Series as a single column of data with an index (like a labeled list).
import pandas as pd
import numpy as np
# Creating a Series - like a single feature column
user_ratings = [Link]([4.5, 3.2, 5.0, 2.8, 4.1],
index=['user1', 'user2', 'user3', 'user4', 'user5'],
name='movie_rating')
print("Series Example:")
print(user_ratings)
print(f"\nSeries info:")
print(f"Data type: {user_ratings.dtype}")
print(f"Index: {user_ratings.[Link]()}")
print(f"Values: {user_ratings.values}")
DataFrame: 2D Labeled Data Structure Think of a DataFrame as a spreadsheet or SQL table - rows and columns with labels.
# Creating a DataFrame - like a complete dataset
user_data = [Link]({
'user_id': [101, 102, 103, 104, 105],
'age': [25, 34, 28, 45, 31],
'subscription': ['premium', 'basic', 'premium', 'basic', 'premium'],
'avg_rating': [4.5, 3.2, 5.0, 2.8, 4.1],
'total_reviews': [23, 45, 12, 67, 34]
})
print("DataFrame Example:")
print(user_data)
print(f"\nDataFrame info:")
print(f"Shape: {user_data.shape}") # (rows, columns)
print(f"Columns: {user_data.[Link]()}")
print(f"Index: {user_data.[Link]()}")
Key Differences: | Feature | Series | DataFrame | |---------|--------|-----------| | Dimensions | 1D | 2D | | Structure | Single column | Multiple columns | |
Use Case | One feature/variable | Complete dataset | | Index | Single index | Row index + column names |
Creating DataFrames from Different Sources (15 minutes)
Method 1: From Dictionary
# Method 1: Dictionary (most common for small datasets)
ai_models = {
'model_name': ['GPT-4', 'Claude-3', 'Gemini', 'LLaMA-2'],
'parameters': ['1.76T', '175B', '540B', '70B'],
'release_year': [2023, 2024, 2023, 2023],
'accuracy_score': [95.3, 94.8, 93.2, 91.7],
'use_case': ['general', 'reasoning', 'multimodal', 'open_source']
}
ai_df = [Link](ai_models)
print("DataFrame from Dictionary:")
print(ai_df)
Method 2: From Lists
# Method 2: From lists (when you have separate lists)
models = ['BERT', 'RoBERTa', 'DistilBERT', 'ALBERT']
f1_scores = [0.88, 0.91, 0.85, 0.89]
training_time = [120, 150, 80, 100] # in minutes
model_performance = [Link]({
'model': models,
'f1_score': f1_scores,
'training_time_min': training_time
})
print("DataFrame from Lists:")
print(model_performance)
Method 3: From CSV File (Most Common in Real Projects)
# Method 3: From CSV (real-world scenario)
# First, let's create a sample CSV for demonstration
sample_data = [Link]({
'text': ['I love this product!', 'Terrible service', 'Great quality', 'Not satisfied'],
'sentiment': ['positive', 'negative', 'positive', 'negative'],
'confidence': [0.92, 0.88, 0.85, 0.79]
})
# Save to CSV
sample_data.to_csv('sentiment_data.csv', index=False)
# Read from CSV
sentiment_df = pd.read_csv('sentiment_data.csv')
print("DataFrame from CSV:")
print(sentiment_df)
Teaching Tip: Emphasize that CSV is the most common format in real AI projects.
Viewing and Understanding Data (10 minutes)
Essential Data Exploration Commands
# Create a larger dataset for demonstration
[Link](42) # For reproducible results
training_data = [Link]({
'epoch': range(1, 101),
'train_loss': [Link](0.5, 100) + 0.1,
'val_loss': [Link](0.6, 100) + 0.15,
'accuracy': [Link](8, 2, 100),
'learning_rate': [0.001 * (0.95 ** (i//10)) for i in range(100)],
'batch_size': [Link]([16, 32, 64], 100)
})
print("=== DATA EXPLORATION COMMANDS ===")
# 1. head() - First few rows (default 5)
print("1. First 5 rows:")
print(training_data.head())
print("\n2. First 3 rows:")
print(training_data.head(3))
# 3. tail() - Last few rows
print("\n3. Last 5 rows:")
print(training_data.tail())
# 4. info() - Data types and memory usage
print("\n4. Data types and info:")
print(training_data.info())
# 5. describe() - Statistical summary
print("\n5. Statistical summary:")
print(training_data.describe())
# 6. Additional useful methods
print(f"\n6. Additional info:")
print(f"Shape: {training_data.shape}")
print(f"Columns: {training_data.[Link]()}")
print(f"Memory usage: {training_data.memory_usage().sum()} bytes")
AI/ML Context: These commands are essential for:
Understanding dataset structure before training
Identifying data quality issues early
Choosing appropriate preprocessing steps
Debugging model performance issues
Part 2: Pandas Advanced (55 minutes)
Indexing, Slicing, and Filtering (15 minutes)
Column Selection
# Sample AI experiment tracking data
experiment_data = [Link]({
'experiment_id': ['exp_001', 'exp_002', 'exp_003', 'exp_004', 'exp_005'],
'model_type': ['CNN', 'RNN', 'Transformer', 'CNN', 'Transformer'],
'accuracy': [0.85, 0.78, 0.92, 0.88, 0.94],
'f1_score': [0.83, 0.76, 0.90, 0.86, 0.93],
'training_time': [120, 200, 300, 150, 350],
'dataset_size': [10000, 15000, 25000, 12000, 30000]
})
print("Original Dataset:")
print(experiment_data)
# Select single column
print(f"\n1. Single column (accuracy):")
print(experiment_data['accuracy'])
# Select multiple columns
print(f"\n2. Multiple columns:")
print(experiment_data[['model_type', 'accuracy', 'f1_score']])
# Select columns by position (iloc)
print(f"\n3. First 3 columns:")
print(experiment_data.iloc[:, :3])
Row Selection and Filtering
# Row selection by position
print(f"\n4. First 3 rows:")
print(experiment_data.head(3))
print(f"\n5. Specific rows by index:")
print(experiment_data.iloc[1:4]) # Rows 1, 2, 3
# Boolean filtering (VERY IMPORTANT for AI/ML)
print(f"\n6. High-performance models (accuracy > 0.9):")
high_performance = experiment_data[experiment_data['accuracy'] > 0.9]
print(high_performance)
print(f"\n7. Transformer models only:")
transformers = experiment_data[experiment_data['model_type'] == 'Transformer']
print(transformers)
print(f"\n8. Complex filtering (high accuracy AND fast training):")
efficient_models = experiment_data[
(experiment_data['accuracy'] > 0.85) &
(experiment_data['training_time'] < 250)
]
print(efficient_models)
AI/ML Applications:
Filter experiments by performance thresholds
Select specific model types for comparison
Identify best-performing configurations
Adding and Deleting Columns (10 minutes)
Adding Columns
print("=== ADDING COLUMNS ===")
# Add calculated columns (common in feature engineering)
experiment_data['efficiency_score'] = experiment_data['accuracy'] / (experiment_data['training_time'] / 100)
print(f"\n1. Added efficiency score:")
print(experiment_data[['experiment_id', 'accuracy', 'training_time', 'efficiency_score']])
# Add categorical column based on conditions
experiment_data['performance_category'] = experiment_data['accuracy'].apply(
lambda x: 'Excellent' if x >= 0.9 else 'Good' if x >= 0.8 else 'Needs Improvement'
)
print(f"\n2. Added performance category:")
print(experiment_data[['experiment_id', 'accuracy', 'performance_category']])
# Add constant column
experiment_data['researcher'] = 'AI_Team_2024'
print(f"\n3. Added researcher column:")
print(experiment_data[['experiment_id', 'researcher']].head(3))
Deleting Columns
print("\n=== DELETING COLUMNS ===")
# Method 1: Using drop()
experiment_clean = experiment_data.drop('researcher', axis=1)
print(f"\n1. Dropped 'researcher' column:")
print(f"Original columns: {experiment_data.[Link]()}")
print(f"New columns: {experiment_clean.[Link]()}")
# Method 2: Using del (permanent)
experiment_copy = experiment_data.copy()
del experiment_copy['efficiency_score']
print(f"\n2. Deleted 'efficiency_score':")
print(f"Remaining columns: {experiment_copy.[Link]()}")
# Method 3: Drop multiple columns
minimal_data = experiment_data.drop(['researcher', 'efficiency_score'], axis=1)
print(f"\n3. Dropped multiple columns:")
print(minimal_data.head(3))
Aggregations and Group Operations (15 minutes)
Basic Aggregations
print("=== BASIC AGGREGATIONS ===")
# Sample model comparison data
model_results = [Link]({
'model_family': ['CNN', 'CNN', 'RNN', 'RNN', 'Transformer', 'Transformer', 'CNN', 'RNN'],
'accuracy': [0.85, 0.88, 0.78, 0.82, 0.92, 0.94, 0.86, 0.80],
'training_time': [120, 150, 200, 180, 300, 350, 140, 190],
'memory_usage': [2.1, 2.3, 1.8, 1.9, 4.2, 4.5, 2.2, 1.7],
'dataset': ['small', 'large', 'small', 'large', 'small', 'large', 'medium', 'medium']
})
print("Model Results Dataset:")
print(model_results)
# Basic statistics
print(f"\n1. Overall Statistics:")
print(f"Average accuracy: {model_results['accuracy'].mean():.3f}")
print(f"Best accuracy: {model_results['accuracy'].max():.3f}")
print(f"Total experiments: {model_results['accuracy'].count()}")
print(f"Standard deviation: {model_results['accuracy'].std():.3f}")
GroupBy Operations (Critical for AI Analysis)
print(f"\n=== GROUPBY OPERATIONS ===")
# Group by model family
print(f"\n2. Performance by Model Family:")
model_performance = model_results.groupby('model_family').agg({
'accuracy': ['mean', 'max', 'count'],
'training_time': 'mean',
'memory_usage': 'mean'
}).round(3)
print(model_performance)
# Group by dataset size
print(f"\n3. Performance by Dataset Size:")
dataset_analysis = model_results.groupby('dataset')['accuracy'].agg([
'mean', 'std', 'min', 'max'
]).round(3)
print(dataset_analysis)
# Multiple grouping
print(f"\n4. Model Family vs Dataset Size:")
detailed_analysis = model_results.groupby(['model_family', 'dataset'])['accuracy'].mean().round(3)
print(detailed_analysis)
# Custom aggregations
print(f"\n5. Custom Analysis:")
custom_agg = model_results.groupby('model_family').agg({
'accuracy': lambda x: f"{[Link]():.3f} ± {[Link]():.3f}",
'training_time': lambda x: f"{[Link]()}-{[Link]()} min"
})
print(custom_agg)
Handling Missing Values (15 minutes)
Understanding Missing Data
print("=== HANDLING MISSING VALUES ===")
# Create dataset with missing values (common in real AI projects)
messy_data = [Link]({
'user_id': [1, 2, 3, 4, 5, 6, 7, 8],
'age': [25, [Link], 30, 35, [Link], 28, 32, 29],
'rating': [4.5, 3.2, [Link], 4.8, 2.1, [Link], 4.0, 3.8],
'review_text': ['Great!', None, 'Good product', 'Excellent', 'Poor', None, 'Amazing', 'OK'],
'purchase_date': ['2024-01-15', '2024-02-20', None, '2024-03-10', '2024-01-25', None, '2024-02-14', '2024-03-05']
})
print("Dataset with Missing Values:")
print(messy_data)
# Detecting missing values
print(f"\n1. Missing Value Detection:")
print(f"Missing values per column:")
print(messy_data.isna().sum())
print(f"\nPercentage of missing values:")
print((messy_data.isna().sum() / len(messy_data) * 100).round(2))
# Visual representation
print(f"\n2. Missing Value Pattern:")
print(messy_data.isna())
Handling Missing Values Strategies
print(f"\n=== MISSING VALUE STRATEGIES ===")
# Strategy 1: Drop rows with any missing values
print(f"\n1. Drop rows with ANY missing values:")
clean_data = messy_data.dropna()
print(f"Original shape: {messy_data.shape}")
print(f"After dropping: {clean_data.shape}")
print(clean_data)
# Strategy 2: Drop rows only if specific columns have missing values
print(f"\n2. Drop rows only if 'rating' is missing:")
rating_clean = messy_data.dropna(subset=['rating'])
print(f"Shape after dropping missing ratings: {rating_clean.shape}")
# Strategy 3: Fill missing values
print(f"\n3. Fill Missing Values:")
# Fill with mean (for numerical data)
filled_data = messy_data.copy()
filled_data['age'].fillna(filled_data['age'].mean(), inplace=True)
filled_data['rating'].fillna(filled_data['rating'].mean(), inplace=True)
# Fill with specific values
filled_data['review_text'].fillna('No review provided', inplace=True)
filled_data['purchase_date'].fillna('Unknown', inplace=True)
print("After filling missing values:")
print(filled_data)
# Strategy 4: Forward fill (useful for time series)
time_series_example = [Link]({
'timestamp': pd.date_range('2024-01-01', periods=8, freq='D'),
'sensor_value': [100, [Link], [Link], 150, [Link], 200, [Link], 180]
})
print(f"\n4. Forward Fill Example (Time Series):")
print("Original:")
print(time_series_example)
time_series_example['sensor_value_ffill'] = time_series_example['sensor_value'].fillna(method='ffill')
print("After forward fill:")
print(time_series_example)
AI/ML Context for Missing Values:
Training Data: Missing values can break model training
Feature Engineering: Different strategies affect model performance
Real-time Inference: Must handle missing values in production
Data Quality: Understanding missing patterns reveals data collection issues
Part 3: File Operations and Advanced Data Processing (10 minutes)
Reading JSON and CSV Files
Working with CSV Files
print("=== WORKING WITH CSV FILES ===")
# Create sample training dataset
training_dataset = [Link]({
'text': [
'I love this product!',
'Terrible customer service',
'Great quality and fast delivery',
'Not worth the money',
'Excellent experience overall',
'Could be better'
],
'sentiment': ['positive', 'negative', 'positive', 'negative', 'positive', 'neutral'],
'confidence': [0.92, 0.88, 0.95, 0.85, 0.91, 0.67],
'language': ['en', 'en', 'en', 'en', 'en', 'en']
})
# Save to CSV
training_dataset.to_csv('sentiment_training.csv', index=False)
print("Saved training dataset to CSV")
# Read from CSV with various options
df_csv = pd.read_csv('sentiment_training.csv')
print("\nRead from CSV:")
print(df_csv)
# Advanced CSV reading options
df_subset = pd.read_csv('sentiment_training.csv',
usecols=['text', 'sentiment'], # Only specific columns
nrows=3) # Only first 3 rows
print("\nSubset reading (first 3 rows, specific columns):")
print(df_subset)
Working with JSON Files
print(f"\n=== WORKING WITH JSON FILES ===")
# Create sample API response data (common in AI applications)
api_responses = [
{
"id": 1,
"user_query": "What is machine learning?",
"ai_response": "Machine learning is a subset of AI...",
"response_time": 0.5,
"metadata": {"model": "GPT-4", "tokens": 150}
},
{
"id": 2,
"user_query": "Explain neural networks",
"ai_response": "Neural networks are computing systems...",
"response_time": 0.8,
"metadata": {"model": "Claude-3", "tokens": 200}
}
]
# Save to JSON
import json
with open('ai_responses.json', 'w') as f:
[Link](api_responses, f, indent=2)
# Read JSON into DataFrame
df_json = pd.read_json('ai_responses.json')
print("Read from JSON:")
print(df_json)
Converting Nested JSON to Flat Table
print(f"\n=== FLATTENING NESTED JSON ===")
# Handle nested JSON (common in API responses)
from pandas import json_normalize
# Flatten the nested structure
df_flat = json_normalize(api_responses)
print("Flattened JSON:")
print(df_flat)
# More complex nested example
complex_data = [
{
"experiment_id": "exp_001",
"model_config": {
"type": "transformer",
"layers": 12,
"attention_heads": 8
},
"results": {
"accuracy": 0.92,
"f1_score": 0.89,
"precision": 0.91
},
"training_details": {
"epochs": 100,
"batch_size": 32,
"learning_rate": 0.001
}
}
]
# Flatten complex nested structure
df_complex = json_normalize(complex_data)
print("\nComplex flattened structure:")
print(df_complex.[Link]()) # Show all flattened column names
print(df_complex)
Summary and Real-World Applications (10 minutes)
Key Pandas Operations for AI/ML Pipeline
print("=== COMPLETE AI/ML PREPROCESSING PIPELINE ===")
# Simulate a complete preprocessing workflow
raw_data = [Link]({
'text': ['Great product!', 'Bad service', None, 'Amazing quality', 'Poor value'],
'rating': [5, 2, [Link], 5, 1],
'user_id': [101, 102, 103, 104, 105],
'category': ['electronics', 'service', 'electronics', 'electronics', 'service'],
'timestamp': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
})
print("1. Raw Data:")
print(raw_data)
# Step 1: Handle missing values
cleaned_data = raw_data.copy()
cleaned_data['text'].fillna('No comment', inplace=True)
cleaned_data['rating'].fillna(cleaned_data['rating'].mean(), inplace=True)
print("\n2. After handling missing values:")
print(cleaned_data)
# Step 2: Feature engineering
cleaned_data['text_length'] = cleaned_data['text'].[Link]()
cleaned_data['is_positive'] = cleaned_data['rating'] >= 4
cleaned_data['timestamp'] = pd.to_datetime(cleaned_data['timestamp'])
print("\n3. After feature engineering:")
print(cleaned_data)
# Step 3: Filtering and selection
final_dataset = cleaned_data[cleaned_data['text_length'] > 5][['text', 'rating', 'category', 'is_positive']]
print("\n4. Final processed dataset:")
print(final_dataset)
# Step 4: Export for model training
final_dataset.to_csv('processed_training_data.csv', index=False)
print("\n5. Exported processed data for model training!")
Connection to Gen AI Applications
1. LLM Training Data Preparation
Clean and format conversational datasets
Handle multilingual text data
Remove duplicates and inappropriate content
Create training/validation splits
2. Fine-tuning Datasets
Prepare instruction-following datasets
Format prompt-response pairs
Balance dataset categories
Quality filtering based on metrics
3. Model Evaluation and Analysis
Analyze model performance across different categories
Track experiments and hyperparameters
Compare model versions
Identify failure patterns
4. Production Data Processing
Handle real-time user inputs
Process API response logs
Monitor model performance metrics
A/B testing analysis
Best Practices for AI/ML with Pandas
1. Always Explore First: Use head(), info(), describe() before processing
2. Handle Missing Values Thoughtfully: Choose strategy based on data context
3. Validate Data Quality: Check for duplicates, outliers, inconsistencies
4. Document Preprocessing Steps: Keep track of transformations for reproducibility
5. Save Intermediate Results: Don't lose preprocessing work
6. Use Meaningful Column Names: Make code self-documenting
7. Consider Memory Usage: Use appropriate data types for large datasets
What's Next?
Building on Pandas Foundation:
Matplotlib/Seaborn: Visualizing your Pandas data
Scikit-learn: Using Pandas DataFrames directly in ML models
Streamlit: Building interactive web apps with Pandas
Dask: Scaling Pandas to larger-than-memory datasets
Advanced Pandas for AI:
Time Series Analysis: For sequential data and forecasting
Text Processing: String operations for NLP preprocessing
Multi-indexing: Complex hierarchical data structures
Performance Optimization: Vectorization and memory management
Hands-On Exercise (Remaining Time)
Complete Preprocessing Pipeline Exercise
Give students this practical exercise:
# Exercise: Prepare customer feedback data for sentiment analysis model
import pandas as pd
import numpy as np
# Sample messy customer feedback data
messy_feedback = [Link]({
'review_id': range(1, 11),
'customer_text': [
'Love this product!',
'terrible quality',
None,
'AMAZING SERVICE!!!',
'not worth it',
'pretty good',
None,
'Excellent experience',
'could be better',
'WORST PURCHASE EVER'
],
'star_rating': [5, 1, [Link], 5, 2, 4, [Link], 5, 3, 1],
'purchase_verified': [True, True, False, True, True, None, True, True, False, True],
'review_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18',
'2024-01-19', '2024-01-20', '2024-01-21', '2024-01-22',
'2024-01-23', '2024-01-24']
})
# Student tasks:
print("Student Tasks:")
print("1. Explore the dataset using head(), info(), describe()")
print("2. Identify and handle missing values appropriately")
print("3. Create new features: text_length, sentiment_label (positive/negative/neutral)")
print("4. Filter out unverified purchases")
print("5. Group by sentiment_label and calculate statistics")
print("6. Export the cleaned dataset for model training")
# This exercise combines all concepts learned in the session!
Assessment and Troubleshooting
Common Student Mistakes and Solutions
1. Forgetting to handle missing values
Solution: Always check with [Link]().sum() first
2. Modifying original DataFrames accidentally
Solution: Use [Link]() when experimenting
3. Confusing iloc and loc
Solution: iloc = position-based, loc = label-based
4. Not understanding boolean indexing
Solution: Break complex filters into steps
5. Inefficient loops instead of vectorization
Solution: Use Pandas built-in methods and .apply()
Quick Reference Cheat Sheet
# Essential Pandas Operations for AI/ML
import pandas as pd
# Data Loading
df = pd.read_csv('[Link]')
df = pd.read_json('[Link]')
# Data Exploration
[Link](), [Link](), [Link](), [Link]()
[Link], [Link], [Link]
# Data Selection
df['column'], df[['col1', 'col2']]
[Link][0:5], [Link][df['col'] > 5]
# Data Cleaning
[Link]().sum(), [Link](), [Link](value)
df.drop_duplicates(), [Link]('column', axis=1)
# Data Analysis
[Link]('column').mean()
df['column'].value_counts()
[Link](), df.sort_values('column')
# Data Export
df.to_csv('[Link]', index=False)
df.to_json('[Link]')
This comprehensive foundation in Pandas will enable students to handle any dataset they encounter in their AI/ML journey!
Matplotlib Teaching Notes - Gen AI Course
Duration: 2 Hours | Level: Beginner
Session Overview
This session introduces students to Matplotlib, the foundational plotting library for data visualization in Python. Students will learn to create
compelling visualizations essential for AI/ML model analysis, data exploration, and result presentation.
Learning Objectives
By the end of this session, students will be able to:
Understand Matplotlib's role in the AI/ML workflow
Create essential plot types for AI data analysis
Customize visualizations for professional presentation
Use subplots for comprehensive model comparison
Apply appropriate chart types for different AI/ML scenarios
Part 1: Matplotlib Basics (60 minutes)
Introduction to Matplotlib and Its AI/ML Context (15 minutes)
What is Matplotlib? Matplotlib is Python's foundational plotting library, providing the building blocks for data visualization. Think of it as the
"photoshop" for data - it creates the visual stories that make complex AI insights understandable.
Why Matplotlib is Critical for Gen AI:
1. Model Performance Visualization
Training/validation loss curves
Accuracy progression over epochs
Learning rate schedules
Confusion matrices and ROC curves
2. Data Exploration and EDA
Dataset distribution analysis
Feature correlation visualization
Outlier detection
Data quality assessment
3. Research and Communication
Academic paper figures
Business presentation charts
Model comparison visualizations
A/B testing results
4. Real-world AI Applications:
LLM Training: Visualizing loss curves and perplexity scores
Computer Vision: Plotting accuracy across different image categories
Recommendation Systems: Showing user engagement patterns
Time Series AI: Forecasting visualizations
Reinforcement Learning: Reward progression charts
AI Industry Context: Every major AI breakthrough is accompanied by compelling visualizations:
OpenAI's GPT papers feature training loss curves
DeepMind's papers show performance comparisons
Tesla's Autopilot progress is tracked through visual metrics
Google's AI research relies heavily on matplotlib-generated figures
Understanding pyplot and Plotting Syntax (10 minutes)
The pyplot Interface
import [Link] as plt
import numpy as np
# The magic command for Jupyter notebooks
%matplotlib inline
# Basic plotting syntax follows a simple pattern:
# 1. Create data
# 2. Plot data
# 3. Customize plot
# 4. Display plot
print("Matplotlib: Where data becomes insight!")
The Anatomy of a Plot
# Simple example: AI model accuracy over training epochs
epochs = [Link](1, 11) # Training epochs 1-10
accuracy = [0.45, 0.62, 0.71, 0.78, 0.83, 0.87, 0.89, 0.91, 0.92, 0.93]
# Create the plot
[Link](figsize=(8, 6)) # Set figure size
[Link](epochs, accuracy) # Create line plot
[Link]('AI Model Training Progress')
[Link]('Training Epochs')
[Link]('Accuracy')
[Link](True) # Add grid for better readability
[Link]() # Display the plot
print("This is exactly how AI engineers track model training!")
Key Concepts:
Figure: The entire plotting window
Axes: The area where data is plotted
Plot Elements: Lines, markers, text, legends
Customization: Colors, styles, labels, titles
Line Plots: The Foundation of AI Visualization (10 minutes)
Basic Line Plots for AI Applications
# Example 1: Training vs Validation Loss (Most Important AI Plot!)
epochs = [Link](1, 21)
train_loss = [Link](-epochs/8) + 0.1 + [Link](0, 0.02, 20)
val_loss = [Link](-epochs/6) + 0.15 + [Link](0, 0.03, 20)
[Link](figsize=(10, 6))
[Link](epochs, train_loss, label='Training Loss', color='blue')
[Link](epochs, val_loss, label='Validation Loss', color='red')
[Link]('Model Training Progress: Loss Curves')
[Link]('Epochs')
[Link]('Loss')
[Link]()
[Link](True, alpha=0.3)
[Link]()
print("This plot can save weeks of debugging in AI projects!")
Multiple Line Plots: Model Comparison
# Example 2: Comparing different AI models
epochs = [Link](1, 16)
gpt_accuracy = [0.5 + 0.4*(1 - [Link](-i/5)) + [Link](0, 0.01) for i in epochs]
bert_accuracy = [0.4 + 0.45*(1 - [Link](-i/4)) + [Link](0, 0.01) for i in epochs]
transformer_accuracy = [0.45 + 0.5*(1 - [Link](-i/6)) + [Link](0, 0.01) for i in epochs]
[Link](figsize=(12, 7))
[Link](epochs, gpt_accuracy, 'b-', linewidth=2, label='GPT Model')
[Link](epochs, bert_accuracy, 'r--', linewidth=2, label='BERT Model')
[Link](epochs, transformer_accuracy, 'g:', linewidth=2, label='Custom Transformer')
[Link]('AI Model Performance Comparison', fontsize=16)
[Link]('Training Epochs', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](fontsize=11)
[Link](True, alpha=0.3)
[Link](0, 1) # Set y-axis limits
[Link]()
print("This visualization helps choose the best AI architecture!")
Line Plot Variations
# Example 3: Learning Rate Scheduling
epochs = [Link](1, 101)
learning_rate = 0.001 * (0.95 ** (epochs // 10))
[Link](figsize=(10, 6))
[Link](epochs, learning_rate, 'purple', linewidth=2, marker='o', markersize=3)
[Link]('Learning Rate Schedule for Neural Network Training')
[Link]('Training Epochs')
[Link]('Learning Rate')
[Link]('log') # Logarithmic scale for learning rates
[Link](True, alpha=0.3)
[Link]()
print("Learning rate schedules are crucial for stable AI training!")
Bar Plots: Categorical AI Data Visualization (10 minutes)
Basic Bar Plots for AI Metrics
# Example 1: Model Performance Across Categories
categories = ['Text Classification', 'Image Recognition', 'Speech Processing', 'Translation', 'Summarization']
accuracy_scores = [0.92, 0.89, 0.85, 0.94, 0.87]
[Link](figsize=(12, 7))
bars = [Link](categories, accuracy_scores, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'])
[Link]('AI Model Performance Across Different Tasks', fontsize=16)
[Link]('AI Task Categories', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](0, 1)
# Add value labels on bars
for bar, score in zip(bars, accuracy_scores):
[Link](bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.2f}', ha='center', va='bottom', fontweight='bold')
[Link](rotation=45, ha='right')
plt.tight_layout()
[Link]()
print("This shows which AI tasks your model excels at!")
Horizontal Bar Plots: Feature Importance
# Example 2: Feature Importance in ML Model
features = ['Word Frequency', 'Sentence Length', 'Sentiment Score', 'Topic Category', 'Grammar Score']
importance = [0.35, 0.15, 0.25, 0.18, 0.07]
[Link](figsize=(10, 6))
bars = [Link](features, importance, color='skyblue', edgecolor='navy', alpha=0.7)
[Link]('Feature Importance in Text Classification Model', fontsize=14)
[Link]('Importance Score', fontsize=12)
# Add value labels
for i, bar in enumerate(bars):
[Link](bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
f'{importance[i]:.2f}', va='center', fontweight='bold')
plt.tight_layout()
[Link]()
print("Feature importance helps understand what your AI model focuses on!")
Grouped Bar Charts: Model Comparison
# Example 3: Comparing metrics across different models
models = ['BERT', 'GPT-3', 'T5', 'RoBERTa']
precision = [0.89, 0.91, 0.87, 0.90]
recall = [0.86, 0.88, 0.85, 0.89]
f1_score = [0.87, 0.89, 0.86, 0.89]
x = [Link](len(models))
width = 0.25
[Link](figsize=(12, 7))
[Link](x - width, precision, width, label='Precision', color='lightcoral', alpha=0.8)
[Link](x, recall, width, label='Recall', color='lightblue', alpha=0.8)
[Link](x + width, f1_score, width, label='F1-Score', color='lightgreen', alpha=0.8)
[Link]('NLP Model Performance Comparison', fontsize=16)
[Link]('Models', fontsize=12)
[Link]('Score', fontsize=12)
[Link](x, models)
[Link]()
[Link](True, alpha=0.3, axis='y')
[Link](0, 1)
plt.tight_layout()
[Link]()
print("This helps choose the best NLP model for your specific needs!")
Scatter Plots: Relationship Analysis in AI (10 minutes)
Basic Scatter Plots for AI Data
# Example 1: Model Size vs Performance
model_size = [100, 250, 500, 1000, 1750, 3500, 7000, 15000] # Million parameters
accuracy = [0.78, 0.83, 0.87, 0.91, 0.93, 0.94, 0.95, 0.96]
[Link](figsize=(10, 7))
[Link](model_size, accuracy, c='red', s=100, alpha=0.7, edgecolors='black')
[Link]('AI Model Size vs Performance Trade-off', fontsize=16)
[Link]('Model Size (Million Parameters)', fontsize=12)
[Link]('Accuracy Score', fontsize=12)
[Link](True, alpha=0.3)
# Add trend line
z = [Link](model_size, accuracy, 2)
p = np.poly1d(z)
[Link](model_size, p(model_size), "r--", alpha=0.8, linewidth=2)
plt.tight_layout()
[Link]()
print("This reveals the diminishing returns of larger AI models!")
Color-coded Scatter Plots: Multi-dimensional Analysis
# Example 2: Training Data Analysis
data_size = [Link](1000, 50000, 50)
training_time = data_size * 0.02 + [Link](0, 100, 50)
model_type = [Link](['CNN', 'RNN', 'Transformer'], 50)
# Create color map for model types
colors = {'CNN': 'red', 'RNN': 'blue', 'Transformer': 'green'}
scatter_colors = [colors[mt] for mt in model_type]
[Link](figsize=(12, 8))
for model in ['CNN', 'RNN', 'Transformer']:
mask = model_type == model
[Link](data_size[mask], training_time[mask],
c=colors[model], label=model, s=80, alpha=0.7)
[Link]('Training Time vs Dataset Size by Model Type', fontsize=16)
[Link]('Dataset Size (samples)', fontsize=12)
[Link]('Training Time (minutes)', fontsize=12)
[Link](title='Model Type')
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()
print("This helps predict training costs for different AI architectures!")
Bubble Charts: Three-dimensional Data
# Example 3: AI Model Evaluation Dashboard
models = ['BERT-Base', 'BERT-Large', 'GPT-2', 'GPT-3', 'T5-Small', 'T5-Large']
accuracy = [0.89, 0.92, 0.87, 0.94, 0.88, 0.91]
inference_speed = [150, 80, 200, 60, 180, 90] # tokens per second
model_size = [110, 340, 117, 175000, 60, 770] # million parameters
[Link](figsize=(12, 8))
# Bubble size represents model size
bubble_sizes = [size/100 for size in model_size] # Scale down for visualization
scatter = [Link](accuracy, inference_speed, s=bubble_sizes,
c=model_size, cmap='viridis', alpha=0.7, edgecolors='black')
[Link]('AI Model Performance Dashboard\n(Bubble size = Model Size)', fontsize=16)
[Link]('Accuracy Score', fontsize=12)
[Link]('Inference Speed (tokens/sec)', fontsize=12)
# Add colorbar for model size
cbar = [Link](scatter)
cbar.set_label('Model Size (Million Parameters)', fontsize=11)
# Add model labels
for i, model in enumerate(models):
[Link](model, (accuracy[i], inference_speed[i]),
xytext=(5, 5), textcoords='offset points', fontsize=9)
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()
print("This dashboard view helps select the optimal AI model!")
Titles, Labels, and Legends: Professional Presentation (5 minutes)
Complete Plot Customization
# Example: Professional AI Research Figure
epochs = [Link](1, 26)
baseline_loss = 2.5 * [Link](-epochs/10) + 0.3
our_model_loss = 2.0 * [Link](-epochs/8) + 0.25
sota_model_loss = 1.8 * [Link](-epochs/12) + 0.2
[Link](figsize=(12, 8))
# Plot with custom styling
[Link](epochs, baseline_loss, 'b--', linewidth=2.5,
label='Baseline Model', marker='o', markersize=4)
[Link](epochs, our_model_loss, 'r-', linewidth=2.5,
label='Our Proposed Model', marker='s', markersize=4)
[Link](epochs, sota_model_loss, 'g-.', linewidth=2.5,
label='State-of-the-Art Model', marker='^', markersize=4)
# Professional customization
[Link]('Training Loss Comparison: Novel Architecture vs Baselines',
fontsize=18, fontweight='bold', pad=20)
[Link]('Training Epochs', fontsize=14, fontweight='bold')
[Link]('Cross-Entropy Loss', fontsize=14, fontweight='bold')
# Enhanced legend
[Link](loc='upper right', fontsize=12, frameon=True,
shadow=True, fancybox=True, framealpha=0.9)
# Professional grid and styling
[Link](True, alpha=0.3, linestyle='-', linewidth=0.5)
[Link](fontsize=12)
[Link](fontsize=12)
# Add annotations
min_epoch = [Link](our_model_loss) + 1
min_loss = our_model_loss[[Link](our_model_loss)]
[Link](f'Best Performance\nEpoch {min_epoch}: {min_loss:.3f}',
xy=(min_epoch, min_loss), xytext=(min_epoch+5, min_loss+0.3),
arrowprops=dict(arrowstyle='->', color='red', lw=1.5),
fontsize=11, ha='center',
bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7))
plt.tight_layout()
[Link]()
print("This is publication-ready AI research visualization!")
Part 2: Matplotlib Advanced (60 minutes)
Subplots: Comprehensive AI Analysis (15 minutes)
Basic Subplots for Model Analysis
# Example 1: Complete Training Dashboard
fig, axes = [Link](2, 2, figsize=(15, 12))
# Data for different metrics
epochs = [Link](1, 21)
train_loss = [Link](-epochs/8) + 0.1 + [Link](0, 0.02, 20)
val_loss = [Link](-epochs/6) + 0.15 + [Link](0, 0.03, 20)
train_acc = 1 - [Link](-epochs/5) * 0.8 + [Link](0, 0.01, 20)
val_acc = 1 - [Link](-epochs/4) * 0.85 + [Link](0, 0.015, 20)
# Subplot 1: Loss curves
axes[0, 0].plot(epochs, train_loss, 'b-', label='Training Loss', linewidth=2)
axes[0, 0].plot(epochs, val_loss, 'r-', label='Validation Loss', linewidth=2)
axes[0, 0].set_title('Model Loss Progression', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Epochs')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Subplot 2: Accuracy curves
axes[0, 1].plot(epochs, train_acc, 'b-', label='Training Accuracy', linewidth=2)
axes[0, 1].plot(epochs, val_acc, 'r-', label='Validation Accuracy', linewidth=2)
axes[0, 1].set_title('Model Accuracy Progression', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Epochs')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Subplot 3: Learning rate schedule
learning_rates = [0.001 * (0.9 ** (i//5)) for i in epochs]
axes[1, 0].plot(epochs, learning_rates, 'g-', linewidth=2, marker='o')
axes[1, 0].set_title('Learning Rate Schedule', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Epochs')
axes[1, 0].set_ylabel('Learning Rate')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)
# Subplot 4: Model complexity
layers = [2, 4, 6, 8, 10, 12]
performance = [0.75, 0.83, 0.89, 0.92, 0.94, 0.93]
axes[1, 1].bar(layers, performance, color='purple', alpha=0.7)
axes[1, 1].set_title('Model Depth vs Performance', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Number of Layers')
axes[1, 1].set_ylabel('Test Accuracy')
axes[1, 1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
[Link]('Complete AI Model Training Dashboard', fontsize=18, fontweight='bold', y=1.02)
[Link]()
print("This dashboard gives you complete visibility into AI model training!")
Advanced Subplot Configurations
# Example 2: AI Model Comparison Grid
models = ['BERT', 'GPT-2', 'T5', 'RoBERTa', 'ELECTRA', 'DeBERTa']
tasks = ['Classification', 'Generation', 'Q&A', 'Summarization']
# Create performance matrix
[Link](42)
performance_matrix = [Link](len(tasks), len(models)) * 0.3 + 0.65
fig, axes = [Link](2, 3, figsize=(18, 12))
axes = [Link]()
for i, model in enumerate(models):
task_performance = performance_matrix[:, i]
bars = axes[i].bar(tasks, task_performance,
color=[Link].Set3(i), alpha=0.8, edgecolor='black')
axes[i].set_title(f'{model} Performance', fontsize=12, fontweight='bold')
axes[i].set_ylabel('Accuracy Score')
axes[i].set_ylim(0.6, 1.0)
axes[i].tick_params(axis='x', rotation=45)
# Add value labels
for bar, score in zip(bars, task_performance):
axes[i].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.2f}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
[Link]('AI Model Performance Across Different Tasks', fontsize=20, fontweight='bold', y=1.02)
[Link]()
print("This comparison helps choose the right model for each AI task!")
Histograms: Distribution Analysis for AI (15 minutes)
Basic Histograms for AI Data
# Example 1: Model Prediction Confidence Distribution
[Link](42)
high_conf_predictions = [Link](8, 2, 1000) # Skewed towards high confidence
low_conf_predictions = [Link](2, 5, 500) # Skewed towards low confidence
all_predictions = [Link]([high_conf_predictions, low_conf_predictions])
[Link](figsize=(12, 8))
[Link](all_predictions, bins=30, alpha=0.7, color='skyblue',
edgecolor='black', density=True)
[Link]('AI Model Prediction Confidence Distribution', fontsize=16, fontweight='bold')
[Link]('Confidence Score', fontsize=12)
[Link]('Density', fontsize=12)
[Link](all_predictions.mean(), color='red', linestyle='--',
linewidth=2, label=f'Mean: {all_predictions.mean():.3f}')
[Link]([Link](all_predictions), color='green', linestyle='--',
linewidth=2, label=f'Median: {[Link](all_predictions):.3f}')
[Link]()
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]()
print("This helps assess whether your AI model is over-confident or under-confident!")
Comparative Histograms
# Example 2: Training Data Quality Analysis
[Link](42)
clean_data_scores = [Link](0.8, 0.1, 1000)
noisy_data_scores = [Link](0.6, 0.2, 800)
[Link](figsize=(12, 8))
[Link](clean_data_scores, bins=25, alpha=0.6, label='Clean Training Data',
color='green', density=True, edgecolor='black')
[Link](noisy_data_scores, bins=25, alpha=0.6, label='Noisy Training Data',
color='red', density=True, edgecolor='black')
[Link]('Training Data Quality Distribution Comparison', fontsize=16, fontweight='bold')
[Link]('Data Quality Score', fontsize=12)
[Link]('Density', fontsize=12)
[Link](fontsize=12)
[Link](True, alpha=0.3)
# Add statistical annotations
[Link](clean_data_scores.mean(), color='green', linestyle=':',
linewidth=2, alpha=0.8)
[Link](noisy_data_scores.mean(), color='red', linestyle=':',
linewidth=2, alpha=0.8)
[Link](0.85, 3, f'Clean Data\nMean: {clean_data_scores.mean():.3f}\nStd: {clean_data_scores.std():.3f}',
bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen", alpha=0.7))
[Link](0.3, 2, f'Noisy Data\nMean: {noisy_data_scores.mean():.3f}\nStd: {noisy_data_scores.std():.3f}',
bbox=dict(boxstyle="round,pad=0.3", facecolor="lightcoral", alpha=0.7))
plt.tight_layout()
[Link]()
print("This analysis helps decide if you need data cleaning before AI training!")
2D Histograms: Feature Correlation
# Example 3: Feature Correlation Analysis
[Link](42)
feature1 = [Link].multivariate_normal([0.7, 0.6], [[0.02, 0.01], [0.01, 0.03]], 500)
feature2 = [Link].multivariate_normal([0.3, 0.8], [[0.03, -0.01], [-0.01, 0.02]], 300)
all_features = [Link]([feature1, feature2])
x_features = all_features[:, 0]
y_features = all_features[:, 1]
[Link](figsize=(12, 10))
# Create 2D histogram
plt.hist2d(x_features, y_features, bins=25, cmap='Blues', alpha=0.8)
[Link](label='Frequency')
[Link]('Feature Correlation Heatmap for AI Model Input', fontsize=16, fontweight='bold')
[Link]('Feature 1 (e.g., Text Sentiment)', fontsize=12)
[Link]('Feature 2 (e.g., Text Length)', fontsize=12)
# Add marginal histograms
from [Link] import Rectangle
# Top histogram (marginal for x)
ax_top = [Link]().inset_axes([0, 1.02, 1, 0.2])
ax_top.hist(x_features, bins=25, color='lightblue', alpha=0.7, edgecolor='black')
ax_top.set_xlim([Link]().get_xlim())
ax_top.tick_params(labelbottom=False)
# Right histogram (marginal for y)
ax_right = [Link]().inset_axes([1.02, 0, 0.2, 1])
ax_right.hist(y_features, bins=25, orientation='horizontal',
color='lightblue', alpha=0.7, edgecolor='black')
ax_right.set_ylim([Link]().get_ylim())
ax_right.tick_params(labelleft=False)
plt.tight_layout()
[Link]()
print("This reveals feature correlations that affect AI model performance!")
Pie Charts: Categorical Analysis for AI (10 minutes)
Basic Pie Charts for AI Metrics
# Example 1: AI Model Error Analysis
error_types = ['False Positives', 'False Negatives', 'True Positives', 'True Negatives']
error_counts = [120, 80, 1650, 1850]
colors = ['#FF6B6B', '#FFE66D', '#4ECDC4', '#95E1D3']
[Link](figsize=(10, 8))
wedges, texts, autotexts = [Link](error_counts, labels=error_types, colors=colors,
autopct='%1.1f%%', startangle=90, explode=(0.1, 0.1, 0, 0))
[Link]('AI Model Prediction Analysis\n(Confusion Matrix Distribution)',
fontsize=16, fontweight='bold', pad=20)
# Enhance text
for autotext in autotexts:
autotext.set_color('white')
autotext.set_fontweight('bold')
autotext.set_fontsize(12)
# Add legend with counts
legend_labels = [f'{label}: {count}' for label, count in zip(error_types, error_counts)]
[Link](wedges, legend_labels, title="Prediction Counts",
loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.tight_layout()
[Link]()
print("This pie chart helps understand your AI model's error patterns!")
Advanced Pie Charts with Subplots
# Example 2: AI Model Resource Usage Comparison
fig, axes = [Link](1, 3, figsize=(18, 6))
# Model A: Lightweight model
model_a_resources = [20, 15, 10, 55] # Training, Inference, Storage, Available
labels = ['Training', 'Inference', 'Storage', 'Available']
colors = ['#FF9999', '#66B2FF', '#99FF99', '#FFCC99']
axes[0].pie(model_a_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Lightweight AI Model\nResource Usage', fontsize=14, fontweight='bold')
# Model B: Heavy model
model_b_resources = [45, 25, 20, 10]
axes[1].pie(model_b_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Heavy AI Model\nResource Usage', fontsize=14, fontweight='bold')
# Model C: Optimized model
model_c_resources = [30, 18, 12, 40]
axes[2].pie(model_c_resources, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[2].set_title('Optimized AI
<style type="text/css">@media print {
*, :after, :before {background: 0 0 !important;color: #000 !important;box-shadow: none !important;text-shadow: none !im
a, a:visited {text-decoration: underline}
a[href]:after {content: " (" attr(href) ")"}
abbr[title]:after {content: " (" attr(title) ")"}
a[href^="#"]:after, a[href^="javascript:"]:after {content: ""}
blockquote, pre {border: 1px solid #999;page-break-inside: avoid}
thead {display: table-header-group}
img, tr {page-break-inside: avoid}
img {max-width: 100% !important}
h2, h3, p {orphans: 3;widows: 3}
h2, h3 {page-break-after: avoid}
}
html {font-size: 12px}
@media screen and (min-width: 32rem) and (max-width: 48rem) {
html {font-size: 15px}
}
@media screen and (min-width: 48rem) {
html {font-size: 16px}
}
body {line-height: 1.85}
.air-p, p {font-size: 1rem;margin-bottom: 1.3rem}
.air-h1, .air-h2, .air-h3, .air-h4, h1, h2, h3, h4 {margin: 1.414rem 0 .5rem;font-weight: inherit;line-height: 1.42}
.air-h1, h1 {margin-top: 0;font-size: 3.998rem}
.air-h2, h2 {font-size: 2.827rem}
.air-h3, h3 {font-size: 1.999rem}
.air-h4, h4 {font-size: 1.414rem}
.air-h5, h5 {font-size: 1.121rem}
.air-h6, h6 {font-size: .88rem}
.air-small, small {font-size: .707em}
canvas, iframe, img, select, svg, textarea, video {max-width: 100%}
body {color: #444;font-family: 'Open Sans', Helvetica, sans-serif;font-weight: 300;margin: 0;text-align: center}
img {border-radius: 50%;height: 200px;margin: 0 auto;width: 200px}
a, a:visited {color: #3498db}
a:active, a:focus, a:hover {color: #2980b9}
pre {background-color: #fafafa;padding: 1rem;text-align: left}
blockquote {margin: 0;border-left: 5px solid #7a7a7a;font-style: italic;padding: 1.33em;text-align: left}
li, ol, ul {text-align: left}
p {color: #777}</style>