Generative AI

Name: Generative AI Nanodegree Program
Rating: 4.9 (91 reviews)

Ready to build production-grade AI? This program equips developers to deploy reliable generative AI solutions. We'll move past theory and focus on the proven implementation patterns you need. You'll master production essentials like model selection, cost estimation, and reliable prompt engineering to build efficient apps. You'll also implement lightweight model adaptation using PEFT. Then, you'll build end-to-end RAG systems, using vector databases to connect LLMs to your data and evaluate quality with frameworks like RAGAs. Finally, you'll dive into advanced multimodal applications that process text, images, and audio. You'll enforce structured outputs with Pydantic and implement system observability to build, trace, and debug modern AI apps.

Nanodegree Program

Intermediate
56 hours
4.9 (91)
Updated: Dec 17, 2025

Subscription · Monthly

Cancel Anytime
Unlimited access to hundreds of top-rated courses
Hands-on projects with expert feedback
Personalized career coaching and interview prep
Program Certificates

Skills you'll learn

53 skills

Image pre-processing
Word embeddings
Ethical AI
AI Audio and Speech Analysis
Diffusion Models

Prerequisites

12 prerequisites

Prior to enrolling, you should have the following knowledge:

You will also need to be able to communicate fluently and professionally in written and spoken English.

Program Outline

3 courses
82 lessons
3 projects

Course 1: Generative AI Fundamentals

Employ the abilities of Generative AI with a deep dive into fundamentals. This course examines how various models are developed, how they work, and how to use them to their full potential.

21 hours

Introduction to Generative AI Foundations
Explore core principles, tools, and ethical use of Generative AI, and discover its real-world impact and foundational models powering creative applications.
Generative AI Overview
Explore the fundamentals of generative AI, its key modalities, advanced capabilities, and essential ethical considerations shaping responsible AI development.
Applications of Generative AI
Explore real-world applications of Generative AI, including LLM-assisted coding, and learn to prompt, validate, and improve AI-generated code and tests.
Introduction to Foundation Models
Discover foundation models: large, versatile AI systems trained on massive datasets that generalize across tasks, surpassing traditional models in scalability and adaptability.
Building Applications using Foundation Models
Learn to build text classifiers with foundation models, using zero-shot and few-shot prompt engineering for tasks like sentiment and spam detection, and evaluate classifier accuracy.
How Generative AI Works
Learn how generative AI creates new data with architectures like Transformers and diffusion models, and how training enables creativity, reasoning, and task-specific abilities.
Evaluating Generative AI Models
Learn how to assess generative AI using human evaluation, exact metrics, AI judges, and benchmarks, ensuring robust performance for open-ended, probabilistic model outputs.
Implementing Evaluations for Generative AI Models
Learn practical techniques to evaluate generative AI models, from Exact Match to ROUGE, semantic similarity, code correctness, Pass@k, and LLM-as-a-Judge scoring.
Neural Networks and Multilayer Perceptrons
Explore neural networks from perceptrons to multilayer perceptrons, learning how they adapt via training, gradient descent, and backpropagation to solve complex AI tasks.
Implementing Neural Networks using Pytorch
Learn to implement neural networks in PyTorch by mastering tensors, model building, loss functions, optimizers, data loading, and complete training loops for practical machine learning.
Model Interpretability and Ethics
Explore AI model interpretability and ethics, including bias, misinformation, environmental impact, and fairness for responsible development and deployment of AI technologies.
Generating Text using LLMs
Discover how LLMs generate text token by token using Hugging Face's Transformers, from tokenization to model use, and explore hands-on demos with efficient generation methods.
Role-Based Prompting
Explains the theory of using roles or personas to control the tone, style, and expertise of an LLM's output.
Implementing Role-Based Prompting with Python
Provides hands-on practice in iteratively developing a role-based prompt to create a believable historical figure persona.
Adapting Foundation Models
Learn to adapt foundation models for specialized tasks using prompt engineering, RAG, fine-tuning, model compression, and agentic AI tools for efficient, tailored AI solutions.
Applying PEFT on Foundation Models
Learn to efficiently customize foundation models with PEFT and SFT, using LoRA to teach LLMs new skills like spelling via hands-on data preparation and fine-tuning.
Post-Training Foundation Models
Explore post-training for foundation models, including supervised and preference fine-tuning, to align AI with human values, improve usability, and ensure responsible interactions.
Reinforcement Fine-tuning on Foundation Models
Learn to fine-tune LLMs for structured tasks like counting and spelling using GRPO and LoRA, applying reinforcement-based reward functions for targeted skill improvements.

Teaching an LLM to Count!
Teaching an LLM to count the number of letters in a word using GRPO.

Course 2: Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)

Master Large Language Models (LLMs) and build sophisticated text generation applications in this hands-on course. You’ll master prompt engineering techniques, optimize model selection and costs, and dive deep into Retrieval-Augmented Generation (RAG), using vector databases to ground AI responses in external data and eliminate hallucinations. Finally, you’ll evaluate system performance with RAGAS and showcase your skills by building an end-to-end RAG application.

17 hours

Introduction to LLMs and Retrieval Augmented Generation (RAG)
Introduces Large Language Models (LLMs), their core concepts, and the course structure. Covers prerequisites, environment setup, and defines Retrieval-Augmented Generation (RAG).
The Large Language Model Landscape
Explore the four core capabilities of LLMs: generation, summarization, classification, and reasoning. Real-world applications and the importance of RAG for building trust.
Implementing a Chatbot with an LLM
Learn to build a stateful chatbot using an LLM. Covers managing conversation history, using system prompts to define behavior, and understanding message roles (system, user, assistant).
LLM Prompting & Inference
Defines prompt engineering and its components. Explains how to control LLM outputs using inference parameters like temperature, top-p, max tokens, and stop sequences.
Applied Prompting and Inference
Apply prompting techniques hands-on. Implement Chain of Thought (CoT) prompting to improve reasoning and test how different inference parameters change model behavior.
Prompt Instruction Refinement
Explains the theory of systematically refining prompt instructions by modifying components like Role, Task, Context, Examples, and Output Format.
Applying Prompt Instruction Refinement with Python
Provides hands-on practice iteratively refining a prompt to transform a generic recipe analyzer into a precise dietary consultant that produces structured JSON.
Tokens, Embeddings, and Vector Search
Covers the foundations of NLP for LLMs. Defines tokenization, embeddings as semantic vectors, and vector search (similarity search) as the basis for finding relevant information.
Implementing Tokens, Embeddings, and Vector Search
Hands-on practice with tokenization. Implementing embedding generation and vector search to build a semantic search system from scratch.
Strategic Model Selection & Economics
Learn the business trade-offs of model selection. Covers performance, cost, speed, and control (TCO). Compares general-purpose (generation) vs. specialized (reasoning) models.
Applying Model Selection and Economics
Apply model selection theory. Calculate Total Cost of Ownership (TCO) including error costs. Implement a hybrid model routing system to balance cost and quality.
Retrieval Augmented Generation (RAG) Workflow
Introduces the RAG architecture. Compares naive vs. advanced modular RAG. Covers the data ingestion pipeline, focusing on data formats and intelligent chunking strategies.
Semantic Search with Vector Databases for RAG
Explains semantic search and the role of vector databases. Covers indexing algorithms (HNSW) for speed and advanced retrieval techniques like HyDE and re-ranking.
Prompt Engineering for RAG Synthesis
Learn to write prompts for RAG. Covers grounding answers in context, handling conflicts, managing uncertainty, and enforcing verifiability by generating inline citations.
Implementing RAG with Vector Databases
Build a complete RAG system. Practice vector database operations in ChromaDB, including adding documents, applying metadata filters, and implementing a retrieval and generation pipeline.
Evaluating RAG Systems
Learn to evaluate RAG system quality. Introduces key metrics: Context Precision, Context Recall, Faithfulness, and Answer Relevancy. Covers frameworks like RAGAS.
RAG Evaluation Implementation
Implement a RAG evaluation pipeline using the RAGAS framework. Learn to calculate and interpret quality metrics to diagnose and improve your RAG system's performance.
Project: NASA Mission Intelligence: Developing a RAG-Based Chat System
Build an end-to-end RAG chatbot. Ingest NASA mission data, build a vector search pipeline, generate answers, and evaluate the system's quality.

Course 3: Multimodal AI Applications

Learn how computers process and understand image data, then harness the power of the latest Generative AI models to create new images.

18 hours

Introduction
Multimodal AI Fundamentals
Discover multimodal AI fundamentals and technologies, including models and use cases that process and generate text, images, audio, and video for richer, real-world applications.
Using Multimodal AI Technologies
Explore practical applications of multimodal AI by using APIs and open-source models for image captioning and audio transcription, with hands-on exercises and secure credential handling.
Transformers & Multimodal Processing
Explore how transformers unify text, images, audio, and video through attention, embeddings, and fusion strategies, powering state-of-the-art multimodal understanding and generation.
Multimodal AI Tooling
Explore practical tools for building multimodal AI apps, compare commercial and open-source options, and use Pydantic AI to create reliable, structured, vendor-agnostic workflows.
Introduction to Enterprise Visual Content Processing
Explore enterprise visual content processing: core computer vision tasks, digital image representation, and real-world applications for efficiency, safety, and automation.
Vision Pre-processing Pipelines with HuggingFace
Explore vision data pipelines using HuggingFace, from dataset loading to resizing and normalization, with demos and hands-on exercises for effective image pre-processing.
Understanding Embeddings in Computer Vision
Learn how embeddings convert images into compact vectors for efficient search, enable cross-modal tasks with models like CLIP, and power large-scale, robust computer vision systems.
Image Search Using CLIP Embeddings
Explore how to build text-to-image and image-to-image search using CLIP embeddings, combining theory, real-world demos, hands-on practice, and solution walkthroughs.
Using Multimodal Model APIs for Vision
Explore multimodal vision APIs: prompt design, parameter tuning, structured outputs, cost control, integration, and best practices for robust, efficient image analysis.
Gemini Vision API Basics
Explore Gemini Vision API basics by practicing image moderation, learning to analyze images and implement moderation workflows using real-world examples and guided hands-on exercises.
Vision Transformer Models & Architectures
Explore Vision Transformer models: core architecture, image tokenization, self- and cross-attention, and top models (SAM, RT-DETR, DINOv2) for segmentation, detection, and enterprise use.
Using Vision Transformers
Explore vision transformers with hands-on demos: extract image embeddings using DINOv2 and perform object detection and segmentation using RT-DETR and SAM2.1 models.
Vision-Language Models
Learn how vision-language models align images and text for tasks like search, captioning, and VQA, with focus on architectures, applications, data needs, and deploying for enterprise use.
Multimodal Vision Applications with CLIP
Explore zero-shot image classification and auto-labeling for driving scenes using CLIP, enabling efficient, scalable multimodal vision applications.
Diffusion Models & Image Generation
Explore how diffusion models generate images by reversing noise through iterative denoising, inspired by physical diffusion processes and key to modern generative AI developments.
Introduction to Enterprise Audio Processing
Discover enterprise audio processing, core speech tasks (transcription, diarization, sentiment, TTS), key use cases, and strategies for value and integration in modern businesses.
Audio Data Representation
Explore how audio is digitized for AI: sample rate, bit depth, channels, formats, and mel spectrograms for speech, plus challenges and best practices in audio preprocessing and analysis.
Audio Processing with librosa
Explore audio processing with librosa: load, resample, convert, and analyze audio files; visualize with mel spectrograms and apply techniques through hands-on exercises.
Sound Retrieval and Classification
Explore audio embeddings for efficient sound classification and retrieval, using models like CLAP to enable semantic search and robust text-based audio analysis at scale.
Sound Retrieval and Classification with CLAP
Explore using CLAP for sound retrieval, similarity, and zero-shot classification, then apply these skills to detect fan on/off states in real audio data.
Speech Processing
Discover automatic speech recognition with Whisper: a robust, multilingual, open-source model for accurate transcription, translation, and speech processing in real-world audio.
Implementing Speech Processing with Whisper & Gemini
Explore real-world speech transcription and translation with Whisper and Gemini, using Python to process, segment, and align audio with text, including multilingual support.
Audio Intelligence
Explore advances in Audio Intelligence: multimodal systems, speech recognition, TTS, enterprise controls, creative workflows, and ethics for robust, secure, and accessible audio solutions.
Audio Sentiment Analysis with Gemini
Explore audio sentiment and command analysis using Pydantic AI and Gemini; learn to extract emotions and recognize spoken commands from audio with real-world datasets and hands-on exercises.
Audio Classification and Moderation
Explore voice content moderation: real-time and batch pipelines, compliance, privacy, layered detection, and operational excellence for secure and fair audio classification.
Building a Basic Voice Moderation System with Gemini
Learn to build a voice moderation system using Gemini to transcribe audio, detect personal data disclosures, and flag policy violations in customer service recordings.
Introduction to Enterprise Video Processing
Discover how enterprise video AI overcomes temporal complexity using smart frame selection for efficient understanding, search, classification, moderation, and generation at scale.
AI Models for Video Understanding
Explore key AI models like YOLO for real-time detection, CoTracker and TimeSformer for motion and temporal understanding, enabling advanced, scalable enterprise video analytics.
Implementing Object Recognition & Tracking
Learn how to detect and track objects in videos using YOLOv9, apply multi-object tracking, handle small objects, and count items crossing boundaries in practical scenarios.
Video Understanding & Search
Explore methods for video analysis and search using foundation models and CLIP4Clip, balancing temporal understanding, cost, and retrieval accuracy for enterprise applications.
Video Understanding & Search with Gemini & Clip4Clip
Explore video understanding with Gemini and Clip4Clip: learn automated video description, key moment detection, and natural language video search using AI models and structured outputs.
Video Classification & Moderation
Learn to classify and moderate video by modeling temporal patterns, handling real-world challenges, and combining automation with human oversight for scale, accuracy, and compliance.
Video Classification & Moderation with Gemini
Learn to build automated systems for video classification and moderation with Gemini and Pydantic AI, including action recognition and safety compliance in real-world scenarios.
Video Generation
Explore generative video AI tools and workflows that turn text, images, or footage into dynamic content for marketing, training, and creative use while ensuring quality and compliance.
Video Generation with Veo 3
Learn to generate marketing videos with Veo 3 using both text-to-video and image-to-video workflows, and understand their strengths, limitations, and real-world applications.
Multimodal AI Deployment
Explore deployment of multimodal AI systems for text, images, audio, video via unified APIs, multi-API orchestration, and custom solutions, balancing speed, cost, and control.
Implementation Tools and Serving Strategies
Explore tools and strategies for implementing, serving, and monitoring AI solutions, from rapid prototyping to production, including unified APIs, orchestration, and managed platforms.
Using Gradio and Pydantic AI
Learn to build multimodal chatbots and analysis apps using Gradio and Pydantic AI, covering async programming, media inputs, rate limiting, and interface customization.
Multimodal AI Performance Monitoring and Logging
Learn to monitor and log multimodal AI systems, tracking performance, costs, and failures across modalities for optimized, reliable, and coherent production deployments.
Logging and Performance Monitoring with Gradio and Arize Phoenix
Learn to implement logging and performance monitoring for multimodal AI chatbots using Gradio and Arize Phoenix, enabling robust analytics, debugging, and cost tracking.
Evaluating Multimodal Applications
Learn how to evaluate multimodal AI apps using user feedback systems and testing methods, blending human review, automated metrics, and continuous monitoring for quality improvement.
Testing Multimodal Apps with Pydantic AI Evals
Learn to build robust testing frameworks for multimodal AI apps using Pydantic Evals, covering structured outputs, semantic evaluation, custom evaluators, and hands-on exercises.
Scaling Multimodal AI Architecture
Learn strategies to scale multimodal AI: unified APIs, multi-API pipelines, and custom deployments, focusing on performance, cost, reliability, and architectural trade-offs.
OmniTrainer: Multimodal Customer Service Trainer
In this project, students will create an AI agent that simulates customer service scenarios and specialized monitoring agents that analyze communications across text, images, videos, and audio.

Skills that demand high salaries

Jobseekers with generative AI skills can expect a nearly 50% salary bump compared to competitors who lack them.*

Generative AI Engineer

Salary info from Talent.com

Low
$157,367
Average
$187,306
High
$228,346

Program Instructors

3 instructors

Unlike typical professors, our instructors come from Fortune 500 and Global 2000 companies and have demonstrated leadership and expertise in their professions:

Brian Cruz

Head of AI Engineering, Advocate

Eduardo Mota

Sr. Cloud Data Architect

Giacomo Vianello

Director, Machine Learning Engineer

Brian Cruz

Head of AI Engineering, Advocate

Eduardo Mota

Sr. Cloud Data Architect

Giacomo Vianello

Director, Machine Learning Engineer

Reviews

Average Rating: 4.9 (91 Reviews)

Completed course

Jheriko De Asis

Jan 23, 2026

Good learning

Soumik Roy Chowdhury

Jan 16, 2026

very useful.

Mediboina Dhanalakshmi

Jan 16, 2026

All good for me!

Maria Minilea Rubin

Jan 15, 2026

very helpful

Mildred San Andres

Jan 14, 2026

About this program

Move beyond basics to build reliable enterprise-grade AI. Master foundation model-fine tuning, RAG, cost-effective prompt engineering, and multimodal solutions.