0% found this document useful (0 votes)
2 views

AP Lecture33

The document outlines the lifecycle of a Machine Learning product, detailing stages from problem definition to model deployment and continuous monitoring. It emphasizes the importance of data collection, preparation, model development, and evaluation, highlighting techniques such as ETL, Content-Based and Collaborative Filtering for recommendations. Additionally, it discusses the roles of data scientists and AI engineers, the tools and libraries used in machine learning, and key concepts in AI engineering.

Uploaded by

shakshanbayt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AP Lecture33

The document outlines the lifecycle of a Machine Learning product, detailing stages from problem definition to model deployment and continuous monitoring. It emphasizes the importance of data collection, preparation, model development, and evaluation, highlighting techniques such as ETL, Content-Based and Collaborative Filtering for recommendations. Additionally, it discusses the roles of data scientists and AI engineers, the tools and libraries used in machine learning, and key concepts in AI engineering.

Uploaded by

shakshanbayt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Advanced

programming
Lecture 3
lifecycle of a Machine Learning product
• First, define the problem or state the situation.
• data collection,
• data preparation,
• model development and evaluation,
• Finally to model deployment.

• In reality, the model lifecycle is iterative, which means


that we tend to go back and forth between these
processes.
ETL process
Data collection and preparation is called the Extract, Transform, and
Load, or ETL process.
The ETL process involves collecting data from various sources,
then cleaning, transforming, and storing it into a new single place.
The data is then accessible to the machine learning engineer, allowing
them to perform tasks like building a machine learning model.
Data collection
• User data (demographics, purchase history, etc. )
• Product data (inventory of products and what they do, their ingredients, how
popular they are, their customer ratings)
• Other data (user’s saved products, liked products, search history, most visited
products, and so on)

• To help increase business revenue, need to create and deploy a model


that recommends similar products to what the customer has already
bought.
• end-user’s pain point: “As a beauty product customer, I would like to
receive recommendations for other products based on my purchase
history so that I will be able to address my skincare needs and improve
the overall health of my skin.”
Data preparation
• Cleaning and structuring the data:
• Remove irrelevant/extreme values.
• Handle missing values appropriately.
• Ensure correct data formats. (For example, dates should be in
date formats and strings should be properly identified. )
• Feature Engineering: Creating new variables such as
average transaction time. (
• calculate the average duration between transactions for each
user and find which products they buy the most.
• need a feature that identifies what kind of skin issues each
product targets and assign them to each user.)
Data preparation
• Exploratory Data Analysis (EDA): Identify patterns and
correlations.
• create plots to visually identify patterns, validate the data
based on information that the beauty product subject matter
expert has given me, and do some correlation analysis to
identify what variables or features are very important to the
users’ buying habits and needs.
• Decide data splitting strategy (e.g., last transaction as
test set).
• Time-consuming process due to data
inconsistencies.
Model Development
• Choose the right model and framework.
• Use Content-Based Filtering to recommend similar products.
• if someone is using a cleanser with lots of water, it is likely that the user has dry skin and will
want a moisturizer that is highly moisturizing as well. One of the steps I might take here is to
create a similarity score of the products a user has purchased and rank them to other products. I
might recommend the most similar product while bearing in mind that there may be other factors
that could come into play. For example, I might notice that the user has searched for products
without particular ingredients, so I want to make sure that we are not recommending a product
that they absolutely won’t use.
• Use Collaborative Filtering to make recommendations based on user similarities.
• creating similarities between two users based on how they view a product. For example, I can
create a similarity, based on how two users rate their product. First, I group users into a bucket
based on their characteristics. This could be age, region, and skin type, products the users rated,
and or purchased. Then, I can take the average ratings for existing members and assume that the
new user will be somewhere around the average, and recommend a product based on what
others have rated highly.
• Combine both techniques for improved accuracy.
• Time-consuming due to model experimentation and optimization.
Model Evaluation
• Tune the model using test data.
• Conduct A/B testing with real users.
• Gather feedback (e.g., user ratings, click-through rates, purchase
rates).
• involve me tuning the model and doing some testing on the data set had
kept earlier for testing. Once satisfied with the results, will further
evaluate the model by experimenting with the recommendations on a
group of users and asking for their feedback. The feedback will include
asking the group of users to rate the recommendations, and collecting
data on the number of people who clicked and bought the recommended
products, along with any other necessary metrics.
• Modify the model based on evaluation results.
• Iterative process that can take significant time.
Model Deployment
• Deploy the model in a real-world environment (e.g.,
app, website).
• Ensure seamless integration with existing systems.
• Monitor model performance to detect issues early.
• Now that I am done with building and testing, the model is ready to go to
production. For this project, it will be a part of the beauty product app and
website.
• While this is the last step, I still need to track the deployed
model’s performance to make sure it continues to do the job that the
business requires. Future iterations may include retraining the model based
on new information in order to expand its capabilities.
Continuous Monitoring &
Improvement
• Track performance metrics post-deployment.
• Retrain the model periodically with new data.
• Adapt to changing user behavior and business needs.

• Each step in the ML lifecycle is essential for success.


• Data collection, preparation, and model evaluation are
the most time-consuming stages.
• Continuous monitoring is crucial for long-term
effectiveness.
Data Scientist and AI engineer
• traditionally, data scientists have always used AI models
to do their analysis.
• Generative AI breakthroughs have been so
groundbreaking that generative AI has split off into
its own distinct field, and we call that AI engineering.
data scientist as a data storyteller. AI engineer as an AI system builder
They take massive amounts of messy real use foundation models to build generative
world data, and they use mathematical AI systems that help to transform business
models to translate this data into insights. processes.
use a lot of descriptive analytics to Focus on prescriptive (decision
describe the past. Use descriptive & optimization, recommendation systems)
predictive analytics (EDA, clustering, and generative (LLMs, chatbots) use
regression, classification). cases.
Work mostly with structured (tabular) Work mainly with unstructured data
data. (text, images, audio, video).
Clean & preprocess datasets (e.g., remove
outliers, feature engineering).
Use a variety of models, each trained Require massive-scale datasets
for specific datasets. (billions-trillions of tokens for LLMs (large
language model).
Models are small, requiring less Use Foundation Models that generalize
computation and time (seconds to hours). across tasks. Models are huge, requiring
massive computational resources (weeks-
months to train).
a typical data science process AI Engineering Process
• start off with a use case, and then from • starts off with a use case, but then we
that use case, you pick the right data. can skip directly to working with a pre-
• Then after that data is prepared, you use trained model. And what makes this
it to train and validate a model using possible is a phenomenon called AI
techniques such as feature engineering, democratization, which is a big fancy
cross-validation, or hyperparameter word that simply means making AI more
tuning, as an example. This model then widely accessible to everyday users. AI
is deployed at some endpoint, for engineers interact with these foundation
example in the cloud, to do real-time models via natural language instructions
prediction and inference. to prompt them to do various tasks, and
this process is known as prompt
engineering.
Define use case Define use case
Collect & prepare structured data Use a pre-trained Foundation Model
Train & validate a specific ML model Apply Prompt Engineering, Fine-Tuning
Deploy model for real-time inference (PEFT), or RAG
Build end-to-end AI applications
Key Techniques in AI Engineering
• These are three major techniques used to adapt large pre-trained AI models
(such as GPT-4, LLaMA, or Claude) for specific tasks without training from
scratch.
Prompt Engineering
• The process of carefully crafting inputs (prompts) to guide the AI model’s
response.
• Bad Prompt: "Summarize this article.“
• Good Prompt: "Summarize this article in 3 bullet points, highlighting the key takeaways.“
Fine-Tuning (PEFT - Parameter-Efficient Fine-Tuning)
• Instead of training a whole model from scratch, PEFT adapts only a small
number of model parameters.
• A hospital wants an AI assistant fine-tuned on medical research papers to provide better
healthcare insights.
Retrieval-Augmented Generation (RAG)
• Enhances LLMs by retrieving relevant external documents before generating
a response.
• If an AI chatbot is helping lawyers, it can search legal databases before answering questions,
so its answers are always up to date and accurate.
What is data?
• Data is a collection of raw facts, figures, or information.
• Used to draw insights, inform decisions, and fuel AI and
machine learning.
• Essential for all machine learning models as it provides the
necessary information for pattern discovery and prediction.

• Machine learning tools help in:


• Data preprocessing
• Building, evaluating, and optimizing models
• Implementing machine learning solutions
• These tools simplify complex tasks like handling big data,
statistical analysis, and making predictions.
Popular Machine Learning
Libraries
• Pandas: Data manipulation and analysis
• Scikit-learn: Supervised and unsupervised learning
algorithms
• NumPy: Numerical computations for large datasets
• SciPy: Scientific computing, including optimization and
regression
Machine Learning Programming
Languages
• Python: Most widely used, extensive libraries for ML
and AI.
• R: Popular for statistical analysis and data exploration.
• Julia: High-performance, used in research and
numerical computing.
• Scala: Ideal for big data processing and ML pipelines.
• Java: Scalable ML applications for production.
• JavaScript: Runs ML models in web browsers for client-
side applications
Categories of Machine Learning
Tools
• Data Processing & Analytics
• Data Visualization
• Machine Learning Model Development
• Deep Learning Frameworks
• Computer Vision Tools
• Natural Language Processing (NLP) Tools
• Generative AI Tool
Data Processing & Analytics
Tools
• PostgreSQL: SQL-based database system.
• Hadoop: Scalable disk-based big data storage and
processing.
• Spark: In-memory data processing, faster than Hadoop.
• Apache Kafka: Real-time data streaming and analytics.
• Pandas: Data wrangling and transformation.
• NumPy: Mathematical functions and linear algebra
operations.
Data Visualization Tools
• Matplotlib: Customizable plots and visualizations.
• Seaborn: Statistical graphics built on Matplotlib.
• ggplot2: R-based visualization package for layered
graphics.
• Tableau: Business intelligence tool for interactive
dashboards
Machine Learning Model
Development Tools
• Scikit-learn: Classic ML algorithms (classification,
regression, clustering).
• Pandas: Prepares data for ML models.
• SciPy: Supports linear regression and optimization
Deep Learning Frameworks
• TensorFlow: Large-scale ML and deep learning.
• Keras: User-friendly deep learning library.
• Theano: Efficient mathematical computations for
neural networks.
• PyTorch: Deep learning with support for NLP and
computer vision
Computer Vision Tools

• OpenCV: Real-time image processing and object


detection.
• Scikit-Image: Image segmentation and feature
extraction.
• TorchVision: Pre-trained models and image
transformation functions
Natural Language Processing
(NLP) Tools
• NLTK: Text processing, tokenization, and stemming.
• TextBlob: Sentiment analysis and part-of-speech
tagging.
• Stanza: Pre-trained NLP models for multiple languages
Generative AI Tools
• Hugging Face Transformers: Transformer models for
NLP.
• ChatGPT: AI chatbot and text generation.
• DALL-E: AI-generated images from text descriptions.
• GANs (Generative Adversarial Networks): Deep
learning models for generating images and videos
Scikit-learn
• Scikit-learn is a free machine learning library for
Python.
• Provides algorithms for classification, regression,
clustering, and dimensionality reduction.
• Works seamlessly with NumPy, SciPy, and
Matplotlib.
• Offers extensive documentation and a large support
community
scales your data by standardizing it

Scikit-learn can split arrays and matrices into random


train and test subsets for you in one line of code.
Here, 33% of the data is reserved for testing

instantiate a classifier model using a support vector


classification algorithm. This line of code generates a
classification model object, called clf, and initializes
its parameters, gamma and C.
The clf model learns to predict the classes for
unknown cases by passing the training set to
the fit method.
Then you can use the test data to generate
predictions. The result tells you the
predicted class for each observation in the
test set.

You can also use different metrics to


evaluate your model accuracy, such as a
confusion matrix to compare the predicted
and actual labels for the test set. And
finally, you can save your model as a pickle
file and retrieve it
Confusion Matrix, Accuracy,
predict
• This is a table that shows how many predictions were
correct and incorrect.
• It compares the model's predictions (y_pred) with the
actual values (y_true).
• Accuracy measures how many predictions were
correct overall.
• The model predicts values for new data based on what
it learned from training
Which one of the following best describes
machine learning?
Which one of the following tasks is a machine learning
engineer more likely to perform than a data scientist?
Which library is at the core of an open-source Python machine learning ecosystem that enables you to develop machine learning models?

• Pandas
• Scikit-learch
Which library is a tool for data analysis, visualization, cleaning, and preparing data for machine learning?

• Numpy
• Scipy
Next slides
• Vanishing gradients → errors fade, network learns
poorly.
• Exploding gradients → errors become too large, network
behaves chaotically.
• Backpropagation → a method that helps the network
learn and correct errors.
• Batch normalization → stabilizes the network so that
learning is faster and more reliable.
• Data is the foundation of machine learning.
• Machine Learning Tools help simplify tasks and
enhance efficiency.
• Programming Languages like Python and R are
widely used.
• Specialized tools exist for data processing,
visualization, ML, deep learning, computer vision, NLP,
and generative AI.

You might also like