0% found this document useful (0 votes)
12 views

W01 PracticalProblemsProjects

Uploaded by

eddieguo1128
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

W01 PracticalProblemsProjects

Uploaded by

eddieguo1128
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Practical Problems & Projects

11-767: On-Device Machine Learning

Prof. Emma Strubell


Recognizing People in Photos
Through Private On-Device Machine
Learning
Floris Chabert, Jingwen Zhu, Brett Keating, and Vinay Sharma

Apple Research Blog July 2021


https://machinelearning.apple.com/research/recognizing-people-photos

Yonatan Bisk & Emma Strubell


Task
Find faces of contacts

Why is this hard?


Lighting, perspective,
skin color, age, gender, …

Why is this hard On-Device?


Motivation
On-Device face recognition is privacy preserving

Context:

• Competitors in the market (e.g. Google) use cloud based services and so your
data is shared.

• Apple has their own neural engine for acceleration.

• Quality vs battery.

What is a naive algorithm we might use?


Notes:

Inference Pipeline 1. Two feature representations (2 * model)


2. Agglomerative Clustering (naively expensive)
3. Use of external metadata (can correct for weak model)
Clustering
1. Conservative embedding clusters (very few merges - within moments?)
Relies on hand-tuned weighting for face (vs mean face) and body
Dij = min(Fij, α ⋅ Fij + β ⋅ Tij) where F & T are face and body, respectively
2. Agglomerative Clustering (Faces only)
First pass (ideal): “median distance between the members of two HAC clusters”
After threshold: “random sampling” ← Maintain linear runtime (no guarantees)

Note: runs periodically,


typically overnight during device charging
Assigning Identity
• Every cluster has c “canonical” exemplars:
1 1 1 2 2 2 K K
D= [X0 , X1 , . . . Xc , X0 , X1 , . . . Xc , X1 , . . . Xc ]
• Construct a representation for the input as a function of the dictionary (existing
2
clusters): min ∥y − D ⋅ x∥2 + λ ⋅ ∥x∥1
x
This reduces to a convex optimization (least squares) for the values in x

• So this is quickly learnable (optimally)


i
• Now the values in xj for a given X* de ne the cluster
fi
Network Design
Network Design
“highest accuracy possible while running ef ciently on-device, with low latency and a thin memory pro le”

• Skipping important details here because the model is


largely based around MobileNet which will be discussed
later, BUT

• Double channels “within limits of computation”

• Bottleneck expansions are smaller and added attention


at every layer

• PReLU
fi
fi
Network Design
“highest accuracy possible while running ef ciently on-device, with low latency and a thin memory pro le”

• Wider ~= performance to deeper (but faster)


Zagoruyko, S., Komodakis, N.: Wide residual networks

• Attention adds performance with little to no new parameters

fi
fi
Performance of Attention
Training (Focus on normalization and cos)

Margin ensures weighting on hard examples


Training (Focus on normalization and cos)
Other Considerations

1. Filtering Unclear Faces (no details)

2. Augmentations: “pixel-level changes such as color jitter or grayscale conversion, structural


changes like left-right ipping or distortion, Gaussian blur, random compression artifacts and
cutout regularization”

3. COVID-19: “we designed a synthetic mask augmentation. We used face landmarks to


generate a realistic shape corresponding to a face mask. We then overlaid random samples
from clothing and other textures in the inferred mask area over the input face”
fl
Qualitative
Key Components
• Optimized clustering (constant time)

• Assignment via Convex coding (minimal updates)

• Wider (shallower) networks

Questions:
1. Was the attention worth it?
2. Was this only possible because of the neural engine?
Course Project
Anatomy of the Course Project

We provide: You decide:

• Lab 2: Benchmarking • Hardware: Laptop, robot, RPi…

• Lab 3: Quantization • Model: ResNet, Transformer,


encoder vs. decoder…
• Lab 4: Pruning
• Data: Language, vision, …
Same as training data, or transfer/
adaptation setting?
Example Projects
• AnySurface: Converting any surface into a
controller by compressing UNet, run on RPi4.

• Speech-to-text translation: Automatic speech


recognition and translation on RPi4.

• Im2Cal: Estimating food calories from image by


compressing Segformer, on RPi4.

• Hey where’s that thing: Temporal localization in


videos by compressing 2D-TAN on laptop.

• Shazaam: On-device music recognition w/


FAISS, separable convolutions.
Example Projects
• Plant Jones: Smart assistant, who is also a plant.

• v1.0 (2015): Find tweets with positive/negative


sentiment about water, post positive sentiment ones
when well watered, negative when thirsty (dry).

• V2.0 (2023): Use an LLM to generate thirst-related


conversation. Also:
— Custom wake-word detection (“hey plant!”)
— Text-to-speech
— Speech-to-text
— Tiny LCD screen mouth

• ^This is an example baseline using open source


software and libraries — implemented using out-of-the-
box tools over about a week.
Axes to Consider
• Theory or practice? Resource optimized vs resource constrained?

• Target hardware:
CPU + RAM vs GPU/M1 + Shared RAM vs GPU+CPU + Separate RAM

• Hardware support: Logic, quantization, sparse ops, batching…

• Novelty: Reproduction vs transfer (new data/hardware) vs novel?

• In-distribution or transfer: Fitting to in-distribution data, vs. adapting to a new


task or domain?
Efficiency in Theory versus Practice

Resource Optimized Resource Constrained

• Magnitude pruning • Structured pruning / layer pruning

• Server • Edge device

• Quantization (3-bit) • Quantization (8-bit) if hw supports


Target hardware considerations
In addition to devices, we can provide: $100 AWS and $50 OpenAI credits
per student.

• Where do you store model weights, activations, gradients?


How does this impact latency?

• Trade-off between storage size, speed, and on-the- y computation

• Do I want on-device training? Fine-tuning?

• How heavy is the OS? How heavy are USB vs GPIO?

• Does your hardware support ef cient batched computation? Ef cient low-


bitwidth computation? Ef cient control ow?
fi
fi
fl
fl
fi
Project Ideas
Resource Optimized
• Does ef cient method X published in a CV venue apply to NLP, or vice versa?

• Does theoretically proven idea Y published in ML venue apply to larger, more


complex models and datasets?
Resource Constrained
• Does “ef cient” method Z, evaluated on GPU/TPU, work on CPU/Edge? Under
memory constraints? Power constraints?

• Can you further optimize an already-ef cient model?


Can you compress a huge model enough to t it on device?
All of the above
• Compare existing methods across different metrics: Pareto optimality,
generalization, fairness, …
fi
fi
fi
fi
Learning Goals

Project is not Project is


• Entrepreneurship 101 • Measuring Ef ciency and Power

• Multimodal Machine Learning • Adjusting data for 👆


(amazing class)
• Changing architectures for 👆
• Graded based on model performance
• Producing Pareto curves for 👆
• Real world robotics
fi
Plotting Goals
Where to start?
• What pre-trained models exist for my task?

• What is a baseline I can feasibly train/evaluate in a few hours?

• How can I sub-sample my data to create a feasible train/test set?

• Single domain? Limited label space? Simpli ed task?

• Goal: Performance that’s non-trivial but do not need competitive performance

• What is unique about my data/task/… that makes me think I can


compress my models?
fi

You might also like