Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning
Please pick up a penny from the front of the room for an activity later in class. At the end
of class, please take it with you or return it to the front.
Announcements
Special thanks to Prof. Moritz Hardt and Prof. Jacob Steinhardt for most of the content in today’s slides.
Weekly Outline
● Last time: High-Dimensional Regression
● Related to (but not the same as) Personal Health Information (PHI)
○ Protected under HIPAA (US medical privacy law)
○ Includes medical information like doctors’ notes, lab results, etc.
○ Only applies to data in a medical setting, and is specific to health data
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “Your data is safely anonymized!” (name, address, SSN etc., removed)
Ethnicity Name
Visit data Address
Diagnosis zip Date
Procedure dob registered
Medication sex Party
Charge affiliation
... Date last
voted
Latanya Sweeney
Medical data Voter list
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “your data is safely anonymized”: name, address, SSN etc., removed.
Ethnicity Name
Visit data Address
Diagnosis zip Date
Lesson: removing private identifiers
Procedure dob
sex
registered
Medication Party
Charge might not be enough!
affiliation
Date last
...
voted
Latanya Sweeney
Medical data Voter list
Linkage Attack: Netflix Prize
moviefan777
👍👍👎👍 8/10 ⭐
👎👍👎👎
2/10 ⭐
👍
👎👎👍👎 3/10 ⭐
👍
👍👍👍👍 9/10 ⭐
Linkage Attack: Netflix Prize
moviefan777
👍Lesson:
👍👎 👍
even without anything 8/10 ⭐
obviously
👎👍👎
“identifiable”, 👎 can still be2/10identified
individuals in
⭐
👍
👎👎👍👎 “anonymized” data releases
3/10 ⭐
👍
👍👍👍👍 9/10 ⭐
Another attempt: k-Anonymity
● Observation: if only one person has some combination of traits, then we can
re-identify that person
● Idea: divide columns into quasi-identifiers (identifiable stuff) and sensitive
attributes (other stuff)
○ e.g., ZIP, DOB, and sex could be quasi-identifiers
● Requirement: for any combination of quasi-identifiers there must be at least k
rows in the dataset with that combination
○ e.g., if k=3, then we need at least 3 people for every combination of ZIP/DOB/sex
3563 40
2400 32
4029 27
3309 20
9214 36
Basics of Genome-Wide Association Studies (GWAS)
● Recall: DNA is made up of A/C/T/G
● Humans have identical values in most of the genome (~99.6%)
● Focus on locations where >1% of people have “unusual” values
○ e.g., “most people have G here, but 3.2% of people have C”
○ These are called single nucleotide polymorphisms (SNPs)
○ 3.2% is called the minor allele frequency (MAF): how common is the
mutation?
● Goal: find associations between SNPs and disease
● Example Dataset
○ 1000 people with a disease of interest
○ For 100,000 locations, release MAFs at each one: “what percentage of
this population has a mutation in this spot?”
GWAS Attack: Can I find out if someone is in the dataset?
● Answer: yes! (using public data and that person’s DNA)
● Dataset: 100,000 locations, MAFs at each one
● Two models
○ Trust nobody: you keep your private data and nobody else gets to see it
○ Trusted curator (e.g., NIH, Census, etc.): we trust the curator to hold on to private information,
but only release “safe” de-identified information to the public
Keeping data private: randomized response
● Setup: I’m conducting a yes/no survey, but you don’t trust me to give me your
(private) answer
○ Concern: response bias if I ask people anyway
● Exercise: Let b be your true answer, and b’ be the answer you gave me. If
E[b] = q, what is E[b’]?
○ See the answer in next week’s discussion!
Randomized response example
● Have you ever eaten food that fell on the ground outside? (clean floors
don’t count)
● Why is it a good idea to always flip twice regardless of the first flip outcome?
Definitions
● Two datasets D, D’ are called neighboring if they differ by at most one row
○ One row can be deleted, added, or modified
It follows that:
Discussion Question
● Consider a deterministic algorithm that always returns 0.7 (regardless of the
data). Is this algorithm ε-differentially private? If so, for what value of ε?
Discussion Question
● Do larger values of ε correspond to stronger or weaker privacy guarantees?
Why?
Differential Privacy: Summary
● Differential privacy is a framework (not an algorithm) for describing how a
randomized algorithm can provide privacy guarantees
● Informally: An algorithm A(D) is called differentially private if A(D) is not too
different from A(D’) (when D and D’ are neighbors)
○ Neighboring datasets differ by at most one element
● Formally: A(D) is ε-differentially private if
● ε describes how strong our algorithm’s privacy guarantee is
○ Small values of ε correspond to strong privacy guarantees (accuracy may suffer)
○ Large values of ε correspond to weak privacy guarantees (can achieve higher accuracy)
Trusted curator model
��🏽♂
live in Beechum County?
3,487
��
How many Black men live
��🏽♂
in Faulconer County?
1,792
Trusted curator model
Query q
��
Answer: q(D) + noise
Trusted curator
��
Private data D
Goals:
● q(D) + noise should be ε-differentially private
● |noise| should be as small as possible while guaranteeing privacy
Trusted curator model: examples
US Census data
Trusted curator model: protecting individuals
��🏽♂
in Yakutat County, Alaska?
��
How many Asian men over
��🏽♂
70 live in Yakutat County,
Alaska?
��
1
How do we achieve differential privacy?
● Definition: The sensitivity of an algorithm A is how much it can change for
neighboring datasets:
○ If an algorithm has high sensitivity, we’ll need a lot of noise to “hide” the change from D to D’
○ If an algorithm has low sensitivity, we don’t need as much noise
● For a deterministic query q, the curator returns q(D) + random noise
○ Random noise should depend on sensitivity and on how strong a privacy guarantee we want
3563 40
“What was the average
score on MT2?” 2400 32
4029 27
3309 20
��
9214 36
Differential privacy in
the wild
https://www.microsoft.com/en-us/research/project/privacy-integrated-queries-pinq/
Source:
https://www.theverge.com/2019/9/5/20850465/google-differential-privacy-open-sour
ce-tool-privacy-data-sharing
The Verge: “It was probably the most bewildering part of Apple’s [2016] WWDC Keynote: in the middle of a
rundown of fancy new products arriving with iOS 10, Craig Federighi stopped to talk about abstract
mathematics. He was touting differential privacy, a statistical method that’s become a valuable tool for
protecting user data.” https://www.theverge.com/2016/6/17/11957782/apple-differential-privacy-ios-10-wwdc-2016
From: https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
“In June 2016, Apple announced that it will deploy differential privacy for some user
data collection in order to ensure privacy of user data, even from Apple. The details of
Apple's approach remained sparse. Although several patents have since appeared
hinting at the algorithms that may be used to achieve differential privacy, they did not
include a precise explanation of the approach taken to privacy parameter choice.
Such choice and the overall approach to privacy budget use and management are
key questions for understanding the privacy protections provided by any deployment
of differential privacy.”
https://arxiv.org/abs/1709.02753
Stepping back
There are many challenges with putting differential privacy in practice
Computational challenges
Implementation pitfalls
Political struggles