0% found this document useful (0 votes)
8 views

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

rashamunsar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

rashamunsar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Privacy in Machine Learning

Data 102 Fall 2023 Lecture 24

Please pick up a penny from the front of the room for an activity later in class. At the end
of class, please take it with you or return it to the front.

Announcements

● Project checkpoint deadlines extended

Special thanks to Prof. Moritz Hardt and Prof. Jacob Steinhardt for most of the content in today’s slides.
Weekly Outline
● Last time: High-Dimensional Regression

● Today: Privacy in Machine Learning


○ Why it’s hard, and how to get it wrong
○ Randomized response: a simple way of getting it right without trust
○ Trusted curator model and differential privacy
○ Current events: 2020 Census and differential privacy

● Next time: Robustness and Generalization


○ Case studies: applying course ideas to real-world scenarios

Why does privacy matter?
● Many valuable applications of ML require use of sensitive data
○ Health, census, phone/location, wearables, etc.

● We want to perform useful analysis while protecting individuals’ privacy


● This is harder than it sounds: many seemingly “private” schemes are
vulnerable to attack
● Key idea: adopt security mindset
○ Think about potential attacks and how to guard against them
Personally Identifiable Information (PII)
● Anything that can identify an individual
○ Name, SSN, email, fingerprint, drivers license #, etc.
● Many security & privacy standards have requirements on how to store PII
securely
● Common approach: remove all PII from a dataset to call it “anonymized”
○ Isn’t always good enough

● Related to (but not the same as) Personal Health Information (PHI)
○ Protected under HIPAA (US medical privacy law)
○ Includes medical information like doctors’ notes, lab results, etc.
○ Only applies to data in a medical setting, and is specific to health data
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “Your data is safely anonymized!” (name, address, SSN etc., removed)

Ethnicity Name
Visit data Address
Diagnosis zip Date
Procedure dob registered
Medication sex Party
Charge affiliation
... Date last
voted

Latanya Sweeney
Medical data Voter list
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “your data is safely anonymized”: name, address, SSN etc., removed.

Ethnicity Name
Visit data Address
Diagnosis zip Date
Lesson: removing private identifiers
Procedure dob
sex
registered
Medication Party
Charge might not be enough!
affiliation
Date last
...
voted

Latanya Sweeney
Medical data Voter list
Linkage Attack: Netflix Prize

󰲁󰪈󰱩󰲰 moviefan777
󰬈
👍👍👎👍 8/10 ⭐
👎👍👎👎
2/10 ⭐
👍
👎👎👍👎 3/10 ⭐

👍
👍👍👍👍 9/10 ⭐
Linkage Attack: Netflix Prize

󰲁󰪈󰱩󰲰 moviefan777
󰬈
👍Lesson:
👍👎 👍
even without anything 8/10 ⭐
obviously
👎👍👎
“identifiable”, 👎 can still be2/10identified
individuals in

👍
👎👎👍👎 “anonymized” data releases
3/10 ⭐

👍
👍👍👍👍 9/10 ⭐
Another attempt: k-Anonymity
● Observation: if only one person has some combination of traits, then we can
re-identify that person
● Idea: divide columns into quasi-identifiers (identifiable stuff) and sensitive
attributes (other stuff)
○ e.g., ZIP, DOB, and sex could be quasi-identifiers
● Requirement: for any combination of quasi-identifiers there must be at least k
rows in the dataset with that combination
○ e.g., if k=3, then we need at least 3 people for every combination of ZIP/DOB/sex

● Does not guarantee privacy: why?


○ Any column could be a (quasi-)identifier given enough public data!
De-identified data vs aggregate data
● Releasing anonymized data is hard because many columns and attributes
can identify individuals
● Re-identification attacks: discover the identity of individuals from
“anonymized” data
○ Common, and come in many different flavors

● Idea: only release aggregate data, with no individual-level data


○ Average across an entire population, etc.
○ Still not good enough!
Aggregate Data Example: Exam Grades
SID Score

3563 40

2400 32

4029 27

3309 20

9214 36
Basics of Genome-Wide Association Studies (GWAS)
● Recall: DNA is made up of A/C/T/G
● Humans have identical values in most of the genome (~99.6%)
● Focus on locations where >1% of people have “unusual” values
○ e.g., “most people have G here, but 3.2% of people have C”
○ These are called single nucleotide polymorphisms (SNPs)
○ 3.2% is called the minor allele frequency (MAF): how common is the
mutation?
● Goal: find associations between SNPs and disease
● Example Dataset
○ 1000 people with a disease of interest
○ For 100,000 locations, release MAFs at each one: “what percentage of
this population has a mutation in this spot?”
GWAS Attack: Can I find out if someone is in the dataset?
● Answer: yes! (using public data and that person’s DNA)
● Dataset: 100,000 locations, MAFs at each one

SNP 1 2 3 … … 100000 Test


… 0.02 population
MAF 0.02 0.03 0.05

SNP 1 2 3 … … 100000 Ramesh’s


NO NO YES … YES DNA
MA
Reference
SNP 1 2 3 … … 100000 population
MAF 0.01 0.04 0.04 … 0.01 (HapMap data,
public)
Homer et al 2008

GWAS Attack: Can I find out if someone is in the dataset?


● Answer: yes! (using public data and that person’s DNA)
● Dataset: 100,000 locations, MAFs at each one

SNP 1 2 3 … … 100000 Test


… 0.02 population
MAF 0.02 0.03 0.05
probably
SNP 1 2 3 … … 100000 Ramesh’s
NO NO YES … YES DNA
MA
Reference
SNP 1 2 3 … … 100000 population
MAF 0.01 0.04 0.04 … 0.01 (HapMap data,
public)
Interesting but typical characteristics
● Only innocuous looking data was released
○ Data was HIPAA compliant
● Data curator is trusted (NIH)
● Attack uses background knowledge (HapMap data set)
available in public domain
● Attack uses unanticipated algorithm
● Curator pulled data sets (now hard to get)
● Technical principle: Many weak signals combine into one
strong signal
Fundamental Law of Information Recovery
“Overly accurate information about too many queries to a data source allows for
partial or full reconstruction of data (i.e., blatant non-privacy).”
- Cynthia Dwork, 2014

● With enough information, anyone can be identified


● World population: ~8 billion people
○ 33 bits (233 = 8.6 billion)
○ Rule of thumb: 33 bits of information is enough to uniquely identify any individual (Arvind
Narayanan)
● Examples
○ Netflix challenge: movie preferences
○ Browsing history or even just browser settings
○ Many more
Privacy is Hard
● Releasing information on individuals has a high risk of revealing private info
● Many seemingly reasonable attempts at anonymization/privatization fail
○ Removing “identifiable” info (Sweeney’s Surprise)
○ Only releasing anonymized movie tastes (Netflix Prize)
○ Releasing aggregate information only (exam scores, NIH GWAS)

● We need a more formal way of guaranteeing privacy


● Key ideas
○ Introduce noise: lose some accuracy to guarantee privacy
○ Quantify how much privacy we want and how much accuracy loss we’re willing to accept
Part II: Ensuring Privacy
Please pick up a penny from the front of the room for an activity later in class. At the end of
class, please take it with you or return it to the front.
Who is responsible for keeping data private?
● What we can do depends on who we trust to preserve our privacy and keep
private data

● Two models
○ Trust nobody: you keep your private data and nobody else gets to see it
○ Trusted curator (e.g., NIH, Census, etc.): we trust the curator to hold on to private information,
but only release “safe” de-identified information to the public
Keeping data private: randomized response
● Setup: I’m conducting a yes/no survey, but you don’t trust me to give me your
(private) answer
○ Concern: response bias if I ask people anyway

● Idea: randomize your response


○ With probability p (≥ 0.5), give the true answer
○ With probability (1-p), give me an answer uniformly at random

● Exercise: Let b be your true answer, and b’ be the answer you gave me. If
E[b] = q, what is E[b’]?
○ See the answer in next week’s discussion!
Randomized response example
● Have you ever eaten food that fell on the ground outside? (clean floors
don’t count)

● Flip your coin: if heads, answer randomly, otherwise answer truthfully

● Why is it a good idea to always flip twice regardless of the first flip outcome?
Definitions
● Two datasets D, D’ are called neighboring if they differ by at most one row
○ One row can be deleted, added, or modified

● Informally: An algorithm A(D) is called differentially private if A(D) is not too


different from A(D’) (when D and D’ are neighbors)
○ Example: GWAS average data with my data should be similar to the dataset without my data
○ Intuition: the algorithm result shouldn’t change (much) whether I’m in or out of the dataset

● Provides a strong or “worst-case” guarantee: even if an attacker knows


everything else about me and everyone else in the dataset, they (probably)
can’t tell whether I’m in it
Differential Privacy: Formal Definition
● For all events S, and for neighboring D, D’, a algorithm A(D) is called
ε-differentially private if

It follows that:
Discussion Question
● Consider a deterministic algorithm that always returns 0.7 (regardless of the
data). Is this algorithm ε-differentially private? If so, for what value of ε?
Discussion Question
● Do larger values of ε correspond to stronger or weaker privacy guarantees?
Why?
Differential Privacy: Summary
● Differential privacy is a framework (not an algorithm) for describing how a
randomized algorithm can provide privacy guarantees
● Informally: An algorithm A(D) is called differentially private if A(D) is not too
different from A(D’) (when D and D’ are neighbors)
○ Neighboring datasets differ by at most one element
● Formally: A(D) is ε-differentially private if
● ε describes how strong our algorithm’s privacy guarantee is
○ Small values of ε correspond to strong privacy guarantees (accuracy may suffer)
○ Large values of ε correspond to weak privacy guarantees (can achieve higher accuracy)
Trusted curator model

How many Hispanic women

��🏽♂
live in Beechum County?

3,487

��
How many Black men live

��🏽♂
in Faulconer County?

1,792
Trusted curator model

Query q

��
Answer: q(D) + noise

Trusted curator
��
Private data D

Goals:
● q(D) + noise should be ε-differentially private
● |noise| should be as small as possible while guaranteeing privacy
Trusted curator model: examples

NIH GWAS data (example from earlier)


Google Maps “popular times” feature

US Census data
Trusted curator model: protecting individuals

How many men over 70 live

��🏽♂
in Yakutat County, Alaska?

��
How many Asian men over

��🏽♂
70 live in Yakutat County,
Alaska?

��
1
How do we achieve differential privacy?
● Definition: The sensitivity of an algorithm A is how much it can change for
neighboring datasets:

○ If an algorithm has high sensitivity, we’ll need a lot of noise to “hide” the change from D to D’
○ If an algorithm has low sensitivity, we don’t need as much noise
● For a deterministic query q, the curator returns q(D) + random noise
○ Random noise should depend on sensitivity and on how strong a privacy guarantee we want

● Laplace distribution: p(z|µ, b) ∝ exp(-|z-µ|/b)


○ Like Gaussian, but with heavier tails
○ Like a symmetric exponential distribution
An algorithm for ε-Differential Privacy: Laplace Mechanism
● Suppose algorithm A has sensitivity s
○ This means that for any neighboring D, D’, |A(D) - A(D’)| ≤ s
● For ε-differentially private response to algorithm A on dataset D:
○ Trusted curator returns A(D) + Laplace(s/ε)
○ Private: guaranteed ε-differentially private
○ Accurate: with high probability, returns an answer within s/ε of q(D)
Example: Laplace Mechanism for Grade Stats
🔐
SID Score

3563 40
“What was the average
score on MT2?” 2400 32

4029 27

3309 20

��
9214 36
Differential privacy in
the wild
https://www.microsoft.com/en-us/research/project/privacy-integrated-queries-pinq/

Early implementation of differential privacy spearheaded by Frank McSherry


https://github.com/google/rappor

Large-scale system implemented as part of


Google Chrome.

First major industry product feature


involving differential privacy
https://github.com/google/differential-privacy

Source:
https://www.theverge.com/2019/9/5/20850465/google-differential-privacy-open-sour
ce-tool-privacy-data-sharing
The Verge: “It was probably the most bewildering part of Apple’s [2016] WWDC Keynote: in the middle of a
rundown of fancy new products arriving with iOS 10, Craig Federighi stopped to talk about abstract
mathematics. He was touting differential privacy, a statistical method that’s become a valuable tool for
protecting user data.” https://www.theverge.com/2016/6/17/11957782/apple-differential-privacy-ios-10-wwdc-2016
From: https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
“In June 2016, Apple announced that it will deploy differential privacy for some user
data collection in order to ensure privacy of user data, even from Apple. The details of
Apple's approach remained sparse. Although several patents have since appeared
hinting at the algorithms that may be used to achieve differential privacy, they did not
include a precise explanation of the approach taken to privacy parameter choice.
Such choice and the overall approach to privacy budget use and management are
key questions for understanding the privacy protections provided by any deployment
of differential privacy.”
https://arxiv.org/abs/1709.02753
Stepping back
There are many challenges with putting differential privacy in practice

Computational challenges

Implementation pitfalls

Political struggles

Legal and policy hurdles


2020 Census Controversy
● How does differentially private data affect redistricting? (Jessica Hullman)
○ Several other good articles here
● For the U.S. Census, keeping your data anonymous and useful is a tricky
balance (NPR)
● Data Scientists Square Off Over Trust and Privacy (Bloomberg)
● They Deliberately Put Errors in the Census (Matt Yglesias)
Biases Introduced by Post-Processing
What we saw today
Many seemingly private data schemes are not private at all

Need formal analysis as a guide: differential privacy

Methods: randomized response, Laplace mechanism for statistical queries

Real-world applications with active discussion

You might also like