0% found this document useful (0 votes)

8 views

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

rashamunsar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

rashamunsar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Privacy in Machine Learning

Data 102 Fall 2023 Lecture 24

Please pick up a penny from the front of the room for an activity later in class. At the end
of class, please take it with you or return it to the front.

Announcements

● Project checkpoint deadlines extended

Special thanks to Prof. Moritz Hardt and Prof. Jacob Steinhardt for most of the content in today’s slides.
Weekly Outline
● Last time: High-Dimensional Regression

● Today: Privacy in Machine Learning

○ Why it’s hard, and how to get it wrong
○ Randomized response: a simple way of getting it right without trust
○ Trusted curator model and differential privacy
○ Current events: 2020 Census and differential privacy

● Next time: Robustness and Generalization

○ Case studies: applying course ideas to real-world scenarios
●
Why does privacy matter?
● Many valuable applications of ML require use of sensitive data
○ Health, census, phone/location, wearables, etc.

● We want to perform useful analysis while protecting individuals’ privacy

● This is harder than it sounds: many seemingly “private” schemes are
vulnerable to attack
● Key idea: adopt security mindset
○ Think about potential attacks and how to guard against them
Personally Identifiable Information (PII)
● Anything that can identify an individual
○ Name, SSN, email, fingerprint, drivers license #, etc.
● Many security & privacy standards have requirements on how to store PII
securely
● Common approach: remove all PII from a dataset to call it “anonymized”
○ Isn’t always good enough

● Related to (but not the same as) Personal Health Information (PHI)
○ Protected under HIPAA (US medical privacy law)
○ Includes medical information like doctors’ notes, lab results, etc.
○ Only applies to data in a medical setting, and is specific to health data
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “Your data is safely anonymized!” (name, address, SSN etc., removed)

Ethnicity Name
Visit data Address
Diagnosis zip Date
Procedure dob registered
Medication sex Party
Charge affiliation
... Date last
voted

Latanya Sweeney
Medical data Voter list
Linkage Attack: Sweeney’s Surprise
● 1997: MA’s state insurance org released data on hospital visits by state employees
● Governor: “your data is safely anonymized”: name, address, SSN etc., removed.

Ethnicity Name
Visit data Address
Diagnosis zip Date
Lesson: removing private identifiers
Procedure dob
sex
registered
Medication Party
Charge might not be enough!
affiliation
Date last
...
voted

Latanya Sweeney
Medical data Voter list
Linkage Attack: Netflix Prize

󰲁󰪈󰱩󰲰 moviefan777
󰬈
👍👍👎👍 8/10 ⭐
👎👍👎👎
2/10 ⭐
👍
👎👎👍👎 3/10 ⭐

👍
👍👍👍👍 9/10 ⭐
Linkage Attack: Netflix Prize

󰲁󰪈󰱩󰲰 moviefan777
󰬈
👍Lesson:
👍👎 👍
even without anything 8/10 ⭐
obviously
👎👍👎
“identifiable”, 👎 can still be2/10identified
individuals in
⭐
👍
👎👎👍👎 “anonymized” data releases
3/10 ⭐

👍
👍👍👍👍 9/10 ⭐
Another attempt: k-Anonymity
● Observation: if only one person has some combination of traits, then we can
re-identify that person
● Idea: divide columns into quasi-identifiers (identifiable stuff) and sensitive
attributes (other stuff)
○ e.g., ZIP, DOB, and sex could be quasi-identifiers
● Requirement: for any combination of quasi-identifiers there must be at least k
rows in the dataset with that combination
○ e.g., if k=3, then we need at least 3 people for every combination of ZIP/DOB/sex

● Does not guarantee privacy: why?

○ Any column could be a (quasi-)identifier given enough public data!
De-identified data vs aggregate data
● Releasing anonymized data is hard because many columns and attributes
can identify individuals
● Re-identification attacks: discover the identity of individuals from
“anonymized” data
○ Common, and come in many different flavors

● Idea: only release aggregate data, with no individual-level data

○ Average across an entire population, etc.
○ Still not good enough!
Aggregate Data Example: Exam Grades
SID Score

3563 40

2400 32

4029 27

3309 20

9214 36
Basics of Genome-Wide Association Studies (GWAS)
● Recall: DNA is made up of A/C/T/G
● Humans have identical values in most of the genome (~99.6%)
● Focus on locations where >1% of people have “unusual” values
○ e.g., “most people have G here, but 3.2% of people have C”
○ These are called single nucleotide polymorphisms (SNPs)
○ 3.2% is called the minor allele frequency (MAF): how common is the
mutation?
● Goal: find associations between SNPs and disease
● Example Dataset
○ 1000 people with a disease of interest
○ For 100,000 locations, release MAFs at each one: “what percentage of
this population has a mutation in this spot?”
GWAS Attack: Can I find out if someone is in the dataset?
● Answer: yes! (using public data and that person’s DNA)
● Dataset: 100,000 locations, MAFs at each one

SNP 1 2 3 … … 100000 Test

… 0.02 population
MAF 0.02 0.03 0.05

SNP 1 2 3 … … 100000 Ramesh’s

NO NO YES … YES DNA
MA
Reference
SNP 1 2 3 … … 100000 population
MAF 0.01 0.04 0.04 … 0.01 (HapMap data,
public)
Homer et al 2008

GWAS Attack: Can I find out if someone is in the dataset?

● Answer: yes! (using public data and that person’s DNA)
● Dataset: 100,000 locations, MAFs at each one

SNP 1 2 3 … … 100000 Test

… 0.02 population
MAF 0.02 0.03 0.05
probably
SNP 1 2 3 … … 100000 Ramesh’s
NO NO YES … YES DNA
MA
Reference
SNP 1 2 3 … … 100000 population
MAF 0.01 0.04 0.04 … 0.01 (HapMap data,
public)
Interesting but typical characteristics
● Only innocuous looking data was released
○ Data was HIPAA compliant
● Data curator is trusted (NIH)
● Attack uses background knowledge (HapMap data set)
available in public domain
● Attack uses unanticipated algorithm
● Curator pulled data sets (now hard to get)
● Technical principle: Many weak signals combine into one
strong signal
Fundamental Law of Information Recovery
“Overly accurate information about too many queries to a data source allows for
partial or full reconstruction of data (i.e., blatant non-privacy).”
- Cynthia Dwork, 2014

● With enough information, anyone can be identified

● World population: ~8 billion people
○ 33 bits (233 = 8.6 billion)
○ Rule of thumb: 33 bits of information is enough to uniquely identify any individual (Arvind
Narayanan)
● Examples
○ Netflix challenge: movie preferences
○ Browsing history or even just browser settings
○ Many more
Privacy is Hard
● Releasing information on individuals has a high risk of revealing private info
● Many seemingly reasonable attempts at anonymization/privatization fail
○ Removing “identifiable” info (Sweeney’s Surprise)
○ Only releasing anonymized movie tastes (Netflix Prize)
○ Releasing aggregate information only (exam scores, NIH GWAS)

● We need a more formal way of guaranteeing privacy

● Key ideas
○ Introduce noise: lose some accuracy to guarantee privacy
○ Quantify how much privacy we want and how much accuracy loss we’re willing to accept
Part II: Ensuring Privacy
Please pick up a penny from the front of the room for an activity later in class. At the end of
class, please take it with you or return it to the front.
Who is responsible for keeping data private?
● What we can do depends on who we trust to preserve our privacy and keep
private data

● Two models
○ Trust nobody: you keep your private data and nobody else gets to see it
○ Trusted curator (e.g., NIH, Census, etc.): we trust the curator to hold on to private information,
but only release “safe” de-identified information to the public
Keeping data private: randomized response
● Setup: I’m conducting a yes/no survey, but you don’t trust me to give me your
(private) answer
○ Concern: response bias if I ask people anyway

● Idea: randomize your response

○ With probability p (≥ 0.5), give the true answer
○ With probability (1-p), give me an answer uniformly at random

● Exercise: Let b be your true answer, and b’ be the answer you gave me. If
E[b] = q, what is E[b’]?
○ See the answer in next week’s discussion!
Randomized response example
● Have you ever eaten food that fell on the ground outside? (clean floors
don’t count)

● Flip your coin: if heads, answer randomly, otherwise answer truthfully

● Why is it a good idea to always flip twice regardless of the first flip outcome?
Definitions
● Two datasets D, D’ are called neighboring if they differ by at most one row
○ One row can be deleted, added, or modified

● Informally: An algorithm A(D) is called differentially private if A(D) is not too

different from A(D’) (when D and D’ are neighbors)
○ Example: GWAS average data with my data should be similar to the dataset without my data
○ Intuition: the algorithm result shouldn’t change (much) whether I’m in or out of the dataset

● Provides a strong or “worst-case” guarantee: even if an attacker knows

everything else about me and everyone else in the dataset, they (probably)
can’t tell whether I’m in it
Differential Privacy: Formal Definition
● For all events S, and for neighboring D, D’, a algorithm A(D) is called
ε-differentially private if

It follows that:
Discussion Question
● Consider a deterministic algorithm that always returns 0.7 (regardless of the
data). Is this algorithm ε-differentially private? If so, for what value of ε?
Discussion Question
● Do larger values of ε correspond to stronger or weaker privacy guarantees?
Why?
Differential Privacy: Summary
● Differential privacy is a framework (not an algorithm) for describing how a
randomized algorithm can provide privacy guarantees
● Informally: An algorithm A(D) is called differentially private if A(D) is not too
different from A(D’) (when D and D’ are neighbors)
○ Neighboring datasets differ by at most one element
● Formally: A(D) is ε-differentially private if
● ε describes how strong our algorithm’s privacy guarantee is
○ Small values of ε correspond to strong privacy guarantees (accuracy may suffer)
○ Large values of ε correspond to weak privacy guarantees (can achieve higher accuracy)
Trusted curator model

How many Hispanic women

��🏽♂
live in Beechum County?

3,487

��
How many Black men live

��🏽♂
in Faulconer County?

1,792
Trusted curator model

Query q

��
Answer: q(D) + noise

Trusted curator
��
Private data D

Goals:
● q(D) + noise should be ε-differentially private
● |noise| should be as small as possible while guaranteeing privacy
Trusted curator model: examples

NIH GWAS data (example from earlier)

Google Maps “popular times” feature

US Census data
Trusted curator model: protecting individuals

How many men over 70 live

��🏽♂
in Yakutat County, Alaska?

��
How many Asian men over

��🏽♂
70 live in Yakutat County,
Alaska?

��
1
How do we achieve differential privacy?
● Definition: The sensitivity of an algorithm A is how much it can change for
neighboring datasets:

○ If an algorithm has high sensitivity, we’ll need a lot of noise to “hide” the change from D to D’
○ If an algorithm has low sensitivity, we don’t need as much noise
● For a deterministic query q, the curator returns q(D) + random noise
○ Random noise should depend on sensitivity and on how strong a privacy guarantee we want

● Laplace distribution: p(z|µ, b) ∝ exp(-|z-µ|/b)

○ Like Gaussian, but with heavier tails
○ Like a symmetric exponential distribution
An algorithm for ε-Differential Privacy: Laplace Mechanism
● Suppose algorithm A has sensitivity s
○ This means that for any neighboring D, D’, |A(D) - A(D’)| ≤ s
● For ε-differentially private response to algorithm A on dataset D:
○ Trusted curator returns A(D) + Laplace(s/ε)
○ Private: guaranteed ε-differentially private
○ Accurate: with high probability, returns an answer within s/ε of q(D)
Example: Laplace Mechanism for Grade Stats
🔐
SID Score

3563 40
“What was the average
score on MT2?” 2400 32

4029 27

3309 20

��
9214 36
Differential privacy in
the wild
https://www.microsoft.com/en-us/research/project/privacy-integrated-queries-pinq/

Early implementation of differential privacy spearheaded by Frank McSherry

https://github.com/google/rappor

Large-scale system implemented as part of

Google Chrome.

First major industry product feature

involving differential privacy
https://github.com/google/differential-privacy

Source:
https://www.theverge.com/2019/9/5/20850465/google-differential-privacy-open-sour
ce-tool-privacy-data-sharing
The Verge: “It was probably the most bewildering part of Apple’s [2016] WWDC Keynote: in the middle of a
rundown of fancy new products arriving with iOS 10, Craig Federighi stopped to talk about abstract
mathematics. He was touting differential privacy, a statistical method that’s become a valuable tool for
protecting user data.” https://www.theverge.com/2016/6/17/11957782/apple-differential-privacy-ios-10-wwdc-2016
From: https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
“In June 2016, Apple announced that it will deploy differential privacy for some user
data collection in order to ensure privacy of user data, even from Apple. The details of
Apple's approach remained sparse. Although several patents have since appeared
hinting at the algorithms that may be used to achieve differential privacy, they did not
include a precise explanation of the approach taken to privacy parameter choice.
Such choice and the overall approach to privacy budget use and management are
key questions for understanding the privacy protections provided by any deployment
of differential privacy.”
https://arxiv.org/abs/1709.02753
Stepping back
There are many challenges with putting differential privacy in practice

Computational challenges

Implementation pitfalls

Political struggles

Legal and policy hurdles

2020 Census Controversy
● How does differentially private data affect redistricting? (Jessica Hullman)
○ Several other good articles here
● For the U.S. Census, keeping your data anonymous and useful is a tricky
balance (NPR)
● Data Scientists Square Off Over Trust and Privacy (Bloomberg)
● They Deliberately Put Errors in the Census (Matt Yglesias)
Biases Introduced by Post-Processing
What we saw today
Many seemingly private data schemes are not private at all

Need formal analysis as a guide: differential privacy

Methods: randomized response, Laplace mechanism for statistical queries

Real-world applications with active discussion

The Code. The Evaluation. The Protocols by Jocko Willink
100% (5)
The Code. The Evaluation. The Protocols by Jocko Willink
54 pages
Restaurant Marketing Plan
100% (1)
Restaurant Marketing Plan
15 pages
Tetra Alfast Plus
No ratings yet
Tetra Alfast Plus
4 pages
Table 8 Touch DNA
No ratings yet
Table 8 Touch DNA
17 pages
Friday Lunchtime Lecture: Open Data - The Dark Side
100% (1)
Friday Lunchtime Lecture: Open Data - The Dark Side
31 pages
DNA and Genealogy Research: Simplified
From Everand
DNA and Genealogy Research: Simplified
Stephen Szabados
5/5 (1)
Business Analytics: An Introduction: Dr. Devesh Bathla
No ratings yet
Business Analytics: An Introduction: Dr. Devesh Bathla
27 pages
Pre Requisite Excel Stats
No ratings yet
Pre Requisite Excel Stats
11 pages
DNA Fingerprinting, 2024 - ANS
No ratings yet
DNA Fingerprinting, 2024 - ANS
7 pages
Criterion D Planning by Aman and Arthur
No ratings yet
Criterion D Planning by Aman and Arthur
4 pages
DNA Fingerprinting Powerpoint
No ratings yet
DNA Fingerprinting Powerpoint
23 pages
Ethics of Data Science: Lawrence Hunter, PH.D
No ratings yet
Ethics of Data Science: Lawrence Hunter, PH.D
17 pages
Introduction of T-Dna Saloni Chaudhary
No ratings yet
Introduction of T-Dna Saloni Chaudhary
3 pages
DNA Fingerprinting
No ratings yet
DNA Fingerprinting
30 pages
Dark Data: Why What You Don’t Know Matters
From Everand
Dark Data: Why What You Don’t Know Matters
David J. Hand
3.5/5 (4)
Intro Part2
No ratings yet
Intro Part2
50 pages
DNA Fingerprinting Definition, Uses & Steps
No ratings yet
DNA Fingerprinting Definition, Uses & Steps
6 pages
DNA Forensics
No ratings yet
DNA Forensics
10 pages
Dna Fingerprinting.
100% (2)
Dna Fingerprinting.
24 pages
DNA Fingerprinting Lab
No ratings yet
DNA Fingerprinting Lab
10 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Dnafingerprinting
No ratings yet
Dnafingerprinting
23 pages
Presentation (1)
No ratings yet
Presentation (1)
36 pages
talk (1)
No ratings yet
talk (1)
60 pages
Powerpoint Dna
No ratings yet
Powerpoint Dna
21 pages
Talk
No ratings yet
Talk
25 pages
DNA Fingerprinting
No ratings yet
DNA Fingerprinting
22 pages
Forensic Finals
No ratings yet
Forensic Finals
37 pages
Evaluating Forensic DNA Evidence
No ratings yet
Evaluating Forensic DNA Evidence
44 pages
Dna "Fingerprints" and Their Statistical Analysis in Human Populations
No ratings yet
Dna "Fingerprints" and Their Statistical Analysis in Human Populations
4 pages
Lab 7 - DNA Fingerprinting and Gel Electrophoresis Fall 2014
No ratings yet
Lab 7 - DNA Fingerprinting and Gel Electrophoresis Fall 2014
25 pages
Dna Profiling
No ratings yet
Dna Profiling
2 pages
Dna Profiling Research Papar
No ratings yet
Dna Profiling Research Papar
2 pages
3 - DNA PROFILING class notes 12C from Zama High School Newcastle
No ratings yet
3 - DNA PROFILING class notes 12C from Zama High School Newcastle
21 pages
Bef Begin
No ratings yet
Bef Begin
6 pages
Lecture 09 DifferentialPrivacy
No ratings yet
Lecture 09 DifferentialPrivacy
18 pages
DNA Fingerprinting
100% (2)
DNA Fingerprinting
20 pages
Wa0012
No ratings yet
Wa0012
14 pages
Data Science Presentation
No ratings yet
Data Science Presentation
33 pages
Dna Fingerprinting 5 2 23
No ratings yet
Dna Fingerprinting 5 2 23
13 pages
Spectral Feature Selection for Data Mining 1st Edition Zheng Alan Zhao (Author) - The complete ebook set is ready for download today
No ratings yet
Spectral Feature Selection for Data Mining 1st Edition Zheng Alan Zhao (Author) - The complete ebook set is ready for download today
43 pages
Lesson-2-DNA-fingerprinting
No ratings yet
Lesson-2-DNA-fingerprinting
29 pages
Adobe Scan Sep 18, 2024
No ratings yet
Adobe Scan Sep 18, 2024
17 pages
Dna Analysis
100% (1)
Dna Analysis
11 pages
DNA Testing - How It Helps Forensics Experts
No ratings yet
DNA Testing - How It Helps Forensics Experts
2 pages
What Is A DNA Fingerprint
No ratings yet
What Is A DNA Fingerprint
8 pages
DNA Profiling - Edited2021
100% (1)
DNA Profiling - Edited2021
19 pages
AFIIA - Objectivity
No ratings yet
AFIIA - Objectivity
25 pages
DNA Fingerprinting Unravelling The Genetic Code 241101 121309
No ratings yet
DNA Fingerprinting Unravelling The Genetic Code 241101 121309
10 pages
Lecture 17
No ratings yet
Lecture 17
60 pages
Who's Calling
No ratings yet
Who's Calling
44 pages
A Simplified Guide To DNA Evidence
No ratings yet
A Simplified Guide To DNA Evidence
23 pages
Ethnicity 2021 Whitepaper
No ratings yet
Ethnicity 2021 Whitepaper
41 pages
Data Stats ELR Fall 2012 Compressed Shared
No ratings yet
Data Stats ELR Fall 2012 Compressed Shared
24 pages
Dna Fingerprinting Powerpoint
100% (1)
Dna Fingerprinting Powerpoint
23 pages
BIO
No ratings yet
BIO
22 pages
Dna Fingerprinting Technology: B.E.S Bharathi Thirtha Vidyalayam
100% (1)
Dna Fingerprinting Technology: B.E.S Bharathi Thirtha Vidyalayam
19 pages
Assignment of Bioinformatics
No ratings yet
Assignment of Bioinformatics
16 pages
According To The Lockard
No ratings yet
According To The Lockard
23 pages
Lab 7 - DNA Fingerprinting and Gel Electrophoresis
100% (1)
Lab 7 - DNA Fingerprinting and Gel Electrophoresis
22 pages
Illustration Showing The Steps in DNA Fingerprinting
No ratings yet
Illustration Showing The Steps in DNA Fingerprinting
3 pages
Lie With Statistics
No ratings yet
Lie With Statistics
7 pages
The Targeted Individuals Schizophrenic Survival Guide Pt. 1
From Everand
The Targeted Individuals Schizophrenic Survival Guide Pt. 1
Joshua Hayes
No ratings yet
MSDS Rust-Remover-SDS5048
No ratings yet
MSDS Rust-Remover-SDS5048
7 pages
Oxfam - Coffee Chain Game
No ratings yet
Oxfam - Coffee Chain Game
16 pages
Gnocchi À La Parisienne With Arugula, Tomatoes, and Olives - Cook's Illustrated
No ratings yet
Gnocchi À La Parisienne With Arugula, Tomatoes, and Olives - Cook's Illustrated
3 pages
Scope of Nle909
No ratings yet
Scope of Nle909
247 pages
Chapter 3: Excretion 3.1 Excretion in Human
No ratings yet
Chapter 3: Excretion 3.1 Excretion in Human
2 pages
Virtual Brain Dissection
No ratings yet
Virtual Brain Dissection
15 pages
Chapter 13 - Non-Conventional Sources of Energy - Summary
No ratings yet
Chapter 13 - Non-Conventional Sources of Energy - Summary
3 pages
3. BỘ ĐỀ THI DÀNH CHO HỌC SINH GIỎI
No ratings yet
3. BỘ ĐỀ THI DÀNH CHO HỌC SINH GIỎI
48 pages
Lakwagaon - NIT - OILPROJECT - CNP - 62 - LKN - 01
No ratings yet
Lakwagaon - NIT - OILPROJECT - CNP - 62 - LKN - 01
7 pages
Module 2 - Ship Survey
No ratings yet
Module 2 - Ship Survey
18 pages
Ra 9851
No ratings yet
Ra 9851
73 pages
Medical Hypotheses: Mona Kamal Saadeldin, Amal Kamal Abdel-Aziz, Ahmed Abdellatif
No ratings yet
Medical Hypotheses: Mona Kamal Saadeldin, Amal Kamal Abdel-Aziz, Ahmed Abdellatif
7 pages
When Mary Owens Husband Ralph Passed Away About Three Months
No ratings yet
When Mary Owens Husband Ralph Passed Away About Three Months
2 pages
Dermatologi
No ratings yet
Dermatologi
2 pages
Rigor Mortis
No ratings yet
Rigor Mortis
2 pages
MDM - Shri 244
No ratings yet
MDM - Shri 244
1 page
Webster - Clipbook
No ratings yet
Webster - Clipbook
18 pages
CATARACT A Cataract Is An Opacity of The Eye's Normally
No ratings yet
CATARACT A Cataract Is An Opacity of The Eye's Normally
2 pages
Region III
No ratings yet
Region III
5 pages
(Ebook) Corrosion Failures: Theory, Case Studies, and Solutions by K. Elayaperumal, V. S. Raja ISBN 9780470455647, 0470455640 pdf download
100% (2)
(Ebook) Corrosion Failures: Theory, Case Studies, and Solutions by K. Elayaperumal, V. S. Raja ISBN 9780470455647, 0470455640 pdf download
53 pages
Benefits of Cycling
No ratings yet
Benefits of Cycling
3 pages
Reaction Paper To BBC's Documentary "How To Build A Human"
No ratings yet
Reaction Paper To BBC's Documentary "How To Build A Human"
3 pages
Inspection of Electrical Installations I
No ratings yet
Inspection of Electrical Installations I
8 pages
Hurricanes (Notes)
No ratings yet
Hurricanes (Notes)
3 pages
Operation Fresh Start Safety Toolbox: Topics For Daily Safety Meetings
No ratings yet
Operation Fresh Start Safety Toolbox: Topics For Daily Safety Meetings
9 pages
DOD Policy Memo On Hex Chrome PDF
No ratings yet
DOD Policy Memo On Hex Chrome PDF
3 pages
Community Acquired Pneumonia
No ratings yet
Community Acquired Pneumonia
32 pages

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

Data 102 Fall 2023 Lecture 24 - Privacy in Machine Learning

Uploaded by

Privacy in Machine Learning

Data 102 Fall 2023 Lecture 24

● Project checkpoint deadlines extended

● Today: Privacy in Machine Learning

● Next time: Robustness and Generalization

● We want to perform useful analysis while protecting individuals’ privacy

● Does not guarantee privacy: why?

● Idea: only release aggregate data, with no individual-level data

SNP 1 2 3 … … 100000 Test

SNP 1 2 3 … … 100000 Ramesh’s

GWAS Attack: Can I find out if someone is in the dataset?

SNP 1 2 3 … … 100000 Test

● With enough information, anyone can be identified

● We need a more formal way of guaranteeing privacy

● Idea: randomize your response

● Flip your coin: if heads, answer randomly, otherwise answer truthfully

● Informally: An algorithm A(D) is called differentially private if A(D) is not too

● Provides a strong or “worst-case” guarantee: even if an attacker knows

How many Hispanic women

NIH GWAS data (example from earlier)

How many men over 70 live

● Laplace distribution: p(z|µ, b) ∝ exp(-|z-µ|/b)

Early implementation of differential privacy spearheaded by Frank McSherry

Large-scale system implemented as part of

First major industry product feature

Legal and policy hurdles

Need formal analysis as a guide: differential privacy

Methods: randomized response, Laplace mechanism for statistical queries

Real-world applications with active discussion

You might also like