Database Privacy. Buffer Overflow Attacks
Database Privacy. Buffer Overflow Attacks
Buffer
overflow attacks
Database privacy
Two general methods to deal with database
privacy
– Query restriction: Limit what queries are allowed.
Allowed queried are answered correctly, while
disallowed queries are simply not answered
– Perturbation: Queries answered “noisily”. Also includes
“scrubbing” (or suppressing) some of the data
Perturbation
Data perturbation: Add noise to entire table, then
answer queries accordingly (or release entire
perturbed dataset)
Output perturbation: Keep table intact, but add
noise to answers
(From: “Computer Security,” by Stallings)
Perturbation
Trade-off between privacy and utility!
No randomization – bad privacy but perfect utility
Complete randomization – perfect privacy but no
utility
Data perturbation
One technique: data swapping
Restriction to
– Substitute and/or swap any
values, while maintaining
low-order statistics
two columns is
identical
F Bio 4.0 F Bio 3.0
F CS 3.0 F CS 4.0
F EE 3.0 F EE 4.0
F Psych 4.0 F Psych 3.0
M Bio 3.0 M Bio 4.0
M CS 4.0 M CS 3.0
M EE 4.0 M EE 3.0
M Psych 3.0 M Psych 4.0
Data perturbation
Second technique: (re)generate the table based on
derived distribution
– For each sensitive attribute, determine a probability
distribution that best matches the recorded data
– Generate fresh data according to the determined
distribution
– Populate the table with this fresh data
Queries on the database can never “learn” more
than what was learned initially
Data perturbation
Data cleaning/scrubbing: remove sensitive data, or
data that can be used to breach anonymity
k-anonymity: ensure that any “identifying
information” is shared by at least k members of
the database
Example…
Example: 2-anonymity
Race ZIP Smoke? Cancer?
Asian
-
Asian 0213x
02138 Y Y
Asian
-
Asian 0213x
02139 Y N
Asian
-
Asian 0214x
02141 N Y
Asian
-
Asian 0214x
02142 Y Y
Black
-
Black 0213x
02138 N N
Black
-
Black 0213x
02139 N Y
Black
-
Black 0214x
02141 Y Y
Black
-
Black 0214x
02142 N N
White
-
White 0213x
02138 Y Y
White
-
White 0213x
02139 N N
White
-
White 0214x
02141 Y Y
White
-
White 0214x
02142 Y Y
Problems with k-anonymity
Hard to find the right balance between what is
“scrubbed” and utility of the data
Not clear what security guarantees it provides
– For example, what if I know that the Asian person in
ZIP code 0214x smokes?
• Does not deal with out-of-band information
– What if all people who share some identifying
information share the same sensitive attribute?
Output perturbation
One approach: replace the query with a perturbed
query, then return an exact answer to that
– E.g., a query over some set of entries C is answered
using some (randomly-determined) subset C’ C
– User only learns the answer, not C’
Second approach: add noise to the exact answer
(to the original query)
– E.g., answer SUM(salary, S) with
SUM(salary, S) + noise
A negative result [Dinur-Nissim]
Heavily paraphrased:
Given a database with n rows, if roughly n queries
are made to the database then essentially the entire
database can be reconstructed even if O(n1/2) noise
is added to each answer
On the positive side, it is known that very small
error can be used when the total number of queries
is kept small
Formally defining privacy
A problem inherent in all the approaches we have
discussed so far (and the source of many of the
problems we have seen) is that no definition of
“privacy” is offered
Recently, there has been work addressing exactly
this point
– Developing definitions
– Provably secure schemes!
A definition of privacy
Differential privacy [Dwork et al.]
Roughly speaking:
– For each row r of the database (representing, say, an
individual), the distribution of answers when r is
included in the database is “close” to the distribution of
answers when r is not included in the database
• No reason for r not to include themselves in the database!
– Note: can’t hope for “closeness” better than 1/|DB|
Further refining/extending this definition, and
determining when it can be applied, is an active
area of research
Achieving privacy
A “converse” to the Dinur-Nissim result is that
adding some (carefully-generated) noise, and
limiting the number of queries, can be proven to
achieve privacy
An active area of research
Achieving privacy
E.g., answer SUM(salary, S) with
SUM(salary, S) + noise,
where the magnitude of the noise depends on the
range of plausible salaries (but not on |S|!)
Automatically handles multiple (arbitrary) queries,
though privacy degrades as more queries are made
Gives formal guarantees
Buffer overflows
Buffer overflows
Previous focus in this class has been on secure
protocols and algorithms
For real-world security, it is not enough for the
protocol/algorithm to be secure -- the
implementation must also be secure
– We have seen this already when we talked about side-
channel attacks
– Here, the attacks are active rather than passive
– Also, here the attacks exploit the way programs are run
by the machine/OS
Importance of the problem
Most common cause of Internet attacks
– Over 50% of CERT advisories related to buffer
overflow vulnerabilities
Morris worm (1988)
– 6,000 machines infected
CodeRed (2001)
– 300,000 machines infected in 14 hours
Etc.
Buffer overflows
Fixed-sized buffer that is to be filled with
unknown data, usually provided directly by user
If more data “stuffed” into the buffer than it can
hold, that data spills over into adjacent memory
If this data is executable code, the victim’s
machine may be tricked into running it
Can overflow on the stack or the heap…
A glimpse into memory
Registers
ebp function
frame stack
esp
eip
heap
code
Stack overview
Each function that is executed is allocated its own
frame on the stack
When one function calls another, a new frame is
initialized and placed (pushed) on the stack
When a function is finished executing, its frame is
taken off (popped) the stack
Function calls
locals vars
gets(color)
– What if I type “blue 1” ?
– (Actually, need to be more clever than this)
More devious examples…
strcpy(buf, str)
ret Frame of the
bufoverflowebp addr calling function