Differential privacy


Differential privacy is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while limiting information that is leaked about specific individuals. This is done by injecting carefully calibrated noise into statistical computations such that the utility of the statistic is preserved while provably limiting what can be inferred about any individual in the dataset.
Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.
Roughly, an algorithm is differentially private if an observer seeing its output cannot tell whether a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms provably resist such attacks.

ε-differential privacy

The 2006 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database.
The definition of ε-differential privacy requires that a change to one entry in a database only creates a small change in the probability distribution of the outputs of measurements, as seen by the attacker. The intuition for the definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. In differential privacy, each individual is given roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not be substantially affected by the removal, addition, or change of any individual in the data.
How much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."

Definition

Let ε be a positive real number and be a randomized algorithm that takes a dataset as input. Let denote the image of.
The algorithm is said to provide -differential privacy if, for all datasets and that differ on a single element, and all subsets of :
where the probability is taken over the randomness used by the algorithm. This definition is sometimes called "approximate differential privacy", with "pure differential privacy" being a special case when. In the latter case, the algorithm is commonly said to satisfy ε-differential privacy.
Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its [|composability], [|robustness to post-processing], and graceful degradation in the presence of [|correlated data].

Example

According to this definition, differential privacy is a condition on the release mechanism and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.
For example, assume we have a database of medical records where each record is a pair, where is a Boolean denoting whether a person has diabetes or not. For example:
NameHas Diabetes
Ross1
Monica1
Joey0
Phoebe0
Chandler1
Rachel0

Now suppose a malicious user wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query that returns the partial sum of the first rows of column in the database. In order to find Chandler's diabetes status the adversary executes and, then computes their difference. In this example, and, so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.
Continuing this example, if we construct by replacing with then this malicious adversary will be able to distinguish from by computing for each dataset. If the adversary were required to receive the values via an -differentially private algorithm, for a sufficiently small, then he or she would be unable to distinguish between the two datasets.

Composability and robustness to post processing

Composability refers to the fact that the joint distribution of the outputs of differentially private mechanisms satisfies differential privacy.
  • Sequential composition. If we query an ε-differential privacy mechanism times, and the randomization of the mechanism is independent for each query, then the result would be -differentially private. In the more general case, if there are independent mechanisms:, whose privacy guarantees are differential privacy, respectively, then any function of them: is -differentially private.
  • Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function would be -differentially private instead.
The other important property for modular use of differential privacy is robustness to post processing. This is defined to mean that for any deterministic or randomized function defined over the image of the mechanism, if satisfies ε-differential privacy, so does.
The property of composition permits modular construction and analysis of differentially private mechanisms and motivates the concept of the privacy loss budget. If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted their information. However this is also extendable. We may want to protect databases differing in rows, which amounts to an adversary with arbitrary auxiliary information knowing if ' particular participants submitted their information. This can be achieved because if items change, the probability dilation is bounded by instead of,' i.e., for D1 and D2 differing on items:Thus setting ε instead to achieves the desired result. In other words, instead of having each item ε-differentially private protected, now every group of items is ε-differentially private protected.

Hypothesis testing interpretation

One can think of differential privacy as bounding the error rates in a hypothesis test. Consider two hypotheses:
  • : The individual's data is not in the dataset.
  • : The individual's data is in the dataset.
Then, there are two error rates:
Ideal protection would imply that both error rates are equal, but for a fixed setting, an attacker can achieve the following rates:
  • ε-differentially private mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism and posterior sampling sample from a problem-dependent family of distributions instead.
An important definition with respect to ε-differentially private mechanisms is sensitivity.' Let be a positive integer, be a collection of datasets, and be a function. One definition of the sensitivity of a function, denoted, can be defined by:'where the maximum is over all pairs of datasets and in differing in at most one element and denotes the L1 norm.' In the example of the medical database below, if we consider to be the function, then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one. This can be generalized to other metric spaces, and must be to make certain differentially private algorithms work, including adding noise from the Gaussian distribution instead of the Laplace distribution.'
There are techniques using which we can create a differentially private algorithm for functions, with parameters that vary depending on their sensitivity.''''''