Differential privacy is a definition used to describe various methods and techniques for analyzing data sets and extrapolating aggregated results, without directly affecting the privacy of any specific individuals contained within the original data sets.
The technique is often used to train up algorithms and to gain statistical information about large data sets without directly affecting data subjects.
What makes these techniques interesting is that they serve as a compromise between the interests of consumers – and their desire for privacy – and the needs of organizations and their need for data.
How does differential privacy work?
Differential privacy is a state achieved by using techniques originally devised by cryptographers for the analysis of statistical data sets using algorithms.
When data processors refer to data as 'differentially private', they are referring to various techniques used to add noise to data sets so that the identities of the data subjects (originally used to obtain those outputs) can no longer be pinpointed.
Techniques for arriving at differential privacy are becoming increasingly popular as a method for providing consumer privacy – while still gaining important aggregated data for the organization.
The techniques allow for useful patterns and analysis to be acquired. Thus, it is best understood as a group of techniques employed for analyzing Big Data by companies that want to promise individuals' privacy within their policies – while still being able to exploit user data in an aggregated manner.
For example, differentially private data sets are used by governments to publish demographic information and aggregated statistics collected via a national census or other surveys. This is useful because it allows for observations to be published without affecting the confidentiality and privacy of those individual citizens who took part in the survey.
Data is considered differentially private if the data observed from the algorithm's output cannot be attributed to any individual, and if it is impossible to tell from the output which individual people's data was used to reach the statistical conclusions.
Why is differential privacy important?
In a world where consumer privacy is increasingly desirable and legislated for by governments, differential privacy provides the framework for data processing without unnecessary invasion of privacy.
This not only protects each individuals' privacy, but theoretically also permits organizations, businesses, and government entities to process data in such a way that they can remain compliant with privacy regulations.
What it is important to understand, is that when your data is included in a database (ordinarily) this will usually directly impact your privacy.
When data is analyzed solely as an output of algorithms that produce differentially private outputs, your personally identifiable information is never directly accessible as an output. This ensures that the results of statistical processing cannot be re-attributed to the individuals who made up the original dataset.
A differentially private system ensures that data is always structured in such a way that it is impossible to tell whether a particular user's data was used to gain the final output.
This is an important distinction because studies have proven that even when personal data has been 'anonymized' by stripping it of identifiers, processes exist for re-identifying that data. This is why anonymized data (that applies to just one data subject) is far more invasive and problematic than differentially private outputs.
Global VS Local Differential Privacy
These two primary models for differential privacy. Each affects how data is processed and the level of privacy that is afforded to data subjects. Below we have explained each model, with details about any potential drawbacks they have.
In the global model, a single curator (a tech company, for example) has control over the original data sets used to create secure aggregated outputs. That curator analyzes the data and adds noise before publishing the data in a differentially private state.
Under these circumstances, the data curator initially has access to individual inputs, and, for this reason, those users' privacy is impacted by the data processor itself. However, the published reports are differentially private and nobody who sees those published details can single out any individual from the data.
One can only consider this global model differentially private if the data processor has strong security in place to prevent all unwanted access to the identifiable data in its central database. In addition, high levels of trust must exist in the data processor/curator who has access to the data.
If these conditions can't be met, differential privacy cannot be satisfactorily achieved because:
- The company could opt to misuse your data itself or might accidentally leak it through mismanagement
- Hackers could access the raw data breaking the differential privacy
In the local model of differential privacy, each individual adds noise to their data themselves. This lowers the potential for the data processor to know what each individual's input is.
This model assumes that it is impossible to trust any data processor or curator and that for this reason it is necessary to add noise to the data prior to letting them analyze it.
In practice, this usually involves algorithms asking for, or taking data, which is automatically obfuscated with noise prior to being sent back to the data processor.
Thus, under the locally private model, the user never sends their personal data to be stored in a central database – removing the potential for hacking and data mismanagement or abuse.
Differential Privacy VS Data Anonymization
Data anonymization is the process of taking an individual's data and scrubbing it of identifiers so that it is not immediately clear who the data refers to. If, for example, a database exists that states that Benny watched Jurassic Park on Tuesday, that data could be anonymized by altering it to: Anonymous 'subject A' watched Jurassic Park on Tuesday.
This appears to make the data safe, but in reality it still applies to just one data subject and methods exist for re-identifying that data. For example, somebody who happened to walk in on Benny when he was watching Jurassic Park last Tuesday, could deduce that Benny was the data subject. As a result, this method of anonymizing individual data sets cannot be considered foolproof.
In the real world, studies have revealed that anonymized data sets can often be re-attributed to their owners with only a small amount of extra data. If this additional data can easily be accessed by someone seeking to re-identify a data subject, it is possible that re-identification may occur, creating a massive vulnerability in the system for data subjects.
This is why differential privacy techniques are ultimately more secure than de-identified or 'anonymized' data sets; they do not contain any data that could potentially be linked back to any single data subject.
There is no such thing as truly anonymized individual data sets, because data that applies to just one data subject can potentially always be re-identified using other data sets.
That is why differential privacy is much more fundamentally sound. When achieved responsibly, it produces a result that is removed from the potential for reidentification - even if the persons wanting to re-identify data have data about every data subject other than the one they are seeking information on – it should still remain impossible.
As a result, differential privacy techniques can successfully allow data processors to collect and process information while reducing the risk that it might be used in a way that harms an individual's privacy rights.
Is differential privacy safe?
Because of how differentially private data has been processed into an aggregated output, it is generally agreed to be a secure way to analyze data without impacting people's privacy.
Unlike 'anonymized' data sets that can be subjected to a reidentification attack, differentially private data cannot, ordinarily, be used to re-identify individuals.
As a result, differentially private results can be considered a good way to display statistical analysis that is originally gleaned from databases containing identifiable data.
However, there are some considerations that must be met for the promise of differential privacy to be provided adequately to data subjects. Unless those considerations are applied satisfactorily, it is possible that data believed to be differentially private is not.
For example, achieving differential privacy for larger data sets requires lower levels of noise to be added in order to achieve privacy. On the contrary, the smaller the data set being analyzed, and the fewer data subjects involved, the more significant the amount of noise that must be added to achieve sufficient differential privacy.
Thus, if sufficient noise is not added to datasets, the potential for identification increases (meaning that differential privacy was not actually achieved).
Should we accept the use of differential privacy?
This is an important question, and is one that should be considered from two different sides:
First, it is important to acknowledge that data is a powerful and useful tool. Data can allow people, organizations, businesses, and governments to find out what is currently happening, allowing those decision-makers to make better choices in the future.
This can help with highly important endeavors such as improving efficiency, allocating resources, reducing unnecessary waste, and understanding the logistical requirements for implementing services, for example.
With this in mind, processing data – while also providing individuals with personal data privacy – can be understood to be extremely desirable.
For this reason, the use of differential privacy for analyzing data should generally be considered a positive outcome for both data subjects and decision-makers seeking to use data to better inform their actions in the future.
As long as the techniques used for adding noise to data sets and achieving differential privacy are mathematically sound, we can consider the process agreeable as a method for permitting data analysis with high levels of privacy for data subjects.
That said, differential privacy must only be accepted when it is implemented and achieved responsibly, which means that legislators must set demands for the application of differential privacy in such a way that it genuinely protects data subjects.
It is important to acknowledge that certain limitations are incurred by implementing differential privacy. This is because the process of adding noise can reduce the level of accuracy acquired from the data. Depending on the kind of data needed by the data processor, this may not be an acceptable compromise – thereby rendering the local model for differential privacy unsuitable.
Whether the tradeoff between privacy and accuracy is acceptable, largely depends on the kinds of data and data processors involved. And whether there is a legitimate legal interest for accessing individual data sets (perhaps by employing the global differential privacy model).
These questions are central to understanding the use of differential privacy, and it is generally agreed that differential privacy can primarily be employed when aggregated, non-specific outputs are sufficient to the needs of the data processor. If, on the other hand, differential privacy renders data so inaccurate that it becomes useless, then differential privacy may not be considered fit-for-purpose.
Differential Privacy and the COVID-19 Pandemic
Although differential privacy techniques have been around for many years (and have been becoming more popular in recent years – with companies like Apple using it to provide privacy for various services on iOS devices, for example), the technique has gained a lot of traction recently due to the pandemic.
By accessing location data to analyze where people go, it is possible to employ differential privacy to spot potential hotspots for the spread of Coronavirus without infringing on individuals' privacy rights.
This is an example of how differential privacy can be used to provide benefits for society without infringing on people's privacy and is the kind of decentralized and non-invasive tracking that is considered acceptable for preventing the spread of the virus by data privacy experts.