In this post, I am going to briefly describe what is Differential Privacy and what it promises. You may keep hearing about privacy-preserving Deep Learning. With the advent of new privacy laws such as the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), data stakeholders are becoming more and more careful and concerned about the privacy aspects of their works.
Differential Privacy is a new solid definition of privacy that is utilized to equipped algorithms to aggregate and employ information about data by limiting the disclosure of private information. The nice thing about Differential Privacy is its solid mathematical definition which allows precise measurement of privacy guarantee and privacy loss. In this article, you will learn the basics of Differential Privacy Definitions.
The Issue: Ambiguity in Defining Privacy
It’s been always a matter of debate to define privacy:
- What does it mean?
- When we can say there is a privacy breach?
- How much information leakage is there?
- How much information leakage we can tolerate?
Such questions were always being asked and the majority of the time they were hard to answer! The terms privacy itself is rather too broad and we should declare we are talking about the privacy of what? Let’s focus on privacy-preserving data analysis and discussing what does it mean?
The Problem of Privacy-Preserving Data Analysis
Well, defining the privacy-preserving data analysis is by itself complicated and many proposed definitions failed to address it!
Let’s start with an example: Let’s assume we have a Deep Learning model in which we desire to train a neural network. Assume we train our neural network on data with sensitive information. The network is learning some information from the data and makes some predictions. Now assume someone asks you the following question: How can you make sure, while this network does the prediction (or any other analysis of the data) it does not jeopardize the sensitive data? This seems to be a broad question BUT no one might be willing to give you their sensitive data unless the can be sure the privacy is preserved by some strong convincing definition.
A lot of research efforts were dedicated to exploring the potential threats to data privacy and discussed why the majority are definitions and approaches are NOT effective. Let’s discuss the most famous one of them which is a widely used approach.
Data Anonymization is NOT SECURE
Let’s talk about this by referring to the Healthcare domain which always been in the center of attention regarding the patients’ sensitive data privacy. According to the Health Insurance Portability and Accountability Act (HIPAA) :
A major goal of the Privacy Rule is to ensure that individuals’ health information is properly protected while allowing the flow of health information needed to provide and promote high quality health care and to protect the public’s health and well-being. The Privacy Rule strikes a balance that permits important uses of information while protecting the privacy of people who seek care and healing.cdc.gov
At its very core, this federal law enforces the patients’ records to be anonymized before sharing with any stake-holder. The data anonymization, which is the process of removing personally identifiable information from data, for long believed to be a privacy-preserving approach!
Top Examples Backed By Research
One may ask if the personally identifiable information is removed, how on earth the privacy of a patient can be jeopardized?! The linkage attacks were successfully conducted to match “anonymized” records with non-anonymized records in another dataset. Check below examples:
- In 2007, Netflix published data about movie rankings for 500,000 customers and with few additional inputs from IMDb, researchers de-anonymize the data!
- By utilizing 1990 U.S. census data, researchers demonstrated that by using only their Zip code, gender, and DOB, it is possible to identify 87 percent of the U.S. population!
So the abundance of information in a data backfired on its privacy!
The Secret Sauce of Differential Privacy
Basically differential privacy is a guarantee that the data holder or data curator gives to the data subject. Let’s quote Cynthia Dwork who is a pioneer in differential privacy:
Differntial Privacy promise to the data subject: You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth
This is a great promise. Something that makes the data subject to BREATHE A SIGH OF RELIEF! Differential Privacy, in an ideal sense, is learning from data but learning nothing from individuals. Differential privacy is a strong definition of privacy and not a technique, approach, or algorithm. But what does that mean? Let’s have an example.
Assume poor Bob has cancer! He participates in a data collection study and researchers later used that data for analytical purposes. If the utilized approach by the researchers is differentially private, the outcome of their analytics MUST NOT be affected by removing Bob from the dataset! This may look paradoxical. One may ask:
- If Bob’s data is not helping the study, why it should be there from the first place?
- If Bob’s data in contributing to the analysis, how it is possible that by removing that, the analytics outcome would not be affected?
Well, there is a misunderstanding here. Bob’s data is affecting the analysis for sure, but the question is which aspects? Assume we want to report the mean of some particular blood test category (this is a query). Making this query is NOT differentially private. Why? Because if Bob has that blood test category, removing Bob will definitely affect the reported mean. So we should carefully analyze what we want to learn from data and how.
After this, we should answer this question: How we learn something from data that will NOT be affected too much with the presence of a particular individual? The term too much indicates that Differential Privacy is not something absolute. In fact, as it is defined in The Algorithmic Foundations of Differential Privacy, an algorithm can be -differentially private and the values determines how good the algorithm is in terms of preserving privacy.
Who Cares? The Significance of Differential Privacy
Answering this question is critical as it may give you some insight into if and how much you should care!
The stakeholders that are associated with sensitive and private data collection, management, and sharing (basically whoever working in the AI community dealing with private datasets) are very interested in Differential Privacy to address the regulatory enforcement of the European General Data Protection Regulation (GDPR). Furthermore, the enactment of the California Consumer Privacy Act (CCPA), make the same stakeholder to be more concerned and careful about practicing privacy. This is to the level that some experts believe Silicon Valley is terrified of California’s privacy law! Some of the basics rights that the CCPA gives customers are as follows:
- The consumers should be informed about the collection of personal data
- The right to know how their personal data is being used
- Present the selling of their personal data
- Accessing and modifying their personal data including the deletion.
Differential Privacy comes to our rescue by the following potentials:
- It promises the customer or data subject that their data privacy is protected by its strong notion.
- Create a robust model that is NOT very affected by eliminating particular individuals. However, this claim needs further investigation regarding how robust the model is? For example, certainly removing half the dataset may not give us the same level of accuracy or performance!
In this post, we discussed Differential Privacy and as a strong definition of privacy. In fact, I did NOT go through the details of the definition. I just wanted to give you a (1) brief introduction of the notion and (2) its promises. As a final word, whoever works with privacy-protected datasets is urged to practice Private Deep Learning and Differential Privacy is the strongest proposed notion of privacy as of now.