Subtitles section Play video
These days, companies are using more and more of our data to improve their products and services.
And it makes a lot of sense if you think about it.
It's better to measure what your users like, than to guess and build products that no one wants to use.
However, this is also very dangerous.
It undermines our privacy because the collected data can be quite sensitive, causing harm if it would leak.
So companies love data to improve their products, but we, as users, we want to protect our privacy.
These contradicting needs can be satisfied with a technique called differential privacy.
It allows companies to collect information about their users without compromising the privacy of an individual.
But let's first take a look at why we would go through all this trouble.
Companies can just take our data, remove our names and call it a day, right?
Well, not quite.
First of all, this anonymization process usually happens on the servers of the companies that collect your data.
So you have to trust them to really remove the identifiable records.
And secondly, how anonymous is anonymized data really?
In 2006, Netflix started a competition called the Netflix Prize.
Competing teams had to create an algorithm that could predict how someone would rate a movie.
To help with this challenge, Netflix provided a dataset containing over 100 million ratings submitted by over 480,000 users for more than 17,000 movies.
Netflix of course anonymized this dataset by removing the names of users and by replacing some ratings with fake and random ratings.
Even though that sounds pretty anonymous, it actually wasn't.
Two computer scientists from the University of Texas published a paper in 2008 that said that they had successfully identified people from this dataset by combining it with data from IMDB.
These types of attacks are called linkage attacks and it happens when pieces of seemingly anonymous data can be combined to reveal real identities.
Another more creepy example would be the case of the governor of Massachusetts.
In the mid-1990s, the state's group insurance commission decided to publish the hospital visits of state employees.
They anonymized this data by removing names, addresses and other fields that could identify people.
However, computer scientist Latanya Sweeney decided to show how easy it was to reverse this.
She combined the published health records with voter registration records and simply reduced the list.
There was only one person in the medical data that lived in the same zip code, had the same gender and the same date of birth as the governor, thus exposing his medical records.
In a later paper, she noted that 87% of all Americans can be identified with only three pieces of information, zip code, birthday and gender.
So much for anonymity.
Clearly, this technique isn't enough to protect our privacy.
Differential privacy, on the other hand, neutralizes these types of attacks.
To explain how it works, let's assume that we want to get a view on how many people do something embarrassing like, for example, picking their nose.
To do that, we set up a service with the question, do you pick your nose?
And with the yes and no buttons below it.
We collect all these answers on a server somewhere, but instead of sending the real answers, we're going to introduce some noise.
Let's say that Bob is a nose picker and that he clicks on the yes button.
Before we send his response to the server, our differential privacy algorithm will flip a coin.
If it's heads, the algorithm sends Bob's real answer to our server.
If it's tails, the algorithm flips a second coin and sends yes if it's tails or no if it's heads.
Back on our server, we see the data coming in, but because of the added noise, we can't really trust individual records.
Our record for Bob might say that he's a nose picker, but there is at least a 1 in 4 chance that he's actually not a nose picker, but that the answer was simply the effect of the coin toss that the algorithm performed.
This is plausible deniability.
You can't be sure of people's answers, so you can't judge them on it.
This is particularly interesting if you're collecting data about illegal behavior, such as drug use for instance.
Because you know how the noise is distributed, you can compensate for it and end up with a fairly accurate view on how many people are actually nose pickers.
Of course, the coin toss algorithm is just an example and a bit too simple.
Real world algorithms use the Laplace distribution to spread data over a larger range and increase the level of anonymity.
In the paper The Algorithmic Foundations of Differential Privacy, it is noted that differential privacy promises that the outcome of a survey will stay the same whether or not you participate in it.
Therefore, you don't have any reason not to participate in the survey.
You don't have to fear that your data, in this case your nose picking habits, will be exposed.
Alright, so now we know what differential privacy is and how it works, but let's take a look at who is already using it.
Apple and Google are two of the biggest companies who are currently using it.
Apple started rolling out differential privacy in iOS 10 and macOS Sierra.
They use it to collect data on what websites are using a lot of power, what images are used in a certain context, and what words people are typing that aren't in the keyboard's Apple's implementation of differential privacy is documented, but not open source.
Google on the other hand has been developing an open source library for this.
They use it in Chrome to do studies on browser malware and in Maps to collect data about traffic in large cities.
But overall there aren't many companies who have adopted differential privacy and those who have only use it for a small percentage of their data collection.
So why is that?
Well, for starters, differential privacy is only usable for large datasets because of the injected noise.
Using it on a tiny dataset will likely result in inaccurate data.
And then there is also the complexity of implementing it.
It's a lot more difficult to implement differential privacy compared to just reporting the real data of users and anonymize it in the old fashioned way.
So the bottom line is that differential privacy can help companies to learn more about a group of users without compromising the privacy of an individual within that group.
Adoption however is still limited, but it's clear that there is an increasing need in ways to collect data about people without compromising their privacy.
So that's it for this video.
If you want to learn more, head over to the Simply Explained playlist to watch more videos.
And as always, thank you very much for watching.