Subtitles section Play video
These days, companies are using more and more of our data to improve their products and services.
如今,越來越多的公司利用我們的數據來改進他們的產品和服務。
And it makes a lot of sense if you think about it.
仔細想想,這也很有道理。
It's better to measure what your users like, than to guess and build products that no one wants to use.
衡量用戶的喜好,勝過臆測和製造無人問津的產品。
However, this is also very dangerous.
然而,這也是非常危險的。
It undermines our privacy because the collected data can be quite sensitive, causing harm if it would leak.
它損害了我們的隱私,因為收集到的數據可能相當敏感,一旦洩露就會造成傷害。
So companies love data to improve their products, but we, as users, we want to protect our privacy.
是以,公司喜歡用數據來改進他們的產品,但作為用戶,我們希望保護自己的隱私。
These contradicting needs can be satisfied with a technique called differential privacy.
這些相互矛盾的需求可以通過一種名為 "差別隱私 "的技術得到滿足。
It allows companies to collect information about their users without compromising the privacy of an individual.
它允許公司在不損害個人隱私的情況下收集用戶資訊。
But let's first take a look at why we would go through all this trouble.
不過,讓我們先來看看為什麼我們要如此大費周章。
Companies can just take our data, remove our names and call it a day, right?
公司可以拿走我們的數據,刪除我們的名字,然後就這樣算了,對嗎?
Well, not quite.
也不盡然。
First of all, this anonymization process usually happens on the servers of the companies that collect your data.
首先,這種匿名化過程通常發生在收集你的數據的公司的服務器上。
So you have to trust them to really remove the identifiable records.
是以,你必須相信他們會真正刪除可識別的記錄。
And secondly, how anonymous is anonymized data really?
其次,匿名數據到底有多匿名?
In 2006, Netflix started a competition called the Netflix Prize.
2006 年,Netflix 發起了一項名為 "Netflix 獎 "的競賽。
Competing teams had to create an algorithm that could predict how someone would rate a movie.
參賽團隊必須創建一種算法,以預測某人會如何評價一部電影。
To help with this challenge, Netflix provided a dataset containing over 100 million ratings submitted by over 480,000 users for more than 17,000 movies.
為了幫助應對這一挑戰,Netflix 提供了一個數據集,其中包含 480,000 多名用戶為 17,000 多部電影提交的 1 億多個評分。
Netflix of course anonymized this dataset by removing the names of users and by replacing some ratings with fake and random ratings.
當然,Netflix 對該數據集進行了匿名化處理,刪除了用戶姓名,並用虛假的隨機評分取代了部分評分。
Even though that sounds pretty anonymous, it actually wasn't.
雖然聽起來很無名,但實際上並非如此。
Two computer scientists from the University of Texas published a paper in 2008 that said that they had successfully identified people from this dataset by combining it with data from IMDB.
得克薩斯大學的兩位計算機科學家於 2008 年發表了一篇論文,稱他們通過將該數據集與 IMDB 的數據相結合,成功地從該數據集中識別出了一些人。
These types of attacks are called linkage attacks and it happens when pieces of seemingly anonymous data can be combined to reveal real identities.
這類攻擊被稱為鏈接攻擊,它發生在看似匿名的數據片段被組合起來以揭示真實身份的時候。
Another more creepy example would be the case of the governor of Massachusetts.
另一個更令人毛骨悚然的例子是馬薩諸塞州州長的案例。
In the mid-1990s, the state's group insurance commission decided to publish the hospital visits of state employees.
20 世紀 90 年代中期,州團體保險委員會決定公佈州僱員的醫院就診情況。
They anonymized this data by removing names, addresses and other fields that could identify people.
他們對這些數據進行了匿名處理,刪除了姓名、地址和其他可能識別身份的字段。
However, computer scientist Latanya Sweeney decided to show how easy it was to reverse this.
然而,計算機科學家 Latanya Sweeney 決定向人們展示逆轉這種情況有多麼容易。
She combined the published health records with voter registration records and simply reduced the list.
她將已公佈的健康記錄與選民登記記錄相結合,簡單地減少了名單。
There was only one person in the medical data that lived in the same zip code, had the same gender and the same date of birth as the governor, thus exposing his medical records.
醫療數據中只有一個人與州長住在同一個郵政編碼,性別相同,出生日期也相同,是以他的醫療記錄暴露了。
In a later paper, she noted that 87% of all Americans can be identified with only three pieces of information, zip code, birthday and gender.
她在後來的一篇論文中指出,僅憑郵政編碼、生日和性別這三項資訊,就能識別出 87% 的美國人。
So much for anonymity.
匿名性就到此為止吧。
Clearly, this technique isn't enough to protect our privacy.
顯然,這種技術不足以保護我們的隱私。
Differential privacy, on the other hand, neutralizes these types of attacks.
而差分隱私則可以抵消這類攻擊。
To explain how it works, let's assume that we want to get a view on how many people do something embarrassing like, for example, picking their nose.
為了解釋它是如何工作的,讓我們假設一下,我們想了解有多少人做了尷尬的事情,比如挖鼻孔。
To do that, we set up a service with the question, do you pick your nose?
為此,我們設置了一個服務問題:你會挖鼻孔嗎?
And with the yes and no buttons below it.
下面還有 "是 "和 "否 "按鈕。
We collect all these answers on a server somewhere, but instead of sending the real answers, we're going to introduce some noise.
我們在服務器上收集所有這些答案,但我們不會發送真正的答案,而是會引入一些噪音。
Let's say that Bob is a nose picker and that he clicks on the yes button.
假設鮑勃是個摳鼻子的人,他點擊了 "是 "按鈕。
Before we send his response to the server, our differential privacy algorithm will flip a coin.
在我們將他的回覆發送給服務器之前,我們的差分隱私算法會擲一枚硬幣。
If it's heads, the algorithm sends Bob's real answer to our server.
如果是 "頭",算法就會將鮑勃的真實答案發送到我們的服務器。
If it's tails, the algorithm flips a second coin and sends yes if it's tails or no if it's heads.
如果是反面,算法會擲第二枚硬幣,如果是反面,則發送 "是",如果是正面,則發送 "否"。
Back on our server, we see the data coming in, but because of the added noise, we can't really trust individual records.
在我們的服務器上,我們可以看到輸入的數據,但由於增加了噪音,我們無法真正信任單個記錄。
Our record for Bob might say that he's a nose picker, but there is at least a 1 in 4 chance that he's actually not a nose picker, but that the answer was simply the effect of the coin toss that the algorithm performed.
我們對鮑勃的記錄可能會說他是個 "摳鼻子 "的人,但至少有四分之一的可能是,他其實不是個 "摳鼻子 "的人,答案只是算法拋硬幣的結果。
This is plausible deniability.
這就是似是而非的推諉。
You can't be sure of people's answers, so you can't judge them on it.
你無法確定別人的答案,所以不能以此來評判他們。
This is particularly interesting if you're collecting data about illegal behavior, such as drug use for instance.
如果您要收集非法行為的數據,如吸毒等,這一點尤其有趣。
Because you know how the noise is distributed, you can compensate for it and end up with a fairly accurate view on how many people are actually nose pickers.
因為你知道噪音是如何分佈的,所以你可以對其進行補償,最終就能相當準確地瞭解有多少人是真正的 "挖鼻孔者"。
Of course, the coin toss algorithm is just an example and a bit too simple.
當然,擲硬幣算法只是一個例子,有點過於簡單。
Real world algorithms use the Laplace distribution to spread data over a larger range and increase the level of anonymity.
現實世界中的算法使用拉普拉斯分佈將數據分散到更大的範圍內,並提高匿名程度。
In the paper The Algorithmic Foundations of Differential Privacy, it is noted that differential privacy promises that the outcome of a survey will stay the same whether or not you participate in it.
差分隱私的算法基礎》一文中指出,差分隱私承諾,無論您是否參與調查,調查結果都將保持不變。
Therefore, you don't have any reason not to participate in the survey.
是以,您沒有任何理由不參與調查。
You don't have to fear that your data, in this case your nose picking habits, will be exposed.
您不必擔心您的數據(這裡指您摳鼻子的習慣)會被洩露。
Alright, so now we know what differential privacy is and how it works, but let's take a look at who is already using it.
好了,現在我們知道了什麼是差異化隱私及其工作原理,但讓我們看看誰已經在使用它。
Apple and Google are two of the biggest companies who are currently using it.
蘋果和谷歌是目前使用該技術的最大的兩家公司。
Apple started rolling out differential privacy in iOS 10 and macOS Sierra.
蘋果開始在 iOS 10 和 macOS Sierra 中推出差分隱私。
They use it to collect data on what websites are using a lot of power, what images are used in a certain context, and what words people are typing that aren't in the keyboard's Apple's implementation of differential privacy is documented, but not open source.
他們用它來收集數據,包括哪些網站使用了大量的電量,哪些圖片是在特定環境下使用的,以及人們輸入了哪些鍵盤上沒有的單詞。
Google on the other hand has been developing an open source library for this.
另一方面,谷歌一直在為此開發一個開源庫。
They use it in Chrome to do studies on browser malware and in Maps to collect data about traffic in large cities.
他們在 Chrome 瀏覽器中使用它來研究瀏覽器惡意軟件,在地圖中使用它來收集大城市的交通數據。
But overall there aren't many companies who have adopted differential privacy and those who have only use it for a small percentage of their data collection.
但總體而言,採用差異化隱私保護的公司並不多,而那些採用差異化隱私保護的公司也只是將其用於一小部分數據收集。
So why is that?
這是為什麼呢?
Well, for starters, differential privacy is only usable for large datasets because of the injected noise.
首先,由於注入了噪音,差分隱私只適用於大型數據集。
Using it on a tiny dataset will likely result in inaccurate data.
在極小的數據集上使用它很可能會導致數據不準確。
And then there is also the complexity of implementing it.
此外,實施起來也很複雜。
It's a lot more difficult to implement differential privacy compared to just reporting the real data of users and anonymize it in the old fashioned way.
與僅僅報告用戶的真實數據並以老式方法進行匿名處理相比,實施差異化隱私保護要困難得多。
So the bottom line is that differential privacy can help companies to learn more about a group of users without compromising the privacy of an individual within that group.
是以,最根本的一點是,差異化隱私可以幫助公司更多地瞭解用戶群體,而不會損害該群體中個人的隱私。
Adoption however is still limited, but it's clear that there is an increasing need in ways to collect data about people without compromising their privacy.
不過,雖然採用率仍然有限,但人們顯然越來越需要在不損害個人隱私的情況下收集有關個人的數據。
So that's it for this video.
本視頻到此結束。
If you want to learn more, head over to the Simply Explained playlist to watch more videos.
如果您想了解更多資訊,請前往 "簡單講解 "播放列表觀看更多視頻。
And as always, thank you very much for watching.
一如既往,感謝您的收看。