Subtitles section Play video Print subtitles [MUSIC PLAYING] EMILY GLANZ: Hi, everyone. Thanks for joining us today. I'm Emily, a software engineer on Google's federated learning team. DANIEL RAMAGE: And I'm Dan. I'm a research scientist and the team lead. We'll be talking to day about Federated Learning-- machine learning on decentralized data. The goal of federated learning is to enable edge devices to do state-of-the-art machine learning without centralizing data and with privacy by default. And, with privacy, what we mean is that we have an aspiration that app developers, centralized servers, and models themselves learn common patterns only. That's really what we mean by privacy. In today's talk, we'll talk about decentralized data, what it means to work with decentralized data in a centralized fashion. That's what we call federated computation. We'll talk a bit about learning on decentralized data. And then we'll give you an introduction to TensorFlow Federated, which is a way that you can experiment with federated computations in simulation today. Along the way, we'll introduce a few privacy principles, like ephemeral reports, and privacy technologies, like federated model averaging that embody those principles. All right, let's start with decentralized data. A lot of data is born at the edge, with billions of phones and IoT devices that generate data. That data can enable better products and smarter models. You saw in yesterday's keynote a lot of ways that that data can be used locally at the edge, with on-device inference, such as the automatic captioning and next generation assistant. On-device inference offers improvements to latency, lets things work offline, often has battery life advantages, and can also have some substantial privacy advantages because a server doesn't need to be in the loop for every interaction you have with that locally-generated data. But if you don't have a server in the loop, how do you answer analytics questions? How do you continue to improve models based on the data that those edge devices have? That's really what we'll be looking at in the context of federated learning. And the app we'll be focusing on today is Gboard, which is Google's mobile keyboard. People don't think much about their keyboards, but they spend hours on it each day. And typing on a mobile keyboard is 40% slower than on a physical one. It is easier to share cute stickers, though. Gboard uses machine-learned models for almost every aspect of the typing experience. Tap typing, gesture typing both depend on models because fingers are a little bit wider than the key targets, and you can't just rely on people hitting exactly the right keystrokes. Similarly, auto-corrections and predictions are powered by learned models, as well as voice to text and other aspects of the experience. All these models run on device, of course, because your keyboard needs to be able to work offline and quickly. For the last few years, our team has been working with the Gboard team to experiment with decentralized data. Gboard aims to be the best and most privacy forward keyboard available. And one of the ways that we're aiming to do that is by making use of an on-device cache of local interactions. This would be things like touch points, type text, context, and more. This data is used exclusively for federated learning and computation. EMILY GLANZ: Cool. Let's jump in to federated computation. Federated computation is basically a MapReduce for decentralized data with privacy-preserving aggregation built in. Let's introduce some of the key concepts of federated computations using a simpler example than Gboard. So here we have our clients. This is a set of devices-- some things like cell phones, or sensors, et cetera. Each device has its own data. In this case, let's imagine it's the maximum temperature that that device saw that day, which gets us to our first privacy technology-- on-device data sets. Each device keeps the raw data local, and this comes with some obligations. Each device is responsible for data asset management locally, with things like expiring old data and ensuring that the data is encrypted when it's not in use. So how do we get the average maximum temperature experienced by our devices? Let's imagine we had a way to only communicate the average of all client data items to the server. Conceptually, we'd like to compute an aggregate over the distributed data in a secure and private way, which we'll build up to throughout this talk. So now let's walk through an example where the engineer wants to answer a specific question of the decentralized data, like what fraction of users saw a daily high over 70 degrees Fahrenheit. The first step would be for the engineer to input this threshold to the server. Next, this threshold would then be broadcast to the subset of available devices the server has chosen to participate in this round of federated computation. This threshold is then compared to the local temperature data to compute a value. And this is going to be a 1 or a 0, depending on whether the temperature was greater than that threshold. Cool. So these values would then be aggregated using an aggregation operator. In this case, it's a federated mean, which encodes a protocol for computing the average value over the participating devices. The server is responsible for collating device reports throughout the round and emitting this aggregate, which contains the answer to the engineer's question. So this demonstrates our second privacy technology of federated aggregation. The server is combining reports from multiple devices and only persisting the aggregate, which now leads into our first privacy principle of only an aggregate. Performing that federated aggregation only makes the final aggregate data, those sums and averages over the device reports, available to the engineer, without giving them access to an individual report itself. So this now ties into our second privacy principle of ephemeral reports. We don't need to keep those per-device messages after they've been aggregated, so what we collect only stays around for as long as we need it and can be immediately discarded. In practice, what we've just shown is a round of computation. This server will repeat this process multiple times to get a better estimate to the engineer's question. It repeats this multiple times because some devices may not be available at the time of computation or some of the devices may have dropped out during this round. DANIEL RAMAGE: So what's different between federated computation and decentralized computation in the data center with things like MapReduce? Federal computation has challenges that go beyond what we usually experience in distributed computation. Edge devices like phones tend to have limited communication bandwidth, even when they're connected to a home Wi-Fi network. They're also intermittently available because the devices will generally participate only if they are idle, charging, and on an unmetered network. And because each compute node keeps the only copy of its data, the data itself has intermittent availability. Finally, devices participate only with the user's permission, depending on an app's policies. Another difference is that in a federated setting, it is much more distributed than a traditional data center distributed computation. So to give you a sense of orders of magnitude, usually in a data center, you might be looking at thousands or maybe tens of thousands of compute nodes, where this federated setting might have something like a billion compute nodes. Maybe something like 10 million are available at any given time. Something like 1,000 are selected for a given round of computation, and maybe 50 drop out. That's just kind of a rough sense of the scales that we're interested in supporting. And, of course, as Emily mentioned, privacy preserving aggregation is kind of fundamental to the way that we think about federated computation. So when you posed this set of differences, what does it actually look like when you run a computation in practice? This is a graph of the round completion rate by hour over the course of three days for a Gboard model that was trained in the United States. You see this periodic structure of peaks and troughs, which represent day versus night. Because devices are only participating when they're otherwise idle and charging, this represents that the peaks of down completion rate are when more devices are plugged in, which is usually when they're charging on someone's nightstand as they sleep. Rounds complete faster when more devices are available. And the device availability can change over the course of the day. That, in turn, implies a dynamic data availability because the data itself might be slightly different from the users who plug in phones at night versus the day, which is something that we'll get back to when we talk about federated learning in particular. Let's take a more in-depth example of what a federated computation looks like-- the relative typing frequencies of common words in Gboard. Typing frequencies are actually useful for improving the Gboard experience in a few ways. If someone has typed the letters H-I, "hi" is much, much more likely than "hieroglyphic." And so knowing those relative word frequencies allows the Gboard team to make the product better. How would we be compute these relative typing frequencies as a federated computation? Instead of the engineers specifying a single threshold. Now, what they would be specifying is something like a snippet of code that's going to be running on each edge device. And in practice, that will often be something that's actually in TensorFlow, but for here, I've written it as Python X pseudocode. So think of that device data as each device's record of what was typed in recent sessions on the phone. So for each word in that device data, if the word is in one of the common words we're trying to count, we'll increase its count when the local device updates. That little program is what would be shipped to the edge and run locally to compute a little map that says that perhaps this phone typed the word "hello" 18 times and "world" 0 times. That update would then be encoded as a vector. Here, the first element of the vector would represent the count for "hello" and the second one for the count for "world," which would then be combined and summed using the federated aggregation operators that Emily mentioned before. At the server, the engineer would see the counts from all the devices that have participated in that round, not from any single device, which brings up a third privacy principle of focused collection. Devices report only what is needed for this specific computation. There's a lot more richness in the on-device data set that's not being shared. And if the analyst wanted to ask a different question, for example, counting a different set of words, they would run a different computation. This would then repeat over multiple rounds, getting the aggregate counts higher and higher, which in turn would give us better and better estimates of the relative frequencies of the words typed across the population. EMILY GLANZ: Awesome. Let's talk about our third privacy technology of secure aggregation. In the previous example, we saw how this server only needs to emit the sum of vectors reported by the devices. The server could compute this sum from the device reports directly, but we've been researching ways to provide even stronger guarantees. Can we make it so the server itself cannot inspect individual reports? That is, how do we enforce that only in aggregate privacy principle we saw from before in our technical implementation? Secure aggregation is an optional extension to the client/server protocol that embodies this privacy principle. Here's how it works. So this is a simplified overview that demonstrates the key idea of how a server can compute a sum without being able to decrypt the individual messages. In practice, handling phones that have dropped partway is also required by this protocol. See the paper for details. Awesome. So let's jump into this. Through coordination by the server, two devices are going to agree upon a pair of large masks that when summed add to 0. Each device will add these masks to their vectors before reporting. All devices that are participating in this round of computation will exchange these zero-sum pairs. Reports will be completely masked by these values, such that we see that these added pairs now make each individual report themselves look randomized. But when aggregated together, the pairs cancel out, and we're left with only the sum we were looking for. In practice, again, this protocol is more complicated to handle dropout. So we showed you what you can do with federated computation. But what about the much more complex workflows associated with federated learning? Before we jump into federated learning, let's look at the typical workflow a model engineer who's performing machine learning would go through. Typically, they'll have some data in the cloud where they start training and evaluation jobs, potentially in grids to experiment with different hyperparameters, and they'll monitor how well these different jobs are performing. They'll end up with a model that will be a good fit for the distribution of cloud data that's available. So how does this workflow translate into a federated learning workflow? Well, the model engineer might still have some data in the cloud, but now this is proxy data that's similar to the on-device data. This proxy data might be useful for training and evaluating in advance, but our main training loop is now going to take place on our decentralized data. The model engineer will still do things that are typical of a machine learning workflow, like starting and stopping tasks, trying out different learning rates or different hyperparameters, and monitoring their performance as training is occurring. If the model performs well on that decentralized data set, the model engineer now has a good release candidate. They'll evaluate this release candidate using whatever validation techniques they typically use before deploying to users. These are things you can do with ModelValidator and TFX. They'll distribute this final model for on-device inference with TensorFlow Lite after validation, perhaps with a rollout or A/B testing. This deployment workflow is a step that comes after federated learning once they have a model that works well. Note that the model does not continue to train after it's been deployed for inference on device unless the model engineer is doing something more advanced, like on-device personalization. So how does this federated learning part work itself? If a device is idle and charging, it will check into the server. And most of the time, it's going to be told to go away and come back later. But some of the time, the server will have work to do. The initial model as dictated by the model engineer is going to be sent to the phone. For the initial model, usually 0s or a random initialization is sufficient. Or if they have some of that relevant proxy data in the cloud, they can also use a pre-trained model. The client computes an update to the model using their own local training data. Only this update is then sent to the server to be aggregated, not the raw data. Other devices are participating in this round, as well, performing their own local updates to the model. Some of the clients may drop out before reporting their update, but this is OK. The server will aggregate user updates into a new model by averaging the model updates, optionally using secure aggregation. The updates are ephemeral and will be discarded after use. The engineer will be monitoring the performance of federated training through metrics that are themselves aggregated along with the model. Training rounds will continue if the engineer is happy with model performance. A different subset of devices is chosen by the server and given the new model parameters. This is an iterative process and will continue through many training rounds. So what we've just described is our fourth privacy technology of federated model averaging. Our diagram showed federated averaging as the flavor of aggregation performed by the server for distributed machine learning. Federated averaging works by computing a data-weighted average of the model updates from many steps of gradient descent on the device. Other federization optimization techniques could be used. DANIEL RAMAGE: So what's different between federated learning and traditional distributed learning inside a data center? Well, it's all the differences that we saw with federated computation plus some additional ones that are learning specific. For example, the data sets in a data center are usually balanced in size. Most compute nodes will have a roughly equal size slice of the data. In the federated setting, each device has one users' data, and some users might use Gboard much more than others, and therefore those data set sizes might be very different. Similarly, the data in federated computation is very self-correlated. It's not a representative sample of all users' typing. Each device has only one user's data in it. And many distributed training algorithms in the data center make an assumption that every compute node gets a representative sample of the full data set. And, third, that variable data availability that I mentioned earlier-- because the people whose phones are plugged in at night versus plugged in during the day might actually be different, for example, night shift workers versus day shift workers, we might actually have different kinds of data available at different times of day, which is a potential source of bias when we're training federated models and an active area of research. What's exciting is the fact that federated model averaging actually works well for a variety of state-of-the-art models despite these differences. That's an empirical result. When we started this line of research, we didn't know if that would be true or if it would apply widely to the kinds of state-of-the-art models that teams like Gboard are interested in pursuing. The fact that it does work well in practice is great news. So when does federated learning apply? When is it most applicable? It's when the on-device data is more relevant than the server-side proxy data or its privacy sensitive or large in ways that would make it not make sense to upload. And, importantly, it works best when the labels for your machine-learned algorithm can be inferred naturally from user interaction. So what does that naturally inferred label look like? Let's take a look at some examples from Gboard. Language modeling is one of the most essential models that powers a bunch of Gboard experiences. The key idea in language modeling is to predict the next word based on typed text so far. And this, of course, powers the prediction strip, but it also powers other aspects of the typing experience. Gboard uses the language model also to help understand as you're tap typing or gesture typing which words are more likely. The model input in this case is the type in sequence so far, and the output is whatever word the user had typed next. That's what we mean by self-labeling. If you take a sequence of text, you can use every prefix of that text to predict the next word. And so that gives a series of training examples as result of people's natural use of the keyboard itself. The Gboard team ran dozens of experiments in order to replace their prediction strip language model with a new one based on a more modern recurrent neural network architecture, described in the paper linked below. On the left, we see a server-trained recurrent neural network compared to the old Gboard model, and on the right, a federated model compared to that same baseline. Now, these two model architectures are identical. The only difference is that one is trained in the data center using the best available server-side proxy data and the other was trained with federated learning. Note that the newer architecture is better in both cases, but the federated model actually does even better than the server model, and that's because the decentralized data better represents what people actually type. On the x-axis here for the federated model, we see the training round, which is how many rounds of computation did it take to hit a given accuracy on the y-axis? And the model tends to converge after about 1,000 rounds, which is something like a week on wall clock time. That's longer than in the data center, where the x-axis measures the step of SGD, where we get to a similar quality in about a day or two. But that week long time frame is still practical for machine learning engineers to do their job because they can start many models in parallel and work productively in this setting, even though it takes a little bit longer. What's the impact of that relatively small difference? It's actually pretty big. The next word prediction accuracy improves by 25% relative. And it actually makes the prediction strip itself more useful. Users click it about 10% more. Another example that the Gboard team has been working with is emoji prediction. Software keyboards have a nice emoji interface that you can find, but many users don't know to look there or find it inconvenient. And so Gboard has introduced the ability to predict emoji right in line on the prediction strip, just like next words. And the federated model was able to learn that the fire emoji is an appropriate completion for this party is lit. Now, on the bottom, you can see a histogram of just the overall frequency of emojis that people tend to type, which has the laugh/cry emoji much more represented. So this is how you know that the context really matters for emoji. We wouldn't want to make that laugh cry emoji just the one that we suggest all the time. And this model ends up with 7% more accurate emoji predictions. And Gboard users actually click the prediction strip 4% more. And I think, most importantly, there are 11% more users who've discovered the joy of including emoji in their texts, and untold numbers of users who are receiving those wonderfully emojiful texts. So far, we've focused on the text entry aspects, but there are other components to where federated learning can apply, such as action prediction in the UI itself. Gboard isn't really just used for typing. A key feature is enabling communication. So much of what people type is in messaging apps, and those apps can become more lively when you share the perfect GIF. So just helping people discover great GIFs to search for and share from the keyboard at the right times without getting in the way is one of Gboard's differentiating product features. This model was trained to predict from the context so far, a query suggestion for a GIF or a sticker, a search or emoji, and whether that suggestion is actually worth showing to the user at this time. An earlier iteration of this model is described at the paper linked below. This model actually resulted in a 47% reduction in unhelpful suggestions, while simultaneously increasing the overall rate of emoji, GIF and sticker shares by being able to better indicate when a GIF search would be appropriate, and that's what you can see in that animation. As someone types "good night," that little "g" turns into a little GIF icon, which indicates that a good GIF is ready to share. One final example that I'd like to give from Gboard is the problem of discovering new words. So what words are people typing that Gboard doesn't know? It can be really hard to type a word that the keyboard doesn't know because it will often auto-correct to something that it does know. And Gboard engineers can use the top typed unknown words to improve the typing experience. They might add new common words to the dictionary in the next model release after manual review or they might find out what kinds of typos are common, suggesting possible fixes to other aspects of the typing experience. Here is a sample of words that people tend to type that Gboard doesn't know. How did we get this list of words if we're not sharing the raw data? We actually trained a recurrent network to predict the sequence of characters people type when they're typing words that the keyboard doesn't know. And that model, just like the next word prediction model, is able to be used to sample out letter by letter words. We then take that model in the data center, and we ask it. We just generate from it. We generate millions and millions of samples from that model that are representative of words that people are typing out in the wild. And if we break these down a little bit, there is a mix of things. There's abbreviations, like "really" and "sorry" missing their vowels. There's extra letters added to "hahah" and "ewwww," often for emphasis. There are typos that are common enough that they show up even though Gboard likes to auto-correct away from those. There are new names. And we also see examples of non-English words being typed in an English language keyboard, which is what this was-- English in the US was what this was trained against. Those non-English words actually indicate another way that Gboard might improve. Gboard has, of course, an experience for typing in multiple languages. And perhaps there's ways that that multilingual experience or switching language more easily could be improved. This also brings us to our fourth privacy principle, which is don't memorize individuals' data. We're careful in this case to use only models aggregated over lots of users and trained only on out of vocabulary words that have a particular flavor, such as not having a sequence of digits. We definitely don't want the model we've trained in federated learning to be able to memorize someone's credit card number. And we're looking further at techniques that can provide other kinds of even stronger and more provable privacy properties. One of those is differential privacy. This is the statistical science of learning common patterns in the data set without memorizing individual examples. This is a field that's been around for a number of years and it is very complementary to federated learning. The main idea is that when you're training a model with federated learning or in the data center, you're going to use appropriately calibrated noise that can obscure an individual's impact on the model that you've learned. This is something that you can experiment with a little bit today in the TensorFlow privacy project, which I've linked here, for more traditional data center settings, where you might have all the data available and you'd like to be able to use an optimizer that adds the right kind of noise to be able to guarantee this property, that individual examples aren't memorized. The combination of differential privacy and federated learning is still very fresh. Google is working to bring this to production, and so I'm giving you kind of a preview of some of these early results. Let me give you a flavor of how this works with privacy technology number five-- differentially private model averaging, which is described in the ICLR paper linked here. The main idea is that in every round of federated learning, just like what Emily described for a normal round, an initial model will be sent to the device, and that model will be trained on that device's data. But here's where the first difference comes in. Rather than sending that model update back to the server for aggregation, the device first clips the update, which is to say it makes sure that the model update is limited to a maximum size. And by maximum size, we actually mean in a technical sense the L2 ball of in parameter space. Then the server will add noise when combining the device updates for that round. How much noise? It's noise that's roughly on the same order of magnitude as the maximum size that any one user is going to send. With those two properties combined and properly tuned, it means that any particular aspect of the updated model from that round might be because some user's contribution suggested that the model go that direction or it might be because of the random noise. That gives kind of an intuitive notion of plausible deniability about whether or not any change was due to a user versus the noise, but it actually provides even a more stronger formal property that the model that you learn with differentially private model averaging will be approximately the same model whether or not any one user was actually participating in training. And a consequence of that is that if there is something only one user has typed, this model can't learn it. We've created a production system for federated computation here at Google, which is what has been used by the Gboard team in the examples that I've talked about today. You can learn more about this in the paper we published at SysML this year, "Towards Federated Learning at Scale-- System Design." Now, this system is still being used internally. It's not yet a system that we expect external developers to be able to use, but that's something that we're certainly very interested in supporting. EMILY GLANZ: Awesome. We're excited to share our community projects that allows all to develop the building blocks of federated computations. And this is TensorFlow Federated. TFF offers two APIs, the Federated Learning or FL API, and the Federated Core, or FC API. The FL API comes with implementations of federated training and evaluation that can be applied to your existing Keras models so you can experiment with federated learning in simulation. The FC API allows you to build your own federated computations. And TFF also comes with a local runtime for simulations. So, earlier, we showed you how federated computation works conceptually. Here's what this looks like in TFF. So we're going to refer to these sensor readings collectively as a federated value. And each federated value has a type, both the placement-- so this is at clients-- and the actual type of the data items themselves, or a float32. The server also has a federated type. And, this time, we've dropped the curly braces to indicate that this is one value and not many, which gets us into our next concept is distributed aggregation protocol that runs between the clients and this server. So in this case, it's the TFF federated mean. So this is a federated operator that you can think of as a function, even though its inputs and its outputs live in different places. A federated op represents an abstract specification of a distributed communication protocol. So TFF provides a library of these federated operators that represent the common building types of federated protocols. So now I'm going to run through a brief code example using TFF. I'm not going to go too in-depth, so it might look a little confusing. But at the end, I'm going to put up a link to a site that provides more tutorials, and more walkthrough is of the code. So this section of code that I have highlighted right now declares our federated type that represents our input. So you can see we're defining both the placement, so this is at the TFF clients, and that each data item is a tf.float32. Next, we're passing this as an argument to this special function decorator that declares this a federated computation. And here we're invoking our federated operator. In this case, it's that tff.federated_mean on those sensor readings. So now let's jump back to that example where the model engineer had that specific question of what fraction of sensors saw readings that were greater than that certain threshold. So this is what that looks like in TFF. Our first federated operator in this case is the tff.federated_broadcast that's responsible for broadcasting that threshold to the devices. Our next federated operator is the tff.federated_map that you can think of as the map step in MapReduce. That gets those 1s and 0s representing whether their local values are greater than that threshold. And, finally, we perform a federated aggregation so that tff.federated_mean, to get the result back at the server. So let's look at this, again, in code. We're, again, declaring our inputs. Let's pretend we've already declared our readings type and now we're also defining our threshold type. This time, it has a placement at the server, and we're indicating that there is only one value with that all_equal=True, and it's a tf.float32. So we're again passing that into that function decorator to declare this a federated computation. We're invoking all those federated operators in the appropriate order. So we have that tff.federated_broadcast that's working on the threshold. We're performing our mapping step that's taking a computation I'll talk about in a second and applying it to the readings in that threshold that we just broadcast. And this chunk of code represents the local computation each device will be performing, where they're comparing their own data item to the threshold that they received. So I know that was a fast brief introduction to coding with TFF. Please visit this site, tensorflow.org/federated, to get more hands-on with the code. And if you like links, we have one more link to look at all the ideas we've introduced today about federated learning. Please check out our comic book at federated.withgoogle.com. We were fortunate enough to work with two incredibly talented comic book artists to illustrate these comics as graphic art. And it even has corgis. That's pretty cool. DANIEL RAMAGE: All right, so in today's talk, we covered decentralized data, federated computation, how we can use federated computation building blocks to do learning, and gave you a quick introduction to the TensorFlow Federated project, which you can use to experiment with how federated learning might work on data sets that you have already in the server in simulation today. We expect that you might have seen, the TF Lite team has also announced that training is a big part of their roadmap, and that's something that we are also really excited about for being able to enable external developers to run the kinds of things that we're running internally sometime soon. We also introduced privacy technologies, on-device data sets, federated aggregation, secure aggregation, federated model averaging, and the differentially private version of that, which embodies some privacy principles of only an aggregate, ephemeral reports, focused collection, and not memorizing individuals' data. So we hope we've given you a flavor of the kinds of things that federated learning and computation can do. To learn more, check out the comic book and play a little bit with TensorFlow Federated for a preview of how you can write your own kinds of federated computations. Thank you very much. [APPLAUSE] [MUSIC PLAYING]
B1 data model computation device server privacy Federated Learning: Machine Learning on Decentralized Data (Google I/O'19) 4 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary