Subtitles section Play video Print subtitles [MUSIC PLAYING] JACQUELINE PAN: Hi, everyone. I'm Jackie, and I'm the Lead Program Manager on ML Fairness here at Google. So what is ML fairness? As some of you may know, Google's mission is to organize the world's information and make it universally accessible and useful. Every one of our users gives us their trust. And it's our responsibility to do right by them. And as the impact and reach of AI has grown across societies and sectors, it's critical to ethically design and deploy these systems in a fair and inclusive way. Addressing fairness in AI is an active area of research at Google, from fostering a diverse and inclusive workforce that embodies critical and diverse knowledge to training models to remove or correct problematic biases. There is no standard definition of fairness, whether decisions are made by humans or by machines. Far from a solved problem, fairness in AI presents both an opportunity and a challenge. Last summer, Google outlined principles to guide the responsible development and use of AI. One of them directly speaks to ML fairness and making sure that our technologies don't create or reinforce unfair bias. The principles further state that we seek to avoid unjust impacts on people related to sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and political or religious belief. Now let's take a look at how unfair bias might be created or reinforced. An important step on that path is acknowledging that humans are at the center of technology design, in addition to being impacted by it. And humans have not always made product design decisions that are in line with the needs of everyone. For example, because female body-type crash test dummies weren't required until 2011, female drivers were more likely than male drivers to be severely injured in an accident. Band-Aids have long been manufactured in a single color-- a soft pink. In this tweet, you see the personal experience of an individual using a Band-Aid that matches his skin tone for the first time. A product that's designed and intended for widespread use shouldn't fail for an individual because of something that they can't change about themselves. Products and technology should just work for everyone. These choices may not have been deliberate, but they still reinforce the importance of being thoughtful about technology design and the impact it may have on humans. Why does Google care about these problems? Well, our users are diverse, and it's important that we provide an experience that works equally well across all of our users. The good news is that humans, you, have the power to approach these problems differently, and to create technology that is fair and more inclusive for more people. I'll give you a sense of what that means. Take a look at these images. You'll notice where the label "wedding" was applied to the images on the left, and where it wasn't, the image on the right. The labels in these photos demonstrate how one open source image classifier trained on the Open Images Dataset does not properly recognize wedding traditions from different parts of the world. Open datasets, like open images, are a necessary and critical part of developing useful ML models, but some open source datasets have been found to be geographically skewed based on how and where they were collected. To bring greater geographic diversity to open images, last year, we enabled the global community of crowdsourced app users to photograph the world around them and make their photos available to researchers and developers as a part of the Open Images Extended Dataset. We know that this is just an early step on a long journey. And to build inclusive ML products, training data must represent global diversity along several dimensions. These are complex sociotechnical challenges, and they need to be interrogated from many different angles. It's about problem formation and how you think about these systems with human impact in mind. Let's talk a little bit more about these challenges and where they can manifest in an ML pipeline. Unfairness can enter the system at any point in the ML pipeline, from data collection and handling to model training to end use. Rarely can you identify a single cause of or a single solution to these problems. Far more often, various causes interact in ML systems to produce problematic outcomes. And a range of solutions is needed. We try to disentangle these interactions to identify root causes and to find ways forward. This approach spans more than just one team or discipline. ML fairness is an initiative to help address these challenges. And it takes a lot of different individuals with different backgrounds to do this. We need to ask ourselves questions like, how do people feel about fairness when they're interacting with an ML system? How can you make systems more transparent to users? And what's the societal impact of an ML system? Bias problems run deep, and they don't always manifest in the same way. As a result, we've had to learn different techniques of addressing these challenges. Now we'll walk through some of the lessons that Google has learned in evaluating and improving our products, as well as tools and techniques that we're developing in this race. Here to tell you more about this is Tulsee. TULSEE DOSHI: Awesome. Thanks, Jackie. Hi, everyone. My name is Tulsee, and I lead product for the ML Fairness effort here at Google. Today, I'll talk about three different angles in which we've thought about and acted on fairness concerns in our products, and the lessons that we've learned from that. We'll also walk through our next steps, tools, and techniques that we're developing. Of course, we know that the lessons we're going to talk about today are only some of the many ways of tackling the problem. In fact, as you heard in the keynote on Tuesday, we're continuing to develop new methods, such as [INAUDIBLE],, to understand our models and to improve them. And we hope to keep learning with you. So with that, let's start with data. As Jackie mentioned, datasets are a key part of the ML development process. Data trains a model and informs what a model learns from and sees. Data is also a critical part of evaluating the model. The datasets we choose to evaluate on indicate what we know about how the model performs, and when it performs well or doesn't. So let's start with an example. What you see on the screen here is a screenshot from a game called Quick Draw that was developed through the Google AI Experiments program. In this game, people drew images of different objects around the world, like shoes or trees or cars. And we use those images to train an image classification model. This model could then play a game with the users, where a user would draw an image and the model would guess what that image was of. Here you see a whole bunch of drawings of shoes. And actually, we were really excited, because what better way to get diverse input from a whole bunch of users than to launch something globally where a whole bunch of users across the world could draw images for what they perceived an object to look like? But what we found as this model started to collect data was that most of the images that users drew of shoes looked like that shoe in the top right, the blue shoe. So over time, as the model saw more and more examples, it started to learn that a shoe looked a certain way like that top right shoe, and wasn't able to recognize the shoe in the bottom right, the orange shoe. Even though we were able to get data from a diverse set of users, the shoes that the users chose to draw or the users who actually engaged with the product at all were skewed, and led to skewed training data in what we actually received. This is a social issue first, which is then exacerbated by our technical implementation. Because when we're making classification decisions that divide up the world into parts, even if those parts are what is a shoe and what isn't a shoe, we're making fundamental judgment calls about what deserves to be in one part or what deserves to be in the other. It's easier to deal with when we're talking about shoes, but it's harder to talk about when we're classifying images of people. An example of this is the Google Clips camera. This camera was designed to recognize memorable moments in real-time streaming video. The idea is that it automatically captures memorable motion photos of friends, of family, or even of pets. And we designed the Google Clips camera to have equitable outcomes for all users. It, like all of our camera products, should work for all families, no matter who or where they are. It should work for people of all skin tones, all age ranges, and in all poses, and in all lighting conditions. As we started to build this system, we realized that if we only created training data that represented certain types of families, the model would also only recognize certain types of families. So we had to do a lot of work to increase our training data's coverage and to make sure that it would recognize everyone. We went global to collect these datasets, collecting datasets of different types of families in different environments conditions in different lighting conditions. And in doing so, we were able to make sure that not only could we train a model that had diverse outcomes, but that we could also evaluate this constrained on a whole bunch of different variables like lighting or space. This is something that we're continuing to do, continuing to create automatic fairness tests for our systems so that we can see how they change over time and to continue to ensure that they are inclusive of everyone. The biggest lesson we've learned in this process is how important it is to build training and evaluation datasets that represent all the nuances of our target population. This both means making sure that the data that we collect is diverse and representative, but also that the different contexts of the way that the users are providing us this data is taken into account. Even if you have a diverse set of users, that doesn't mean that the images of shoes you get will be diverse. And so thinking about those nuances and the trade-offs that might occur when you're collecting your data is super important. Additionally, it's also important to reflect on who that target population might leave out. Who might not actually have access to this product? Where are the blind spots in who we're reaching? And lastly, how will the data that you're collecting grow and change over time? As our users use our products, they very rarely use them in exactly the way we anticipated them to. And so what happens is the way that we collect data, or the data that we even need to be collecting, changes over time. And it's important that our collection methods and our maintenance methods are equally diverse as that initial process. But even if you have a perfectly balanced, wonderful training dataset, that doesn't necessarily imply that the output of your model will be perfectly fair. Also, it can be hard to collect completely diverse datasets at the start of a process. And you don't always know what it is that you're missing from the beginning. Where are your blind spots in what you're trying to do? Because of that, it's always important to test, to test and measure these issues at scale for individual groups, so that we can actually identify where our model may not be performing as well, and where we might want to think about more principled improvements. The benefit of measurement is also that you can start tracking these changes over time. You can understand how the model works. Similar to the way that you would always want to have metrics for your model as a whole, it's important to think about how you slice those metrics, and how you can provide yourself a holistic understanding of how this model or system works for everybody. What's interesting is that different fairness concerns may require different metrics, even within the same product experience. A disproportionate performance problem is when, for example, a model works well for one group, but may not work as well for another. For example, you could have a model that doesn't recognize some subset of users or errors more for that subset of users. In contrast, a representational harm problem is when a model showcases an offensive stereotype or harmful association. Maybe this doesn't necessarily happen at scale. But even a single instance can be hurtful and harmful to a set of users. And this requires a different way of stress-testing the system. Here's an example where both of those metrics may apply. The screenshot you see is from our Jigsaw Perspective API. This API is designed to detect hate and harassment in the context of online conversations. The idea is, given a particular sentence, we can classify whether or not that sentence is perceived likely to be toxic. We have this API externally. So our users can actually write sentences and give us feedback. And what we found was one of our users articulated a particular example that you see here. The sentence, "I am straight," is given a score of 0.04, and is classified as "unlikely to be perceived as toxic." Whereas the sentence, "I am gay," was given a score of 0.86, and was classified as "likely to be perceived as toxic." Both of these are innocuous identity statements, but one was given a significantly higher score. This is something we would never want to see in our products. And an example-- we not only wanted to fix this immediate example, but we actually wanted to understand and quantify these issues to ensure that we could tackle this appropriately. The first thing we looked at was this concept of "representational harm," understanding these counterfactual differences. For a particular sentence, we would want the sentence to be classified the same way regardless of the identity referenced in the sentence. Whether it's, "I am Muslim," "I am Jewish," or "I am Christian," you would expect the score perceived by the classifier to be the same. Being able to provide these scores allowed us to understand how the system performed. It allowed us to identify places where our model might be more likely to be biased, and allowed us to go in and actually understand those concerns more deeply. But we also wanted to understand overall error rates for particular groups. Were there particular identities where, when referenced in comments, we were more likely to have errors versus others? This is where the disproportionate performance question comes in. We wanted to develop metrics on average for a particular identity term that showcased, across a set of comments, whether or not we were more likely to classify. This was in both directions-- misclassifying something as toxic, but also misclassifying something as not toxic when it truly was a harmful statement. The three metrics you see here capture different ways of looking at that problem. And the darker the color, the darker the purple, the more likely we were to have error rates. And you can see that in the first version of this model, there were huge disparities between different groups. So OK, we were able to measure the problem. But then how do we improve it? How do we make sure this doesn't happen? A lot of research has been published in the last few years, both internally within Google as well as externally, that look at how to train and improve our models in a way that still allows them to be stable, to be resource-efficient, and to be accurate, so that we can still deploy them in production use cases. These approaches balance the simplicity of implementation with the required accuracy and quality that we would want. The simplest way to think about this problem would be through the idea of removals or block lists, taking steps to ensure that your model can't access information in a way that could lead to skewed outcomes. Take, for example, the sentence, "Some people are Indian." We may actually want to remove that identity term altogether, and replace it with a more generic tag, "identity." if you do this for every single identity term, your model wouldn't even have access to identity information. It would simply know that the sentence referenced an identity. As a result, it couldn't make different decisions for different identities or different user groups. This is a great way to make sure that your model is agnostic of a particular definition of an individual. At the same time, it can be harmful. It actually might be useful in certain cases to know when identity terms are used in a way that is offensive or harmful. If a particular term is often used in a negative or derogatory context, we would want to know that, so we could classify that as toxic. Sometimes, this context is actually really important. But it's important that we capture it in a nuanced and contextual way. Another way to think about it is to go back to that first lesson, and look back at the data. We can enable our models to sample data from areas in which the model seems to be underperforming. We could do this both manually as well as algorithmically. On the manual side, what you see on the right is a quote collected through Google's Project Respect effort. Through Project Respect, we went globally to collect more and more comments of positive representations of identity. This comment is from a pride parade, where someone from Lithuania talks about their gay friends, and how they're brilliant and amazing people. Positive reflections of identity are great examples for us to train our model, and to support the model in developing a context and nuanced understanding of comments, especially when the model is usually trained from online comments that may not always have the same flavor. We can also enable the model to do this algorithmically through active sampling. The model can identify the places where it has the least confidence in its decision making, where it might be underperforming. And it can actively go out and sample more from the training dataset that represents that type of data. We can continue to even build more and more examples through synthetic examples. Similar to what you saw at the beginning, we can create these short sentences, like "I am," "He is," "My friends are." And these sentences can continue to provide the model understandings of when identity can be used in natural context. We can even make changes directly to our models by updating the models' loss functions to minimize difference in performance between different groups of individuals. Adversarial training and min diff loss, two of the research methods in this space, have actively looked at how to effect your loss function to keep the model stable and to keep it lightweight, while still enforcing this kind of a penalty. What you saw earlier were the results of the Toxicity V1 model. And as we made changes, especially in terms of creating manual synthetic examples and augmenting the data performance, we were able to see real improvements. This is the toxicity V6 model, where you can see that the colors get lighter as the performance for individual identity groups gets better. We're really excited about the progress that we've made here. But we know that there is still a long ways to go. The results you see here are on synthetic data, short identity statements like I talked about earlier. But the story of bias can become much more complex when you're talking about real data, comments that are actually used in the wild. We're currently working on evaluating our systems on real comments, building up these datasets, and then trying to enhance our understanding of performance and improvements in that space. While we've still seen progress on real comments and improvements from our changes, we know that this will actually help more once we start looking at these real datasets. And actually, there's a Kaggle competition live now if you're interested in checking this out more. Overall, the biggest lesson is "Test early and test often." Measuring your systems is critical to actually understanding where the problems exist, where our users might be facing risk, or where our products aren't working the way that we intend for them to be. Also, bias can affect the user experience and cause issues in many different forms. So it's important to develop methods for measuring the scale of each problem. Even a particular single product may manifest bias in different ways. So we want to actually be sure to measure those metrics, also. The other thing to note is it's not always quantitative metrics. Qualitative metrics, user research, and adversarial testing of really, actually stress-testing and poking at your product manually, can also be really, really valuable. Lastly, it is possible to take proactive steps in modeling that are aware of your production constraints. These techniques have been invaluable in our own internal use cases. And we will continue to publish these methods for you to use, as well. You can actually go to mlfairness.com to learn more. I also want to talk about design. And this is our third lesson for today. Because context is really important. The way that our users interact with our results is different. And our design decisions around the results have consequences. Because the experience that a user actually has with a product extends beyond the performance of the model. It relates to how users are actually engaging with the results. What are they seeing? What kind of information are they being given? What kind of information do they have that maybe the model may not have? Let's look at an example. Here you see an example from the Google Translate product. And what you see here is a translation from Turkish to English. Turkish is a gender-neutral language, which means that in Turkish, nouns aren't gendered. And "he," "she," or "it" are all referenced through the pronoun, "O." I actually misspoke. I believe not all nouns are gendered, but some may be. Thus, while the sentences in Turkish, in this case, don't actually specify gender, our product translates it to common stereotypes. "She is a nurse," while "He is a doctor." So why does that happen? Well, Google Translate learns from hundreds of millions of already translated examples from the web. And it therefore also learns the historical and social trends that have come with these hundreds of millions of examples, the historical trends of how we've thought of occupations in society thus far. So it skews masculine for doctor, whereas it skews feminine for nurse. As we started to look into this problem, we went back to those first two lessons. OK, how can we make the training data more diverse? How can we make it more representative of the full gender diversity? Also, how could we better train a model? How could we improve and measure the space, and then make modeling changes? Both of these questions are important. But what we started to realize is how important context was in this situation. Take, for example, the sentence, "Casey is my friend." Let's say we want to translate to Spanish, in which case friend could be "amigo," the masculine version, or "amiga," the feminine version. Well, how do we know if Casey is a male, a female, or a gender non-binary friend? We don't have that context. Even a perfectly precise model trained on diverse data that represents all kinds of professions would not have that context. And so we realized that even if we do make our understandings of terms more neutral, and even if we were to build up model precision, we would actually want to give this choice to the user, who actually understands what they were trying to achieve with the sentence in the translation. What we did is choose to provide that to our users in the form of options and selections. We translate "friend" both to "amigo" and to "amiga," so that the user can make a choice that is informed based on the context that they have. Currently, this solution is only available for a few languages. And it's also only available for single terms like "friend." But we're actively working on trying to expand it to more languages, and also trying to be inclusive of larger sentences and longer contexts, so we can actually tackle the example you saw earlier. We're excited about this line of thinking, though, because it enables us to think about fairness beyond simply the data and the model, but actually as a holistic experience that a user engages with every day, and trying to make sure that we actually build those communication lines between the product and the end consumer. The biggest lesson we learned here is that context is key. Think about the ways that your user will be interacting with your product and the information that they may have that the model doesn't have, or the information that the model might have that the user doesn't have. How do you enable the users to communicate effectively with your product, but also get back the right transparency from it? Sometimes, this is about providing user options, like you saw with Translate. Sometimes, it's also just about providing more context about the model's decisions, and being a little bit more explainable and interpretable. The other piece that's important is making sure that you get feedback from diverse users. In this case, this was users who spoke different languages, and who had different definitions of identity. But it's also important to make sure, as you're trying to get feedback from users, that you think about the different ways in which these users provide you feedback. Not every user is equally likely to be accepting of the same feedback mechanism, or equally likely to proactively give you feedback in, say, a feedback form on your product. So it's important to actually make sure that whether that be through user research, or through dog fooding, or through different feedback mechanisms in your product, that you identify different ways to access different communities who might be more or less likely to provide that information. Lastly, identify ways to enable multiple experiences in your product. Identify the places where there could be more than one correct answer, for example. And find ways to enable users to have that different experience. Representing human culture and all of its differences requires more than a theoretical and technical toolkit. It requires a much more rich and context-dependent experience. And that is really, at the end of the day, what we want to provide our users. We hope that those lessons were helpful. They've been lessons that we've been really, really grateful to learn, and that we've started to execute in our own products. But what's next? We're starting to put these lessons into practice. And while we know that product development in ML fairness is a context-dependent experience, we do want to start building some of the fundamentals in terms of tools, resources, and best practices. Because we know how important it is to at least start with those metrics, start with the ability to collect diverse data, start with consistent communication. One of the first things we're thinking about is transparency frameworks. We want to create and leverage frameworks that drive consistent communication-- both within Google, but also with the industry at large-- about fairness and other risks that might exist with data collection and modeling. We also want to build tools and techniques, develop and socialize tools that enable evaluating and improving fairness concerns. Let's talk about transparency first. Today, we're committing to a framework for transparency that ensures that we think about, measure, and communicate about our models and data in a way that is consistent. This is not about achieving perfection in our data on models, although of course we hope to get there. It's about the context under which something is supposed to be used. What are its intended use cases? What is it not intended for? And how does it perform across various users? We released our first Data Card last October as part of the Open Images Extended Dataset that you heard Jackie talk about earlier. This Data Card allows us to answer questions like, what are the intended use cases of this dataset? What is the nature of the content? What data was excluded, if any? Who collected the data? It also allows us to go into some of the fairness considerations. Who labeled the data, and what information did they have? How was the data sourced? And what is the distribution of it? For Open Images Extended, for example, while you can see that the geographic distribution is extremely diverse, 80% percent of the data comes from India. This is an important finding for anyone who wants to use this dataset, both for training or for testing purposes. It might inform how you interpret your results. It also might inform whether or not you choose to augment your dataset with something else, for example. This kind of transparency allows for open communication about what the actual use cases of this dataset should be, and where it may have flaws. We want to take this a step further with Model Cards. Here you see an example screenshot for the Jigsaw Perspective Toxicity API that we talked about earlier. With Model Cards, we want to be able to give you an overview of what the model is about, what metrics we use to think about it, how it was architected, how it was trained, how it was tested, what we think it should be used for, and where we believe that it has limitations. We hope that the Model Card framework will work across models, so not just for something like toxicity, but also for a face detection model, or for any other use case that we can think of. In each case, the framework should be consistent. We can look at metrics. We can look at use cases. We can look at the training and test data. And we can look at the limitations. Each Model Card will also have the quantitative metrics that tell you how it performs. Here, for example, you can see an example set of metrics sliced by age. You can see the performance on all ages, on the child age bucket, on the adult age bucket, and on the senior age bucket. So how do you create those metrics? How do you compute them? Well, we also want to be able to provide you the tools to do this analysis, to be able to create your own model cards, and also to be able to improve your models over time. The first piece of the set of tools and resources is open datasets. The Open Images Extended Dataset is one of many datasets that we have and hope to continue to open source in the coming years. In this example, the Open Images Extended Dataset collects data from crowdsourced users who are taking images of objects in their own regions of the world. You can see, for example, how a hospital or food might look different in different places, and how important it is for us to have that data. With the live Kaggle competition, we also have open sourced a dataset related to the Perspective Toxicity API. I mentioned earlier how important it is for us to look at real comments and real data. So here, the Jigsaw team has open sourced a dataset of real comments from around the web. Each of these comments is annotated with the identity that the comment references, as well as whether or not the comment is toxic, as well as other factors about the comment, as well. We hope that datasets like these continue to be able to advance the conversation, the evaluation, and the improvements of fairness. Once you have a dataset, the question becomes, how do you take that step further? How do you evaluate the model? One thing you can do today is deep-dive with the What-If tool. The What-If tool is available as a Tensorboard plugin, as well as a Jupyter Notebook. You can deep-dive into specific examples, and see how changing features actually affects your outcome. You can understand different fairness definitions, and how modifying the threshold of your model might actually change the goals that you're achieving. Here's a screenshot of the What-If tool. What you see here on the right is a whole bunch of data points that are classified by your model. Data points of a similar color have been given a similar score. You can select a particular data point, and then with the features on the right, you can actually modify the feature value to see how changing the input would potentially change the output. For example, if I changed the age defined in this example, does it actually change my classification? If it does, that might tell me something about how age is influencing my model, and where potentially, there may be biases, or where I need to deep-dive a little bit more. We also hope to take this a step further with Fairness Indicators, which will be launched later this year. Fairness Indicators will be a tool that is built on top of TensorFlow Model Analysis, and as a result, can work end to end with the TFX pipeline. TSX stands for TensorFlow Extended. And it's a platform that allows you to train, evaluate, and serve your models, all in one go. And so we're hoping to build fairness into this workflow and into these processes. But Fairness Indicators will also work alone. It'll work as an independent tool that can be used with any production pipeline. We hope that with Fairness Indicators, you'll be able to actually look at data on a large scale, and see actually how your model performs. You can compute fairness metrics for any individual group, and visualize these comparisons to a baseline slice. Here, for example, you can see the baseline slice as the overall average metric in blue, and then you can actually compare how individual groups or individual slices compare to that baseline. For example, some may have a higher false negative rate than average, while others may have a lower. We'll provide feedback about these main metrics that we believe have been useful for various fairness use cases. You can then use Fairness Indicators also to evaluate at multiple thresholds to understand how performance changes, and how maybe changes to your model could actually lead to different outcomes for different users. If you find a slice that doesn't seem to be performing as well as you expect it to, you can actually take that slice further by deep-diving immediately with the What-If tool. We will also be providing confidence intervals, so that you can understand where the differences that you're seeing are significant, and where we may actually need more data to better understand the problem. With Fairness Indicators, we'll also be launching case studies for how we've leveraged these metrics and improvements in the past internally in our own products. We hope that this will help provide context about where we found certain metrics useful, what kinds of insights they've provided us, and where we found that certain metrics actually haven't really served the full purpose. We'll also provide benchmark datasets that can be immediately used for vision and text use cases. We hope that Fairness Indicators will simply be a start to being able to ask questions of our models, understand fairness concerns, and then eventually, over time, improve them. Our commitment to you is that we continue to measure, improve, and share our learnings related to fairness. It is important not only that we make our own products work for all users, but that we continue to share these best practices and learnings so that we, as an industry, can continue to develop fairer products-- products that work equitably for everybody. One thing I do want to underscore is that we do know that in order to create diverse products, products that work for diverse users, it is also important to have diverse voices in the room. This not only means making sure that we have diverse voices internally working on our products, but also means that we include you as the community in this process. We want your feedback on our products, but we also want to learn from you about how you're tackling fairness and inclusion in your own work, what lessons you're learning, what resources you're finding useful. And we want to work with you to continue and build and develop this resource toolkit, so that we can continue, as an industry, to build products that are inclusive for everyone. Thank you. [MUSIC PLAYING]
B1 fairness model data diverse datasets dataset Machine Learning Fairness: Lessons Learned (Google I/O'19) 10 1 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary