Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • JACQUELINE PAN: Hi, everyone.

  • I'm Jackie, and I'm the Lead Program Manager on ML Fairness

  • here at Google.

  • So what is ML fairness?

  • As some of you may know, Google's mission

  • is to organize the world's information

  • and make it universally accessible and useful.

  • Every one of our users gives us their trust.

  • And it's our responsibility to do right by them.

  • And as the impact and reach of AI

  • has grown across societies and sectors,

  • it's critical to ethically design and deploy

  • these systems in a fair and inclusive way.

  • Addressing fairness in AI is an active area

  • of research at Google, from fostering

  • a diverse and inclusive workforce that

  • embodies critical and diverse knowledge to training models

  • to remove or correct problematic biases.

  • There is no standard definition of fairness,

  • whether decisions are made by humans or by machines.

  • Far from a solved problem, fairness in AI

  • presents both an opportunity and a challenge.

  • Last summer, Google outlined principles

  • to guide the responsible development and use of AI.

  • One of them directly speaks to ML fairness

  • and making sure that our technologies don't create

  • or reinforce unfair bias.

  • The principles further state that we

  • seek to avoid unjust impacts on people related

  • to sensitive characteristics such as race, ethnicity,

  • gender, nationality, income, sexual orientation, ability,

  • and political or religious belief.

  • Now let's take a look at how unfair bias might be created

  • or reinforced.

  • An important step on that path is

  • acknowledging that humans are at the center of technology

  • design, in addition to being impacted by it.

  • And humans have not always made product design decisions

  • that are in line with the needs of everyone.

  • For example, because female body-type crash test dummies

  • weren't required until 2011, female drivers

  • were more likely than male drivers

  • to be severely injured in an accident.

  • Band-Aids have long been manufactured in a single

  • color--

  • a soft pink.

  • In this tweet, you see the personal experience

  • of an individual using a Band-Aid that matches his skin

  • tone for the first time.

  • A product that's designed and intended for widespread use

  • shouldn't fail for an individual because of something

  • that they can't change about themselves.

  • Products and technology should just work for everyone.

  • These choices may not have been deliberate,

  • but they still reinforce the importance

  • of being thoughtful about technology

  • design and the impact it may have on humans.

  • Why does Google care about these problems?

  • Well, our users are diverse, and it's important

  • that we provide an experience that works equally

  • well across all of our users.

  • The good news is that humans, you, have the power

  • to approach these problems differently,

  • and to create technology that is fair and more

  • inclusive for more people.

  • I'll give you a sense of what that means.

  • Take a look at these images.

  • You'll notice where the label "wedding" was applied

  • to the images on the left, and where it wasn't, the image

  • on the right.

  • The labels in these photos demonstrate

  • how one open source image classifier trained on the Open

  • Images Dataset does not properly recognize wedding traditions

  • from different parts of the world.

  • Open datasets, like open images, are a necessary and critical

  • part of developing useful ML models,

  • but some open source datasets have

  • been found to be geographically skewed based on how

  • and where they were collected.

  • To bring greater geographic diversity

  • to open images, last year, we enabled the global community

  • of crowdsourced app users to photograph the world

  • around them and make their photos available to researchers

  • and developers as a part of the Open Images Extended Dataset.

  • We know that this is just an early step on a long journey.

  • And to build inclusive ML products,

  • training data must represent global diversity

  • along several dimensions.

  • These are complex sociotechnical challenges,

  • and they need to be interrogated from many different angles.

  • It's about problem formation and how

  • you think about these systems with human impact in mind.

  • Let's talk a little bit more about these challenges

  • and where they can manifest in an ML pipeline.

  • Unfairness can enter the system at any point in the ML

  • pipeline, from data collection and handling to model training

  • to end use.

  • Rarely can you identify a single cause of or a single solution

  • to these problems.

  • Far more often, various causes interact in ML systems

  • to produce problematic outcomes.

  • And a range of solutions is needed.

  • We try to disentangle these interactions

  • to identify root causes and to find ways forward.

  • This approach spans more than just one team or discipline.

  • ML fairness is an initiative to help address these challenges.

  • And it takes a lot of different individuals

  • with different backgrounds to do this.

  • We need to ask ourselves questions like, how do people

  • feel about fairness when they're interacting with an ML system?

  • How can you make systems more transparent to users?

  • And what's the societal impact of an ML system?

  • Bias problems run deep, and they don't always

  • manifest in the same way.

  • As a result, we've had to learn different techniques

  • of addressing these challenges.

  • Now we'll walk through some of the lessons that Google

  • has learned in evaluating and improving our products,

  • as well as tools and techniques that we're

  • developing in this race.

  • Here to tell you more about this is Tulsee.

  • TULSEE DOSHI: Awesome.

  • Thanks, Jackie.

  • Hi, everyone.

  • My name is Tulsee, and I lead product for the ML Fairness

  • effort here at Google.

  • Today, I'll talk about three different angles in which we've

  • thought about and acted on fairness

  • concerns in our products, and the lessons

  • that we've learned from that.

  • We'll also walk through our next steps, tools, and techniques

  • that we're developing.

  • Of course, we know that the lessons

  • we're going to talk about today are only

  • some of the many ways of tackling the problem.

  • In fact, as you heard in the keynote on Tuesday,

  • we're continuing to develop new methods,

  • such as [INAUDIBLE],, to understand our models

  • and to improve them.

  • And we hope to keep learning with you.

  • So with that, let's start with data.

  • As Jackie mentioned, datasets are a key part

  • of the ML development process.

  • Data trains a model and informs what

  • a model learns from and sees.

  • Data is also a critical part of evaluating the model.

  • The datasets we choose to evaluate on indicate what we

  • know about how the model performs,

  • and when it performs well or doesn't.

  • So let's start with an example.

  • What you see on the screen here is a screenshot

  • from a game called Quick Draw that

  • was developed through the Google AI Experiments program.

  • In this game, people drew images of different objects

  • around the world, like shoes or trees or cars.

  • And we use those images to train an image classification model.

  • This model could then play a game

  • with the users, where a user would draw an image

  • and the model would guess what that image was of.

  • Here you see a whole bunch of drawings of shoes.

  • And actually, we were really excited,

  • because what better way to get diverse input

  • from a whole bunch of users than to launch something globally

  • where a whole bunch of users across the world

  • could draw images for what they perceived an object

  • to look like?

  • But what we found as this model started to collect data

  • was that most of the images that users

  • drew of shoes looked like that shoe in the top right,

  • the blue shoe.

  • So over time, as the model saw more and more examples,

  • it started to learn that a shoe looked a certain way

  • like that top right shoe, and wasn't

  • able to recognize the shoe in the bottom right,

  • the orange shoe.

  • Even though we were able to get data

  • from a diverse set of users, the shoes

  • that the users chose to draw or the users

  • who actually engaged with the product at all

  • were skewed, and led to skewed training data

  • in what we actually received.

  • This is a social issue first, which

  • is then exacerbated by our technical implementation.

  • Because when we're making classification decisions that

  • divide up the world into parts, even if those parts are

  • what is a shoe and what isn't a shoe,

  • we're making fundamental judgment calls

  • about what deserves to be in one part

  • or what deserves to be in the other.

  • It's easier to deal with when we're talking about shoes,

  • but it's harder to talk about when we're

  • classifying images of people.

  • An example of this is the Google Clips camera.

  • This camera was designed to recognize memorable moments

  • in real-time streaming video.

  • The idea is that it automatically

  • captures memorable motion photos of friends, of family, or even

  • of pets.

  • And we designed the Google Clips camera

  • to have equitable outcomes for all users.

  • It, like all of our camera products,

  • should work for all families, no matter who or where they are.

  • It should work for people of all skin tones,

  • all age ranges, and in all poses,

  • and in all lighting conditions.

  • As we started to build this system,

  • we realized that if we only created training data that

  • represented certain types of families,

  • the model would also only recognize

  • certain types of families.

  • So we had to do a lot of work to increase our training data's

  • coverage and to make sure that it would recognize everyone.

  • We went global to collect these datasets, collecting datasets

  • of different types of families in different environments

  • conditions in different lighting conditions.

  • And in doing so, we were able to make sure

  • that not only could we train a model that

  • had diverse outcomes, but that we could also

  • evaluate this constrained on a whole bunch

  • of different variables like lighting or space.

  • This is something that we're continuing to do,

  • continuing to create automatic fairness tests for our systems

  • so that we can see how they change over time

  • and to continue to ensure that they are inclusive of everyone.

  • The biggest lesson we've learned in this process

  • is how important it is to build training and evaluation

  • datasets that represent all the nuances of our target

  • population.

  • This both means making sure that the data that we collect

  • is diverse and representative, but also

  • that the different contexts of the way

  • that the users are providing us this data

  • is taken into account.

  • Even if you have a diverse set of users,

  • that doesn't mean that the images of shoes you get

  • will be diverse.

  • And so thinking about those nuances and the trade-offs that

  • might occur when you're collecting your data

  • is super important.

  • Additionally, it's also important to reflect

  • on who that target population might leave out.

  • Who might not actually have access to this product?

  • Where are the blind spots in who we're reaching?

  • And lastly, how will the data that you're collecting

  • grow and change over time?

  • As our users use our products, they very

  • rarely use them in exactly the way we anticipated them to.

  • And so what happens is the way that we collect data,

  • or the data that we even need to be collecting,

  • changes over time.

  • And it's important that our collection methods

  • and our maintenance methods are equally

  • diverse as that initial process.

  • But even if you have a perfectly balanced, wonderful training

  • dataset, that doesn't necessarily

  • imply that the output of your model will be perfectly fair.

  • Also, it can be hard to collect completely diverse datasets

  • at the start of a process.

  • And you don't always know what it is that you're

  • missing from the beginning.

  • Where are your blind spots in what you're trying to do?

  • Because of that, it's always important to test,

  • to test and measure these issues at scale for individual groups,

  • so that we can actually identify where our model may not

  • be performing as well, and where we

  • might want to think about more principled improvements.

  • The benefit of measurement is also

  • that you can start tracking these changes over time.

  • You can understand how the model works.

  • Similar to the way that you would always

  • want to have metrics for your model as a whole,

  • it's important to think about how you slice those metrics,

  • and how you can provide yourself a holistic understanding of how

  • this model or system works for everybody.

  • What's interesting is that different fairness concerns

  • may require different metrics, even within the same product

  • experience.

  • A disproportionate performance problem

  • is when, for example, a model works well for one group,

  • but may not work as well for another.

  • For example, you could have a model

  • that doesn't recognize some subset of users or errors

  • more for that subset of users.

  • In contrast, a representational harm problem

  • is when a model showcases an offensive stereotype

  • or harmful association.

  • Maybe this doesn't necessarily happen at scale.

  • But even a single instance can be hurtful and harmful

  • to a set of users.

  • And this requires a different way

  • of stress-testing the system.

  • Here's an example where both of those metrics may apply.

  • The screenshot you see is from our Jigsaw Perspective API.

  • This API is designed to detect hate and harassment

  • in the context of online conversations.

  • The idea is, given a particular sentence,

  • we can classify whether or not that sentence is

  • perceived likely to be toxic.

  • We have this API externally.

  • So our users can actually write sentences and give us feedback.

  • And what we found was one of our users

  • articulated a particular example that you see here.

  • The sentence, "I am straight," is given a score of 0.04,

  • and is classified as "unlikely to be perceived as toxic."

  • Whereas the sentence, "I am gay,"

  • was given a score of 0.86, and was

  • classified as "likely to be perceived as toxic."

  • Both of these are innocuous identity statements,

  • but one was given a significantly higher score.

  • This is something we would never want to see in our products.

  • And an example-- we not only wanted

  • to fix this immediate example, but we actually

  • wanted to understand and quantify these issues

  • to ensure that we could tackle this appropriately.

  • The first thing we looked at was this concept

  • of "representational harm," understanding

  • these counterfactual differences.

  • For a particular sentence, we would want the sentence

  • to be classified the same way regardless

  • of the identity referenced in the sentence.

  • Whether it's, "I am Muslim," "I am Jewish,"

  • or "I am Christian," you would expect

  • the score perceived by the classifier to be the same.

  • Being able to provide these scores

  • allowed us to understand how the system performed.

  • It allowed us to identify places where our model might

  • be more likely to be biased, and allowed

  • us to go in and actually understand those concerns more

  • deeply.

  • But we also wanted to understand overall error rates

  • for particular groups.

  • Were there particular identities where,

  • when referenced in comments, we were

  • more likely to have errors versus others?

  • This is where the disproportionate performance

  • question comes in.

  • We wanted to develop metrics on average

  • for a particular identity term that

  • showcased, across a set of comments,

  • whether or not we were more likely to classify.

  • This was in both directions-- misclassifying something

  • as toxic, but also misclassifying something as not

  • toxic when it truly was a harmful statement.

  • The three metrics you see here capture different ways

  • of looking at that problem.

  • And the darker the color, the darker the purple,

  • the more likely we were to have error rates.

  • And you can see that in the first version of this model,

  • there were huge disparities between different groups.

  • So OK, we were able to measure the problem.

  • But then how do we improve it?

  • How do we make sure this doesn't happen?

  • A lot of research has been published in

  • the last few years, both internally

  • within Google as well as externally,

  • that look at how to train and improve

  • our models in a way that still allows them to be stable,

  • to be resource-efficient, and to be accurate,

  • so that we can still deploy them in production use cases.

  • These approaches balance the simplicity of implementation

  • with the required accuracy and quality that we would want.

  • The simplest way to think about this problem

  • would be through the idea of removals or block lists,

  • taking steps to ensure that your model can't access information

  • in a way that could lead to skewed outcomes.

  • Take, for example, the sentence, "Some people are Indian."

  • We may actually want to remove that identity term altogether,

  • and replace it with a more generic tag, "identity."

  • if you do this for every single identity term,

  • your model wouldn't even have access to identity information.

  • It would simply know that the sentence referenced

  • an identity.

  • As a result, it couldn't make different decisions

  • for different identities or different user groups.

  • This is a great way to make sure that your model is

  • agnostic of a particular definition of an individual.

  • At the same time, it can be harmful.

  • It actually might be useful in certain cases

  • to know when identity terms are used in a way that

  • is offensive or harmful.

  • If a particular term is often used

  • in a negative or derogatory context,

  • we would want to know that, so we

  • could classify that as toxic.

  • Sometimes, this context is actually really important.

  • But it's important that we capture it

  • in a nuanced and contextual way.

  • Another way to think about it is to go back

  • to that first lesson, and look back at the data.

  • We can enable our models to sample data

  • from areas in which the model seems to be underperforming.

  • We could do this both manually as well as algorithmically.

  • On the manual side, what you see on

  • the right is a quote collected through Google's Project

  • Respect effort.

  • Through Project Respect, we went globally

  • to collect more and more comments

  • of positive representations of identity.

  • This comment is from a pride parade,

  • where someone from Lithuania talks about their gay friends,

  • and how they're brilliant and amazing people.

  • Positive reflections of identity are great examples for us

  • to train our model, and to support the model in developing

  • a context and nuanced understanding of comments,

  • especially when the model is usually

  • trained from online comments that may not always

  • have the same flavor.

  • We can also enable the model to do this algorithmically

  • through active sampling.

  • The model can identify the places

  • where it has the least confidence in its decision

  • making, where it might be underperforming.

  • And it can actively go out and sample more

  • from the training dataset that represents that type of data.

  • We can continue to even build more and more examples

  • through synthetic examples.

  • Similar to what you saw at the beginning,

  • we can create these short sentences, like "I am,"

  • "He is," "My friends are."

  • And these sentences can continue to provide the model

  • understandings of when identity can be used in natural context.

  • We can even make changes directly to our models

  • by updating the models' loss functions to minimize

  • difference in performance between different groups

  • of individuals.

  • Adversarial training and min diff loss,

  • two of the research methods in this space,

  • have actively looked at how to effect your loss function

  • to keep the model stable and to keep it lightweight,

  • while still enforcing this kind of a penalty.

  • What you saw earlier were the results

  • of the Toxicity V1 model.

  • And as we made changes, especially in terms

  • of creating manual synthetic examples

  • and augmenting the data performance,

  • we were able to see real improvements.

  • This is the toxicity V6 model, where

  • you can see that the colors get lighter

  • as the performance for individual identity groups

  • gets better.

  • We're really excited about the progress that we've made here.

  • But we know that there is still a long ways to go.

  • The results you see here are on synthetic data, short identity

  • statements like I talked about earlier.

  • But the story of bias can become much more complex

  • when you're talking about real data, comments that

  • are actually used in the wild.

  • We're currently working on evaluating our systems

  • on real comments, building up these datasets,

  • and then trying to enhance our understanding of performance

  • and improvements in that space.

  • While we've still seen progress on real comments

  • and improvements from our changes,

  • we know that this will actually help more once we start

  • looking at these real datasets.

  • And actually, there's a Kaggle competition

  • live now if you're interested in checking this out more.

  • Overall, the biggest lesson is "Test early and test often."

  • Measuring your systems is critical to actually

  • understanding where the problems exist,

  • where our users might be facing risk,

  • or where our products aren't working the way

  • that we intend for them to be.

  • Also, bias can affect the user experience

  • and cause issues in many different forms.

  • So it's important to develop methods

  • for measuring the scale of each problem.

  • Even a particular single product may manifest bias

  • in different ways.

  • So we want to actually be sure to measure those metrics, also.

  • The other thing to note is it's not always

  • quantitative metrics.

  • Qualitative metrics, user research,

  • and adversarial testing of really,

  • actually stress-testing and poking at your product

  • manually, can also be really, really valuable.

  • Lastly, it is possible to take proactive steps

  • in modeling that are aware of your production constraints.

  • These techniques have been invaluable

  • in our own internal use cases.

  • And we will continue to publish these methods for you

  • to use, as well.

  • You can actually go to mlfairness.com to learn more.

  • I also want to talk about design.

  • And this is our third lesson for today.

  • Because context is really important.

  • The way that our users interact with our results is different.

  • And our design decisions around the results have consequences.

  • Because the experience that a user actually has with

  • a product extends beyond the performance of the model.

  • It relates to how users are actually

  • engaging with the results.

  • What are they seeing?

  • What kind of information are they being given?

  • What kind of information do they have that maybe the model

  • may not have?

  • Let's look at an example.

  • Here you see an example from the Google Translate product.

  • And what you see here is a translation

  • from Turkish to English.

  • Turkish is a gender-neutral language,

  • which means that in Turkish, nouns aren't gendered.

  • And "he," "she," or "it" are all referenced through the pronoun,

  • "O."

  • I actually misspoke.

  • I believe not all nouns are gendered, but some may be.

  • Thus, while the sentences in Turkish, in this case,

  • don't actually specify gender, our product

  • translates it to common stereotypes.

  • "She is a nurse," while "He is a doctor."

  • So why does that happen?

  • Well, Google Translate learns from hundreds of millions

  • of already translated examples from the web.

  • And it therefore also learns the historical and social trends

  • that have come with these hundreds of millions

  • of examples, the historical trends of how

  • we've thought of occupations in society thus far.

  • So it skews masculine for doctor,

  • whereas it skews feminine for nurse.

  • As we started to look into this problem,

  • we went back to those first two lessons.

  • OK, how can we make the training data more diverse?

  • How can we make it more representative

  • of the full gender diversity?

  • Also, how could we better train a model?

  • How could we improve and measure the space,

  • and then make modeling changes?

  • Both of these questions are important.

  • But what we started to realize is how important

  • context was in this situation.

  • Take, for example, the sentence, "Casey is my friend."

  • Let's say we want to translate to Spanish, in which case

  • friend could be "amigo," the masculine version, or "amiga,"

  • the feminine version.

  • Well, how do we know if Casey is a male, a female, or a gender

  • non-binary friend?

  • We don't have that context.

  • Even a perfectly precise model trained

  • on diverse data that represents all kinds of professions

  • would not have that context.

  • And so we realized that even if we do

  • make our understandings of terms more neutral,

  • and even if we were to build up model precision,

  • we would actually want to give this choice to the user, who

  • actually understands what they were

  • trying to achieve with the sentence in the translation.

  • What we did is choose to provide that to our users

  • in the form of options and selections.

  • We translate "friend" both to "amigo"

  • and to "amiga," so that the user can

  • make a choice that is informed based on the context

  • that they have.

  • Currently, this solution is only available for a few languages.

  • And it's also only available for single terms like "friend."

  • But we're actively working on trying to expand it

  • to more languages, and also trying

  • to be inclusive of larger sentences and longer contexts,

  • so we can actually tackle the example you saw earlier.

  • We're excited about this line of thinking, though,

  • because it enables us to think about fairness beyond simply

  • the data and the model, but actually as

  • a holistic experience that a user engages with every day,

  • and trying to make sure that we actually

  • build those communication lines between the product and the end

  • consumer.

  • The biggest lesson we learned here is that context is key.

  • Think about the ways that your user

  • will be interacting with your product and the information

  • that they may have that the model doesn't have,

  • or the information that the model might have

  • that the user doesn't have.

  • How do you enable the users to communicate effectively

  • with your product, but also get back the right transparency

  • from it?

  • Sometimes, this is about providing user options,

  • like you saw with Translate.

  • Sometimes, it's also just about providing more context

  • about the model's decisions, and being a little bit more

  • explainable and interpretable.

  • The other piece that's important is making sure

  • that you get feedback from diverse users.

  • In this case, this was users who spoke different languages,

  • and who had different definitions of identity.

  • But it's also important to make sure,

  • as you're trying to get feedback from users,

  • that you think about the different ways

  • in which these users provide you feedback.

  • Not every user is equally likely to be accepting

  • of the same feedback mechanism, or equally

  • likely to proactively give you feedback in, say, a feedback

  • form on your product.

  • So it's important to actually make sure

  • that whether that be through user research,

  • or through dog fooding, or through different feedback

  • mechanisms in your product, that you identify

  • different ways to access different communities who

  • might be more or less likely to provide that information.

  • Lastly, identify ways to enable multiple experiences

  • in your product.

  • Identify the places where there could be more than one

  • correct answer, for example.

  • And find ways to enable users to have that different experience.

  • Representing human culture and all of its differences

  • requires more than a theoretical and technical toolkit.

  • It requires a much more rich and context-dependent experience.

  • And that is really, at the end of the day, what

  • we want to provide our users.

  • We hope that those lessons were helpful.

  • They've been lessons that we've been really, really grateful

  • to learn, and that we've started to execute in our own products.

  • But what's next?

  • We're starting to put these lessons into practice.

  • And while we know that product development in ML fairness

  • is a context-dependent experience,

  • we do want to start building some of the fundamentals

  • in terms of tools, resources, and best practices.

  • Because we know how important it is to at least start

  • with those metrics, start with the ability

  • to collect diverse data, start with consistent communication.

  • One of the first things we're thinking about

  • is transparency frameworks.

  • We want to create and leverage frameworks that drive

  • consistent communication-- both within Google,

  • but also with the industry at large--

  • about fairness and other risks that

  • might exist with data collection and modeling.

  • We also want to build tools and techniques,

  • develop and socialize tools that enable evaluating and improving

  • fairness concerns.

  • Let's talk about transparency first.

  • Today, we're committing to a framework for transparency

  • that ensures that we think about, measure, and communicate

  • about our models and data in a way that is consistent.

  • This is not about achieving perfection in our data

  • on models, although of course we hope to get there.

  • It's about the context under which something

  • is supposed to be used.

  • What are its intended use cases?

  • What is it not intended for?

  • And how does it perform across various users?

  • We released our first Data Card last October

  • as part of the Open Images Extended Dataset

  • that you heard Jackie talk about earlier.

  • This Data Card allows us to answer questions like,

  • what are the intended use cases of this dataset?

  • What is the nature of the content?

  • What data was excluded, if any?

  • Who collected the data?

  • It also allows us to go into some

  • of the fairness considerations.

  • Who labeled the data, and what information did they have?

  • How was the data sourced?

  • And what is the distribution of it?

  • For Open Images Extended, for example,

  • while you can see that the geographic distribution is

  • extremely diverse, 80% percent of the data comes from India.

  • This is an important finding for anyone

  • who wants to use this dataset, both for training

  • or for testing purposes.

  • It might inform how you interpret your results.

  • It also might inform whether or not

  • you choose to augment your dataset with something else,

  • for example.

  • This kind of transparency allows for open communication

  • about what the actual use cases of this dataset should be,

  • and where it may have flaws.

  • We want to take this a step further with Model Cards.

  • Here you see an example screenshot

  • for the Jigsaw Perspective Toxicity API

  • that we talked about earlier.

  • With Model Cards, we want to be able to give you

  • an overview of what the model is about,

  • what metrics we use to think about it,

  • how it was architected, how it was trained, how it was tested,

  • what we think it should be used for,

  • and where we believe that it has limitations.

  • We hope that the Model Card framework

  • will work across models, so not just for something

  • like toxicity, but also for a face detection model,

  • or for any other use case that we can think of.

  • In each case, the framework should be consistent.

  • We can look at metrics.

  • We can look at use cases.

  • We can look at the training and test data.

  • And we can look at the limitations.

  • Each Model Card will also have the quantitative metrics

  • that tell you how it performs.

  • Here, for example, you can see an example set

  • of metrics sliced by age.

  • You can see the performance on all ages,

  • on the child age bucket, on the adult age bucket,

  • and on the senior age bucket.

  • So how do you create those metrics?

  • How do you compute them?

  • Well, we also want to be able to provide you the tools

  • to do this analysis, to be able to create your own model cards,

  • and also to be able to improve your models over time.

  • The first piece of the set of tools and resources

  • is open datasets.

  • The Open Images Extended Dataset is one of many datasets

  • that we have and hope to continue to open

  • source in the coming years.

  • In this example, the Open Images Extended Dataset

  • collects data from crowdsourced users

  • who are taking images of objects in their own regions

  • of the world.

  • You can see, for example, how a hospital or food might

  • look different in different places,

  • and how important it is for us to have that data.

  • With the live Kaggle competition,

  • we also have open sourced a dataset

  • related to the Perspective Toxicity API.

  • I mentioned earlier how important

  • it is for us to look at real comments and real data.

  • So here, the Jigsaw team has open

  • sourced a dataset of real comments from around the web.

  • Each of these comments is annotated with the identity

  • that the comment references, as well as whether or not

  • the comment is toxic, as well as other factors

  • about the comment, as well.

  • We hope that datasets like these continue

  • to be able to advance the conversation, the evaluation,

  • and the improvements of fairness.

  • Once you have a dataset, the question becomes,

  • how do you take that step further?

  • How do you evaluate the model?

  • One thing you can do today is deep-dive with the What-If

  • tool.

  • The What-If tool is available as a Tensorboard plugin, as well

  • as a Jupyter Notebook.

  • You can deep-dive into specific examples,

  • and see how changing features actually affects your outcome.

  • You can understand different fairness definitions,

  • and how modifying the threshold of your model

  • might actually change the goals that you're achieving.

  • Here's a screenshot of the What-If tool.

  • What you see here on the right is a whole bunch of data points

  • that are classified by your model.

  • Data points of a similar color have

  • been given a similar score.

  • You can select a particular data point,

  • and then with the features on the right,

  • you can actually modify the feature value

  • to see how changing the input would potentially

  • change the output.

  • For example, if I changed the age defined in this example,

  • does it actually change my classification?

  • If it does, that might tell me something

  • about how age is influencing my model,

  • and where potentially, there may be biases,

  • or where I need to deep-dive a little bit more.

  • We also hope to take this a step further

  • with Fairness Indicators, which will

  • be launched later this year.

  • Fairness Indicators will be a tool

  • that is built on top of TensorFlow Model Analysis,

  • and as a result, can work end to end with the TFX pipeline.

  • TSX stands for TensorFlow Extended.

  • And it's a platform that allows you to train, evaluate,

  • and serve your models, all in one go.

  • And so we're hoping to build fairness into this workflow

  • and into these processes.

  • But Fairness Indicators will also work alone.

  • It'll work as an independent tool

  • that can be used with any production pipeline.

  • We hope that with Fairness Indicators,

  • you'll be able to actually look at data on a large scale,

  • and see actually how your model performs.

  • You can compute fairness metrics for any individual group,

  • and visualize these comparisons to a baseline slice.

  • Here, for example, you can see the baseline slice

  • as the overall average metric in blue,

  • and then you can actually compare

  • how individual groups or individual slices

  • compare to that baseline.

  • For example, some may have a higher false negative rate

  • than average, while others may have a lower.

  • We'll provide feedback about these main metrics

  • that we believe have been useful for various fairness use cases.

  • You can then use Fairness Indicators also

  • to evaluate at multiple thresholds

  • to understand how performance changes,

  • and how maybe changes to your model

  • could actually lead to different outcomes for different users.

  • If you find a slice that doesn't seem

  • to be performing as well as you expect it to,

  • you can actually take that slice further

  • by deep-diving immediately with the What-If tool.

  • We will also be providing confidence intervals,

  • so that you can understand where the differences that you're

  • seeing are significant, and where we may actually

  • need more data to better understand the problem.

  • With Fairness Indicators, we'll also

  • be launching case studies for how

  • we've leveraged these metrics and improvements

  • in the past internally in our own products.

  • We hope that this will help provide context

  • about where we found certain metrics useful, what

  • kinds of insights they've provided us,

  • and where we found that certain metrics actually haven't really

  • served the full purpose.

  • We'll also provide benchmark datasets

  • that can be immediately used for vision and text use cases.

  • We hope that Fairness Indicators will simply

  • be a start to being able to ask questions of our models,

  • understand fairness concerns, and then eventually, over time,

  • improve them.

  • Our commitment to you is that we continue

  • to measure, improve, and share our learnings

  • related to fairness.

  • It is important not only that we make our own products

  • work for all users, but that we continue

  • to share these best practices and learnings so that we,

  • as an industry, can continue to develop fairer products--

  • products that work equitably for everybody.

  • One thing I do want to underscore

  • is that we do know that in order to create

  • diverse products, products that work for diverse users,

  • it is also important to have diverse voices in the room.

  • This not only means making sure that we

  • have diverse voices internally working on our products,

  • but also means that we include you

  • as the community in this process.

  • We want your feedback on our products,

  • but we also want to learn from you

  • about how you're tackling fairness and inclusion

  • in your own work, what lessons you're learning,

  • what resources you're finding useful.

  • And we want to work with you to continue and build and develop

  • this resource toolkit, so that we can continue,

  • as an industry, to build products that

  • are inclusive for everyone.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it