Subtitles section Play video
[MUSIC PLAYING]
JACQUELINE PAN: Hi, everyone.
I'm Jackie, and I'm the Lead Program Manager on ML Fairness
here at Google.
So what is ML fairness?
As some of you may know, Google's mission
is to organize the world's information
and make it universally accessible and useful.
Every one of our users gives us their trust.
And it's our responsibility to do right by them.
And as the impact and reach of AI
has grown across societies and sectors,
it's critical to ethically design and deploy
these systems in a fair and inclusive way.
Addressing fairness in AI is an active area
of research at Google, from fostering
a diverse and inclusive workforce that
embodies critical and diverse knowledge to training models
to remove or correct problematic biases.
There is no standard definition of fairness,
whether decisions are made by humans or by machines.
Far from a solved problem, fairness in AI
presents both an opportunity and a challenge.
Last summer, Google outlined principles
to guide the responsible development and use of AI.
One of them directly speaks to ML fairness
and making sure that our technologies don't create
or reinforce unfair bias.
The principles further state that we
seek to avoid unjust impacts on people related
to sensitive characteristics such as race, ethnicity,
gender, nationality, income, sexual orientation, ability,
and political or religious belief.
Now let's take a look at how unfair bias might be created
or reinforced.
An important step on that path is
acknowledging that humans are at the center of technology
design, in addition to being impacted by it.
And humans have not always made product design decisions
that are in line with the needs of everyone.
For example, because female body-type crash test dummies
weren't required until 2011, female drivers
were more likely than male drivers
to be severely injured in an accident.
Band-Aids have long been manufactured in a single
color--
a soft pink.
In this tweet, you see the personal experience
of an individual using a Band-Aid that matches his skin
tone for the first time.
A product that's designed and intended for widespread use
shouldn't fail for an individual because of something
that they can't change about themselves.
Products and technology should just work for everyone.
These choices may not have been deliberate,
but they still reinforce the importance
of being thoughtful about technology
design and the impact it may have on humans.
Why does Google care about these problems?
Well, our users are diverse, and it's important
that we provide an experience that works equally
well across all of our users.
The good news is that humans, you, have the power
to approach these problems differently,
and to create technology that is fair and more
inclusive for more people.
I'll give you a sense of what that means.
Take a look at these images.
You'll notice where the label "wedding" was applied
to the images on the left, and where it wasn't, the image
on the right.
The labels in these photos demonstrate
how one open source image classifier trained on the Open
Images Dataset does not properly recognize wedding traditions
from different parts of the world.
Open datasets, like open images, are a necessary and critical
part of developing useful ML models,
but some open source datasets have
been found to be geographically skewed based on how
and where they were collected.
To bring greater geographic diversity
to open images, last year, we enabled the global community
of crowdsourced app users to photograph the world
around them and make their photos available to researchers
and developers as a part of the Open Images Extended Dataset.
We know that this is just an early step on a long journey.
And to build inclusive ML products,
training data must represent global diversity
along several dimensions.
These are complex sociotechnical challenges,
and they need to be interrogated from many different angles.
It's about problem formation and how
you think about these systems with human impact in mind.
Let's talk a little bit more about these challenges
and where they can manifest in an ML pipeline.
Unfairness can enter the system at any point in the ML
pipeline, from data collection and handling to model training
to end use.
Rarely can you identify a single cause of or a single solution
to these problems.
Far more often, various causes interact in ML systems
to produce problematic outcomes.
And a range of solutions is needed.
We try to disentangle these interactions
to identify root causes and to find ways forward.
This approach spans more than just one team or discipline.
ML fairness is an initiative to help address these challenges.
And it takes a lot of different individuals
with different backgrounds to do this.
We need to ask ourselves questions like, how do people
feel about fairness when they're interacting with an ML system?
How can you make systems more transparent to users?
And what's the societal impact of an ML system?
Bias problems run deep, and they don't always
manifest in the same way.
As a result, we've had to learn different techniques
of addressing these challenges.
Now we'll walk through some of the lessons that Google
has learned in evaluating and improving our products,
as well as tools and techniques that we're
developing in this race.
Here to tell you more about this is Tulsee.
TULSEE DOSHI: Awesome.
Thanks, Jackie.
Hi, everyone.
My name is Tulsee, and I lead product for the ML Fairness
effort here at Google.
Today, I'll talk about three different angles in which we've
thought about and acted on fairness
concerns in our products, and the lessons
that we've learned from that.
We'll also walk through our next steps, tools, and techniques
that we're developing.
Of course, we know that the lessons
we're going to talk about today are only
some of the many ways of tackling the problem.
In fact, as you heard in the keynote on Tuesday,
we're continuing to develop new methods,
such as [INAUDIBLE],, to understand our models
and to improve them.
And we hope to keep learning with you.
So with that, let's start with data.
As Jackie mentioned, datasets are a key part
of the ML development process.
Data trains a model and informs what
a model learns from and sees.
Data is also a critical part of evaluating the model.
The datasets we choose to evaluate on indicate what we
know about how the model performs,
and when it performs well or doesn't.
So let's start with an example.
What you see on the screen here is a screenshot
from a game called Quick Draw that
was developed through the Google AI Experiments program.
In this game, people drew images of different objects
around the world, like shoes or trees or cars.
And we use those images to train an image classification model.
This model could then play a game
with the users, where a user would draw an image
and the model would guess what that image was of.
Here you see a whole bunch of drawings of shoes.
And actually, we were really excited,
because what better way to get diverse input
from a whole bunch of users than to launch something globally
where a whole bunch of users across the world
could draw images for what they perceived an object
to look like?
But what we found as this model started to collect data
was that most of the images that users
drew of shoes looked like that shoe in the top right,
the blue shoe.
So over time, as the model saw more and more examples,
it started to learn that a shoe looked a certain way
like that top right shoe, and wasn't
able to recognize the shoe in the bottom right,
the orange shoe.
Even though we were able to get data
from a diverse set of users, the shoes
that the users chose to draw or the users
who actually engaged with the product at all
were skewed, and led to skewed training data
in what we actually received.
This is a social issue first, which
is then exacerbated by our technical implementation.
Because when we're making classification decisions that
divide up the world into parts, even if those parts are
what is a shoe and what isn't a shoe,
we're making fundamental judgment calls
about what deserves to be in one part
or what deserves to be in the other.
It's easier to deal with when we're talking about shoes,
but it's harder to talk about when we're
classifying images of people.
An example of this is the Google Clips camera.
This camera was designed to recognize memorable moments
in real-time streaming video.
The idea is that it automatically
captures memorable motion photos of friends, of family, or even
of pets.
And we designed the Google Clips camera
to have equitable outcomes for all users.
It, like all of our camera products,
should work for all families, no matter who or where they are.
It should work for people of all skin tones,
all age ranges, and in all poses,
and in all lighting conditions.
As we started to build this system,
we realized that if we only created training data that
represented certain types of families,
the model would also only recognize
certain types of families.
So we had to do a lot of work to increase our training data's
coverage and to make sure that it would recognize everyone.
We went global to collect these datasets, collecting datasets
of different types of families in different environments
conditions in different lighting conditions.
And in doing so, we were able to make sure
that not only could we train a model that
had diverse outcomes, but that we could also
evaluate this constrained on a whole bunch
of different variables like lighting or space.
This is something that we're continuing to do,
continuing to create automatic fairness tests for our systems
so that we can see how they change over time
and to continue to ensure that they are inclusive of everyone.
The biggest lesson we've learned in this process
is how important it is to build training and evaluation
datasets that represent all the nuances of our target
population.
This both means making sure that the data that we collect
is diverse and representative, but also
that the different contexts of the way
that the users are providing us this data
is taken into account.
Even if you have a diverse set of users,
that doesn't mean that the images of shoes you get
will be diverse.
And so thinking about those nuances and the trade-offs that
might occur when you're collecting your data
is super important.
Additionally, it's also important to reflect
on who that target population might leave out.
Who might not actually have access to this product?
Where are the blind spots in who we're reaching?
And lastly, how will the data that you're collecting
grow and change over time?
As our users use our products, they very
rarely use them in exactly the way we anticipated them to.
And so what happens is the way that we collect data,
or the data that we even need to be collecting,
changes over time.
And it's important that our collection methods
and our maintenance methods are equally
diverse as that initial process.
But even if you have a perfectly balanced, wonderful training
dataset, that doesn't necessarily
imply that the output of your model will be perfectly fair.
Also, it can be hard to collect completely diverse datasets
at the start of a process.
And you don't always know what it is that you're
missing from the beginning.
Where are your blind spots in what you're trying to do?
Because of that, it's always important to test,
to test and measure these issues at scale for individual groups,
so that we can actually identify where our model may not
be performing as well, and where we
might want to think about more principled improvements.
The benefit of measurement is also
that you can start tracking these changes over time.
You can understand how the model works.
Similar to the way that you would always
want to have metrics for your model as a whole,
it's important to think about how you slice those metrics,
and how you can provide yourself a holistic understanding of how
this model or system works for everybody.
What's interesting is that different fairness concerns
may require different metrics, even within the same product
experience.
A disproportionate performance problem
is when, for example, a model works well for one group,
but may not work as well for another.
For example, you could have a model
that doesn't recognize some subset of users or errors
more for that subset of users.
In contrast, a representational harm problem
is when a model showcases an offensive stereotype
or harmful association.
Maybe this doesn't necessarily happen at scale.
But even a single instance can be hurtful and harmful
to a set of users.
And this requires a different way
of stress-testing the system.
Here's an example where both of those metrics may apply.
The screenshot you see is from our Jigsaw Perspective API.
This API is designed to detect hate and harassment
in the context of online conversations.
The idea is, given a particular sentence,
we can classify whether or not that sentence is
perceived likely to be toxic.
We have this API externally.
So our users can actually write sentences and give us feedback.
And what we found was one of our users
articulated a particular example that you see here.
The sentence, "I am straight," is given a score of 0.04,
and is classified as "unlikely to be perceived as toxic."
Whereas the sentence, "I am gay,"
was given a score of 0.86, and was
classified as "likely to be perceived as toxic."
Both of these are innocuous identity statements,
but one was given a significantly higher score.
This is something we would never want to see in our products.
And an example-- we not only wanted
to fix this immediate example, but we actually
wanted to understand and quantify these issues
to ensure that we could tackle this appropriately.
The first thing we looked at was this concept
of "representational harm," understanding
these counterfactual differences.
For a particular sentence, we would want the sentence
to be classified the same way regardless
of the identity referenced in the sentence.
Whether it's, "I am Muslim," "I am Jewish,"
or "I am Christian," you would expect
the score perceived by the classifier to be the same.
Being able to provide these scores
allowed us to understand how the system performed.
It allowed us to identify places where our model might
be more likely to be biased, and allowed
us to go in and actually understand those concerns more
deeply.
But we also wanted to understand overall error rates
for particular groups.
Were there particular identities where,
when referenced in comments, we were
more likely to have errors versus others?
This is where the disproportionate performance
question comes in.
We wanted to develop metrics on average
for a particular identity term that
showcased, across a set of comments,
whether or not we were more likely to classify.
This was in both directions-- misclassifying something
as toxic, but also misclassifying something as not
toxic when it truly was a harmful statement.
The three metrics you see here capture different ways
of looking at that problem.
And the darker the color, the darker the purple,
the more likely we were to have error rates.
And you can see that in the first version of this model,
there were huge disparities between different groups.
So OK, we were able to measure the problem.
But then how do we improve it?
How do we make sure this doesn't happen?
A lot of research has been published in
the last few years, both internally
within Google as well as externally,
that look at how to train and improve
our models in a way that still allows them to be stable,
to be resource-efficient, and to be accurate,
so that we can still deploy them in production use cases.
These approaches balance the simplicity of implementation
with the required accuracy and quality that we would want.
The simplest way to think about this problem
would be through the idea of removals or block lists,
taking steps to ensure that your model can't access information
in a way that could lead to skewed outcomes.
Take, for example, the sentence, "Some people are Indian."
We may actually want to remove that identity term altogether,
and replace it with a more generic tag, "identity."
if you do this for every single identity term,
your model wouldn't even have access to identity information.
It would simply know that the sentence referenced
an identity.
As a result, it couldn't make different decisions
for different identities or different user groups.
This is a great way to make sure that your model is
agnostic of a particular definition of an individual.
At the same time, it can be harmful.
It actually might be useful in certain cases
to know when identity terms are used in a way that
is offensive or harmful.
If a particular term is often used
in a negative or derogatory context,
we would want to know that, so we
could classify that as toxic.
Sometimes, this context is actually really important.
But it's important that we capture it
in a nuanced and contextual way.
Another way to think about it is to go back
to that first lesson, and look back at the data.
We can enable our models to sample data
from areas in which the model seems to be underperforming.
We could do this both manually as well as algorithmically.
On the manual side, what you see on
the right is a quote collected through Google's Project
Respect effort.
Through Project Respect, we went globally
to collect more and more comments
of positive representations of identity.
This comment is from a pride parade,
where someone from Lithuania talks about their gay friends,
and how they're brilliant and amazing people.
Positive reflections of identity are great examples for us
to train our model, and to support the model in developing
a context and nuanced understanding of comments,
especially when the model is usually
trained from online comments that may not always
have the same flavor.
We can also enable the model to do this algorithmically
through active sampling.
The model can identify the places
where it has the least confidence in its decision
making, where it might be underperforming.
And it can actively go out and sample more
from the training dataset that represents that type of data.
We can continue to even build more and more examples
through synthetic examples.
Similar to what you saw at the beginning,
we can create these short sentences, like "I am,"
"He is," "My friends are."
And these sentences can continue to provide the model
understandings of when identity can be used in natural context.
We can even make changes directly to our models
by updating the models' loss functions to minimize
difference in performance between different groups
of individuals.
Adversarial training and min diff loss,
two of the research methods in this space,
have actively looked at how to effect your loss function
to keep the model stable and to keep it lightweight,
while still enforcing this kind of a penalty.
What you saw earlier were the results
of the Toxicity V1 model.
And as we made changes, especially in terms
of creating manual synthetic examples
and augmenting the data performance,
we were able to see real improvements.
This is the toxicity V6 model, where
you can see that the colors get lighter
as the performance for individual identity groups
gets better.
We're really excited about the progress that we've made here.
But we know that there is still a long ways to go.
The results you see here are on synthetic data, short identity
statements like I talked about earlier.
But the story of bias can become much more complex
when you're talking about real data, comments that
are actually used in the wild.
We're currently working on evaluating our systems
on real comments, building up these datasets,
and then trying to enhance our understanding of performance
and improvements in that space.
While we've still seen progress on real comments
and improvements from our changes,
we know that this will actually help more once we start
looking at these real datasets.
And actually, there's a Kaggle competition
live now if you're interested in checking this out more.
Overall, the biggest lesson is "Test early and test often."
Measuring your systems is critical to actually
understanding where the problems exist,
where our users might be facing risk,
or where our products aren't working the way
that we intend for them to be.
Also, bias can affect the user experience
and cause issues in many different forms.
So it's important to develop methods
for measuring the scale of each problem.
Even a particular single product may manifest bias
in different ways.
So we want to actually be sure to measure those metrics, also.
The other thing to note is it's not always
quantitative metrics.
Qualitative metrics, user research,
and adversarial testing of really,
actually stress-testing and poking at your product
manually, can also be really, really valuable.
Lastly, it is possible to take proactive steps
in modeling that are aware of your production constraints.
These techniques have been invaluable
in our own internal use cases.
And we will continue to publish these methods for you
to use, as well.
You can actually go to mlfairness.com to learn more.
I also want to talk about design.
And this is our third lesson for today.
Because context is really important.
The way that our users interact with our results is different.
And our design decisions around the results have consequences.
Because the experience that a user actually has with
a product extends beyond the performance of the model.
It relates to how users are actually
engaging with the results.
What are they seeing?
What kind of information are they being given?
What kind of information do they have that maybe the model
may not have?
Let's look at an example.
Here you see an example from the Google Translate product.
And what you see here is a translation
from Turkish to English.
Turkish is a gender-neutral language,
which means that in Turkish, nouns aren't gendered.
And "he," "she," or "it" are all referenced through the pronoun,
"O."
I actually misspoke.
I believe not all nouns are gendered, but some may be.
Thus, while the sentences in Turkish, in this case,
don't actually specify gender, our product
translates it to common stereotypes.
"She is a nurse," while "He is a doctor."
So why does that happen?
Well, Google Translate learns from hundreds of millions
of already translated examples from the web.
And it therefore also learns the historical and social trends
that have come with these hundreds of millions
of examples, the historical trends of how
we've thought of occupations in society thus far.
So it skews masculine for doctor,
whereas it skews feminine for nurse.
As we started to look into this problem,
we went back to those first two lessons.
OK, how can we make the training data more diverse?
How can we make it more representative
of the full gender diversity?
Also, how could we better train a model?
How could we improve and measure the space,
and then make modeling changes?
Both of these questions are important.
But what we started to realize is how important
context was in this situation.
Take, for example, the sentence, "Casey is my friend."
Let's say we want to translate to Spanish, in which case
friend could be "amigo," the masculine version, or "amiga,"
the feminine version.
Well, how do we know if Casey is a male, a female, or a gender
non-binary friend?
We don't have that context.
Even a perfectly precise model trained
on diverse data that represents all kinds of professions
would not have that context.
And so we realized that even if we do
make our understandings of terms more neutral,
and even if we were to build up model precision,
we would actually want to give this choice to the user, who
actually understands what they were
trying to achieve with the sentence in the translation.
What we did is choose to provide that to our users
in the form of options and selections.
We translate "friend" both to "amigo"
and to "amiga," so that the user can
make a choice that is informed based on the context
that they have.
Currently, this solution is only available for a few languages.
And it's also only available for single terms like "friend."
But we're actively working on trying to expand it
to more languages, and also trying
to be inclusive of larger sentences and longer contexts,
so we can actually tackle the example you saw earlier.
We're excited about this line of thinking, though,
because it enables us to think about fairness beyond simply
the data and the model, but actually as
a holistic experience that a user engages with every day,
and trying to make sure that we actually
build those communication lines between the product and the end
consumer.
The biggest lesson we learned here is that context is key.
Think about the ways that your user
will be interacting with your product and the information
that they may have that the model doesn't have,
or the information that the model might have
that the user doesn't have.
How do you enable the users to communicate effectively
with your product, but also get back the right transparency
from it?
Sometimes, this is about providing user options,
like you saw with Translate.
Sometimes, it's also just about providing more context
about the model's decisions, and being a little bit more
explainable and interpretable.
The other piece that's important is making sure
that you get feedback from diverse users.
In this case, this was users who spoke different languages,
and who had different definitions of identity.
But it's also important to make sure,
as you're trying to get feedback from users,
that you think about the different ways
in which these users provide you feedback.
Not every user is equally likely to be accepting
of the same feedback mechanism, or equally
likely to proactively give you feedback in, say, a feedback
form on your product.
So it's important to actually make sure
that whether that be through user research,
or through dog fooding, or through different feedback
mechanisms in your product, that you identify
different ways to access different communities who
might be more or less likely to provide that information.
Lastly, identify ways to enable multiple experiences
in your product.
Identify the places where there could be more than one
correct answer, for example.
And find ways to enable users to have that different experience.
Representing human culture and all of its differences
requires more than a theoretical and technical toolkit.
It requires a much more rich and context-dependent experience.
And that is really, at the end of the day, what
we want to provide our users.
We hope that those lessons were helpful.
They've been lessons that we've been really, really grateful
to learn, and that we've started to execute in our own products.
But what's next?
We're starting to put these lessons into practice.
And while we know that product development in ML fairness
is a context-dependent experience,
we do want to start building some of the fundamentals
in terms of tools, resources, and best practices.
Because we know how important it is to at least start
with those metrics, start with the ability
to collect diverse data, start with consistent communication.
One of the first things we're thinking about
is transparency frameworks.
We want to create and leverage frameworks that drive
consistent communication-- both within Google,
but also with the industry at large--
about fairness and other risks that
might exist with data collection and modeling.
We also want to build tools and techniques,
develop and socialize tools that enable evaluating and improving
fairness concerns.
Let's talk about transparency first.
Today, we're committing to a framework for transparency
that ensures that we think about, measure, and communicate
about our models and data in a way that is consistent.
This is not about achieving perfection in our data
on models, although of course we hope to get there.
It's about the context under which something
is supposed to be used.
What are its intended use cases?
What is it not intended for?
And how does it perform across various users?
We released our first Data Card last October
as part of the Open Images Extended Dataset
that you heard Jackie talk about earlier.
This Data Card allows us to answer questions like,
what are the intended use cases of this dataset?
What is the nature of the content?
What data was excluded, if any?
Who collected the data?
It also allows us to go into some
of the fairness considerations.
Who labeled the data, and what information did they have?
How was the data sourced?
And what is the distribution of it?
For Open Images Extended, for example,
while you can see that the geographic distribution is
extremely diverse, 80% percent of the data comes from India.
This is an important finding for anyone
who wants to use this dataset, both for training
or for testing purposes.
It might inform how you interpret your results.
It also might inform whether or not
you choose to augment your dataset with something else,
for example.
This kind of transparency allows for open communication
about what the actual use cases of this dataset should be,
and where it may have flaws.
We want to take this a step further with Model Cards.
Here you see an example screenshot
for the Jigsaw Perspective Toxicity API
that we talked about earlier.
With Model Cards, we want to be able to give you
an overview of what the model is about,
what metrics we use to think about it,
how it was architected, how it was trained, how it was tested,
what we think it should be used for,
and where we believe that it has limitations.
We hope that the Model Card framework
will work across models, so not just for something
like toxicity, but also for a face detection model,
or for any other use case that we can think of.
In each case, the framework should be consistent.
We can look at metrics.
We can look at use cases.
We can look at the training and test data.
And we can look at the limitations.
Each Model Card will also have the quantitative metrics
that tell you how it performs.
Here, for example, you can see an example set
of metrics sliced by age.
You can see the performance on all ages,
on the child age bucket, on the adult age bucket,
and on the senior age bucket.
So how do you create those metrics?
How do you compute them?
Well, we also want to be able to provide you the tools
to do this analysis, to be able to create your own model cards,
and also to be able to improve your models over time.
The first piece of the set of tools and resources
is open datasets.
The Open Images Extended Dataset is one of many datasets
that we have and hope to continue to open
source in the coming years.
In this example, the Open Images Extended Dataset
collects data from crowdsourced users
who are taking images of objects in their own regions
of the world.
You can see, for example, how a hospital or food might
look different in different places,
and how important it is for us to have that data.
With the live Kaggle competition,
we also have open sourced a dataset
related to the Perspective Toxicity API.
I mentioned earlier how important
it is for us to look at real comments and real data.
So here, the Jigsaw team has open
sourced a dataset of real comments from around the web.
Each of these comments is annotated with the identity
that the comment references, as well as whether or not
the comment is toxic, as well as other factors
about the comment, as well.
We hope that datasets like these continue
to be able to advance the conversation, the evaluation,
and the improvements of fairness.
Once you have a dataset, the question becomes,
how do you take that step further?
How do you evaluate the model?
One thing you can do today is deep-dive with the What-If
tool.
The What-If tool is available as a Tensorboard plugin, as well
as a Jupyter Notebook.
You can deep-dive into specific examples,
and see how changing features actually affects your outcome.
You can understand different fairness definitions,
and how modifying the threshold of your model
might actually change the goals that you're achieving.
Here's a screenshot of the What-If tool.
What you see here on the right is a whole bunch of data points
that are classified by your model.
Data points of a similar color have
been given a similar score.
You can select a particular data point,
and then with the features on the right,
you can actually modify the feature value
to see how changing the input would potentially
change the output.
For example, if I changed the age defined in this example,
does it actually change my classification?
If it does, that might tell me something
about how age is influencing my model,
and where potentially, there may be biases,
or where I need to deep-dive a little bit more.
We also hope to take this a step further
with Fairness Indicators, which will
be launched later this year.
Fairness Indicators will be a tool
that is built on top of TensorFlow Model Analysis,
and as a result, can work end to end with the TFX pipeline.
TSX stands for TensorFlow Extended.
And it's a platform that allows you to train, evaluate,
and serve your models, all in one go.
And so we're hoping to build fairness into this workflow
and into these processes.
But Fairness Indicators will also work alone.
It'll work as an independent tool
that can be used with any production pipeline.
We hope that with Fairness Indicators,
you'll be able to actually look at data on a large scale,
and see actually how your model performs.
You can compute fairness metrics for any individual group,
and visualize these comparisons to a baseline slice.
Here, for example, you can see the baseline slice
as the overall average metric in blue,
and then you can actually compare
how individual groups or individual slices
compare to that baseline.
For example, some may have a higher false negative rate
than average, while others may have a lower.
We'll provide feedback about these main metrics
that we believe have been useful for various fairness use cases.
You can then use Fairness Indicators also
to evaluate at multiple thresholds
to understand how performance changes,
and how maybe changes to your model
could actually lead to different outcomes for different users.
If you find a slice that doesn't seem
to be performing as well as you expect it to,
you can actually take that slice further
by deep-diving immediately with the What-If tool.
We will also be providing confidence intervals,
so that you can understand where the differences that you're
seeing are significant, and where we may actually
need more data to better understand the problem.
With Fairness Indicators, we'll also
be launching case studies for how
we've leveraged these metrics and improvements
in the past internally in our own products.
We hope that this will help provide context
about where we found certain metrics useful, what
kinds of insights they've provided us,
and where we found that certain metrics actually haven't really
served the full purpose.
We'll also provide benchmark datasets
that can be immediately used for vision and text use cases.
We hope that Fairness Indicators will simply
be a start to being able to ask questions of our models,
understand fairness concerns, and then eventually, over time,
improve them.
Our commitment to you is that we continue
to measure, improve, and share our learnings
related to fairness.
It is important not only that we make our own products
work for all users, but that we continue
to share these best practices and learnings so that we,
as an industry, can continue to develop fairer products--
products that work equitably for everybody.
One thing I do want to underscore
is that we do know that in order to create
diverse products, products that work for diverse users,
it is also important to have diverse voices in the room.
This not only means making sure that we
have diverse voices internally working on our products,
but also means that we include you
as the community in this process.
We want your feedback on our products,
but we also want to learn from you
about how you're tackling fairness and inclusion
in your own work, what lessons you're learning,
what resources you're finding useful.
And we want to work with you to continue and build and develop
this resource toolkit, so that we can continue,
as an industry, to build products that
are inclusive for everyone.
Thank you.
[MUSIC PLAYING]