Subtitles section Play video
AMIT SHARMA: Hi, all.
Welcome to this session on
Causality and Machine Learning
as a part of Frontiers in
Machine Learning.
I'm Amit Sharma from Microsoft
Research and your host.
Now of course I presume you
would all agree that
distinguishing correlations from
causation is important.
Even at Microsoft, for example,
when we're deciding which
product feature to ship or when
we're making business decisions
about marketing, causality is
important.
But in recent years, what we're
also finding is that causality
is important for building
predictive machine learning
models as well.
So especially if you're
interested in out-of-domain
generalization having your
models not brittle, you need
causal reasoning to make them
robust. And in fact there are
interesting results even about
adverse robustness and privacy
where causality may play a role.
This is an interesting time at
the intersection of causality
and machine learning. And we
now have a group at Microsoft as
well that is looking at these
connections.
I'll post a link in the chat.
But for now, today I thought we
can ask, all ask this question
what are the big ideas that will
drive further this conversation
between causality and ML.
And I'm glad that today we have
three really exciting talks.
Our first talk is from Susan
Athey, economics of technology
professor from Stanford. She'll
talk about the challenges and
solutions for decision-making
under high dimensional and how
generative data modeling can
help.
And in fact when I started in
causality, Susan's work was one
of the first I saw that was
making connections between
causality and machine learning.
I'm looking forward to her talk.
And next we'll have Elias
Bareinboim, who will be talking
about the three kinds of
questions we typically want to
ask about data and how two of
them turn out to be causal and
they're much harder.
And he'll also talk about an
interesting emerging new field,
causal reinforcement learning.
And then finally we'll have
Cheng Zhang from Microsoft
Research Cambridge.
She'll talk about essentially
give a recipe for how to build
models, neural networks that are
robust to adversal attacks. And
by now you've guessed in the
session she'll use causal
reasoning. And at the end we'll
have 20 minutes for open
discussion. All the speakers
will be live for your questions.
Before we start, let me tell you
one quick secret.
All these talks are prerecorded.
So if you have any questions
during the talk, feel free to
just ask those questions on the
hub chat itself and our speakers
are available to engage with you
on the chat even while the talk
is going on.
With that, I'd like to hand it
over to Susan.
SUSAN ATHEY: Thanks so much for
having me here today in this
really interesting session on
machine learning and causal
inference.
Today I'm going to talk about
the application of machine
learning to the problem of
consumer choice.
And I'm going to talk about some
results from a couple of papers
I've been working on that
analyze how firms can use
machine learning to do
counterfactual inference for
questions like how should I
change prices or how should I
target coupons.
And I'll also talk a little bit
about the value of different
types of data for solving that
problem.
Doing counterfactual inferences
is substantially harder than
prediction.
There can be many data
situations where it's actually
impossible to estimate
counterfactual quantities.
It's essential to have the
availability of experimental or
quasi experimental variation in
the data to separate correlation
from causal effects.
That is, we need to see whatever
treatment it is we're studying,
that needs to vary for reasons
that are unrelated to other
unobservables in the model. We
need the treatment assignment to
be as good as random after
adjusting for other observables.
We also need to customize
machine learning optimization
for estimating causal effects
and counterfactual of interest
instead of for prediction.
And indeed, model selection and
regularization need to be quite
different if the goal is to get
valid causal estimates. That's
been a focus of research,
including a lot of research I've
done.
A second big problem in
estimating causal effects is
statistical power. In general,
historical observational data
may not be informative about
causal effects. If we're trying
to understand what's the impact
of changing prices, if prices
always change in the past in
response to demand shocks, then
we're not going to be able to
learn what would happen if I
change the price at a time when
there wasn't demand shock. I
won't have data from that in the
past.
I'll need to run an experiment
or I'm going to need to focus on
just a few price changes or use
statistical techniques that
focus my estimation on a small
part of the variation of the
data.
Any of those things is going to
lead to a situation where I
don't have as much statistical
power as I would like.
Another problem is effect sizes
are often small.
Firms are usually already
optimizing pretty well.
It will be surprising if making
changes leads to large effects.
And the most obvious ideas for
improving the world have often
already been implemented.
Now that's not always true, but
it's common.
And finally personalization is
hard.
If I want to get exactly the
right treatment for you, I need
to observe lots of other people
just like you, and I need to
observe them with different
values of the treatment variable
that I'm interested in.
And again that's very difficult,
and often it's not possible to
get the best personalized effect
for someone in a small dataset.
Instead, I'm averaging over
people who are really quite
different than the person of
interest.
So for all of these reasons, we
need to be quite cautious in
estimating causal effects and we
need to consider carefully what
environments enables that
estimation and give us enough
statistical power to draw
conclusions.
Now I want to introduce a model
that's commonly used in
economics in marketing to study
consumer choice.
This model was introduced by Dan
McMadden in the early 1970s, he
won the Nobel Prize for this
work.
The main crux of his work was to
establish a connection between
utility maximization, a
theoretical model of economic
behavior.
And the statistical model, the
multinomial logit.
And this modeling setup was
explicitly designed for
counterfactual inference.
The problem he was starting to
solve was what would happen if
we expand Bart, which is the
public transportation in the Bay
Area, what if I expand Bart, how
will people change their
transportation choices when they
have access to this new
alternative.
So the basic model is an
individual's utility depends on
their mean utility which varies
by the user, the item and time
and idiosyncratic shock.
In general, this -- we're going
to have a more specific
functional model for the mean
utility, and that's going to
allow us to learn from seeing
the same consumer over time and
also to extrapolate from one
consumer to the other.
We're going to assume that the
consumer maximizes utility among
items in a category by just
making this choice. So they're
going to choose the item I that
maximizes their utility.
The nice thing is that if the
error has type one extreme value
distribution is independent
across items, then we can write
the probability that the user
used choice at time T is equal
to I, just standard multinomial
logit type functional form.
So utility maximization will
where these mus are the means,
will lead to multinomial logit
probabilities.
So data about individual I's
purchases can be use to estimate
the mean utility.
In particular, if we write their
utility, their mean utility as
something that depends on the
item and the user but that's
constant over time, so this is
just their mean utility for this
item, like how much they like a
certain transportation choice,
and then a second term which is
a product of two terms, the
price the users faces at time T
for item I and a preference
parameter that's specific to the
user.
If I have this form of
preferences and then the price
varies over time while the
user's preference parameters
stay constant, I'll be able to
estimate how the user feels
about prices by looking at how
their choices differ across
different price scenarios.
And if I pull data across users,
I'll then be able to understand
the distribution of consumer
price sensitivities as well as
the distribution of user
utilities for different items.
So in a paper with Rob Donnelly
and David Blei and Fran Ruiz,
We take a look at how we can
combine machine learning methods
and modern computational methods
with traditional approaches to
studying consumer purchase
behavior in supermarkets.
The traditional approach in
economics in marketing is to
study one category like paper
towels at a time.
We then model consumer
preferences using a small number
of latent parameters.
For example, we might allow a
latent parameter for how
consumers can care about prices.
We might allow a latent
parameter for product quality.
But other than that, we would
typically assume that there's a
small number of observable
characteristics of items and
there's some common coefficients
which express how all consumers
feel about those
characteristics.
The traditional models also
assume that items are
substitutes within a category
and they would ignore other
categories.
So you might study consumer
purchases for paper towels
ignoring everything else in the
supermarket, just throwing all
that data away.
So what we do in our approach is
that we maintain this utility
maximization approach.
But instead of just studying one
category, we study many
categories in parallel.
We look at more than 100
categories, more than a thousand
products at the same time.
We maintain the assumption that
categories are independent and
that items are substitutes
within the categories.
And we select categories where
that's true.
So categories of items where the
consumers typically only
purchase one brand or one of the
items.
We then take the approach of a
nested logit which comes from
the literature in economics and
marketing, in each category
where there's a shock to an
individual's need to purchase in
the category at all.
And but then conditional on
purchasing, the errors or the
idiosyncratic shock to the
consumer utility are
independent.
So having the single shock to
purchasing at all is effectively
introducing correlation among
the probabilities of purchasing
each of the items within the
category at all.
Now, the innovation where the
machine learning comes in is
that we're going to use matrix
factorization for the user item
preference parameters.
So instead of having for each
consumer a thousand different
latent parameters, each one for
each product they might
consider, instead we use matrix
factorization so there's a lower
dimensional vector of latent
characteristics for the products
and consumers have a lower
vector for latent preferences
for those characteristics.
That allows us to improve upon
estimating a hundred different
separate category models.
We're going to learn about how
much you like organic lettuce
from whether you chose organic
tomatoes, and we'll also just
learn about whether you like
tomatoes at all from whether you
purchased lettuce in the past.
And so I won't have time to go
through it today but this is a
layout of what we call nested
factorization model showing the
nest where first you decide what
to purchase if you're going to
purchase, and then the consumers
deciding whether to purchase at
all.
And we have in each case vectors
of latent parameters that are
describing the consumer's
utility for categories and for
items.
One of the reasons that this
type of model hasn't been done
in economics in marketing in the
past is what was standard in
economics and marketing, if you
were going to do a model like
this, would be to use either
classical methods like maximum
likelihood without very many
latent parameters or consider
Markov chain Monte Carlo
bayesian estimation which
historically had very limited
scalability. What we do in our
papers is use variational bayes
where we approximate the
posterior with parameterized
distribution and minimize the
kale divergence to the true
posterior using stochastic
gradient descent.
We show we can overcome a number
of challenges particular
introducing price and time rate
cover rate slows down the
computation a fair bit, and the
substitutability within
categories leads to
nonlinearities. Despite that
we're able to overcome these
challenges.
Once we have estimates of
consumer preferences for a
product, and as well we have
estimates of consumer
sensitivity to price, we can
then try to validate our model
and see how well do we actually
do in assessing how consumer
demand changes when prices
change.
And in our data we see many,
many price changes. We see
prices typically change on this
particular grocery store we have
data from on Tuesday night.
And so in any particular week
there may be a change in price
from Tuesday to Wednesday.
And so in order to assess how
well our model does in
predicting the change in demand
and response to a change in
price, we held out test data
from weeks with price changes.
In those weeks we break the
price changes into large price
changes and different buckets of
the size of the price change.
We then look at what is the
change from Tuesday to Wednesday
in demand in those weeks.
Finally, we break out those
aggregations according to which
type of consumer we have for
each item.
So in particular, on a week
where we have a change in price
for a product, we can
characterize the consumers as
being very price sensitive,
medium price sensitive or not
price sensitive for that
specific product.
And then we can compare how
demand changes for each of those
three groups.
And so this figure here
illustrates what we find in the
held-out test data.
In particular, we find that the
consumers that we predict to be
the least price sensitive in
fact don't seem to respond very
much when prices change, while
the consumers who are most price
sensitive are most elastic, as
we say in economics, are the
ones whose quantity changes the
most when prices change. Once
we're confident that we have a
good model of consumer
preferences, we can then try to
do counterfactual exercises such
as evaluate what would happen if
I introduce coupons and targeted
them at individual consumers.
We'll take a simple case where
we have only two prices we
consider, the high price or the
typical price, and the low
price, which is the discounted
price.
Now what we do we look into the
data and we evaluate what would
happen if we sent those targeted
coupons out.
So for each product we look at
the two most common prices that
were charged in the data.
We then assess which consumers
would be most appropriate for
coupons.
We might look, for example, and
say I want to give coupons to
third of consumers, I can see
which consumers are most price
sensitive, most likely to
respond to those coupons.
I can then actually use held out
test data to assess whether my
coupon strategy is actually a
good one.
And that will allow me to
validate again whether my model
has done a good job in
distinguishing the more price
sensitive consumers from the
less price sensitive consumers.
So this figure illustrates that
for a particular product there
were two prices, the high price
and low price that were charged
over time.
The actual data, the different
users might have seen come to
the store sometimes on a low
price day and sometimes on the
high price day indicated by blue
or red.
What we then do is say what
would our models say about who
should get the high price and
who should get the low price.
So we can reassign
counterfactually say the top
four users to high prices
indicated by these orange
squares, and we can
counterfactually reassign the
low, the fourth -- the fifth and
sixth user to the low price,
indicated by the green
rectangles.
Now, since the users we assigned
to high saw a mix of low and
high prices, I can actually
compare how much those users
purchased on the high priced
days and low priced days and I
can also look among the people
that I would counterfactually
assign to low prices and see
what's the impact of high prices
versus low prices for those
consumers. And I can use those
estimates to assess what would
happen if I reassigned users
according to my counterfactual
policy.
When I do this I can compare
what my model prediction
happened in the test set to what
actually happened in the test
set.
What I actually find somewhat
surprisingly here is that in
fact what actually happens in
the test set is even more
advantageous for the firm than
what the model predistricts.
In particular, our model
predicts if I reallocate the
prices to the consumers
according to what our model
suggests would be optimal from a
profit perspective, we can get
an 8% increase in revenue.
That is instead of varying
prices from high to low or from
day to day we always kept them
high and then we targeted the
coupons to the more price
sensitive consumers.
In the data, if we actually look
at what happened in our held-out
test data, it looks like that
the benefits to high versus low
prices and the difference in
those benefits between the high
and the low consumers are such
that it looks like in the test
set we actually would have
gotten a 10 or 11% increase in
profits had the prices been set
in that way.
To conclude, the approach I've
outlined is to try to learn
parameters of consumers utility
through revealed preference.
That is, use the choices that
consumers make to learn about
their preferences about product
characteristics and prices and
then predict their responses to
alternative situations.
It's important to find a dataset
that's large enough and has
sufficient variation in price to
isolate the causal effects of
prices and also assess the
credibility of the estimation
strategy.
And it's also important to
select counterfactual study
where there's actually enough
variation in the data to be able
to assess and validate whether
your estimates are right.
And so I illustrated two cases
where I was able to use test set
data to validate the approach.
You use the training data to
assess, for example, which
consumers are most price
sensitive and look at the test
data and see if their purchase
behavior varies with price in
the way that your model
predicts.
In ongoing work, I'm trying to
understand how the different
types of data create value for
firms.
And so in particular if firms
are using the kinds of machine
learning models that I've been
studying and they use those
estimates in order to do things
like target coupons, we can ask
how much do profits go up as
they get more data.
In particular, how does that
answer vary if it's more data
about lots more consumers, or if
we do things like retain
consumer data for a longer
period of time.
And preliminary results are
showing that retaining user data
for a longer period of time so
you really get to know an
individual consumer can be
especially valuable in this
environment.
Overall, I think there's a lot
of promise in combining tools
from machine learning like
matrix factorization but also
could be neural nets, with some
of the traditional approaches
from causal inference.
And so here we've put the things
together.
We used functional forms for
demand and the concepts of
utility maximization and
approaches to counterfactual
inference from economics in
marketing that use computational
techniques from machine learning
in order to be able to do this
type of analysis at large scale.
ELIAS BAREINBOIM: Hi, guys.
Good afternoon.
I'm glad to be here online
today.
Thank you for coming.
Also thank you for the
organizer, I appreciate the
organizer, Amit and Amber, for
inviting me to speak in the
event today.
My name is Elias Bareinboim.
I'm from the Computer Science
Department and the Causal
Artificial Intelligence Lab at
Columbia University.
Check my Twitter.
I have discussions about
artificial intelligence and
machine learning.
Also apologies for my voice.
I'm a little bit sick.
But very happy to be here today.
I will be talking about what I
have been thinking about the
foundations of artificial
intelligence, how it relates to
causal inference and the notions
of explainability and
decision-making.
I'll start from the outline of
the talk.
I'll start from the beginning.
Defining what is a causal model.
I will introduce three basic
results that are somewhat
intertwined.
I usually say that to understand
them, we understand like 50% of
what causal inference is about.
There's a lot more technical
results, but the conceptual
part, the most important.
The first I'll start with
structural causal models, which
is the most general definition
of causal model that we know to
date, that's by Pearl himself.
Then I'll introduce the second
and third order, the second
result which is known as the
Pearl Causal Hierarchy, the PCH,
which was named after him.
This is the name after object,
mathematical object, used by
Pearl himself and Markesian in
the book of White.
If you haven't read the book,
strongly recommend it. It's
pretty good since it discusses
the foundations of causal
inference and how it relates to
the future of AI and machine
learning.
More prominently in the last
chapter as well as the
intersection of the other
sciences.
This is work partially based on
that chapter that we're working
on post oppose hierarchy and the
foundations of causal inference,
joint work with Juan Correa and
Duligur Ibeling Thomas Icard, my
students at Columbia, and the
last two are collaborators from
Stanford University.
This is the link here to the
chapter.
Take a look because most of the
things I'm talking there, it's
in there some shape or form.
Then I'll move to another result
that is called the causal
hierarchy theorem that which was
proven in the chapter about
20 years old, 20-plus years old
open result, and used as one of
the main building blocks, one of
the main causes.
And then I'll try to connect
advanced machine learning and
more specifically supervisory
and causal learning, how does it
fit or how it fits with the
specifics of causal hierarchy,
also called ladder of causation
in the book.
Then I'll move to talk a little
bit what causal inference and
cross-layer inferences.
I would then move to the design
of artificial intelligence,
artificial intelligence systems
with causal capabilities.
I will come back with machine
learning methods and virtual
deep learning MRL. And
perspective and my focus here
will be more about my goals to
introduce the ideas, principles
and some tasks.
I will not focus on
implementation details.
Also I should mention that
essentially business, the
outline of the course this
semester course at Columbia,
bear with me, I'll try to give
you the idea if you're
interested to learn more check
the reference or send message.
Now without further ado, let me
introduce here the idea of what
is a causal model, structural
causal model.
And we will use the idea, the
idea from the processes, we'll
take a process based approach to
causality.
The idea is borrowed from
physics, chemistry sometimes
economics and other fields that
have a collection of mechanism
in some line of some phenomena
that we're theorizing. In this
case, suppose you're trying to
understand the effect of taking
some drug on the headache.
Those are observable variables
and we have the corresponding
mechanisms here and the data for
available drug and sub H to the
variable headache.
Each mechanism takes as input,
has as argument set of
observables in the case of apps
of the age and observables, in
this case U sub B.
There's an observable here, all
variables in the universe that
generate variations should drop.
There's not age can be included
in the U sub B. And the same
here would be U sub H, drug and
age observables transformation U
sub H, will use U sub H, all
variables in the universe that
are not drug and age and someone
would have or would not have
headache.
This is the real process, you
usually have possibly a
complicated function here F sub
G, F sub H, if it's not
substantiated.
Usually we have some type of
course and this is the causal
graph related to this collection
of mechanisms.
The causal graph is nothing, the
partial specification of the
system in which the arrows here
just means that some variable
participates in the mechanisms
of the other.
Just put XYZ to make the
communication easier.
Now we have, for example, age
participates in the mechanism of
data sub H and then here's from
Z to Y.
The same with drug.
This is arrow from X to Y and
same age here participates in
the F sub B.
Note here in the graph we don't
have the particular
instantiation of the function,
we're just preserving the
arguments that will help them.
Now, for sure we can try -- this
is the process that is kind of
unfolding in time.
We can sample from a process
like that.
This gives rise to our
distribution, observational and
nonexperimental distribution
over the observables PX and Y in
this case.
Usually when you're doing
machine learning supervised
learning or unsupervised
learning, we're playing about
this side here of the equation.
Here we are trying to understand
causality, it's about when you
go to the system and you change
something or you overwrite,
overwrite as a computer science
we like, we overwrite some
function.
Here we would like to overwrite
the equation, the natural way of
how people is taking drugs, here
is drug is equal to yes. This
is called also introduce organic
layer given the time but this is
related to the due operator in
which you have overwrite the
original mechanism, F sub D, in
this case is do X is equal to
yes. Now we no longer have a
regional equation, you have a
constant here.
You can have a lot of constants,
constants no on the other side
we don't have time on these
slides. That's what we have.
This is semantics without
necessarily having access to the
mechanisms themselves.
This is the meaning of the
operation.
Now, here is the graphical, the
graphical counterpart of that.
Note that there's F sub D here
would no longer has the age as
argument of this function,
there's just the constant, you
put the constant here and we cut
about this graph we cut the
incoming out X.
This is the mutilated graph.
Again, if we're able to contrive
reality in this way, you can
sample from this distribution or
from this process which gives
rise to the distribution called
interventional distribution or
experimental distribution, P of
ZY, given 2X is equal to yes.
I use these variables here XZY
but X would be any decision, Y
can be any outcome, Z any set of
covariates or features.
Now, what is the challenge here?
The challenge that in reality
this upper floor here is almost
never observed.
This is usually called
unobserved.
This is why I put it in gray.
Then this is one of the things
we don't have that in practice
or very rarely and another
challenge usually observe the
data that we have it's coming
from the left side that is
coming from this naturally
unfolding or how the system is
naturally evolving and we will
like to understand what's the
effect if you go there and do
things and do intervention to
the system, with our own wheel
or deliberately as a
policymaker, decision-maker,
this sets this variable to yes.
And we have data from the left,
from Cheng, you do inference
what would happen if you do
something in this system.
Now, we can try to generalize
this idea and define what is the
structural causal model.
This is chapter on causality
book approach a thousand.
I won't go through definitions
step-by-step, but suffice to say
you have type of observables or
endogenous variables like age,
drug or headache.
And exogenous, the unobserved
variables, that could U sub D
and H that we had before and
we'll have a collection of
mechanisms for each of these
observed variables.
Mechanism sub D or sub H, excuse
me, this could be seen as some
type of new point in physics to
summarize the conditions outside
the system.
Excuse me.
Outside the system. Kind of
sprinkle mass probability.
Have this probability P of U
over the exogenous variable.
Now, we understand very well how
the systems work. There's
awesome work by Halpern Galles
at Cornell, and Galles and
Pearl.
Given this type of understanding
over these types of systems.
And today we're interested in a
different result that is the
following.
Once we have SCM, structural
causal model M that's fixed or
particular environment or set in
with the particular agents, this
induces the Pearl Causal
Hierarchy, or PCH, that is
called the ladder of causation
in the Book of White.
Let's try to understand what --
here's the PCH.
Now, different layers of the
hierarchy.
This is the first layer that is
called the associational layer,
the activities of seeing, how it
would seeing some variable acts,
acts change my belief in the
variable Y, what does a symptom
care wise about the disease.
Syntactically it's written as
sub P of Y given X and why do
people ask this layer but this
is very related to the machine
learning, supervised and
unsupervised learning.
Different types of comments
there.
Bayes is one type of model
there. You have decision trees.
You have supercomputer machines,
and deep neural networks and
different types of neural
networks. They live in this
layer here.
Quite important, we're to scale
up inferences given this X,
could be the pixels, the set of
features could be order of
thousands, even millions, and
try to predict how wide some
labelly, have pixels, whether
it's a cap or not.
And it's kind of classic and
it's very hard, we're kind of
mastering that to understand
pretty well how to do that, and
recent breakthroughs in the
field in the last 20 years, I
should say.
Now I have a qualitatively
different layer, layer two,
interventional.
It's related to the activity of
doing what if I do X actions,
what if I take the Asprin, will
my headache be cured.
The counterpart to machine
learning would be reinforcement
learning.
You have causal bayesian
networks and decision processes,
partially observable, Ps and so
on.
Quite important I'll tell you
more about that.
Symbolically, you say P of Y
given to X comma C.
That's the notation that you
have.
Now I have a qualitatively
different layer that's layer
three, which is a counterfactual
layer. I'll go back here soon,
but it's related to activity in
pagination, agents to have
imagination, retrospection, and
introspection, and
responsibility, credit
assignment.
It is the layer that gave the
name for the Book of White.
This is the why type of
question.
What if I had acted differently,
was it the aspirin that stopped
my headache.
Syntactically, we have this
common nested counterfactual
here.
I took the drug that is X prime
as instantiation of the big X,
pardon for my license here.
Expire, I took the drug, and I'm
cured.
That is why prime.
Now, you can ask how I have a
view of the headache that is y,
the opposite of Y prime, had I
not taken the drug; that is the
X that is the opposite of X
prime.
I took the drug, I'm good
experiment prime in the actual
world.
In this world.
And I asked what if I hadn't
taken the drug that is X?
Would I be okay?
That is the Y or not okay, that
is the Y.
Okay.
Not Y.
And then there's no counterpart
exactly in machine learning, if
you have some particular
instance you can ask me off
line, but it's all kind of
things written in the
literature, this comes from the
structural causal model.
Now I would like to see what is
going beyond machine learning.
I just mentioned this layer
three here.
Specifically I'd like to
highlight different family of
tasks of inferential attacks
which fall very naturally
causally called cross layer type
of inferences as I'm seeing
here.
Layer one is related as suppose
as input you have some data here
and most of the available data
today is observational.
It's possibly collected, numbers
here 99 percent of the data we
have is coming from layer one,
and the latest numbers can,
someone complains to you, but
99, 90 percent of the inferences
that were introduced today is
about doing or actionnal layer
three about counterfactuals.
And about policies, treatments
and decisions, just to cite a
few examples.
Then search question that we're
trying to answer here across
layers that we have the data and
the inference that one should do
is how to use the data collected
from observations, passively,
that's layer one.
Maybe coming from the hospital,
to answer questions about the
interventions that this layer
two.
And under what conditions can we
do that?
Why is this task different is
usually a good question.
Why is the causal problem
nontrivial?
The answer is like SCM.
Almost never observed, but for a
few exceptions such as in feuds
such as feuds, such as physics,
sorry, chemistry and biology.
Biology sometimes.
In which the very target there
is to learn about this
collection mechanism in general
we do not observe. That's
Young.
Most of the fields we in AI
machine learning we're
interested that there's the
human in the loop.
Some type of interactions that
we cannot given that we cannot
read minds and we don't isolate
the environment in some kind of
precise way.
You don't have a controlled
environment, usually you cannot
get that help.
But still the observation here
that if it does exist, this
collection of mechanisms that
underlying the system that we're
trying to understand is through
there and inducing the PCH and
you could still have the query
or the data task, the cross
layer tasks, how can you get
from data, from data that is
from a fragment that we have
from the SCM, you can talk about
that's layer one, observational,
how can you answer the question
from layer two. And have
observed phenomenon and you're
trying to get fragments observed
at least relizable.
That could be layer three as
well.
How can you move across these
layers?
Like a lot I use in the class,
spend some time but I like the
metaphor here, since there's
complicated reality, just
observe the fragments or shadow
of the fragments of the PCH, do
an inference about the outside
world under what conditions can
give you that.
That's kind of the flavor or the
consequence of these mechanisms
that could be the other layers,
layer two or three, for example,
I'd like to talk about the
possibility results or this
cross layer inferences. As
usual, let me read the task
here.
Infer causal quantity Y given to
X from layer three from
observational that is layer one.
That's the task that I just
showed.
Now, the effect of X and Y is
not identifiable. I've seen
from the observed data proves
there exists a collection of
mechanisms or SCMs capable of
generating the same observed
behavior, layer one P of X, and
Y, Y is disagreeing with respect
to the causal query.
To witness, we show two models.
This is model one.
This is model two.
Such that they generate the sale
of course the model here.
Is this for you to go home and
think a little bit, but simple
models here this is Xor, by the
way.
Not X, Xor.
These models generate the same
observed, model one, P1, P2.
And same observed behavior in
layer one; however, they
generate different layer two
behaviors.
Different layer two predictions.
In this case it tells me -- in
this case layer two says
probability of Y given to X1 is
equal to half, while the model
two is saying this is one.
In other words, we have kind of
layer one under the deterrence
what can we say about layer two?
There's not enough information
there to move.
That's the result.
I would like now to make a
broader statement generalize
this idea.
Again, this is great work with
Correa, Ibeling and Icard from
the paper I mentioned earlier,
proved the following result
theorem.
Respect for that measure over
some kind of technical
conditions, measure over SCM,
the subset that any PCH collapse
is measure zero.
Let me read the informal version
here.
You go home and you can try to
parse that. But informally, for
almost any SCM, in other words,
any possible environment in
which your agent or your system
is embedded, the PCA doesn't
collapse.
In other words, the layers of
the hierarchy remains distinct.
In other words, you have this
hierarchy here, there's some
kind of this will not happen
that one layer usually
determines the other.
There's more knowledge in layer
two than in layer one on line.
There's more knowledge in layer
three than layer one and layer
two.
Then one layer determines the
other, you don't get this type
of situation.
This caused an open problem.
As stated in the book of White
as parallel in Chapter 1 that
says answer question I abut
certain type of interaction to
layer two above intervention,
one needs knowledge at layer I,
two or above.
Now, the natural question here
that you could be asking is
like, Elli, how is after all are
causal inferences possible, or
how are causal inferences
possible.
Now commonly now these are
enforce.
Doesn't mean you shouldn't do
causal inference at all even if
you have this type of
determination from one layer to
another? And the answer is not
at all.
The idea here, this motivates
the following observation.
If you know a little bit about
the -- if you know zero about
SCM, this is the CHP the causal
hierarchy pyramid you get.
If you know anything about SCM
it may be possible.
What is this little bit, it's
what you call constructial
constraints, which you could
have encoded in a graphical
model. Different models here
you can have graphical model
layer one, layer two and so on.
And then in principle it could
be possible to move across
layers, depending on how you
encode constraints here.
Families are graphical models.
I'd like to examine for just one
minute the graphical model layer
one here that is very popular.
Such as a bayesian network
that's layer one versus a causal
based, start of a base net.
Not all graphical models are
created equal.
This is the same task from the
previous theorem. It was shown
that it's impossible to move
from layer one data to layer two
type of statement.
Now what if you have a base net?
It's compatible with the data?
Now, this is all base net, this
is compatible with the data.
X pointing to I, whatever data
we get over XY.
And we would like to know what's
the P of I layer two quantity Y
to X in this case.
If you play a little bit or if
you know a little bit of
causality, there's no unobserved
confounder here in this graph,
then P of Y given to X is equal
to P of Y given to X.
By ignorability or back door
admissibility, those are names
we use to say this unobserved
confounder.
Now I pick another BM, another
layer 1 object, fit there, not
only from XY, but from Y to X,
see what would be the causal
effect of X and Y in this case
the Y given 2x is still
compatible.
Turns out for the semantics of
causal intervention, the 2,
you'll be cutting the arrow here
that is coming from the AX
because we're the one
controlling this system, which
PY given of X equals to be P of
Y.
Then this here highlights that
they have different answers,
recruitment, that's not enough
information about the underlying
SCM in the BM.
So as to allow causal inference.
To say this is not good, the
constraints could be coming from
the SCM as to why a layer one
object is not good.
This is not the end we're
looking.
Now I would like to consider a
second object that is a layer
two kind of graphical model.
You go to the paper that you
define more prominently and I'll
do that here, possible to
encoder layer two constraints
coming from SCM.
The idea of asymmetry of causal
relations and we'd like to focus
on this one now.
Now the idea is that there are
positive instances we can do
cross layer inferences. Let's
consider a graphical model,
other true graphical models.
Remember the mental picture I'd
like you to construct is the
following.
Suppose that this is all the
space of all structural models.
Here are the models compatible
to the graph G.
It's a true graphical model.
These are the models SCM
compatible with PZ, could
generate this observed
distribution. And here are the
models that linked a section of
these guys who have the models
that are giving the same Y given
to X.
What I'm saying in reality is
that there are situations that
for any structural model in
quoting this unobserved nature,
let's call nature N1 and 2, such
that they have the same graph of
G. G of N1 is is equal to layer
two.
If they generate the same PO of
PV, the same observed
distribution, then they will
generate the same causal
distribution. That's the notion
of identifiability.
It is possible to get in some
settings.
Now let me try to summarize what
I've said so far.
About some sort of patience
between the reality and the
reality that is destroying the
line mechanism that we don't
have and our model of reality
that will be graphical model,
for example, could be other or N
the data.
We started from the other
defined world, semantically
speaking, in which an SCM a pair
F and P of U mechanisms and
distribution over the exogenous
implying the PCH.
Which means different aspects of
the island nature and types of
behavior.
Layer one, two, three.
We do acknowledge that the
collection of mechanisms are
there but inference are limiting
given that SCM is almost never
observable or observed due to
the CHP, we have this constraint
about how to move across the
layers.
Now we'll move towards scenarios
in each parcel knowledge of the
SCM is available that is such a
causal graph, layer two causal
graph.
Causal inference theory helps us
determine whether the causal
target, the targeted inference
is allowed.
In the prior example the
inference is from layer one to
layer two.
Namely trying to understand if
the graph is P of V that is
layer one distribution allows us
to answer P of Y given to X.
Observation here, sometimes this
is not possible.
I mean, for weak models, if you
have a weak model, mental
picture here is like sometimes
the true models generating this
green guy here, this
distribution.
There's another model that had
the same graph G.
It can induce the same
observation distribution and
generate a model that's called P
star Y given to X.
And they're in a situation that
we cannot do the inference about
layer three just without one
data.
Now, I'd like to spend two
minutes just doing a summary of
the how does reinforcement
learning fit into this picture.
I stand three hours last week in
ICML talking about that, go to
the crl.causal.net, if you want
details I'll give you two
minutes what happened there.
This is the PCH. Now my comment
is typical URL is usually
confined to layer two or subset
of layer two, and usually you
cannot move from layer one,
cannot leverage the data that is
from layer one or very rarely.
And this URL doesn't support us
make statement about the
counterfactuals, the layer two
type of counterfactuals.
That's the global picture.
This is the kind of canonical
picture of RL.
Can have an agent that's
embedded in the environment.
The agent is a collection of
parameters.
The agent observed some kind of
state and commits to an action
and observes reward.
There's a lot of discussion
about the model, base of model
free.
I'd like to say that all model
base they mentioned today in the
literature is not causal model
based, it's causal. Important
not to get confused.
You can ask me more later.
The only difference here causal
reinforcement learning
perspective, that what, that
we'll leverage. And I spent
three hours, almost three hours
discussing that in the tutorial.
That now officially the
collection of mechanisms that we
just studied, the structural
causal model, would be the model
of the environment, officially,
and the agent side that you have
graph G.
Now, the two cube observations,
the environment and the agent
would be tied to the payer SCM
in the environment side,
environmental side and causal
graph on the agent side will
define different types of
actions or interactions
following the PCH, which means
that observing, experimenting in
and imagining would be these
different modes.
Please check the CRL.causal.ai
for more details there.
And this one, we can check
later, talk about different
types of tasks that we weren't
acknowledging before.
I'd like to move quickly, spend
30 seconds discussing how does
deep learning fit into this
picture.
Here's the same picture I have
before from the left side
observational in the pipes,
about ten slides ago, and right
side interventional world.
Now in reality this is about
reality and model. This is
abstraction in reality or have a
data. And you can sample from
the data. And this allows us to
get the hat distribution, the P
hat, and we have results saying
the results of the hat
distribution and the original
distribution keeps decreasing,
which makes sense to operate in
terms of the hat distribution.
Now, for sure can use some kind
of formalism to try to learn the
hat distribution, including a
deep network variation of that.
Now, challenge usually
interesting in this inference in
the right side and you have zero
data points in the right side.
I'm talking broadly, not
reinforcement learning.
The reinforcement learnings have
all the problems.
But you have zero here.
Now how on earth can you learn
about hat of distribution.
Some people, it's connecting the
input of the DNN that you
learned from the left side to
the right side.
Which put a guy like that,
there's nothing in the data, in
this data, nor in the deep net
that takes into account the
structural constraints that we
discussed and nor the CHT.
It makes no sense to connect.
There's something missing there.
I could talk one hour, you
invite me to talk about neural
nets and causal inference, but
this is the picture I want to
start the conversation.
I would like to conclude and
apologies for the short time.
It's like very short talk, and
thanks for the opportunity.
Now, let me conclude. Causal
inference and AI are from
mentally intertwined, novel
learning opportunities emerge
when this connection is fully
understood.
Most of the patterns for general
AI today are orthogonal to the
current eight causal maps
available. And we're not even
touching the problems, the
patterns for general AI,
including deep learning, the
huge discussions we're having in
reinforcement learning.
In practice, failure to
acknowledge the distant features
of causality almost always leads
to poor decision-making and
superficial type of
explanations.
The board here, the agenda we're
pursuing almost 10 years now,
we're developing a framework for
principle algorithms and tools
for designing causally sensible
AI systems integrating the three
PCH observational,
interventional and
counterfactual data.
Modes of reasoning and
knowledge.
And my belief, strong belief is
that this will lead to natural
treatment of human like
explainability given that we're
causal machines and rational
decision-making.
I would like to thank you for
listening and also this is, my
collaborators, this is joint
work with the causal AI lab at
Columbia and collaborators,
thanks Juan, Sanghac Kai-Zhan,
Judea, Andrew, Dulgar and
Thomas, and all the others, it's
a huge effort. Thanks. I'll be
glad to take questions.
CHENG ZHANG: Hello, everyone.
I'm Cheng Zhang from Microsoft
Research UK. Today I'm going to
talk about causal view on
robustness of neural networks.
Deep learning has been very
successful in many applications.
However, it's also vulnerable.
So let's take our favorite
stochastic classification, for
example.
Deep learning can achieve it
with 99 percent accuracy.
This is impressive.
However, if we just shift image
a little bit, not much, the safe
range won't be more than 10
percent.
So the accuracy will drop to
around 85 percent, which is
already not satisfying for
application.
If we enlarge the safety rings
to 20 percent, the accuracy will
drop to half, which is not
acceptable anymore.
The plot shows that the more we
shift, the more -- the less the
performance.
This is not desired.
Especially with minor shift.
Okay.
Now we would like to be robust.
Let's vertical shift image in
the training set as well.
This is a type of training
setting.
Major dash shift up to
50 percent image in the training
set.
So that you can see that the
performance is much better with
about 95 percent accuracy even
when we shift up to 50 percent.
But do we solve the problem now?
What if I didn't know that it
will be vertical shifting the
test, and start, it would have
been horizontal shift.
Then, for example, shift in the
training data then during the
testing time, we test image with
vertical shift as before.
The online shows the performance
with vertical shift.
It is actually even worse than
training with clean data only.
So adverse training does not
solve the robust problem with
deep learning because it could
even harm the robustness to
unsim -- manipulated images.
And we'll never know our
possible attacks.
This is the real issue in deep
learning. This is a simple task
which has seven digits.
How about healthcare or
policymaking, the decision
qualities is critical.
But humans are very good at this
task.
We can recognize that if it has
shifted a little bit or if the
background changes, because
we're very good at causal
reasoning.
We know that shift or background
changes does not change the
digit number or make a path to
adopt.
This is a model property called
reasoning, and it is also
referred as independent
mechanism sometime.
So causal the relationship from
the previous example can be
summarized in this way.
The final observation is an
effect of three types of causes.
One is the digit number, and the
other one is the writing style,
et cetera, and the left one is
the different manipulations such
as shift or rotation.
The same applies to the other
example observation of cat is
that a real cat?
And it's for color, et cetera,
features, and different
environments, such as different
view and also background.
We use Y here to denote the
target of the task and D denotes
the factors that cannot be
manipulated.
And M denotes the factors that
can be manipulated manually.
So add here is the factor we'd
like to be robust for you before
diving down into the robustness
details.
Let's review what is a valid
attack.
We have seen the shapes amidst
the digit, or background change
with cat.
And another common thing is to
add a bit noise as we see here.
We stressed a very small amount
of noise.
We can come through a deep
learning model claisified as
independent as a given.
We can rotate the image and even
add speakers sometime.
It's also been pointed the noise
can fool humans. The left image
looks more like a dog than cat
to me.
The question is this still a
valid attack, what type of
change and how much change can
we consider that since you form
a valid attack.
We'd like to define valid attack
from causal lens.
Let's take previous example from
a causal view.
We can see the valid attack of
generated from intervention on
M, together with original Y and
Z it produces manipulated data
X.
In general, valid attacks should
not change the underlying Y
because this is a target.
This we can now intervene the
target of Y or parts of Y if
there is an appearance of Y.
While Z are not equal to
intervene by our definition such
as the genetic feature of a cat
or writing style of the image
itself.
In this regard, recent adversary
attacks can be considered as
specific types of intervention
on M such as the adding noise on
manipulating the image.
In this way a learned predictor
is saved.
So the goal of robustness of
deep learning is to be robust to
both the known manipulation and
the unknown manipulation.
Adversary training can help with
the known manipulation but it
the unknown manipulation.
Our question is how to perform a
prediction that can be adaptive
to the potential of known
manipulations as the shifted
digit example.
In this work we propose a model
naming deep causal manipulation
argumented model.
We call it Deep CAMA.
The idea is to create the deep
learning model that is
consistent with underlying
causal process.
In this work, we'll assume that
the causal relationship among
the variables of interest are
provided. The Deep CAMA is one
of the generative model.
Let's quickly recall the deep
generative model variation
auto-encoder.
The variation auto-encoder
bridges deep learning and
probablistic modeling and has
been successful in many
applications. The graphic model
is shown on the left.
From a probabilistic modeling
point of view, we can drive down
the model as we can theorize
showing in the right-hand side
of this equation.
We learned the posterior using
versional inference.
In particular, we can introduce
a variational distribution queue
and we can try to minimize the
divergence between the P and the
Q.
We can follow the standard staff
and form the evidence lower
bound.
We call it ELBO.
And optimize the evidence lower
bound to get the posterior
estimation.
Different from traditional
probabilistic modeling every
link in the graphic model on the
left are all deep neural
networks.
This becomes an auto-encoder
where we can try to reconstruct
the X with the stochastic
probablity.
This can be learned as a
standard deep learning framework
with the loss using evidence
lower bound we just showed
before.
So CAMA is also a deep
generative model. Instead of
simple factorization like on the
left.
Our model is factorized in a
causal consistent way.
You can see it on the right-hand
side.
The model is consistent with the
causal relationship that we saw
before.
Next, let's see how can we use
the inference?
When we only have the clean
dataset, which means like the
dataset without any augmentation
or an adversary example.
So from a causal lens, this is
the same as do M equals clean.
Now we translate it to the deep
CAMA model.
We can use value zero indicating
the clean data.
This we set M to be 0, and we
can consider it to be observed.
We only need to infer the latent
variable Z in this case for
variational distribution instead
of conditioning only on X as
traditional inversion of the
encoder. In CAMA we can
consider XYZ together to define
the variational distribution.
We follow the same procedure and
form the evidence lower bound,
ELBO shown below.
As L is a root node, we have do
M and the DO calculation can be
written as conditioning.
And the user adverse training
way, we may have manipulated
data in the training set as
well.
In this way we may not know the
manipulation with the straight M
as latent variable.
In this case we need to infer
both M and Z.
This we have the Q5ZM condition
on X and Y bookend.
We can provide evidence lower
bound in this form. Finally,
with both clean and manipulative
data in the training set, the
final loss is in a combined form
with corresponding loss with
clean data and unmanipulated
data as shown before.
Here there's the clean subset of
the data and D prime is the
subset of the data which are
manipulated.
This is adversary training
setting using CAMA.
In this way CAMA, can be used
either with only clean data and
with manipulative data together.
The final neural network for
architecture I'll show here, so
the encoder and decoder are
shown on the right.
And the decoder network
correspond to the solid arrow in
the graphic method on left side,
and then the encoder network
corresponds to the dashed line.
The inference network can help
us to compute the posterior
distribution, M and Z.
In the test of time, we'd like
to model to be robust to unseen
manipulation.
We want to learn it in test of
time.
While with the network we're
presenting the generative
process from Y to Z and X.
And we fine tune the network to
adopt to the new M and how M
influences X.
So in this way the network can
learn a new unseen manipulation.
Test of time the label is not
known so we don't know the Y.
In this way we need to
marginalize Y and optimize the
fine tune last year to adopt to
the unseen manipulation.
For prediction we use base of
the posterior of the Y, we see
it parting definitely.
We can see CAMA is designed in
the causal system way where we
can officially train the model
following similar procedure as
variation of the encoder.
We can also fine tune the model
in test of time to unseen
manipulation and make
predictions.
Next see the performance of
Karma.
And first let's use only the
clean data as we see in the
process Y of this talk.
We bring the blue curve from the
first slide, which is the
regular deep neural network.
Our method without fine tuning
is shown in orange.
With fine tuning with
corresponding manipulation, in
test of time, which shows the
group green curve in figure A,
And we can see significant
improvement in the performance
in the figure A. We can also
see with fine tuning our
different manipulation in the
middle panel, the performance
does not drop unlike the
traditional neural network.
This is thanks to that we fixed
the mechanism from Y and Z to X.
Fine tuning a long time
manipulation does not affect the
robustness to other type of
manipulation which is desired.
In the middle panel the fine
tuning was done on the
horizontal shift and the testing
was on vertical.
Furthermore, we use different
percentage for tests for fine
tuning. We see the more data
for fine tuning, the more robust
we have for the performance of
the unseen manipulation.
More importantly, we can see
that with only more than 10
percent of the data, we already
obtained very good performance,
which means that the fine tuning
procedure is validate
utilization.
We also tested our method with
popular gradient based adversary
attacks.
In particular, the fast gradient
side method as shown on the left
and projected gradient descent
attack on the right.
The blue one is traditional deep
learning method which is very
vulnerable.
The orange one is a common model
without fine tuning. And the
green one is the one with fine
tuning.
We can see that CAMA with fine
tuning is so much more robust to
even gradient based attacks.
The red line shows the clean
test performance after fine
tuning, which means fine tuning
does not deteriorate the clean
data performance either.
So the improvement in the
robustness of common model
compared to traditional model is
significant with gradient based
attacks.
With adversary training setting
we can obtain the same results
as with clean data we see
before. See our paper for more
results.
I will not repeat here.
But moreover, I would like to
say that our method obtained
natural disentanglement due to
our model Z and M separately.
And we can apply do operation to
create counterfactual examples.
Figure A shows some examples
that are vertically shifted in
the training data.
After fitting the data, we can
apply do operation and set it
the do M to zero and generate
new data.
Which is shown on the right hand
side. You can we can shift the
image back to the centered
location.
Now we have shown that CAMA
works well in the image
classification case and how does
it work for general case. For
example with many vulnerables
and more causal relationships is
as the one shown in the picture.
For example, with many -- with
the ring, there can be multiple
causes. And whether the jolt is
worth it or not can be caused by
multiple factors, for example,
was it ringing enough, was CAMA
broken or not and can we use
CAMA in this case, the answer is
yes.
We can have generalized deep
CAMA in this setting.
We can see there's a micro
blanket and environmental
interest and we compute a deep
neural network model that's
consistent with causal
relationship.
With target Y, we put all the
variables in the corresponding
location which either has
ancestor A children X and
co-parent C that's consistent
with causal relationship.
We introduce Z in the same way
where Z represents hidden
factors which cannot be
intervened and M is hidden
manipulations.
We also extend the inference and
fine tuning methods in the same
way for this generalized Deep
CAMA model.
We use set of data which have
completed a causal relationship
with the experiment.
We shift the children dataset
for testing.
Again, the blue line is the
baseline and the orange line is
the one without fine tuning and
the green one is the one with
fine tuning.
We can see that such generalized
Deep CAMA is significantly more
robust and the red line shows
the type of data after
manipulation, and we can see
that clean data performance
remains high even after the
model adapts to unseen
manipulation.
The same holds for
gradient-based adversary
attacks. The attack can be on
both children and co-parents,
and vulnerable attacks as target
Y remains the same, comparing to
green line and orange line to
the baseline which is in blue,
our method is significantly more
robust to gradient-based
attacks.
Last you may ask, what if we
don't have causal relationship?
As to now we always assumed that
the causal relationship is given
already.
In general, there are many
methods for causal discovery
from observational data and the
informational data.
So given the dataset, you can
use different tools to find the
causal relationship.
A good review paper is provided
by Clark, et cetera, last year
which summarized different types
of causal discovery method.
Myself also did some research on
this topic.
However, just to be honest, the
causal discovery is a
challenging problem. And it may
not be perfect all the time.
What if the causal relationship
that we use was not completely
correct?
There may be small errors.
So here we performed experiments
to show it with specific data
with many variables.
The blue line is a baseline and
the orange line is the case
where the causal relationship is
perfect.
So here different colored lines
shows different degree of
misspecification in the causal
relationship.
In this experiment we have ten
children variable in total and
we make them have different
degree of misspecified causal
relationship.
The green line shows that two
variables are mis-specified in
the causal relationship.
And the red line is four
variables are mis-specified.
We see that with the
mis-specified causal
relationship, the performance
drops comparing to the ideal
scenario.
However, if it's mis-specified
by a small fraction, we can
still obtain more robust results
compared to baseline.
It's helpful to consider causal
consistent design even though we
may not have the perfect causal
relationship given.
In the end, I would like to
summarize my talk.
I presented causal view on model
robustness and causal inspired
deep generative model called
Deep CAMA. Our model is
manipulation aware and
robustness to unseen
manipulation.
This is efficient with or
without manipulated data during
the training.
Please contact me if you have
any questions.
Thank you very much.
AMIT SHARMA: We're back live for
the panel session. One of the
questions that was asked a lot
during the chat was the question
about model mis-specification
and model misspecification can
happen in two ways. One is that
while we're thinking about the
causal assumptions, we may miss something.
So there could be, for example,
an unobserved confounder. And
the other way could be when we
build our statistical model, we
might parameterize it to simply
or too complex and so on.
So maybe this is a question for
both Susan and Elias, is how do
you reconcile with that?
Are there tools that we can use
to detect which kind of error is
happening, or can we somehow
give some kind of confidence
intervals of guarantees on when
we are worried that such errors
may occur?
So maybe, Susan, you can go
first.
SUSAN ATHEY: Sure. That's a great question.
And it's definetly something I worry about in a lot
of different aspects of my work.
I think one approach is to
exploit additional variation.
So I guess we should start from the fact
that in general in many of these
places these models are just identified
So there's a theorem that says
that you can't detect the
presence of the confounder
without additional information.
But sometimes we do have
additional information.
So if you have like multiple
experiments, for example, that
you can exploit that additional information
And so in one of my papers we do an
exercise where we try to assess,
we can look at certain types of
violations of our assumptions
and see if we can accept or
reject their presence.
So, for example, one thing that
we worried about was there might
be an upward trend over time in
demand for product that might
coincide with an upward trend in
prices.
So we were already using things
like weak effects and throwing
out products that had a lot of
seasonality.
But still our functional form
might not capture everything.
And so we did these exercises
called placebo tests where you
put in fake price series that
are shifted up or shifted back
and then try to assess whether
we actually find a treatment
effect for that fake price
series, and then we had 100
different categories so we could
test across those hundred
categories, and we found
basically a uniform distribution
of test statistics for the
effect of a fake price series,
which sort of helped us convince
ourselves that at least like
these kind of overall time
trends were not a problem.
But that was designed to look at
a very specific type of
mis-specification.
And in another setting, there
might not be an exact analog of
that.
Another thing that I emphasize
in my talk was trying to
validate the model using test
data which again was only
possible because we had lots of
price changes in our data.
And so those types of validation
exercises can also kind of let
you know when you're on the
right track because if you have
mis-estimated price
sensitivities then your
predictions about differences in
behavior between high and low
price sensitive people and the
test set won't be right.
But broadly, this issue of
identification, the fundamental
assumptions for identification
and testing them is challenging.
One of the most common mistakes
I see from people from the
machine learning community is
sort of thinking that, oh, well,
I can just test it this way or
test it that way without
realizing actually in many cases
there's a theorem that says even
infinite data would not allow
you to distinguish things.
So you have to start with the
humbleness that there are
theorems that say that you can't
answer some of these questions
directly.
You need assumptions.
But sometimes you can be clever
and at least provide some data
that supports your assumptions.
So maybe I can come back to the
functional forms and let Elias
take a crack at the first
question because there's a
completely separate answer for functional forms
Go ahead.
You're muted.
ELIAS BAREINBOIM: There we go.
Can you hear me?
AMIT SHARMA: Yes.
ELIAS BAREINBOIM: Cool.
Thanks, Susan.
Thanks, Amit.
The model uses specification
machine learning.
First comment I would say that
is very common is people trying
to use this idea of the training
and testing sets so paradigm, as
I like to call to use, whatever,
try to validate a causal model,
try to validate a causal query
or to verify.
It makes no sense in causality,
as I just talk -- I just kind of
summarize in my talk, usually
one type of data that we have in
the kind of training and testing
data is layer one that is
observational data and we're
trying to make a statement about
another distribution that is the
experimental one.
Then there's no training or
testing, the world that one
distribution can tell you about
the other, at least not naively,
or not in general.
And this is the first comment.
The second one, I think the
interesting scenario, as I
mentioned in the chat earlier,
is about when you are in the
reinforced learning, before
reinforcement learning,
observational setting for
certain trying to get task
classification of your causal
model, condition independence
and quality constraints and
other types of constraints to
try to validate the model.
I think this would be the
principal approach, called task
classifications. And a lot of
the people doing causal
inferences trying to understand
what are these kind of
constraints we usually have.
And then you can submit that to
some type of statistical test.
Now moving to the reinforcement
learning, that's more active
set, quite interesting. In this
setting already taking the
decision.
We already kind of randomize and
controlling the environment,
then the very goal of doing that
by Fisher, perhaps 100 years
ago, was to avoid the unobserved
confounding that was what
originated the question.
Then reinforcement learning is
good for that. And then if you
have something wrong, many times
I can be super critical about
that, but many times the effects
of having wrong will wash away.
I think that's another nice idea
in the reinforcement learning
setting that we're kind of
pursuing, but I think is quite
nice, other people should think
about, how can you use the
combination of these different
datasets not only to
decision-making itself but to
try to validate the model, which
parts of the model the model are
on.
And there's kind of different
types of tasks that are usually
very unconventional about how to
triangulate the observational
and different types of
experimental distributions in
order to detect the parts of the
model that have problems.
My last note here, my last idea
is just do sensitivity analysis.
Usually don't have so many
matters that have good ones or
initial ones, but you don't have
so many in particular tailored
to the causal inference problem.
I think that's a very good area
have some initial work, but I
think is very promising, and
we'll talk about future of
frontiers. But for now I think some more
people should do sensitivity.
I pass the ball here to Amit.
AMIT SHARMA: Sure, yeah, right.
I think it's a fundamental
distinction between
identification and estimation,
right. And I think maybe,
Susan, maybe you can talk about
the statistical
misspecification.
SUSAN ATHEY: So the functional
forms.
Right.
So in econometrics, we often
look at nonparametric
identification and look at
things like semi-parametric
estimation.
So you might think, for example,
in this choice problems I was
talking about, we had behavioral
assumptions that consumers were
maximizing utility. We had
identification assumptions which
basically say that whether the
consumer arrived at the store
just before, just after the
price change was as good as
random.
And so the price was -- within a
period of two days -- was
randomly assigned to the
consumer.
That's kind of the
identification assumption.
And then there's a functional
form assumption which is
type one extreme value which allows
you to use the multinomial logit
formulation. That functional
form assessment is incredibly
convenient because it tells you
if one product goes out of stock
I can predict how you are going
to redistribute purchases across
substitute products. It's going
to allow you to make these
counterfactual predictions and
it's very efficient.
If I change one, if I have one
price sensitivity, I can learn
that on one product and apply it
to other products as well.
Those types of things are
incredibly efficient, and
they've been shown to be
incredibly useful for studying
consumer choice behavior over
many decades.
But there's still functional
form assumptions.
So there are also theorems that
say actually in principle you
can identify choice behavior
even if you don't assume the
type one extreme value, don't
assume this logit formulation.
But then you need a lot of
variation in prices in order to
trace out what the distribution
of your errors really are and to
fully uncover the joint
distribution of all of the
shocks to your preferences, you
would need lots of price
variation and lots of products
over a long period of time.
So theoretically, you can learn
everything without the
functional form assumptions; but
in practice, it's not practical.
And so you're always going to be
relying on some functional form
assumptions in practice.
Even though theoretically you
can identify everything
nonperimetrically with enough
price variation.
So then it comes to sensitivity
analysis.
You want to check whether your
results are sensitive to these,
to the various assumptions
you've made, and that becomes
more of a standard exercise.
But I think it's really helpful
to frame the exercise by first
saying, is it even possible to
answer these questions; and what
would you need? And many
problems are impossible.
And just as Elias was saying, if
you have a confounder in your
training set, you're also going
to have one in your test set,
and just splitting trust and
train doesn't solve anything.
So you have to have a
theoretical reason why you think
that you're going to be able to
answer your question.
AMIT SHARMA: Makes sense.
I have a similar question for
Cheng as well, in the sense it
will be great if we have a
training method that is robust
to all adversarial attacks, but
obviously that'll be dificult.
There's some assumptions you're
making in the structure of your causal model
itself, in the Deep CAMA method.
So my question to you is how
sensitive it is and what kinds
of attacks can be...
What will your model be robust to? But I'll
also throw a more ambitious question.
Is it possible to formally
define the class of attacks on
which a causal model may be
robust to?
CHENG ZHANG: So I think like as
a key here is like how can we
formulate attack in a causal way.
So I think like for some attacks
it's very easy to formulating a
causal way, for example,
shifting, is manipulation, just
another cause for the impact
you're observing.
But for some attacks, it's even
more tricky to formulate it in a
causal way, for example,
gradient-based attack.
It's like causal setting and
especially with this multiple
staff gradient-based attack.
So I think then it goes to like
overtime with cycle, causal
model as an underlying model.
So I think if you can formulate
it properly as a causal model
and design a model that is
consistent, and then we can be
robust to the attack, but not
all cases are so easy or like
there can be technical
challenges when there's cycles
over time for certain type of
attacks.
So I think in general it's just
always good to consider more
causality but how difficult and
how much assumption you have to
make and to which degree you wish to violate the
assumptions I think that depends
on the situation you're in.
AMIT SHARMA: Yeah, that makes
sense.
And maybe I think one question I
want to ask and maybe this will
be the last question live.
So we talked about really
interesting applications of
causality.
So, Susan, you talked about sort
of the classic problem of price
sensitivity in economics.
Elias, you briefly talked about
reinforcement learning and Cheng
about adverse attacks.
These are interesting ideas that
we have seen.
I wanted to ask you to look in
the future a bit.
Maybe a few years. What are the
areas or applications where
you're more excited about or you
think that this amalgamation of
causality and machine learning
is poised to help and may have
the biggest impact.
Susan, you want to go.
SUSAN ATHEY: That's a good
question.
So one thing that I'm working on
a lot in my lab at Stanford is
just personalization of
digitally provided services
education and training, which of
course all the partners I'm
working with have had huge
uptake in the COVID-19 crisis.
So of course you can start to
attack personalization in
digital services without
thinking about causality; you
can build sort of classic
recommendation systems without
really using a causal framework.
But as you start to get deeper
into this, you realize that you
actually can do a fair bit
better in some cases by using a
causal framework.
And so, first of all, it's using
reinforcement learning, for
example, is I would argue that
reinforcement learning is just
intrinsically causal.
You're running experiments,
basically.
But if you're trying to do
reinforcement learning in a
small data setting, you do want
to use ideas from causal
inference and also be very
careful about how you're
interpreting your data and how
you're extrapolating. I think
that at this sort of
intersection of causal inference
and reinforcement learning and
smaller data environments where
the statistics are more
important, worrying about biases
that come up when naive
reinforcement learning you're
creating selection biases and
confounding in your own data.
And if the statistician in the
reinforcement model isn't
actually factoring everything
in, you can make mistakes.
And more broadly we're seeing a
lot of the companies that I'm
working with, Ed Tech and
training tech, are running a lot
of randomized experiments. And
so we're combining historical
observational data with their
experiments. And so you can
learn some parts of the model
using the historical
observational data and use that
to make the experimentation as
well as the analysis of the
experimentation more efficient.
And so I think this whole
intersection of combining
observational experimental data
when you're short on statistical
power is another super
interesting area that a lot of
companies will be thinking about
as they try to improve their
digital services.
AMIT SHARMA: Elias, what do you
think?
ELIAS BAREINBOIM: Amit, thanks
for the question, by the way.
I was trying to answer you.
Thanks, Amit.
I think that in terms of
applications, my general goal
of, the goal in the lab is to
build more general types of AI,
I would say.
That is more human, as people say
human-friendly, using this name
or you have some type of
rational decision, you can attach
to this label rational decision making.
I would like to review these
notions on how we're doing that for the last
maybe five years or so, review what
this could mean.
Because since if up go to books,
AI books from 20, 30 years ago,
all of them are using the same
label and they are usually not
causal.
Then I would say I personally
don't see any way of doing
general AI or more general types
of AI, I should say, without
being serious or attacking
causal inference front and
center.
I can count on my hands the
effort today how many people are
doing it and I cannot count the
number of people that's excited,
which is pretty good.
I'm excited about the lab
excitement at the moment.
Then the primary suggestions I
just don't go around it, just
trying to understand what is a
causal model, what causality is
about, and then just do it.
A little bit of a learning
curve. But I think this is the
critical path if you want to do
AI or more general types of AI.
The two other applications that
we've been working and go to the
website causalAI.net.
Causal reinforcement learning as
you mentioned, we were chatting
before in the internal chat
here, I just gave three hours
tutorial at ICML that is trying
to explain my vision of how I
see this intersection of
causality and reinforcement
learning and check it out see
how causal enforcement learning
and all the notions of
explainability and fairness and ethics.
There's many papers and works
that are not technical...
says causality is hard or is
difficult to get a causal model
and so on. It's inevitable in
some way. Then there's no point
in postponing. If you go to the
court, or talk to human beings,
usually causality is required in
the law or in legal circles.
And as humans, we're causal machines.
Then there's no way
to go around. I'd like to see
more people work on it,
including Microsoft for sure.
Microsoft was the leader by the
way in the bayes net, in the
early 90s, revolution that takes
the 90s and into early 2000s, I
think, which push a lot the
limit and of them today
including variation out in
encoders and so on, still I'd
like to see much bolder steps
from Microsoft.
...Eric Horvitz and
David Heckerman, those are the
two leaders. They understood
very well, they are the
developers of the theory of
graphical mods of bayes net in
the late '80s and they pushed
that in such a good way.
Now I'm not baptized from the
bayes net, that's completely different
than the causal graphical model.
This is the expectation for
Microsoft for myself, and I
think huge potential... well thats the idea.
AMIT SHARMA: Thank you, Elias.
Cheng, what sort of domains or
applications are you most excited about
CHENG ZHANG: I would like to
second Susan and Elias.
I think these are all
interesting directions.
I see, like, great importance in
considering causality in all
occurrence of machine learning,
all directions, because for deep
learning reinforcement learning,
fairness, I like I really your
work, Amit, as well on privacy
robustness, generalization, a
lot of current problems in
machine learning is like if we
actually consider coming -- I
really see the last magic
ingredient to solve a lot of
this drawbacks in the current
machine learning model.
But I'd like to bring another
angle, is if you think about
causality as a direction of
machine learning, I think in
past years -- in recent years a
lot of gap has been bridged but
I think in early days is what I
see it's a little bit more separated.
I would like to second that like
from a causal chair, I think a
lot of more modern machine learning
techniques can also improve the
causal discovery itself because
traditionally we hear about all
these theorems, proof,
identifiability and all these
things, and commonly we limit
ourselves to a simpler function
of work.
I think in recent years there's
been more advances like a format
and other things, but I also do
see a lot of advances, machine
learning techniques that help
with causal discovery, for
example, with a lot of nonlinear
ICA work from Apple recently, it
actually bridges nonlinear ICA
and IAE and with the self
supervised learning time series
can also help with causal
discovery from observational
data. And I also see this is a
great trend.
For example, even from
Burnhouse, recent work, how you
use, like, active learning,
element based active learning
for causal discovery.
So I actually see not only
causality, too, as machine
learning, but also see a great
potential for other
machine learning methods to
causality.
AMIT SHARMA: Great.
On that note, that's a wrap.
Thank you again all the speakers
for taking your time and
attending this session.
And of course thank you all for
the audience for coming to the
Frontiers in ML event.
We'll start again tomorrow at
9:00 a.m. Pacific.
And we'll have a session on
machine learning, reliability
and robustness.
Thank you, all.