Subtitles section Play video
SANDEEP PARIKH: Welcome, everybody.
Thanks for coming.
I know the day's been running a little bit long,
so I will hopefully not keep you too long.
Welcome to the session between you and happy hour.
Hopefully, we'll get you out of here just in time
to get some drinks and get some snacks.
My name is Sandeep Parikh.
I run the cloud solutions architecture
team for Americas East, for Google Cloud Platform,
and today, I want to talk to you about building
a collaborative data platform.
Effectively, you've got teams of individuals spread
across your company, that have to work together
in some meaningful way, and you need
them to share data, talk to each other,
basically build something together.
So, how do you enable that?
So, we're going to walk through some of the ways
that we tell a lot of customers, and developers, and partners
how to do that stuff in the real world.
First, we're going to meet our hypothetical team
of individuals, and we're going to learn
about the tasks that have to do, and the tools that they
typically use.
And then, we'll jump into mapping
the tools they use to the tools available in Google Cloud,
so you can start to see how they might
get a consistent set of tools and work together.
And then, we're going to talk about how
we enable collaboration.
This is how we do things, like set up teams
to do self-service and easy data discovery.
I'm probably going to trivialize a handful of details,
but stay with me, it'll make sense as we get through it.
And then, I want to cover what the new workflow's
like once you get all this stuff up and running,
and then give you a couple little things
and tips on how to get started.
Does that make sense to everybody?
Head nods?
OK, good.
All right, so let's meet the team.
All right, so first, we've got about four individuals.
We've got a data scientist, a data engineer, a business
analyst, and an app developer.
Ultimately, they all have three things in common.
They use different data sets, those data sets
are all in different formats, and they all
use different tools, right?
So, there's not a ton of overlap here,
and that's kind of the challenge.
So, let's assume for a second, or let's imagine for a second,
we've got this team, and their job
is to build a recommendation engine for a retail site.
So, what are the kinds of things that each person on the team
is going to need to be successful
in order to do their job?
So, the first thing we'll cover is the data,
and then we'll talk about the tools.
So, from a data perspective, Alice, the data scientist,
needs kind of cleansed, normalized data.
Things that have no longer log files, right,
not just raw rows from a database,
but things that actually make sense.
And those could be, for example, stuff
like purchase history, product history, product metadata,
click streams, product reviews.
All the stuff that it would take to actually craft and model
a recommendation problem.
Then, we've got Bob, the data engineer.
He's going to have to take all of the raw data
and turn it into something useful.
So, he's probably going to need things like,
logs information, product transactions or purchase
history, product metadata, as well.
This is all the stuff that's coming in
straight off the application, that he's
got to turn into something useful for other parts
of the organization.
Then, there's Ned.
Ned is the business analyst, and he
needs a lot of aggregated data, things
that he can use to generate statistics
and understanding about how the business is performing.
Not just, again, rows or log files, but again,
that next level up from, sort of, the basics.
And then, finally, we have Jan, the app developer.
She's going to need users and product predictions,
or recommendations.
She's going to be able to-- she needs
to be able to build new microservices that
expose things like this recommendation engine
to the end user.
So, that can be things like a recommendations API,
access to wish lists, some knowledge around similar users
or similar products, how things complement each other.
But, you can see there's a little bit of overlap here,
but not a ton, right.
They all need something different
and they kind of need it in a slightly different format.
Now, if you think about sort of a--
I tried to draw a sequence diagram here,
and it got a little complicated, so I simplified it
a little bit.
But, ultimately, Alice and Bob have to agree and talk
to each other.
Bob and Ned have to agree and talk to each other.
Ned and Alice have to talk each other,
and Jan and Alice have to talk to each other.
You know, there's a lot of communication happening.
And, listen, I love talking to my coworkers just as much
as the next person, but if you're
spending all of your time trying to explain to them what you
need, when you need it, where it came from,
what it should look like, it becomes challenging, right.
It slows down the ability for them
to make progress and work quickly.
Ultimately, this is the problem that we
see a lot of folks having out in the, world, right?
They each have to find data sets.
They each have to figure out who owns it.
They have to figure out how things are related
to each other, and they have to manage all that by themselves,
which, again, takes time away from their ultimate task.
Just by a show of hands, does this sound familiar to anybody?
Oh, that's a scary number, actually.
Maybe I shouldn't have asked that question.
No, but that this is great, right?
This is how we have to-- we have to find, understand,
what the problem is, then try to solve it together.
All right, so let's talk about tools.
That's the first step.
We need to understand what everyone's
working with in order to kind of make the next jump.
So, Alice is the data scientist, huge fan of Python,
like myself.
It's like her Swiss army knife, she uses it
for just about everything.
Typically, does a lot of her work
with IPython or Jupyter notebooks,
pulls data down, does it on her workstation,
you know, slices dices data, uses
a lot of pretty common data science frameworks,
uses things like, NumPy, SciPy, scikit-learn,
Pandas, because she has a history,
and/or likes data frames.
That's a typical approach, from a data science perspective.
Bob is a huge fan of Java, right?
He does lots of big ETL pipeline work,
does a lot of orchestrating data from one format to another,
or multiple sources of data together into something
that looks cohesive on the other end.
So, he's typically using tools like MapReduce or Spark.
In some cases, he might be using Apache Beam, which
you'll actually hear more about over the next couple days,
as well.
Ned is like a spreadsheet guy.
He loves SQL and he loves spreadsheets.
So, what he likes to do is build custom reports
and dashboards, very SQL-driven, got to have some kind of access
to a data warehouse, so he can do giant--
you know, kind of cube rotations and understanding.
Basically, he's got to be able to prove to the business
how products are performing, why things like a recommendation
engine is even necessary.
So, it's his responsibility to kind of facilitate that.
And then, we've got Jan, the app developer.
Jan's a typical kind of polyglot app developer.
Likes writing stuff on microservices, likes delivering
features very quickly, you know, may have a language preference
one way or the other, but has to maintain a lot of the existing
infrastructure.
So, that could be things like node apps.
That could be simple things like Python apps with Flask or Ruby
apps with Sinatra, all the way up
to like kitchen sink frameworks, like Django or Rails.
Just depends, right, whatever the use case is.
So, we've gotten that far.
So, we understand their task, we know
who all the key players are, and we know the things
that they like to use.
So, the next step for us is, how do we
figure out which tools they should
be using in Google Cloud?
All right, we can skip that slide.
So, what we'll do is let's lay out
a handful of the tools that are available.
And we're not going to cover everything that's
in GCP right now, because we don't have enough time,
and I want you guys to have snacks at some point soon.
So, what we'll do is kind of focus on the core set of things
that are critical for each of those individual roles
to have access to.
So, the first part, you know, for things like applications,
we've got Compute.
So, with things like virtual machine
or virtual instances with Compute Engine,
manage Kubernetes clusters with Container Engine,
or just kind of manage container deployments,
or Docker container deployments with App Engine.
It's a good place to start.
On the storage front, cloud storage for objects,
Cloud SQL for relational data, Cloud Datastore for, you know,
NoSQL, or non-relational datasets,
and Cloud Bigtable for wide column.
And then, on the data and analytics,
are then the more processing kind of side,
Dataproc for managed Hadoop and Spark clusters,
Datalab for IPython notebooks, BigQuery for kind
of an analytical data warehouse, and data flow for running
managed Beam, Apache Beam pipelines.
And then, the last column, or the last kind of group,
is machine learning.
So we've got cloud machine learning,
which is kind of running TensorFlow models,
and running them at scale, things
like the natural language API, the speech API, or the vision
API.
As you start to think about the team
that we just introduced, and you start
looking at this sort of bucket of tools, of this set of puzzle
pieces, you can start to see where things
are going to kind of fit together
There's a consistent set of tools
here that we can map back on to the team.
So let's walk through each one of those individually,
really quickly.
So, for data science workloads, Alice has a couple of options.
We already talked about the fact that she's a fan of Python,
so that maps really well.
So, with Python, she has two options.
She can either use Cloud Datalab,
or she can use Cloud Dataproc with Jupyter.
With Datalab, what she gets is a complete Python dev
environment.
You know, it's bundled in with like NumPy, SciPy, Matplotlib,
so she can kind of kick off her work, and build those charts,
and build those, kind of understanding,
or on the data set as she would.
Additionally, on top of that though,
it's also got built-in support for TensorFlow and BigQuery.
This means she's got a complete environment to go and start
maybe prototyping models that she wants
to build with TensorFlow, if she's
trying to build a very compelling recommendation
mechanism.
Or, if she needs data that lives in BigQuery,
she can actually do inline SQL statements there, and pull data
back out, or offload queries to BigQuery, as well.
So, she's got a handful of options there.
The nice thing about Datalab is that it's based on Jupyter, so,
it's got a little bit of background.
So, it should look very familiar to her.
But it is somewhat constrained onto just Python.
If she's got more specific needs,
or wants to include additional frameworks or kernels,
then we have to look at something like Cloud Dataproc
plus Jupyter.
So, what you can do is you can spin up a Cloud Dataproc
cluster--
again, a managed YARN cluster, effectively--
and then have Jupyter pre-installed on there
and ready to go.
So, it takes about 90 seconds to fire up a Hadoop cluster,
anywhere from like three, to like
a couple of thousand nodes.
In this case, for Alice, I think just three nodes is probably
appropriate, and her goal is to just spin this up and get
Jupyter pre-installed.
Once she's got Jupyter, then she's
back to the exact same environment
she had on her laptop, support for over 80 different languages
and frameworks or kernels.
But, the nice thing also is built-in support
for PySpark and Spark MLlib.
So, you know, if you're kind of trying
to figure out where the machine learning line sort of falls,
there's definitely a handful of more sessions
you can attend around things like TensorFlow or on Dataproc.
What I would urge you to do is, if you've
got kind of an individual like this in your organization,
is have them explore both.
TensorFlow might be appropriate.
Spark MLlib might be appropriate.
And there should be a clean distinction
between what they can each do.
All right, on the data processing front, same thing.
Bob's got a couple of different options.
He can either use Cloud Dataproc or he can use Cloud Dataflow.
With Dataproc, again, we just talked about this briefly,
but managed Hadoop and Spark clusters,
the nice thing about Dataproc--
and I'm only going to spend a minute on this,
so, by all means, if you're interested we can do questions
afterwards-- but the nice thing about Dataproc
is that it turns that kind of mindset of a Hadoop cluster
or a YARN cluster, on its head.
You end up going away from having a cluster-centric view
of the world, and turning it into more of a job-centric view
of the world.
So instead of firing up a thousand-node cluster
that everybody shares, you could have
every individual job have their own thousand-node cluster.
And as long as those jobs can be done in less than 24 hours,
you can actually take advantage of things
like preemptible OVMs or preemptible instances.
Those instances are about 80% off list price.
So, you get the ability to do this job dramatically faster
than ever before, because you have all this extra compute
sitting around, and you get to do it really cheaply,
because it only takes a few hours anyway.
So Bob's got that option.
He can also do things like tune the cluster parameters,
or if he's got custom JAR files he needs,
easily bootstrap those across the cluster, doesn't matter.
So, h has that option.
The other approach is to use Dataflow.
Dataflow is a managed service that we
have for running Apache Beam workloads.
Apache Beam is basically a programming model,
or an approach, that unifies batch and stream
processing into one.
When you take Beam workloads or Beam pipelines
and you push them to Cloud Dataflow, we run those for you
in a totally managed, kind of automatically scalable fashion.
So, that's pretty nice.
And Apache Beam workload's on Cloud Dataflow,
actually support templates for kind
of like easy parameterization and staging
and things like that.
The way I kind think about the relationship here of which path
you go down is really a question of what
do you have existing already.
If you've got a huge investment in Spark or MapReduce jobs,
or just kind of like oozy workflows around those things,
by all means, go down to Cloud Dataproc group.
It's a turnkey solution.
You can take your job and just push it to a new cluster,
just like that.
If it's net new and you're starting from scratch,
I think Beam is a good approach to look at.
It's relatively new, so it's a little bit young.
It's definitely not as mature as some of the other components
in the Hadoop ecosystem.
But, it does have this really, really,
really critical advantage where you can take a batch pipeline
and turn it into a streaming pipeline
just by changing a couple of lines of input.
So Bob's got a couple of options here,
to kind of match up to what he typically likes to work with.
All right, Ned's use case is actually really simple.
He needs a data warehouse.
He needs a data warehouse that supports SQL,
and that can scale with whatever size of data he needs,
or that he's got access to, and he's
got to be able to plug it into a whole host of tools,
kind of downstream.
So, BigQuery's a great fit for him,
enterprise cloud analytical data warehouse.
It's a fully managed mechanism.
We often use the word server-less,
though I hate that term, so I apologize.
It supports standard SQL and it does scale up
to a kind of petabyte scale.
So, he has the option of running something that's
anywhere from data sets that are about gigabytes all the way up
to petabytes, and still get responses back within seconds.
BigQuery is great, because it does
support kind of batch loading, or streaming inserts, as well.
And it's got built in things like security and durability
and automatic availability, and all the other good stuff.
But, ultimately, the best part is that Ned gets to use SQL,
gets to query really, really large data sets,
and visualize and explore the data as he's used to,
with the typical tools and expertise he's got.
When it comes time to create reports and dashboards
and things like that, there's a couple of options he has.
One, he can use Data Studio, which
is built into the platform.
So, it's kind of like a Google Doc style approach, where
I can create a new report, and then I
can share that with everybody in the company
without actually having to create multiple copies.
And people can edit things like parameters
and watch the difference in those reports,
and see how they look.
But effectively, he has the ability
to create those reports and dashboards.
Alternatively, he could use things like Tableau or Looker,
or other business intelligence tools
that he's that he's a fan of.
So, he's got a handful of options there.
And the nice thing also is that because of all this approach,
like BigQuery also supports other kind of JDBC
infrastructure, so a lot of the tooling that Ned is typically
used to can plug right into BigQuery, as well.
So, the last one is Jan.
We talked about this earlier.
Jan likes to deploy and scale microservices,
so she's got two easy options there.
If she's really, really focused on complex container
orchestration, writes them--
wants to deploy things that way, Container Engine's a great fit.
It's based on open sourced Kubernetes.
We've got built-in health checking, monitoring, logging.
We've got a private Container Registry,
or she can go down the app engine route,
and just take her Docker containers, and push them up,
and we'll auto-scale them for her.
The good thing is that along with kind of the same health
checking, monitoring, and logging,
App Engine also includes a built-in load balancer
up front, has things like version
management, traffic splitting, as well,
and automatic security scanning.
So, again, a couple of options here
that kind of make sense for her.
So, that's right.
We're all done.
Everybody's got tools, they can all work,
and everybody is off--
off to the races.
That's not true.
We've only just kind of scratched the surface.
So, they all have a consistent set of tools to work with,
but we actually haven't enabled any of the collaboration bits
yet.
So, we still have to figure out how to get them,
not only to work together, but actually
to work in a way that makes sense, and scales up.
Right.
Their team is four people today.
It could be eight, 12, 16, you know, in a few weeks.
So, we've got to come up with a better
approach to managing the data that they need access to.
So, if you're trying to enable collaboration,
there are a handful of things you're really
going to think about.
The first is, you've got to find consistency.
And it's not just consistency in the tool set, right?
That's also important, but you also need consistency
in terms of where do you expect data to be coming from,
where do you expect data to be used downstream, right?
That's really important.
If you can't agree on the sources,
if you can't agree on what the downstream workloads are,
it's hard to really understand how
everybody should work together and use the same set of tools.
The next thing you want to do, is take a really good hard look
at all the data that you're trying to get access to.
And this isn't true for every piece
of data that lives in your organization, right?
It might be things that are focused
on certain sets of the company, like certain teams,
or certain tasks, but ultimately, you've
got to figure out where all this data lives, and take
an inventory on it, right?
Take an inventory on it, and make
sure it's in the right storage medium for everyone
to use as time goes on.
The next thing you want to do is come up
with an approach around metadata, right?
This is pretty simple, but if you're
trying to figure out who owns a piece of data
or where it came from, what other data sets it's related
to, what are some of the types of data that are located
in this data set without actually having to go query it,
that's a really challenging problem to do,
when you think about having hundreds or thousands
of individual pieces of data spread out
across your infrastructure.
Then, you want to enable discovery and self-service.
You want people to be able to go find these things by themselves
and pull them down as needed, without having to go,
again, spend all that time arguing
with people about formats, and availability,
and tooling, and worrying about where you're going to store it,
as well.
And then the last thing is security,
right Don't leave that on the table.
It's certainly important to understand security, and more
broadly, identity, right, to make sure we're tracking access
to all these things.
All right, so how do we start with this?
I throw this in there.
You have to kind of embrace the concept of a data lake.
I'm not suggesting you have to go build one.
I know it's a totally loaded buzzword and term,
but you have to build--
you have to embrace the idea of it, on some level, right?
And this is kind of what I said earlier--
if you start at the very bottom, you first
have to understand what are all the sources of data I've got,
right?
At least get everybody to agree in the room
where data is coming in from, and what it looks like.
Once you do that, you want to find some, again, consistency
on which tools are we going to use to store data?
You're not going to consume every single part
of the platform, right?
It doesn't make sense to just use every service
because it's there.
Find the ones that make the most sense for the data
that you've got.
And we'll go through that a second, as well.
And the last thing is, what are the use cases, right?
How is the data going to get used over time, right?
It's important that you understand what those use
cases are, because they're going to drive back to the data--
or to the storage mediums.
And it's important to pick the storage mediums,
because that's going to depend pretty heavily
on how the data comes in, where it comes from,
and what it looks like.
All right, so where should data live, right?
We know the sources, we know the workloads,
but what do we do in the middle there?
How do we figure out which places to put data?
So this is kind of a simple decision tree.
And I'll go into a little bit more depth on the next slide
as well.
But some of this you can kind of simplify, right?
If it's really structured data, you're
kind of down to two options.
Is it sort of OLTP data, or is that OLAP data, right?
Is it transactional, or is that analytical?
If it's transactional, Cloud SQL's a great fit, right?
It's a typical relational database, no frills,
does what it does, and it does it really well.
On the analytical side, you have BigQuery, right?
So it depends on what the use case
is that's starting to drive it-- not just the structure,
but with the use cases.
The next column is semi-structure.
So, you've got something that you might know the schema for,
but it could change.
It could adapt in flight.
People might be deploying new mobile apps
somewhere that customers are going to use.
They're going to start capturing data
they weren't capturing before.
All those things are possible.
So, if you've got semi-structured data,
then it's a question of, how do I need to query that data?
Again, we're back to what is the downstream use case?
If it's something where you need to query in any possible field
that you write in, data source is a good choice.
When you write a piece of JSON to Cloud Datastore,
we automatically index every key by default.
That's part of that JSON.
Now, you can turn that off, if you want to,
or you can pick the keys you want, but, ultimately, you
have the ability to query anything
that you've written in there.
Whereas when you put data in a Bigtable,
you're basically stuck with trying to have
to query by the row key.
And that's actually pretty powerful
for a lot of really great use cases,
especially like time series data or transactional data.
But it's not a great fit, again, if you
want to try to query some random column,
you know, somewhere buried throughout the dataset.
And the last one is object, images media,
that sort of thing, that just go straight into cloud storage
so, it's a pretty simple approach.
So, if you break this out a little bit
and start getting in a few more examples,
you kind of end up with a chart that looks like this.
We just covered object storage really quickly a second
ago great for binary data, great for object data,
media, backups, that sort of thing.
On the non-relational side for Cloud Datastore,
it's really good for hierarchical data.
Again, think of like JSON as a good example
there, to fit in there, obviously,
great on like mobile applications,
that sort of thing.
On the Bigtable side, really, really powerful system
for heavy reads and writes, but you have the row key only,
as your only option.
There's a little bit of filtering
you can apply outbound on a query,
but really, it's driven by the row key.
So you've got a very specific workload
you can use at a Bigtable on the relational side you
have two options.
And I mentioned-- and I didn't mention Spanner yet,
and it's still a little bit early on Spanner,
but I did want to put it into context for kind of,
for everybody in the room.
Cloud SQL's great for web frameworks.
Typical web applications that you're going to build,
typical CRED applications are great.
Spanner's really interesting.
It's new for us.
It's still early.
Spanner hasn't gone into general availability quite yet,
but it's an interesting thing to keep an eye on,
especially as people are building kind
of global-facing applications, where their customers could be
just about anywhere, and having a mechanism that
has a globally distributed SQL infrastructure is really
powerful.
So, it's something to keep your eye on as you kind of make
more progress with GCP.
And the last one is warehouse data.
Data warehouse, BigQuery, that's the right place to put that.
So, this is where it gets interesting.
You found the tools, you found the sources,
and you found the workloads.
How do you continue on this path of enabling self-service
and discovery?
So, you've taken all the data sets.
We've taken inventory of all of them.
They're all located in the right places,
but there might be a lot of it.
So how do we enable people to actually find
the things they're looking for over time?
And this is where metadata is really important,
because I think if you guys can kind of guess where I'm going,
ultimately, what we're going to do
is we're going to try to build a catalog,
and that catalog is what's going to drive everyone's usage
downstream, and everyone's ability to work on their own.
So, in order to build the catalog,
though, you've got to agree on metadata.
And actually, as it turns out, fortuitously,
or fortunately I should say, the Google research team actually
just published an interesting blog post
about facilitating discovery of public data sets.
And in it they cover things like ownership, provenance,
the type of data that's located--
that's contained within the data set.
They relationships between various data sets,
can you get consistent representations,
can you standardize some of the descriptive tools
that we use to do it.
It's basically just JSON.
There's nothing fancy about this,
but it forces everyone to come up and say,
I have a consistent representation
of every single data set.
So, you start to imagine, if you have 100 data sets strewn
across the company, you're only saying to everybody,
if you are an owner of a data set,
just publish this small little JSON file,
and hand it over to us, and now we
can start cataloging this data.
That's really powerful.
What's even better is, if you've got that JSON data,
we already had a place to put it.
So, you have two options here.
You can either push that JSON data into Cloud Datastore
or into BigQuery.
And I'll kind of cover the difference here.
Cloud Datastores are a great place for this data,
for a variety of reasons.
Again, JSON's really well represented.
You can query about any possible field there.
But the downside is, that Datastore doesn't really have
kind of a good user-facing UI.
It's very much application-centric.
So, if you want to go ahead and build a little CRED
API on top of this thing, Datastore can be a good fit.
The other option is, frankly BigQuery.
You know, same idea around Datastore and storing JSON,
in fact, this is an example of-- this screenshot as an example
of what this JSON looks like in BigQuery,
because it does support nested columns.
BigQuery is great because, you've got a UI attached to it.
There's a console right there, so you can actually
run SQL queries on it.
SQL's relatively universal across a lot of folks.
They understand it really easily.
So, this might make a little bit more sense.
To be totally fair and totally honest with you guys,
when I typically recommend this approach,
I often will push people to BigQuery,
as a place to build this data catalog around,
because it just makes sense, and it plugs
into a lot of downstream tools.
Datastore can make sense, but it's
very, very particularly dependent on how
the team wants to work.
For trying to find kind of a good generic solution
to start with, BigQuery's a great fit.
So, once you've done that, you've pushed--
we've inventoried, we've cataloged, we've got metadata,
and we've got everything living inside of BigQuery now.
So, how does this change the workflow?
I'm only going to pick on one, because they're all
going to start looking a little bit similar, if I
keep eating through it.
But, let's talk about the data science workflow.
So, for Alice, now instead of having to go talk to Bob,
or go talk to Ned, or go talk to Jan,
or talk to anybody else in the company
about what data is out there that she can use,
the first thing she can do is start to query the catalog.
So, she can just drop into the BigQuery UI,
and run a simple SQL query, and explore the data sets
that she has access to.
The next thing she can do, because part of what's
in there, one of the pieces of metadata, is what is the URL,
or what is the download location for this,
she can go ahead and pull that data into her environment.
Again, whether she's running a Cloud Datalab
notebook, or a Jupyter notebook, she
can pull that data into her environment.
And then, she can start to prototype a TensorFlow model,
for example.
She could start building a little bit of a TensorFlow
model and running some early analysis
of measuring how well her recommendation engine might
be working.
Once she does that, she might have created some new datasets.
She might have actually taken something and said,
now I actually have training data and test
data to work against.
So, now that she's created two new datasets,
the next thing she's going to do is upload those to the catalog.
Write the metadata, push that in,
so now other people that might want to build this stuff later
on, have access to the same training and testing
that she used.
So, we're starting to create this kind of lifecycle around.
Making sure that if you create something new, make sure
it gets shared with everybody as quickly as possible.
And then, she's going to continue on her work
like she normally does.
She'll train her machine-learning models.
She might push that TensorFlow model
to, like, the cloud ML service.
And then, because the cloud ML service that she created
is actually a data kind of a resource,
she can actually create a catalog entry for that service.
So, now if someone says, I want to find a recommendation
service within the catalog, as long as she's
tagged and labeled everything appropriately,
they could find the URL for that service immediately.
And this is a bit of a simplification,
or oversimplification, of the amount of work
it takes to do all this.
But, if you start to extrapolate these steps out,
she's been able to work at her own pace
without having to worry about trying to find who owns what,
who has access to what she has access to, because we've
built this data catalog.
We've built this approach to metadata.
Over time, we're kind of getting to this point where
we can start building again toward self-service.
So, just like we have a SQL-- a great SQL interface
for querying that data catalog, you
might want to continue down the self-service path and say,
can we make these things even more discoverable?
Can we put like a CRED API on top of this?
So, one option is to take a small application or small CRED
API that sits in front of this BigQuery metadata catalog
and deploy that in, like, Compute Engine, or App
Engine, or Container Engine, and front
it with, like, cloud endpoints.
The reason you might go down this road
is particularly around good API management,
and building-- again, consistent and clear access.
Because the best way you can enable sort of self-service
is, obviously, giving everybody access to it,
but also giving everybody access to
in a way that is most beneficial to them.
So, if you've got a lot of folks who are very API-driven,
or want to build new applications all
the time, having an API endpoint that they can
hit to learn more about the data catalog
is very, very beneficial.
And over time, you can actually start
to extrapolate this step even further,
and start fronting all of the data sources you've got.
So, not only do you have a CRED API in front of the catalog,
but you might actually have one in front
of every single dataset.
So now, again, you're enabling this further, you know,
deeper level of access to data.
And this might be a little bit overkill
for the team of four people, but imagine
if that was a team of 400 or 500 people.
As you think about this kind of approach
permeating the entire organization,
everyone wants to have access, and you've
got to start building these things for scale over time.
So, picking the right tools up front lets you adopt that,
and again, lets new teams pick this approach up, as well.
Before we finish this, I do want to talk
a little bit about security, early security and identity,
in kind of a broad sense.
We've got a great set of identity and access management
tools called Cloud IAM.
There's a ton of sessions about it
throughout the next couple of days,
so I urge you guys, if you're interested to go dig into it.
What I really want to cover here though,
is this idea that there's policy inheritance that
goes from the top all the way to the bottom.
That means, at any level, if you set a policy on data access
or control, it does filter all the way down.
So, if you've got an organization
or a single project that's kind of the host
project in your project account with GCP,
and you've got individual projects for maybe dev
staging, Q&A, that sort of thing, or production,
and then you have individual resources
underneath those projects, you can actually
control who has access to what, as long as you've set
your organization up correctly.
I'm not going to go too deep here,
but, basically, the idea is to walk away
with is, we have the ability to control who can see what,
and who can pull things down, and that's really
what you want to control.
And, for example, if you dig into BigQuery a little bit,
you've got a handful of roles and different responsibilities,
or different access controls that they've
got based on that stuff.
So, as you look at what kind of what--
if you look at and think about what it's
going to take to build for the future,
I put the site up again, because it's really important, if you
do nothing else, if you build nothing else,
you have to get agreement.
That's the most important thing.
If you can't adopt this idea that we
have to have consistency around data sources, data
workloads, and then the tools we're
going to use to work with that data,
this is going to be a long road.
As much as I talk about the products and the blue hexagons
and stuff, a lot of this is a cultural shift
in an organization.
It's kind of a lifestyle change.
You have to get everybody on the same page.
And part of doing that is saying, can we add consistency?
Or can we make this a consistent view of the world
that we can all agree upon?
And this might be a small subset of what you end up with.
It could be a much, much larger picture with a lot more pieces
to it, but it's important that everybody agrees, again,
what the data is going to look like coming in,
what it's going to get use for on the way out,
and where it's going to live while it's
sitting in your infrastructure.
As you think about--
as you kind of get through that consistency piece--
and, you know, that's a tough road,
but once you get through that, then the next step
is really getting around to some building the catalog.
Can you catalog all the data sets?
Do you know exactly what lives throughout the organization?
Can you get everyone to kind of pony up
and say, all you have to do is write this little snippet
of JSON, and we'll be able to leave you alone
for the next few days.
If you can build the catalog, that's a great first step.
Then the next thing you want to do, is
make sure your teams have access to the tools they
need to do this work.
And this is not just the tools that are in GCP,
but it's also, like, the API access, or the SQL
access to the datasets.
Can they get those things?
Have you set up identity and understanding security
correctly, so that everybody has access?
Because what you want people to be able to do
is work on their own, without having to go
bother anyone else.
If they can do everything they need
to do without interacting with somebody else,
then when they do interact, and they go have lunch together,
it's a lot friendlier conversation, as
opposed to arguing about who has access to data.
[MUSIC PLAYING]