Subtitles section Play video Print subtitles SANDEEP PARIKH: Welcome, everybody. Thanks for coming. I know the day's been running a little bit long, so I will hopefully not keep you too long. Welcome to the session between you and happy hour. Hopefully, we'll get you out of here just in time to get some drinks and get some snacks. My name is Sandeep Parikh. I run the cloud solutions architecture team for Americas East, for Google Cloud Platform, and today, I want to talk to you about building a collaborative data platform. Effectively, you've got teams of individuals spread across your company, that have to work together in some meaningful way, and you need them to share data, talk to each other, basically build something together. So, how do you enable that? So, we're going to walk through some of the ways that we tell a lot of customers, and developers, and partners how to do that stuff in the real world. First, we're going to meet our hypothetical team of individuals, and we're going to learn about the tasks that have to do, and the tools that they typically use. And then, we'll jump into mapping the tools they use to the tools available in Google Cloud, so you can start to see how they might get a consistent set of tools and work together. And then, we're going to talk about how we enable collaboration. This is how we do things, like set up teams to do self-service and easy data discovery. I'm probably going to trivialize a handful of details, but stay with me, it'll make sense as we get through it. And then, I want to cover what the new workflow's like once you get all this stuff up and running, and then give you a couple little things and tips on how to get started. Does that make sense to everybody? Head nods? OK, good. All right, so let's meet the team. All right, so first, we've got about four individuals. We've got a data scientist, a data engineer, a business analyst, and an app developer. Ultimately, they all have three things in common. They use different data sets, those data sets are all in different formats, and they all use different tools, right? So, there's not a ton of overlap here, and that's kind of the challenge. So, let's assume for a second, or let's imagine for a second, we've got this team, and their job is to build a recommendation engine for a retail site. So, what are the kinds of things that each person on the team is going to need to be successful in order to do their job? So, the first thing we'll cover is the data, and then we'll talk about the tools. So, from a data perspective, Alice, the data scientist, needs kind of cleansed, normalized data. Things that have no longer log files, right, not just raw rows from a database, but things that actually make sense. And those could be, for example, stuff like purchase history, product history, product metadata, click streams, product reviews. All the stuff that it would take to actually craft and model a recommendation problem. Then, we've got Bob, the data engineer. He's going to have to take all of the raw data and turn it into something useful. So, he's probably going to need things like, logs information, product transactions or purchase history, product metadata, as well. This is all the stuff that's coming in straight off the application, that he's got to turn into something useful for other parts of the organization. Then, there's Ned. Ned is the business analyst, and he needs a lot of aggregated data, things that he can use to generate statistics and understanding about how the business is performing. Not just, again, rows or log files, but again, that next level up from, sort of, the basics. And then, finally, we have Jan, the app developer. She's going to need users and product predictions, or recommendations. She's going to be able to-- she needs to be able to build new microservices that expose things like this recommendation engine to the end user. So, that can be things like a recommendations API, access to wish lists, some knowledge around similar users or similar products, how things complement each other. But, you can see there's a little bit of overlap here, but not a ton, right. They all need something different and they kind of need it in a slightly different format. Now, if you think about sort of a-- I tried to draw a sequence diagram here, and it got a little complicated, so I simplified it a little bit. But, ultimately, Alice and Bob have to agree and talk to each other. Bob and Ned have to agree and talk to each other. Ned and Alice have to talk each other, and Jan and Alice have to talk to each other. You know, there's a lot of communication happening. And, listen, I love talking to my coworkers just as much as the next person, but if you're spending all of your time trying to explain to them what you need, when you need it, where it came from, what it should look like, it becomes challenging, right. It slows down the ability for them to make progress and work quickly. Ultimately, this is the problem that we see a lot of folks having out in the, world, right? They each have to find data sets. They each have to figure out who owns it. They have to figure out how things are related to each other, and they have to manage all that by themselves, which, again, takes time away from their ultimate task. Just by a show of hands, does this sound familiar to anybody? Oh, that's a scary number, actually. Maybe I shouldn't have asked that question. No, but that this is great, right? This is how we have to-- we have to find, understand, what the problem is, then try to solve it together. All right, so let's talk about tools. That's the first step. We need to understand what everyone's working with in order to kind of make the next jump. So, Alice is the data scientist, huge fan of Python, like myself. It's like her Swiss army knife, she uses it for just about everything. Typically, does a lot of her work with IPython or Jupyter notebooks, pulls data down, does it on her workstation, you know, slices dices data, uses a lot of pretty common data science frameworks, uses things like, NumPy, SciPy, scikit-learn, Pandas, because she has a history, and/or likes data frames. That's a typical approach, from a data science perspective. Bob is a huge fan of Java, right? He does lots of big ETL pipeline work, does a lot of orchestrating data from one format to another, or multiple sources of data together into something that looks cohesive on the other end. So, he's typically using tools like MapReduce or Spark. In some cases, he might be using Apache Beam, which you'll actually hear more about over the next couple days, as well. Ned is like a spreadsheet guy. He loves SQL and he loves spreadsheets. So, what he likes to do is build custom reports and dashboards, very SQL-driven, got to have some kind of access to a data warehouse, so he can do giant-- you know, kind of cube rotations and understanding. Basically, he's got to be able to prove to the business how products are performing, why things like a recommendation engine is even necessary. So, it's his responsibility to kind of facilitate that. And then, we've got Jan, the app developer. Jan's a typical kind of polyglot app developer. Likes writing stuff on microservices, likes delivering features very quickly, you know, may have a language preference one way or the other, but has to maintain a lot of the existing infrastructure. So, that could be things like node apps. That could be simple things like Python apps with Flask or Ruby apps with Sinatra, all the way up to like kitchen sink frameworks, like Django or Rails. Just depends, right, whatever the use case is. So, we've gotten that far. So, we understand their task, we know who all the key players are, and we know the things that they like to use. So, the next step for us is, how do we figure out which tools they should be using in Google Cloud? All right, we can skip that slide. So, what we'll do is let's lay out a handful of the tools that are available. And we're not going to cover everything that's in GCP right now, because we don't have enough time, and I want you guys to have snacks at some point soon. So, what we'll do is kind of focus on the core set of things that are critical for each of those individual roles to have access to. So, the first part, you know, for things like applications, we've got Compute. So, with things like virtual machine or virtual instances with Compute Engine, manage Kubernetes clusters with Container Engine, or just kind of manage container deployments, or Docker container deployments with App Engine. It's a good place to start. On the storage front, cloud storage for objects, Cloud SQL for relational data, Cloud Datastore for, you know, NoSQL, or non-relational datasets, and Cloud Bigtable for wide column. And then, on the data and analytics, are then the more processing kind of side, Dataproc for managed Hadoop and Spark clusters, Datalab for IPython notebooks, BigQuery for kind of an analytical data warehouse, and data flow for running managed Beam, Apache Beam pipelines. And then, the last column, or the last kind of group, is machine learning. So we've got cloud machine learning, which is kind of running TensorFlow models, and running them at scale, things like the natural language API, the speech API, or the vision API. As you start to think about the team that we just introduced, and you start looking at this sort of bucket of tools, of this set of puzzle pieces, you can start to see where things are going to kind of fit together There's a consistent set of tools here that we can map back on to the team. So let's walk through each one of those individually, really quickly. So, for data science workloads, Alice has a couple of options. We already talked about the fact that she's a fan of Python, so that maps really well. So, with Python, she has two options. She can either use Cloud Datalab, or she can use Cloud Dataproc with Jupyter. With Datalab, what she gets is a complete Python dev environment. You know, it's bundled in with like NumPy, SciPy, Matplotlib, so she can kind of kick off her work, and build those charts, and build those, kind of understanding, or on the data set as she would. Additionally, on top of that though, it's also got built-in support for TensorFlow and BigQuery. This means she's got a complete environment to go and start maybe prototyping models that she wants to build with TensorFlow, if she's trying to build a very compelling recommendation mechanism. Or, if she needs data that lives in BigQuery, she can actually do inline SQL statements there, and pull data back out, or offload queries to BigQuery, as well. So, she's got a handful of options there. The nice thing about Datalab is that it's based on Jupyter, so, it's got a little bit of background. So, it should look very familiar to her. But it is somewhat constrained onto just Python. If she's got more specific needs, or wants to include additional frameworks or kernels, then we have to look at something like Cloud Dataproc plus Jupyter. So, what you can do is you can spin up a Cloud Dataproc cluster-- again, a managed YARN cluster, effectively-- and then have Jupyter pre-installed on there and ready to go. So, it takes about 90 seconds to fire up a Hadoop cluster, anywhere from like three, to like a couple of thousand nodes. In this case, for Alice, I think just three nodes is probably appropriate, and her goal is to just spin this up and get Jupyter pre-installed. Once she's got Jupyter, then she's back to the exact same environment she had on her laptop, support for over 80 different languages and frameworks or kernels. But, the nice thing also is built-in support for PySpark and Spark MLlib. So, you know, if you're kind of trying to figure out where the machine learning line sort of falls, there's definitely a handful of more sessions you can attend around things like TensorFlow or on Dataproc. What I would urge you to do is, if you've got kind of an individual like this in your organization, is have them explore both. TensorFlow might be appropriate. Spark MLlib might be appropriate. And there should be a clean distinction between what they can each do. All right, on the data processing front, same thing. Bob's got a couple of different options. He can either use Cloud Dataproc or he can use Cloud Dataflow. With Dataproc, again, we just talked about this briefly, but managed Hadoop and Spark clusters, the nice thing about Dataproc-- and I'm only going to spend a minute on this, so, by all means, if you're interested we can do questions afterwards-- but the nice thing about Dataproc is that it turns that kind of mindset of a Hadoop cluster or a YARN cluster, on its head. You end up going away from having a cluster-centric view of the world, and turning it into more of a job-centric view of the world. So instead of firing up a thousand-node cluster that everybody shares, you could have every individual job have their own thousand-node cluster. And as long as those jobs can be done in less than 24 hours, you can actually take advantage of things like preemptible OVMs or preemptible instances. Those instances are about 80% off list price. So, you get the ability to do this job dramatically faster than ever before, because you have all this extra compute sitting around, and you get to do it really cheaply, because it only takes a few hours anyway. So Bob's got that option. He can also do things like tune the cluster parameters, or if he's got custom JAR files he needs, easily bootstrap those across the cluster, doesn't matter. So, h has that option. The other approach is to use Dataflow. Dataflow is a managed service that we have for running Apache Beam workloads. Apache Beam is basically a programming model, or an approach, that unifies batch and stream processing into one. When you take Beam workloads or Beam pipelines and you push them to Cloud Dataflow, we run those for you in a totally managed, kind of automatically scalable fashion. So, that's pretty nice. And Apache Beam workload's on Cloud Dataflow, actually support templates for kind of like easy parameterization and staging and things like that. The way I kind think about the relationship here of which path you go down is really a question of what do you have existing already. If you've got a huge investment in Spark or MapReduce jobs, or just kind of like oozy workflows around those things, by all means, go down to Cloud Dataproc group. It's a turnkey solution. You can take your job and just push it to a new cluster, just like that. If it's net new and you're starting from scratch, I think Beam is a good approach to look at. It's relatively new, so it's a little bit young. It's definitely not as mature as some of the other components in the Hadoop ecosystem. But, it does have this really, really, really critical advantage where you can take a batch pipeline and turn it into a streaming pipeline just by changing a couple of lines of input. So Bob's got a couple of options here, to kind of match up to what he typically likes to work with. All right, Ned's use case is actually really simple. He needs a data warehouse. He needs a data warehouse that supports SQL, and that can scale with whatever size of data he needs, or that he's got access to, and he's got to be able to plug it into a whole host of tools, kind of downstream. So, BigQuery's a great fit for him, enterprise cloud analytical data warehouse. It's a fully managed mechanism. We often use the word server-less, though I hate that term, so I apologize. It supports standard SQL and it does scale up to a kind of petabyte scale. So, he has the option of running something that's anywhere from data sets that are about gigabytes all the way up to petabytes, and still get responses back within seconds. BigQuery is great, because it does support kind of batch loading, or streaming inserts, as well. And it's got built in things like security and durability and automatic availability, and all the other good stuff. But, ultimately, the best part is that Ned gets to use SQL, gets to query really, really large data sets, and visualize and explore the data as he's used to, with the typical tools and expertise he's got. When it comes time to create reports and dashboards and things like that, there's a couple of options he has. One, he can use Data Studio, which is built into the platform. So, it's kind of like a Google Doc style approach, where I can create a new report, and then I can share that with everybody in the company without actually having to create multiple copies. And people can edit things like parameters and watch the difference in those reports, and see how they look. But effectively, he has the ability to create those reports and dashboards. Alternatively, he could use things like Tableau or Looker, or other business intelligence tools that he's that he's a fan of. So, he's got a handful of options there. And the nice thing also is that because of all this approach, like BigQuery also supports other kind of JDBC infrastructure, so a lot of the tooling that Ned is typically used to can plug right into BigQuery, as well. So, the last one is Jan. We talked about this earlier. Jan likes to deploy and scale microservices, so she's got two easy options there. If she's really, really focused on complex container orchestration, writes them-- wants to deploy things that way, Container Engine's a great fit. It's based on open sourced Kubernetes. We've got built-in health checking, monitoring, logging. We've got a private Container Registry, or she can go down the app engine route, and just take her Docker containers, and push them up, and we'll auto-scale them for her. The good thing is that along with kind of the same health checking, monitoring, and logging, App Engine also includes a built-in load balancer up front, has things like version management, traffic splitting, as well, and automatic security scanning. So, again, a couple of options here that kind of make sense for her. So, that's right. We're all done. Everybody's got tools, they can all work, and everybody is off-- off to the races. That's not true. We've only just kind of scratched the surface. So, they all have a consistent set of tools to work with, but we actually haven't enabled any of the collaboration bits yet. So, we still have to figure out how to get them, not only to work together, but actually to work in a way that makes sense, and scales up. Right. Their team is four people today. It could be eight, 12, 16, you know, in a few weeks. So, we've got to come up with a better approach to managing the data that they need access to. So, if you're trying to enable collaboration, there are a handful of things you're really going to think about. The first is, you've got to find consistency. And it's not just consistency in the tool set, right? That's also important, but you also need consistency in terms of where do you expect data to be coming from, where do you expect data to be used downstream, right? That's really important. If you can't agree on the sources, if you can't agree on what the downstream workloads are, it's hard to really understand how everybody should work together and use the same set of tools. The next thing you want to do, is take a really good hard look at all the data that you're trying to get access to. And this isn't true for every piece of data that lives in your organization, right? It might be things that are focused on certain sets of the company, like certain teams, or certain tasks, but ultimately, you've got to figure out where all this data lives, and take an inventory on it, right? Take an inventory on it, and make sure it's in the right storage medium for everyone to use as time goes on. The next thing you want to do is come up with an approach around metadata, right? This is pretty simple, but if you're trying to figure out who owns a piece of data or where it came from, what other data sets it's related to, what are some of the types of data that are located in this data set without actually having to go query it, that's a really challenging problem to do, when you think about having hundreds or thousands of individual pieces of data spread out across your infrastructure. Then, you want to enable discovery and self-service. You want people to be able to go find these things by themselves and pull them down as needed, without having to go, again, spend all that time arguing with people about formats, and availability, and tooling, and worrying about where you're going to store it, as well. And then the last thing is security, right Don't leave that on the table. It's certainly important to understand security, and more broadly, identity, right, to make sure we're tracking access to all these things. All right, so how do we start with this? I throw this in there. You have to kind of embrace the concept of a data lake. I'm not suggesting you have to go build one. I know it's a totally loaded buzzword and term, but you have to build-- you have to embrace the idea of it, on some level, right? And this is kind of what I said earlier-- if you start at the very bottom, you first have to understand what are all the sources of data I've got, right? At least get everybody to agree in the room where data is coming in from, and what it looks like. Once you do that, you want to find some, again, consistency on which tools are we going to use to store data? You're not going to consume every single part of the platform, right? It doesn't make sense to just use every service because it's there. Find the ones that make the most sense for the data that you've got. And we'll go through that a second, as well. And the last thing is, what are the use cases, right? How is the data going to get used over time, right? It's important that you understand what those use cases are, because they're going to drive back to the data-- or to the storage mediums. And it's important to pick the storage mediums, because that's going to depend pretty heavily on how the data comes in, where it comes from, and what it looks like. All right, so where should data live, right? We know the sources, we know the workloads, but what do we do in the middle there? How do we figure out which places to put data? So this is kind of a simple decision tree. And I'll go into a little bit more depth on the next slide as well. But some of this you can kind of simplify, right? If it's really structured data, you're kind of down to two options. Is it sort of OLTP data, or is that OLAP data, right? Is it transactional, or is that analytical? If it's transactional, Cloud SQL's a great fit, right? It's a typical relational database, no frills, does what it does, and it does it really well. On the analytical side, you have BigQuery, right? So it depends on what the use case is that's starting to drive it-- not just the structure, but with the use cases. The next column is semi-structure. So, you've got something that you might know the schema for, but it could change. It could adapt in flight. People might be deploying new mobile apps somewhere that customers are going to use. They're going to start capturing data they weren't capturing before. All those things are possible. So, if you've got semi-structured data, then it's a question of, how do I need to query that data? Again, we're back to what is the downstream use case? If it's something where you need to query in any possible field that you write in, data source is a good choice. When you write a piece of JSON to Cloud Datastore, we automatically index every key by default. That's part of that JSON. Now, you can turn that off, if you want to, or you can pick the keys you want, but, ultimately, you have the ability to query anything that you've written in there. Whereas when you put data in a Bigtable, you're basically stuck with trying to have to query by the row key. And that's actually pretty powerful for a lot of really great use cases, especially like time series data or transactional data. But it's not a great fit, again, if you want to try to query some random column, you know, somewhere buried throughout the dataset. And the last one is object, images media, that sort of thing, that just go straight into cloud storage so, it's a pretty simple approach. So, if you break this out a little bit and start getting in a few more examples, you kind of end up with a chart that looks like this. We just covered object storage really quickly a second ago great for binary data, great for object data, media, backups, that sort of thing. On the non-relational side for Cloud Datastore, it's really good for hierarchical data. Again, think of like JSON as a good example there, to fit in there, obviously, great on like mobile applications, that sort of thing. On the Bigtable side, really, really powerful system for heavy reads and writes, but you have the row key only, as your only option. There's a little bit of filtering you can apply outbound on a query, but really, it's driven by the row key. So you've got a very specific workload you can use at a Bigtable on the relational side you have two options. And I mentioned-- and I didn't mention Spanner yet, and it's still a little bit early on Spanner, but I did want to put it into context for kind of, for everybody in the room. Cloud SQL's great for web frameworks. Typical web applications that you're going to build, typical CRED applications are great. Spanner's really interesting. It's new for us. It's still early. Spanner hasn't gone into general availability quite yet, but it's an interesting thing to keep an eye on, especially as people are building kind of global-facing applications, where their customers could be just about anywhere, and having a mechanism that has a globally distributed SQL infrastructure is really powerful. So, it's something to keep your eye on as you kind of make more progress with GCP. And the last one is warehouse data. Data warehouse, BigQuery, that's the right place to put that. So, this is where it gets interesting. You found the tools, you found the sources, and you found the workloads. How do you continue on this path of enabling self-service and discovery? So, you've taken all the data sets. We've taken inventory of all of them. They're all located in the right places, but there might be a lot of it. So how do we enable people to actually find the things they're looking for over time? And this is where metadata is really important, because I think if you guys can kind of guess where I'm going, ultimately, what we're going to do is we're going to try to build a catalog, and that catalog is what's going to drive everyone's usage downstream, and everyone's ability to work on their own. So, in order to build the catalog, though, you've got to agree on metadata. And actually, as it turns out, fortuitously, or fortunately I should say, the Google research team actually just published an interesting blog post about facilitating discovery of public data sets. And in it they cover things like ownership, provenance, the type of data that's located-- that's contained within the data set. They relationships between various data sets, can you get consistent representations, can you standardize some of the descriptive tools that we use to do it. It's basically just JSON. There's nothing fancy about this, but it forces everyone to come up and say, I have a consistent representation of every single data set. So, you start to imagine, if you have 100 data sets strewn across the company, you're only saying to everybody, if you are an owner of a data set, just publish this small little JSON file, and hand it over to us, and now we can start cataloging this data. That's really powerful. What's even better is, if you've got that JSON data, we already had a place to put it. So, you have two options here. You can either push that JSON data into Cloud Datastore or into BigQuery. And I'll kind of cover the difference here. Cloud Datastores are a great place for this data, for a variety of reasons. Again, JSON's really well represented. You can query about any possible field there. But the downside is, that Datastore doesn't really have kind of a good user-facing UI. It's very much application-centric. So, if you want to go ahead and build a little CRED API on top of this thing, Datastore can be a good fit. The other option is, frankly BigQuery. You know, same idea around Datastore and storing JSON, in fact, this is an example of-- this screenshot as an example of what this JSON looks like in BigQuery, because it does support nested columns. BigQuery is great because, you've got a UI attached to it. There's a console right there, so you can actually run SQL queries on it. SQL's relatively universal across a lot of folks. They understand it really easily. So, this might make a little bit more sense. To be totally fair and totally honest with you guys, when I typically recommend this approach, I often will push people to BigQuery, as a place to build this data catalog around, because it just makes sense, and it plugs into a lot of downstream tools. Datastore can make sense, but it's very, very particularly dependent on how the team wants to work. For trying to find kind of a good generic solution to start with, BigQuery's a great fit. So, once you've done that, you've pushed-- we've inventoried, we've cataloged, we've got metadata, and we've got everything living inside of BigQuery now. So, how does this change the workflow? I'm only going to pick on one, because they're all going to start looking a little bit similar, if I keep eating through it. But, let's talk about the data science workflow. So, for Alice, now instead of having to go talk to Bob, or go talk to Ned, or go talk to Jan, or talk to anybody else in the company about what data is out there that she can use, the first thing she can do is start to query the catalog. So, she can just drop into the BigQuery UI, and run a simple SQL query, and explore the data sets that she has access to. The next thing she can do, because part of what's in there, one of the pieces of metadata, is what is the URL, or what is the download location for this, she can go ahead and pull that data into her environment. Again, whether she's running a Cloud Datalab notebook, or a Jupyter notebook, she can pull that data into her environment. And then, she can start to prototype a TensorFlow model, for example. She could start building a little bit of a TensorFlow model and running some early analysis of measuring how well her recommendation engine might be working. Once she does that, she might have created some new datasets. She might have actually taken something and said, now I actually have training data and test data to work against. So, now that she's created two new datasets, the next thing she's going to do is upload those to the catalog. Write the metadata, push that in, so now other people that might want to build this stuff later on, have access to the same training and testing that she used. So, we're starting to create this kind of lifecycle around. Making sure that if you create something new, make sure it gets shared with everybody as quickly as possible. And then, she's going to continue on her work like she normally does. She'll train her machine-learning models. She might push that TensorFlow model to, like, the cloud ML service. And then, because the cloud ML service that she created is actually a data kind of a resource, she can actually create a catalog entry for that service. So, now if someone says, I want to find a recommendation service within the catalog, as long as she's tagged and labeled everything appropriately, they could find the URL for that service immediately. And this is a bit of a simplification, or oversimplification, of the amount of work it takes to do all this. But, if you start to extrapolate these steps out, she's been able to work at her own pace without having to worry about trying to find who owns what, who has access to what she has access to, because we've built this data catalog. We've built this approach to metadata. Over time, we're kind of getting to this point where we can start building again toward self-service. So, just like we have a SQL-- a great SQL interface for querying that data catalog, you might want to continue down the self-service path and say, can we make these things even more discoverable? Can we put like a CRED API on top of this? So, one option is to take a small application or small CRED API that sits in front of this BigQuery metadata catalog and deploy that in, like, Compute Engine, or App Engine, or Container Engine, and front it with, like, cloud endpoints. The reason you might go down this road is particularly around good API management, and building-- again, consistent and clear access. Because the best way you can enable sort of self-service is, obviously, giving everybody access to it, but also giving everybody access to in a way that is most beneficial to them. So, if you've got a lot of folks who are very API-driven, or want to build new applications all the time, having an API endpoint that they can hit to learn more about the data catalog is very, very beneficial. And over time, you can actually start to extrapolate this step even further, and start fronting all of the data sources you've got. So, not only do you have a CRED API in front of the catalog, but you might actually have one in front of every single dataset. So now, again, you're enabling this further, you know, deeper level of access to data. And this might be a little bit overkill for the team of four people, but imagine if that was a team of 400 or 500 people. As you think about this kind of approach permeating the entire organization, everyone wants to have access, and you've got to start building these things for scale over time. So, picking the right tools up front lets you adopt that, and again, lets new teams pick this approach up, as well. Before we finish this, I do want to talk a little bit about security, early security and identity, in kind of a broad sense. We've got a great set of identity and access management tools called Cloud IAM. There's a ton of sessions about it throughout the next couple of days, so I urge you guys, if you're interested to go dig into it. What I really want to cover here though, is this idea that there's policy inheritance that goes from the top all the way to the bottom. That means, at any level, if you set a policy on data access or control, it does filter all the way down. So, if you've got an organization or a single project that's kind of the host project in your project account with GCP, and you've got individual projects for maybe dev staging, Q&A, that sort of thing, or production, and then you have individual resources underneath those projects, you can actually control who has access to what, as long as you've set your organization up correctly. I'm not going to go too deep here, but, basically, the idea is to walk away with is, we have the ability to control who can see what, and who can pull things down, and that's really what you want to control. And, for example, if you dig into BigQuery a little bit, you've got a handful of roles and different responsibilities, or different access controls that they've got based on that stuff. So, as you look at what kind of what-- if you look at and think about what it's going to take to build for the future, I put the site up again, because it's really important, if you do nothing else, if you build nothing else, you have to get agreement. That's the most important thing. If you can't adopt this idea that we have to have consistency around data sources, data workloads, and then the tools we're going to use to work with that data, this is going to be a long road. As much as I talk about the products and the blue hexagons and stuff, a lot of this is a cultural shift in an organization. It's kind of a lifestyle change. You have to get everybody on the same page. And part of doing that is saying, can we add consistency? Or can we make this a consistent view of the world that we can all agree upon? And this might be a small subset of what you end up with. It could be a much, much larger picture with a lot more pieces to it, but it's important that everybody agrees, again, what the data is going to look like coming in, what it's going to get use for on the way out, and where it's going to live while it's sitting in your infrastructure. As you think about-- as you kind of get through that consistency piece-- and, you know, that's a tough road, but once you get through that, then the next step is really getting around to some building the catalog. Can you catalog all the data sets? Do you know exactly what lives throughout the organization? Can you get everyone to kind of pony up and say, all you have to do is write this little snippet of JSON, and we'll be able to leave you alone for the next few days. If you can build the catalog, that's a great first step. Then the next thing you want to do, is make sure your teams have access to the tools they need to do this work. And this is not just the tools that are in GCP, but it's also, like, the API access, or the SQL access to the datasets. Can they get those things? Have you set up identity and understanding security correctly, so that everybody has access? Because what you want people to be able to do is work on their own, without having to go bother anyone else. If they can do everything they need to do without interacting with somebody else, then when they do interact, and they go have lunch together, it's a lot friendlier conversation, as opposed to arguing about who has access to data. [MUSIC PLAYING]
B1 US data cloud kind sql catalog access How to architect a collaborative big data platform (Google Cloud Next '17) 309 24 Johnson posted on 2017/07/06 More Share Save Report Video vocabulary