Subtitles section Play video Print subtitles DENNIS HUO: Hi, I'm Dennis Huo. And I just want to show you how simple, smooth, and effective running open source, big data software can be on Google Cloud Platform. Now, let's just begin by considering what we might expect out of any modern programming language. It seems it's not quite enough to be turning complete. But we probably at least expect some standard libraries for common things like say, sorting a one megabyte list in-memory. Well, nowadays, it seems like everybody has a lot more data lying around. So what if it could be just as easy to sort a terabyte of data using hundreds or even thousands of machines? More recently programming languages have had to incorporate much stronger support for things like threads and synchronization as we all moved on to multi-processor, multi-core systems. Nowadays, it's pretty easy to distribute work onto hundreds of threads. What if it could be just as easy to distribute work onto hundreds of machines? Nowadays, it's also pretty common to find some fairly sophisticated packages for all sorts of things like cryptography, number theory, and graphics in our standard libraries. What if we could also have state of the art distributed machine learning packages as part of our standard libraries? Well, if you've been watching the code snippets over here the last few slides, you may have noticed that none of what I just said is hypothetical. It's all already possible right now, today. Which is fortunate because just as we spent the last decade finally getting the hang of programming for multi-threaded systems on a single machine, nowadays it seems more and more common to find yourself developing for a massively distributed system of hundreds of machines. And with these changes in the landscape, it's pretty easy to imagine that big data could be a perfect match for a cloud. All the key strengths of cloud like on-demand provisioning, seamless scaling, the separation of storage from computation, and of course, just having such easy access to such a wide variety of services and tools. These all contribute to opening up a whole new universe of possibilities with countless ways to assemble these building blocks into something new and unique. And through this shared evolution of big data and distributed computing, big data has become a very core ingredient of innovation for companies big and small, old and new alike. Take, for example, how streaming media companies like Netflix, Pandora, and Last.fm have contributed to changing the way that consumers expect content to be presented. They've applied large scale distributed machine learning techniques to power these personalized recommendation engines. Millions of users around the world now expect tailor-made entertainment as the norm. As another example, just taking a look at our Google's roots in web search and along side a multitude of other search engines, news aggregators, social media, and e-commerce sites, we've all long grappled with applying big data technologies just to tackle the sheer magnitude of ever-growing web content. Users now expect to find a needle in a haystack hundreds of thousands of times, every second of every day. And amidst this rise of big data, it's not that the data or the tools have suddenly become magical. All these innovations ultimately come from developers just like you, always finding new ways to put it all together so that big data can still mean something completely different to each person or entrepreneur. On Google Cloud Platform, through a combination of cloud services and a wealth of open source technologies like Apache Hadoop and Apache spark, you'll have everything you need to focus on making a real impact instead of having to worry about all the grungy little details just to get started. Now for an idea of what this all might look like, let's follow a data set in the cloud on its journey through a series of processing and analytic steps. And we'll watch as it undergoes its transformation from raw data into crucial insights. Now what I have here is a set of CSV files from the Center for Disease Control containing summary birth data for the United States between 1969 and 2008. These files are just sitting here in Google Cloud Storage. And, as you can see, it's pretty easy to peek here and there just using gsutil cat. Now suppose we want to answer some questions by analyzing this data like, for example, whether there's a correlation between cigarette usage and birth weight. At more than 100 million rows, it turns out we have at least 100 times too much data to fit into any normal spreadsheet program. So that means it's time to bring out the heavy machinery. With our command line tool, bdutil, and just these two simple commands, you can have your very own 100 VM cluster fully loaded up with Apache Hadoop, Spark, and Shark, ready to go just five minutes or so after kicking it off. Once it's done bdutil will print out a command for SSHing into your cluster. Once we're logged into the cluster, one easy way to get started here is just to spin up a Shark shell. We just type shark and this prompt will appear. Now, what we'll be doing here is creating what's known as an external table. So we'll just provide this location parameter to Shark and point it at our existing files in Google Cloud Storage. Shark will go in and list all the files that match that location. And we can immediately start querying those files in place, treating them just like a SQL database. For example, once we have this table loaded, we can select individual columns and sort on other columns to take a peek at the data. We can also use some handy built-in functions to calculate some basic statistics like averages and correlations. Now one thing I've noticed about this kind of manual analytics, is that while it's a great way to answer some basic questions, it's an even better way to discover new questions. For instance, over here we've found a possible correlation between cigarette usage and lower birth weights. So could we, perhaps, apply what we found to build a prediction engine for underweight births? And what about other factors like the parent's age or alcohol consumption? Now, this kind of chain reaction of answers leading to new questions is part of the power of big data analytics. With the right tools at hand, then every step along the way you tend to find new paths and new possibilities. In our case, let's go ahead and try out that idea of building a prediction engine for underweight births. Now, to do this we're going to need some distributed machine learning tools. Traditionally, building a prediction model on 100 node cluster might have meant years of study and maybe even getting a Ph.D. Luckily for us, through the combined efforts of the open source community and, in this case, especially UC Berkeley's AMPLab, Apache Spark already comes pre-equipped with a state of the art distributed machine learning library called MLlib. So our Spark cluster already has everything we need to get started building our prediction engine. The first thing we'll need to do here is just to extract some of these columns as numerical feature vectors to use as part of our model. And there's a few ways to do this. But since we already have a Shark prompt open, we'll just go ahead and use a select statement to create our new data set. We'll start out with a rough rule of thumb here, just defining underweight as being anything less than 5.5 pounds. And we can just select a few of these other columns that we want to use as part of our feature vectors. We'll use this WHERE clause to limit and sanitize our data. And we could also just chop off the data from the year 2008, just for now, to put in a separate location, so that we can use that as our test data, separate from our training data. Now, since we're doing all this as a single create table as a select query statement, all we have to do is provide a location parameter and tell Shark where to put these new files once it's created them. We can also go ahead and run that second query on the data from 2008, so that we'll have our test data available in a separate location, ready to go later. Sure enough, after running these queries, we'll find a bunch of new files have appeared in Google Cloud Storage, which we can look at just using gsutil or Hadoop FS, for example. Now, with all our data already prepared, all we have to do is spin up a Spark prompt so that we have access to MLlib. It just so happens we can pretty much copy and paste the entire MLlib Getting Started example for support vector machines, here. And we'll just make a few minor modifications that point it our training data here, and plug that into Spark. Spark will go ahead and kick off the distributed job, entering it over the state at 600 times to train a brand new SVM model. Now, that will take a couple minutes. And once that's done, we'll have a fully trained model ready to go to make predictions. Here, for example, we can just plug in a few examples of underweight and non-underweight predictions using data from our real data set. We can also go in and load that separate data from 2008 that we saved separately, and rerun our model against it to make sure that our error fraction is still comparable. Now, taking a look at everything we've done here, I'll bet we've raised more questions than we've answered, and probably planted more new ideas than we've actually implemented. For instance, we could try to extend this model to predict other measures of health. Or maybe we can apply the same principles but to something other than obstetrics. We could also explore a whole wealth of other possible machine learning tools coming out of MLlib. Indeed, as we dive deeper into any problem we'll usually find that the possibilities are pretty much limitless. Now, sadly, since we don't quite have limitless time in this video session, we'll have to come to an end of this leg of the journey. Now everything you've seen here is only a tiny peek into the ever ongoing voyage of data through ever-growing stacks of big data analytics technologies. Hopefully, with the help of Google Cloud Platform, your discoveries can reach farther and spread faster by always having the right tool for the right job, no matter what you might come across. Thanks for tuning in. I'm Dennis Huo. And if you want to find out more, come visit us at developers.google.com/hadoop.
B1 data big data cloud shark google cloud spark Open Source Data Analytics: Part of your Standard-Issue Cloud Toolkit 43 3 Chris Lyu posted on 2015/12/14 More Share Save Report Video vocabulary