Placeholder Image

Subtitles section Play video

  • DENNIS HUO: Hi, I'm Dennis Huo.

  • And I just want to show you how simple, smooth, and effective

  • running open source, big data software

  • can be on Google Cloud Platform.

  • Now, let's just begin by considering

  • what we might expect out of any modern programming language.

  • It seems it's not quite enough to be turning complete.

  • But we probably at least expect some standard libraries

  • for common things like say, sorting a one megabyte list

  • in-memory.

  • Well, nowadays, it seems like everybody

  • has a lot more data lying around.

  • So what if it could be just as easy to sort a terabyte of data

  • using hundreds or even thousands of machines?

  • More recently programming languages

  • have had to incorporate much stronger support for things

  • like threads and synchronization as we all moved on

  • to multi-processor, multi-core systems.

  • Nowadays, it's pretty easy to distribute work

  • onto hundreds of threads.

  • What if it could be just as easy to distribute work

  • onto hundreds of machines?

  • Nowadays, it's also pretty common

  • to find some fairly sophisticated packages

  • for all sorts of things like cryptography, number

  • theory, and graphics in our standard libraries.

  • What if we could also have state of the art distributed machine

  • learning packages as part of our standard libraries?

  • Well, if you've been watching the code snippets over here

  • the last few slides, you may have

  • noticed that none of what I just said is hypothetical.

  • It's all already possible right now, today.

  • Which is fortunate because just as we spent the last decade

  • finally getting the hang of programming

  • for multi-threaded systems on a single machine, nowadays

  • it seems more and more common to find yourself developing

  • for a massively distributed system of hundreds of machines.

  • And with these changes in the landscape,

  • it's pretty easy to imagine that big data could

  • be a perfect match for a cloud.

  • All the key strengths of cloud like on-demand provisioning,

  • seamless scaling, the separation of storage from computation,

  • and of course, just having such easy access

  • to such a wide variety of services and tools.

  • These all contribute to opening up

  • a whole new universe of possibilities

  • with countless ways to assemble these building

  • blocks into something new and unique.

  • And through this shared evolution

  • of big data and distributed computing,

  • big data has become a very core ingredient of innovation

  • for companies big and small, old and new alike.

  • Take, for example, how streaming media companies like Netflix,

  • Pandora, and Last.fm have contributed

  • to changing the way that consumers

  • expect content to be presented.

  • They've applied large scale distributed machine learning

  • techniques to power these personalized recommendation

  • engines.

  • Millions of users around the world

  • now expect tailor-made entertainment as the norm.

  • As another example, just taking a look at our Google's roots

  • in web search and along side a multitude of other search

  • engines, news aggregators, social media,

  • and e-commerce sites, we've all long

  • grappled with applying big data technologies just

  • to tackle the sheer magnitude of ever-growing web content.

  • Users now expect to find a needle

  • in a haystack hundreds of thousands of times,

  • every second of every day.

  • And amidst this rise of big data,

  • it's not that the data or the tools

  • have suddenly become magical.

  • All these innovations ultimately come from developers

  • just like you, always finding new ways

  • to put it all together so that big data can still

  • mean something completely different to each person

  • or entrepreneur.

  • On Google Cloud Platform, through a combination

  • of cloud services and a wealth of open source technologies

  • like Apache Hadoop and Apache spark,

  • you'll have everything you need to focus

  • on making a real impact instead of having to worry about all

  • the grungy little details just to get started.

  • Now for an idea of what this all might look like,

  • let's follow a data set in the cloud on its journey

  • through a series of processing and analytic steps.

  • And we'll watch as it undergoes its transformation

  • from raw data into crucial insights.

  • Now what I have here is a set of CSV

  • files from the Center for Disease Control containing

  • summary birth data for the United

  • States between 1969 and 2008.

  • These files are just sitting here in Google Cloud Storage.

  • And, as you can see, it's pretty easy to peek here and there

  • just using gsutil cat.

  • Now suppose we want to answer some questions by analyzing

  • this data like, for example, whether there's

  • a correlation between cigarette usage and birth weight.

  • At more than 100 million rows, it

  • turns out we have at least 100 times too much data

  • to fit into any normal spreadsheet program.

  • So that means it's time to bring out the heavy machinery.

  • With our command line tool, bdutil, and just

  • these two simple commands, you can have your very own 100 VM

  • cluster fully loaded up with Apache Hadoop, Spark,

  • and Shark, ready to go just five minutes or so after kicking it

  • off.

  • Once it's done bdutil will print out a command for SSHing

  • into your cluster.

  • Once we're logged into the cluster,

  • one easy way to get started here is just

  • to spin up a Shark shell.

  • We just type shark and this prompt will appear.

  • Now, what we'll be doing here is creating

  • what's known as an external table.

  • So we'll just provide this location parameter to Shark

  • and point it at our existing files in Google Cloud Storage.

  • Shark will go in and list all the files

  • that match that location.

  • And we can immediately start querying those files in place,

  • treating them just like a SQL database.

  • For example, once we have this table loaded,

  • we can select individual columns and sort on other columns

  • to take a peek at the data.

  • We can also use some handy built-in functions

  • to calculate some basic statistics

  • like averages and correlations.

  • Now one thing I've noticed about this kind of manual analytics,

  • is that while it's a great way to answer some basic questions,

  • it's an even better way to discover new questions.

  • For instance, over here we've found a possible correlation

  • between cigarette usage and lower birth weights.

  • So could we, perhaps, apply what we

  • found to build a prediction engine for underweight births?

  • And what about other factors like the parent's age

  • or alcohol consumption?

  • Now, this kind of chain reaction of answers leading

  • to new questions is part of the power of big data analytics.

  • With the right tools at hand, then every step along the way

  • you tend to find new paths and new possibilities.

  • In our case, let's go ahead and try out

  • that idea of building a prediction

  • engine for underweight births.

  • Now, to do this we're going to need some distributed

  • machine learning tools.

  • Traditionally, building a prediction model on 100 node

  • cluster might have meant years of study

  • and maybe even getting a Ph.D. Luckily for us,

  • through the combined efforts of the open source community

  • and, in this case, especially UC Berkeley's AMPLab,

  • Apache Spark already comes pre-equipped with a state

  • of the art distributed machine learning library called MLlib.

  • So our Spark cluster already has everything

  • we need to get started building our prediction engine.

  • The first thing we'll need to do here

  • is just to extract some of these columns

  • as numerical feature vectors to use as part of our model.

  • And there's a few ways to do this.

  • But since we already have a Shark prompt open,

  • we'll just go ahead and use a select statement

  • to create our new data set.

  • We'll start out with a rough rule of thumb here, just

  • defining underweight as being anything less than 5.5 pounds.

  • And we can just select a few of these other columns

  • that we want to use as part of our feature vectors.

  • We'll use this WHERE clause to limit and sanitize our data.

  • And we could also just chop off the data

  • from the year 2008, just for now,

  • to put in a separate location, so that we

  • can use that as our test data, separate from our training

  • data.

  • Now, since we're doing all this as a single create table

  • as a select query statement, all we have to do

  • is provide a location parameter and tell Shark

  • where to put these new files once it's created them.

  • We can also go ahead and run that second query on the data

  • from 2008, so that we'll have our test

  • data available in a separate location, ready to go later.

  • Sure enough, after running these queries,

  • we'll find a bunch of new files have appeared in Google Cloud

  • Storage, which we can look at just using gsutil or Hadoop FS,

  • for example.

  • Now, with all our data already prepared, all we have to do

  • is spin up a Spark prompt so that we have access to MLlib.

  • It just so happens we can pretty much copy and paste

  • the entire MLlib Getting Started example for support vector

  • machines, here.

  • And we'll just make a few minor modifications that

  • point it our training data here, and plug that into Spark.

  • Spark will go ahead and kick off the distributed job,

  • entering it over the state at 600 times

  • to train a brand new SVM model.

  • Now, that will take a couple minutes.

  • And once that's done, we'll have a fully trained model

  • ready to go to make predictions.

  • Here, for example, we can just plug

  • in a few examples of underweight and non-underweight predictions

  • using data from our real data set.

  • We can also go in and load that separate data from 2008

  • that we saved separately, and rerun our model against it

  • to make sure that our error fraction is still comparable.

  • Now, taking a look at everything we've done here,

  • I'll bet we've raised more questions than we've answered,

  • and probably planted more new ideas

  • than we've actually implemented.

  • For instance, we could try to extend this model

  • to predict other measures of health.

  • Or maybe we can apply the same principles

  • but to something other than obstetrics.

  • We could also explore a whole wealth

  • of other possible machine learning tools

  • coming out of MLlib.

  • Indeed, as we dive deeper into any problem

  • we'll usually find that the possibilities are pretty much

  • limitless.

  • Now, sadly, since we don't quite have

  • limitless time in this video session,

  • we'll have to come to an end of this leg of the journey.

  • Now everything you've seen here is only a tiny peek

  • into the ever ongoing voyage of data

  • through ever-growing stacks of big data analytics

  • technologies.

  • Hopefully, with the help of Google Cloud Platform,

  • your discoveries can reach farther and spread faster

  • by always having the right tool for the right job,

  • no matter what you might come across.

  • Thanks for tuning in.

  • I'm Dennis Huo.

  • And if you want to find out more,

  • come visit us at developers.google.com/hadoop.

DENNIS HUO: Hi, I'm Dennis Huo.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it