Placeholder Image

Subtitles section Play video

  • LEE FLEMING: Good evening.

  • I am really pleased to welcome you all to "Leaders in Big

  • Data" hosted by Google and the Fung Institute of Engineering

  • Leadership at UC Berkeley.

  • I'm Lee Fleming.

  • I'm director of the Institute and this is a Ikhlaq Sidhu,

  • chief scientist and co-founder.

  • The first and most important thing is to thank Google for

  • hosting the event.

  • So thank you very, very much.

  • There's a couple people in particular, Irena Coffman and

  • Gail Hernandez--

  • thank you-- and also Arnav Anant, our entrepreneur in

  • residence at the Fung Institute.

  • So here's Arnav.

  • AUDIENCE: A lot of work.

  • LEE FLEMING: Huge amount of work.

  • The Fung Institute-- we were founded about two years ago.

  • And the intent is to do research and pedagogical

  • development in topics of engineering leadership.

  • We have our degree, the Master's of Engineering--

  • professional Master's of Engineering M. Eng. program--

  • mainly around the Institute.

  • We also have ties though across the campus, as you'll

  • see shortly.

  • This is our intent to have a series of talks on topics of

  • interest to engineering leaders.

  • As it turns out, this Wednesday we

  • have our next talk.

  • It's sponsored by [? Thai ?] and the Fung Institute.

  • And the topic is entrepreneurship--

  • being an entrepreneur within your firm.

  • And fittingly, we have representatives from Google,

  • and Cisco, and SAP.

  • That's Wednesday.

  • Consult the Fung website or the [? Thai ?] website for

  • details on that.

  • So besides enjoying a good discussion tonight, we have an

  • ulterior motive, as you can probably tell.

  • We're trying to advertise all of our fantastic programs in

  • big data at Cal.

  • Now, whether you're interested in computation, or inference,

  • or application, or some combination of those things,

  • we've got the right program for you.

  • As I mentioned, the professional Masters of

  • Engineering, or M. Eng., across all the different

  • engineering departments--

  • one year degree.

  • We have another one-year degree in the stats

  • department-- a professional degree.

  • There's a two-year degree in the Information School.

  • And finally, there's the Haas MBA.

  • Tonight we've got people from all these programs.

  • You can find their tables, ask them questions, and hopefully

  • we'll see you see at Cal soon.

  • And we also have an additional executive and other programs

  • associated with each of those departments

  • and schools as well.

  • Ikhlaq will now introduce our speakers.

  • IKHLAQ SIDHU: OK, thanks.

  • So let me see.

  • LEE FLEMING: Just slide this here.

  • IKHLAQ SIDHU: All right.

  • Welcome, I want to also thank a couple of people.

  • One is [? Claus Nickoli ?], who is not here at the moment,

  • but to you in the ether, he's just not at the meeting.

  • But he's our host here, and so thank you.

  • You guys can tell him that I thanked him.

  • And also, many of you I've seen here are basically

  • friends, and so thanks for coming.

  • It's good to see you again.

  • This is an event on big data.

  • And so I'm going to give you a little data on

  • who is speaking today--

  • who is here.

  • And the way I think of this is, what we've got is three

  • perspectives of big data from leading firms--

  • from people who represent leading firms in the area.

  • And so let's start with NetApp.

  • We've got Gustav Horn.

  • He is a senior consulting engineer with 25 years of

  • experience.

  • And he's built some of the largest enterprise-class

  • Hadoop systems in the world-- on the planet.

  • And from Google, Theodore Vassilakis, and he's a

  • principal engineer at Google.

  • He's ahead of the team that works on data analytics.

  • And he's been responsible for numerous contributions to

  • Google in terms [? about ?] search, and the visualization

  • and representation of the results.

  • And from VMware, Charles Fan, who's senior VP of strategic

  • R&D. He co-founded Rainfinity and was CTO of the company

  • prior to its acquisition by EMC in 2005.

  • And our distinguished set of speakers is moderated by our

  • distinguished moderator, Hal Varian.

  • He is chief economist here at Google.

  • He's an emeritus professor at UC Berkeley and the founding

  • dean of the School of Information.

  • So with that, there's hardly anything more I

  • could possibly say.

  • Come on up Hal and take it away.

  • HAL VARIAN: Thank you.

  • I'm very impressed with the turnout tonight, seeing as

  • you're missing both the debate and the baseball game.

  • But at least it eliminates a difficult

  • choice for many people.

  • I will say that I'm going to follow the same rules as the

  • presidential debates.

  • So no kicking, biting, scratching, or bean balls are

  • allowed during this performance.

  • We're going to talk about foreign policy, wasn't that

  • the agreement?

  • No.

  • All right.

  • In any event, what I thought we'd would do is, we'd have

  • each person talk for about five minutes, lay out their

  • theme, where they're coming from, what their perspective

  • is on big data.

  • And I will take some notes, and then ask some questions,

  • get a conversation going.

  • And I think we'll have a little time at the end for

  • some questions from the floor.

  • So, take it away.

  • THEO VASSILAKIS: Sure.

  • So, should I start, Hal?

  • HAL VARIAN: Yes.

  • THEO VASSILAKIS: All right.

  • Well, hey it's a real pleasure to be here.

  • Thank you guys also, and thank you guys for coming.

  • It's a huge, huge audience.

  • Just a couple of words.

  • As you heard, my name is Theo.

  • I lead some of our analytical systems.

  • So I'm responsible--

  • well, actually up until two weeks ago, I was responsible

  • for a stack that had parallel data warehousing components,

  • query engines, pieces like Dremel, and Tenzing systems

  • that let you query this data, and

  • visualization layers on top.

  • And that's one of the many, many systems at Google that I

  • think, outside, one would think of as

  • big-data type of systems.

  • And so I'll try to give you my perspective at least on the

  • Google view of big data.

  • And hopefully someone will cut me off when it's time.

  • I think I'll probably go for five minutes.

  • This could take a while.

  • AUDIENCE: [INAUDIBLE]

  • THEO VASSILAKIS: All right, sounds good.

  • Thank you.

  • I think, as you guys know, Google's business is primarily

  • about taking data and organizing the world's

  • information, and making it universally

  • accessible and useful.

  • So a lot of what the company does is really about sucking

  • in data-- whether it be the web, whether it be the imagery

  • from Street View, or satellite imagery, or maps information,

  • or Android pings, or you name it.

  • And then transforming it into usable forms.

  • So really, Google is kind of a big data

  • machine in some sense.

  • And I think the term big data came into

  • currency relatively recently.

  • And we all said, yeah, OK, that speaks to what we do.

  • Because we don't really have a word for it.

  • We just kind of knew that the data was large.

  • But just to try to put maybe more structure on to that, I

  • think the Google view on a lot of "what is big data

  • processing" kind of splits up into probably what I would

  • call ingestion type of processes--

  • things like the crawlers, things like all those Street

  • View cars running through all the streets of the world.

  • And then goes into transaction processing systems, where

  • perhaps we capture data through interactions on a lot

  • of our web properties, or a lot of the web properties that

  • we partner with.

  • This means people clicking on search, or people interacting

  • with docs, or people interacting with maps.

  • All generate many, many clicks and many, many interactions

  • that then become transactional big data.

  • Of course, that also includes people using let's say Google

  • Analytics on their sites to measure traffic on their

  • properties, which then generates huge volumes of

  • pings into Google--

  • many tens of thousands of QPS of pings.

  • So that's kind of the second big component.

  • And then probably the third component is the processing

  • side of all of that.

  • The process side includes things like map [? reduce, ?]

  • analysis, generating insights from that data--

  • maybe in the form of building machine learning models.

  • Maybe in the form of building, for example, Zeitgeist top

  • queries that can then be served out to the world to

  • say, hey here is what people are searching for.

  • Maybe in the form of engrams of all the books that Google

  • scanned over many, many years of its ingestion processes.

  • But it's really baking all of that information and then

  • presenting it in some usable form, either through a system

  • such as our ad system that takes models and decides what

  • ads to show, or in a more direct

  • form such as the engrams.

  • Just to say, OK, here are those three broad classes--

  • ingestion, transaction processing, and analytical

  • processing.

  • To dig a little bit deeper into each of those areas, I

  • would say the ingestion processes, especially the very

  • large scale ingestion processes, are

  • highly custom systems.

  • If you think about our web crawlers, if you think about

  • the Street View cars, if you think about maps stitching, or

  • satellite imagery stitching--

  • those are very, very custom processes that I think, at

  • least to this date, don't have a clear analog

  • in the general industry.

  • And maybe this is something that you guys might address or

  • might see differently than how I see the version.

  • They're still highly-specialized systems

  • that produce very large images.

  • And they're very high performance, very complex

  • systems that are run by dedicated engineering teams.

  • The transaction processing systems or the storage systems

  • are things like the Google File System.

  • These are things like Big Table.

  • These are things like Megastore.

  • Those are the ones that we've actually published papers

  • about and that are now reasonably well

  • known in the industry--

  • have evolved a little bit past the purely custom stage, where

  • they're fairly general purpose.

  • And there was a time at Google where actually most people did

  • their own storage in some form or another, until these

  • GFS-like systems evolved to the point where they were good

  • enough that more than one team could use them.

  • And actually, that evolution had many steps in which, for

  • example, everybody ran their own GFS.

  • And so maybe the ads team had their own GFS cells, and the

  • search team maybe had their own GFS cells.

  • And in time, the systems matured to the point where

  • actually we could have a centrally-managed file system.

  • And I think recently you may have seen, we've now talked

  • about this global file system called Spanner which takes

  • that to yet another level of transactions and global

  • availability.

  • And then the third step, which is I think still in a

  • relatively immature stage compared to some of the

  • storage systems, is the analysis.

  • And I think a lot of people know about MapReduce and some

  • of the systems that have been built on top of that.

  • So for example, Flume is the way of chaining MapReduces in

  • a more programmer-friendly way so that you don't end up with

  • 50 MapReduce stages that are individually managed.

  • But rather, you end up with one program that can then be

  • pushed down into many MapReduces that are

  • automatically managed.

  • The process there is still very engineering focused and

  • essentially requires engineering teams to process

  • this large data.

  • And so I think what we're seeing in that area is the

  • same maturation that we saw in the storage and transaction

  • processing systems.

  • Where little by little, systems such as Dremel, such

  • as Tenzing--

  • such as many others inside of Google that we haven't talked

  • about externally--

  • are aggregating a lot of that usage, and saying hey, we

  • really should do it in a much simpler manner.

  • And not really require people to have a full engineering

  • team to get the value out of all that big data.

  • Because at the end of the day, that's what

  • Google wants as a whole.

  • And that's what Google's customers want as a whole.

  • How do we get the value out of those big pieces of

  • information?

  • I would just leave you with those three big pieces.

  • And also this idea that, this is evolving into a

  • higher-level service that people can use without

  • necessarily being very, very

  • low-level engineering oriented.

  • And that more and more value is being derived out of that,

  • and hopefully something that you're seeing in Google's

  • properties and Google's services.

  • I don't know how much if I'm over, but I

  • can hand over here.

  • Thank you.

  • GUSTAV HORN: I'm Gus Horn, and thanks again for everybody for

  • coming tonight.

  • I know it's a big baseball night and you probably want us

  • to get done quick.

  • I come to it from a different approach in a sense and feel,

  • because Theo has--

  • Google has-- really been at the forefront of big data, big

  • data analytics, and in

  • particular Hadoop and MapReduce.

  • So I'm not going to go on the premise that everybody in this

  • room understands what MapReduce is, or what big data

  • is, and what data scientists are.

  • These are all buzz words that are really evolving.

  • I think what I found in my travels globally is that we're

  • really at the forefront right now of big data analytics.

  • I have a presentation that really characterizes it more

  • like a tsunami of data.

  • It's relentless, and it's coming at us.

  • It's coming at us from our Android

  • phones, from our iPhones.

  • It's coming at us from cameras that are everywhere, from our

  • TiVo boxes, from our PVR boxes, from everything we do

  • and touch in our world today.

  • We're generating data.

  • And the question is, do we either let the data fall on

  • the floor--

  • and we do nothing with it-- or are we going to pick that data

  • up and actually do intelligent things with it?

  • And we're finding more and more commercial applications.

  • Google I look at from a pragmatic perspective.

  • It's a commercial entity, but they are having a much more

  • philanthropic and broad approach to the world as well.

  • It was great back in 2003 that they defined GFS and gave us

  • MapReduce, which brought us back to the

  • mainframe days of old IBM.

  • But this is basically what it feels like to me, right?

  • Because it's batch-oriented processing at that time, when

  • we're talking MapReduce jobs.

  • But basically that was the genesis or the beginning of

  • what we call the Hadoop as we know it--

  • the Facebooks, the Yahoos, the LinkedIn--

  • all of these companies that are embracing this technology.

  • But now we look at companies like Progressive Insurance,

  • where they're giving you these dongles to plug into your car.

  • They're generating data.

  • They're collecting data on your habits,

  • your driving habits.

  • Health care industry is looking at how often do you

  • see the doctor, what are your statistics?

  • I was at the Mayo Clinic recently, and they have a

  • human genome initiative where they are looking at all of

  • their patients.

  • And they're actually doing a full genetic map all of their

  • cancer patients.

  • And their following these people for their entire life

  • expectancy.

  • And they want to keep their data 25 years, post mortem.

  • They want to build a repository where they can

  • understand exactly how does that one genetic mutation

  • affect your propensity to be carrying a disease.

  • Because they recognize that diseases

  • aren't just on or off.

  • There can't just be one mutation that

  • gives you that problem.

  • It's your environment, the mutation.

  • And that builds a susceptibility.

  • They're trying to really paint a huge picture, and that's a

  • big data problem.

  • So I see big data problems from health care.

  • I see big data problems in consumer-related industries,

  • whether they be the Walmarts, the Targets.

  • And not everybody is trying to be evil about this.

  • If you think about Target or Walmart, they would much

  • rather show you an advertisement that you care

  • about than to bore you to tears with something that

  • doesn't matter.

  • Just as Google doesn't want you to see a pop-up ad for

  • baby diapers if you're 60 years old and you're not going

  • to have a baby.

  • It doesn't do them many good, it doesn't do you any good.

  • There are a lot of positive things to take away from a lot

  • of this big data, and there's some negative things, too.

  • I'll focus on the positive in that I look at what companies

  • like the auto manufacturers in Europe are doing.

  • You look at BMW.

  • All of these cars are data-generating monsters.

  • And nowadays, you don't even know when you have to go for

  • an oil change, because they're predictively analyzing the

  • fluids in that car.

  • And they're determining when is it time for you to get that

  • oil change.

  • It's not like, oh I have to do to every 4,000 miles.

  • Your car tells you when you need to get it done because of

  • viscosity changes and because of analytical testing.

  • And they're collecting all of this data.

  • I think we're very lucky that we are at this forefront.

  • And I think that big data-- big data scientists--

  • are going to become more and more important.

  • And I think that, as Theo said, that it's going to get

  • to the point where, you don't have to become a

  • MapReduce job expert.

  • You really need to become a logical thinker and be able to

  • articulate the questions you're asking against a data

  • set, where you don't even care where the data came from.

  • You just know that all the data is in there.

  • And that's the key-- is to have a repository that's able

  • to hold all the data, and be able to allow for this kind of

  • processing to take place on that data, and produce results

  • in a timely fashion.

  • And what I've done is, I'm approaching it from more of a

  • corporate perspective, where people are looking at

  • enterprise-class systems, versus what we call white box

  • or dirt cheap.

  • And there're different kind of cut-offs for companies.

  • And I think as you go through your process at UC Berkeley,

  • and you're learning about where you want to go, you'll

  • see that you have to pick and choose your battles when it

  • comes to big data.

  • And the battle you have to choose is, am I going to be

  • setting up my data centers and my infrastructure to support

  • commodity-based platforms, and this-- do I want to own all

  • the data internally?

  • Do I want to virtualize the data in the cloud?

  • At what point do I bring that data internally.

  • Do I want to use services from Google?

  • They're all inflection points that you are going to be

  • making decisions over the next five years to

  • decide how to do that.

  • And this is what I'm dealing with all the time.

  • I think, hopefully, we all learn a lot from this

  • experience.

  • CHARLES FAN: Thank you for coming.

  • My name is Charles.

  • And unlike presidential debate, I agree with

  • what they just said.

  • Big data is like an elephant.

  • We were told we are allowed to touch this elephant from

  • different angles, from different perspectives.

  • But before that, I'll just try to repeat what Theo and Gus

  • just mentioned.

  • First, I think Internet is pretty big in terms of its

  • impact to our lives.

  • And not only to our lives, but also to enterprise IT.

  • And I think what we have seen in the last 20 years has been

  • the repeated tidal waves that's caused by the Internet

  • and the leaders in the Internet

  • space, including Google.

  • The advances they are making, and how those are hitting the

  • enterprise world.

  • And I think big data is the latest of such a tidal wave.

  • Essentially what the scale of data that the Internet

  • providers are dealing with, with consumers, the

  • enterprises are facing the same.

  • And now the challenge is, how do we adopt and massage this

  • technology so it's consumable by the various people inside

  • the enterprise worlds.

  • And that's what's behind the big data world we see.

  • And I think, like what Gus said.

  • Enterprises are working different sectors.

  • There are people doing retailing--

  • selling stuff.

  • There are people doing a manufacturing--

  • building cars.

  • There are people in health care.

  • There are people doing financial trading.

  • In almost every field, they are generating

  • more and more data.

  • And almost every field has many questions they need to

  • ask based on those data.

  • And they need to make decisions based on those data.

  • And unlike the DWBI world, which has been around also for

  • 20 years, the amount of data, the variety of data, and the

  • speed of data coming at you are going beyond the existing

  • infrastructure can take.

  • And that's why to answer these different questions in

  • different verticals, everybody is seeing a need for new

  • infrastructure, a new database, a new storage to be

  • created to support the decision making based on all

  • these data.

  • What's different in those data, besides just the size or

  • the volume of it?

  • When people typically refer to big data, they call it the

  • "three v," which is volume, velocity and variety of data.

  • Some of them call them "four s." It's the source--

  • there more data sources--

  • the size, the speed, and structure of data that are

  • very different.

  • And I have another name for it, which is probably less

  • elegant, but also I think it's pretty true.

  • When we look at the old data, the small data, or the classic

  • data, they're typically record-based data, especially

  • those generated by transactional applications.

  • They usually have people generate it.

  • And they go through the whole life cycle.

  • So we typically call them CRUD data that you need to create,

  • read, update, and delete.

  • I'm sure all of you Berkeley students know

  • the CRUD data word.

  • You manage on the storage front.

  • You also have database design for it.

  • But with the new data, more and more of

  • them are machine generated.

  • We just have more and more devices that's connected to

  • the Internet.

  • Not all of them have a warm body sitting behind them.

  • There're both servers, as well as sensors, RFID, mobile

  • devices, cameras, and so on.

  • And they're all generating Google Cars, they're all

  • generating tons and tons of data, without people sitting

  • behind them.

  • But you still need to create them, but you don't update

  • them that much.

  • Those are usually write once and read many type of data.

  • So there's not much update.

  • And there's not much delete.

  • You need to retain data 25 years after people die.

  • And even after 20 years--

  • 25 years-- people don't remember to delete them.

  • So there's not much delete, not much update.

  • There are a lot of application.

  • So instead of CRUD, now it's like

  • create, replicate, append.

  • There's more and more append.

  • All the data in append-only mode.

  • And process--

  • there's a constant need to process them in real-time,

  • during ingestion, or interactive.

  • So it's just crap data, is what big data is.

  • It's C-R-A-P--

  • create, replicate, append, process.

  • And when we are talking about structured data verses

  • unstructured data, we say there are more and more data

  • that are unstructured than structured.

  • I think it's just because the database technology or the

  • underlying technology is not scalable enough to put them in

  • a schema or in some kind of structure.

  • That's why they are all CRAP.

  • But you still need to process them in a more efficient way.

  • And that causes a lot of your challenges.

  • I think essentially whoever designs the new data

  • management system for CRAP and makes them consumable by

  • enterprises, is going to be the winner of

  • this big data race.

  • GUSTAV HORN: So Google invented the new crapper?

  • HAL VARIAN: Yes, OK, thank you for starting us out on such

  • provocative comments.

  • I wanted to follow up on your own your little troika there

  • with the ingestion, transaction, and analytical.

  • I come at the end of that food chain.

  • So what we get is, the data's been pulled in, the data is

  • available to us, and we're working on

  • the analytical side.

  • I want to say a few words about that.

  • When we have these analytical systems at Google, one of the

  • things you can do is just monitor the system and make

  • sure everything's running the way we expect it to.

  • And these guys have done a fantastic job, because now you

  • can take almost anything that's gathering data at

  • Google and create a dashboard with about 20 minutes of work,

  • which is a fantastic thing for running the business.

  • The other you can do is, you can build the machine-learning

  • models that he alluded to and engage in this kind of

  • predictive analytics.

  • That's very in-vogue these days and it's a

  • great thing to do.

  • But the thing that a lot of people miss, I think, is you

  • can use that data to conduct experiments.

  • And that's really the secret sauce at Google.

  • Our leader of the search team, Amit Singhal, said that a

  • couple years ago, we did over 5,000 experiments with the

  • search algorithm-- made 400 changes.

  • On the ad side, we're running roughly 500

  • experiments at any one time.

  • Any time you're logged into Google-- or any time you're

  • accessing Google, I should say--

  • you're probably in a dozen or more experiments.

  • And it's having the capability to manage that data, not just

  • for the current incarnation of the system, but all the

  • variations you might contemplate, is really a

  • fantastic help in moving the whole system forward.

  • So that experimentation rule is very, very

  • important at Google.

  • I wanted to raise a question of standards and

  • interoperability.

  • You mentioned Hadoop.

  • That's really become an industry

  • standard here at Google.

  • We have our own internal staff.

  • It's a lot easier to enforce these standards for

  • interoperability internally, than industry wide.

  • But to make this system work--

  • of starting with ingestion and transactions,

  • and then the analysis--

  • outside of Google, or outside of other big data companies,

  • you've got to have this kind of standards to interconnect

  • the flow of data.

  • And Charles, why don't you say a few things about what's

  • going on in that area

  • CHARLES FAN: I do think we are at the early

  • stage of this industry.

  • And right now there is no standards, per se, to my

  • knowledge that has emerged.

  • Hadoop has been a very popular technology that's born out of

  • the open source community's effort to--

  • based on the Google papers-- to create the MapReduce and

  • the GFS, as well as the other things they

  • built on top of it.

  • And I think, in lieu of standards, my perspective is,

  • open source plays a huge role here.

  • That in terms of overall data management as I mentioned, we

  • are going from a world that every thing is relational.

  • You basically have your relational data model, which

  • is the standard across all--

  • SQL being the standard query language.

  • Go into a more chaotic world, where there's many kinds of

  • data stores, many kinds of queries.

  • Even on Hadoop, there are various ways you can

  • query on top of it.

  • And open source really gives people the choice.

  • In this chaotic period, it is the choice.

  • It's basically the developers and users who's going to

  • decide which will become the standard.

  • And open source really provide this way to make it happen.

  • GUSTAV HORN: I just want to make one comment.

  • I think open source actually is the best way to make sure

  • that you don't get yourself pigeonholed into anything

  • that's proprietary.

  • And I think that with Hadoop and big data, as I look five

  • or ten years down the road, I think that standards aren't

  • going to provide structure.

  • It'll be more of an inhibitor than it's going to be of a

  • benefit in this area.

  • I think one of the key attributes--

  • and I think you can maybe talk more about that-- is the fact

  • that you want to be able to connect or stitch together a

  • bunch of disparate data sets.

  • You want to be able to look at things where you don't have to

  • be rigidly defined from the standard.

  • You want to be able to look at strange queries where weather

  • patterns, and people's buying habits, and the cars they

  • drive have some correlation.

  • And if you start imposing standards on top of something

  • that is that robust, I think it's going to probably stifle

  • development.

  • So I think the key here is open source.

  • The key is to have published innovations so that people are

  • publishing their works.

  • And I think as we get better and better at natural-language

  • processing and being able to get away from having to be

  • hard-core programmers, to glean insight into any of this

  • data store, it's going to be more beneficial.

  • I think in the next decade you'll find that you'll

  • probably be doing less and less Java programming and more

  • and more just natural language logic, I would think.

  • HAL VARIAN: Theo, I hope you're going to say a word or

  • two about protocol buffers.

  • THEO VASSILAKIS: Protocol buffers, yes, of course.

  • I'll plug protocol buffers for sure.

  • HAL VARIAN: Which you made as an open standard, right?

  • THEO VASSILAKIS: Right.

  • It's actually an open source system.

  • But before that, I was actually going to say I really

  • agree with your point about experimentation.

  • And I actually remember a time at Google where, if you wanted

  • to run an experiment--

  • for example, on search-- there was one engineer who is one of

  • our distinguished engineers now, Diane.

  • And you had to go ask her for some cookies on which you

  • could run your experiment.

  • It was sort of like, she would allot you some cookies.

  • Those days are over, but they really do generate a lot of

  • this CRAP data, because all of those experiments accumulate

  • over the years.

  • And yet it's really important to have the historical view of

  • hey, we tried this.

  • Here's what happened then.

  • And I think actually this plugs directly into this

  • problem of standards, because the way that all of the

  • engineers years back recorded their results, was very, very

  • different than the ways that engineers today

  • record their results.

  • So maybe, at the time, some of them didn't

  • have protocol buffers.

  • Which is, if you like, a kind of XML-like format for

  • representing data that Google created, but is a much more

  • efficient to represent type of format.

  • And so I think the problem comes because we want to

  • integrate all of this variety of data.

  • And what I would say is, I agree with Gus that I don't

  • see a lot of appetite for very generic standards.

  • But I do see people having a need to bridge all of their

  • old data and the new data.

  • And I would basically make two analogs here-- is that I think

  • one of the things that really helped the development of data

  • warehousing was fairly standard SQL.

  • And it was never a standard standard.

  • Like, there existed a standard, but no one really

  • followed the standard very closely.

  • But if it was close enough, you could get

  • your systems to work.

  • And I think the other aspect is file formats.

  • If you can take a file format and feed it into different

  • systems, that will really help.

  • And so until now, CSV was the end-all, be-all file format

  • for interchange.

  • I think we'll see more of these as we need to trade data

  • that's more structured--

  • that has protocol buffers or XML.

  • THEO VASSILAKIS: And if I could, let me add a plug for

  • VMware, as well.

  • As we mentioned, I think we are agreeing that we should

  • allow the chaos to continue for a little awhile.

  • However, there are certain parts I think we can help

  • people to make it easier.

  • Which is how do you stand things up.

  • Hadoop has a great system, but as Gus can probably tell you,

  • it's not so easy for enterprises to stand up a

  • Hadoop cluster.

  • Often the enterprise needs to stand up many of

  • those Hadoop clusters.

  • And some will need to stand up other type of data stores.

  • And that's where VMware is a leader in the virtualization

  • software and cloud infrastructure.

  • And we are building tools which includes some open

  • source project called Serengeti, which is helping

  • people to easily stand up their Hadoop clusters, as well

  • as other data stores--

  • really automate some of those headaches or tough work.

  • And so they can focus on the work that matters.

  • HAL VARIAN: Let me put in a good word about standards.

  • Because when you look at companies, how do they grow?

  • They grow through acquisition.

  • When they grow through acquisition, you end up with

  • data silos everywhere.

  • And data silos are the enemy of big data.

  • And the amazing thing about Google, because of the work

  • that Theo and his team do, is we have no

  • data silos at Google.

  • Now that's not 100% true, of course, but when we bring an

  • acquisition in, we spend a lot of time trying to get their

  • data infrastructure aligned with our own internal

  • infrastructure.

  • And what it means is, you can basically pick an engineer off

  • of one project and move them on to another project,

  • completely at the other side of the company.

  • And they're productive in the first week because of having

  • that standardized infrastructure that we have.

  • And that is not something that most companies have the luxury

  • of dealing with.

  • The biggest problem that most companies face in data

  • management is trying to get this interoperation among the

  • different legacy systems.

  • You know, there's this old line, how did God create the

  • world in only six days?

  • And the answer is, he didn't have a legacy

  • system to worry about.

  • So everybody in the business faces, how's

  • that going to be solved?

  • That's my question.

  • How do you solve that?

  • GUSTAV HORN: I think you're right.

  • There are a lot of heterogeneous databases and a

  • lot of things that need to be stitched together.

  • And I think that big data--

  • again from the Hadoop prospective-- . there are lots

  • of connectors out there-- from Flume, from Scoop.

  • And I think that's key.

  • You'll find that a lot of these big database companies

  • are having to embrace open source.

  • They're having to embrace Hadoop, because if they don't

  • embrace it, they're going to become roadkill.

  • So they're looking for ways to monetize it, from consulting

  • services and things like that.

  • And also how then can they play in this market and become

  • leaders in this market, so they retain

  • their customer base.

  • Because the bottom line is, the Oracles of the world, the

  • SAPs, these people make money through selling licenses.

  • Hadoop is a license killer.

  • So that's going to directly impact their ability to be

  • profitable from a stock market perspective.

  • They need to find ways to innovate that allow them to

  • keep that trajectory.

  • And then the other thing I would say is, that a lot of

  • times the biggest problem I've found in industry, when I go

  • meeting with big customers or potential customers, is that

  • they don't know where to start.

  • They have a huge data problem, not just a big data problem.

  • They have data everywhere and silos in different corners of

  • the organization.

  • And they don't have one person who is competent enough from a

  • technical perspective to know how to move forward.

  • They have individual islands or teams that are looking at

  • how they can move forward.

  • And the real strength in big data and big data analytics is

  • the heterogeneous nature of the data.

  • That's one of the key strengths

  • of this entire industry--

  • is the fact that you want to stitch together all of these

  • different data sources, and then be able to find those

  • correlations amongst them.

  • It doesn't do anybody any good to do a structured database in

  • Hadoop, and you're just doing the same old thing.

  • What's the benefit?

  • There is no benefit.

  • The benefit is when you're able to combine all of these

  • sources into one place and you find that

  • needle in the haystack.

  • Or you're able to better understand your customer.

  • Because fundamentally, all of these things

  • are customer driven.

  • I don't care whether it's Google.

  • I don't care whether it's VMware.

  • If the customer isn't happy, they're not

  • going to come back.

  • They're not going to like your website.

  • They're not going to like your product.

  • So the bottom line is, how can you find ways to modify what

  • you're doing to make it better for the customer.

  • And if you're able to find those needles because you can

  • stitch together all of these different sources--

  • including social media, including global search

  • engines and global communities--

  • and find out what people are doing, you'll find out those

  • subtle differences that really become the real game changer.

  • And that's really what big data is about.

  • CHARLES FAN: Yeah and I think another way I'll dissect the

  • big data, is that it can be looked at as four layers of

  • functionalities.

  • From the very top is the big data applications.

  • And to the second layer, which is big data analytics--

  • the various machine learning and other

  • algorithms you can apply.

  • The third layer is the big data management--

  • the query engines and so on, that you can query the data.

  • And the bottom layer is the data infrastructure--

  • the storage, and so on where you store the data.

  • I think to the question, the more bottom the layer, I think

  • it's closer to standardization.

  • I think there is, maybe to Theo's comment, there probably

  • can be a unified big data store, where all the bits, all

  • the CRAP, eventually end up somewhere.

  • There's a sync, a common sync for all the CRAP.

  • And they come into here.

  • I think right now we should still allow various different

  • ways for them to be queried.

  • Even in our Hadoop system, some people like to use Pick,

  • some people to use Hide, some people like to just do H-based

  • direct on HDFS.

  • Some people like to, Dremel is another way you can

  • interact with it.

  • And I'm sure there are new innovations coming out of

  • Google, out of everywhere in the ecosystem.

  • And like in [INAUDIBLE].

  • When I talk about standardization chaos,

  • sometimes I'll go back to the history--

  • for me, it's Chinese history.

  • Where, for those of you who have read the Chinese book

  • called "The Romance of Three Kingdoms," where the first

  • line of the novel is, "After unification it's chaos.

  • After chaos, it's unification. "

  • And it's describing how often of all the warlords fighting

  • chaos, inevitably somebody's struggle will emerge

  • and unify the land.

  • And that will be your emperor.

  • And also inevitably, whether it's after he gets old or,

  • whether he dies and his kids get weak, that it will fall

  • back into chaos.

  • And this is traditional dynasties that repeat about a

  • dozen times.

  • That's 4,000 years of Chinese history.

  • And I think that can apply to the history of anywhere else.

  • As well, it can apply to the data processing, the data

  • management here.

  • Where we are in this period, going from a more unified SQL

  • interface, a more unified data management query engines, to a

  • more diversified world.

  • But I would predict in ten years, there will be leading

  • standards or ad hoc standards-- de facto

  • standards-- that's going to emerge where the majority of

  • the big data problem going to be solved in that way.

  • THEO VASSILAKIS: Yeah, I agree with that.

  • I don't know if it'll be in the form of a W3C standard or

  • something like that, but I think that's a little bit the

  • dynamic that Hal was referring to inside of Google.

  • That after n years of fighting with all of the different

  • varieties of things, people kind of said, well we

  • understand now that it's not the purpose of our team here

  • over in maps to really build that entire stack.

  • Because now that we know what all that entire stack entails,

  • we realize that it's really far too big for us

  • to do on our own.

  • And so we're willing to concentrate further up the

  • stack in the parts that we really care about.

  • And that then led lots of groups of Google to look

  • around and say, OK well, what is a piece of technology that

  • exists, and is reasonably mature.

  • And a lot of people will use it, and it

  • gives us this advantage.

  • And so that's how some of the components such as Dremel and

  • others emerged as de facto standards of

  • how we analyze data.

  • And I think that those de facto standards will in time,

  • probably lead into some kind of more formal standards that

  • can be adopted across companies and across

  • organizations.

  • HAL VARIAN: Let me switch gears and turn to the

  • infrastructure, the hardware infrastructure.

  • So there's two models out there.

  • You could buy your infrastructure, and people to

  • maintain it, and run it in-house.

  • Or you can lease it on the cloud.

  • And what do you see as the advantages and disadvantages

  • of those two approaches?

  • GUSTAV HORN: I think that there's a place for both, to

  • be honest with you.

  • I think that you'll find that the cloud is a great place to

  • get started.

  • It's a great place for you to kick the tires.

  • I think you're always going to have the open source-- what I

  • call white box, commodity-based approach.

  • And a lot of groups where you're going to be doing your

  • sandbox, your proof of concept, you're going to be

  • testing out your code, from an infrastructure perspective.

  • And also I think that there's a place even for what's being

  • done over at VMware, where they're looking at

  • fundamentally providing an infrastructure and product in

  • a box, so that people can go to service providers and spin

  • up map producers and build their file systems.

  • At some point in time there is going to be again, like I

  • said, a decision where companies are either going to

  • embrace the technology because that internal leadership or

  • their leader within the company has proven

  • the value of this.

  • And that's going to be the tough slog that everybody in

  • this room is going to have to deal with over the next five

  • years-- is that you're going to be battling internal

  • processes, internal fights with in every organization

  • that I've met.

  • Where you have the legacy database people-- the people

  • who said this is how we do it, this is why we do it.

  • We have these checks and balances.

  • We have these constraints.

  • That data has to stay within our walls.

  • And then you're going to have the leaders, who are more

  • aware of what's available in technology with

  • virtualization, with cloud-based technology.

  • And in some cases, it does make sense.

  • There're regulations and laws that are going to dictate

  • where data resides, or where it can reside, or

  • where it has to be.

  • And there're going to be places where the cloud is

  • going to be paramount.

  • But you're going to find in the next five years, that

  • you're going to be fighting more political battles than

  • doing anything else.

  • THEO VASSILAKIS: I agree with that.

  • I think there will certainly be lots of ways to run

  • infrastructure locally, as well as on the cloud.

  • I think though, that what people will realize over time

  • is that a lot of the reason why it may sometimes appear

  • cheaper to run locally than it is to run on the cloud these

  • days, is because with cloud services, you get a lot of

  • services by default.

  • So perhaps you would get back-up by default, perhaps

  • you would get certain compliance

  • functionality by default.

  • Whereas sort of on reasonably bare machine, in perhaps your

  • own data center, you wouldn't get these automatically.

  • And I think over time, as more of this computation becomes a

  • commodity, in that you just expect it to

  • work and that's it--

  • you won't be able to live without some of those things

  • that are today considered value-added services.

  • And I think there will be a crossover point where it'll

  • start to be more expensive to actually do all of these

  • things on your own appliance, than it will be to do it at

  • scale in somebody's data center.

  • And I think the fatter and fatter pipes that connect us

  • to these data centers are going to make that a

  • possibility.

  • HAL VARIAN: Go ahead.

  • CHARLES FAN: Again, in the anti-presidential presidential

  • debate style, I agree with both Theo and Gus.

  • And VMware's view is that it's a hybrid cloud where that we

  • want to provide the same benefit to customers, whether

  • they are running things in their data centers or out of a

  • cloud service provider.

  • All that being said, I do think there will be an

  • increasing amount of infrastructure moving out of

  • data center, over time, to the cloud services.

  • Meaning the applications will be more and more delivered as

  • a service to the enterprise customers, as opposed to as a

  • packaged software today.

  • That will take time, but I think that will happen.

  • But even after that happens, even after the infrastructure

  • is outsourced so called, to the cloud service providers,

  • ownership of the data, of the big data, medium data, small

  • data, still is with the enterprises.

  • And it is still their responsibility to be able to

  • make their decisions based on the data that they own.

  • Even some of the data may be sitting at the service

  • provider, at the cloud provider.

  • It is still their responsibility to analyze

  • those data and to make decisions based on those.

  • THEO VASSILAKIS: And clearly, security is going to be one of

  • those big items.

  • And so if anyone's working on cryptography, that's going to

  • continue to be a pretty hot thing.

  • HAL VARIAN: It's always good to have a job

  • where there's an adversary.

  • Coming back to the elections again--

  • same model.

  • Let's come down to the query language.

  • We've seen SQL mentioned a few times.

  • What about NoSQL?

  • Tell me what's the role of that in today's world?

  • Is SQL going to be obsolete?

  • Or are we going to continue to rely on that

  • as our query basis?

  • CHARLES FAN: OK I'll start.

  • I'm sure Gus and Theo have more to add.

  • I think NoSQL is sort of part of this common, chaotic

  • phenomenon that we are seeing.

  • It's driven by a few factors.

  • Still by far, SQL is the most popular query language today.

  • But NoSQL is one out of the need for people for looking

  • for more flexible schema.

  • And they're developing applications.

  • Sometimes they have the data stay the same, but they want

  • to structure them differently.

  • And they want to do that in a more easier way.

  • And they want to relax some of the consistency requirement of

  • their databases so they can deal with scale in a much

  • easier and much better way.

  • And it's basically driven through

  • various different needs.

  • So there are different flavors of new

  • query model that emerged.

  • And I think there are no better name.

  • So the easiest to one is you call them what they are not.

  • It's just NoSQL.

  • I do see that there is a strong trend in terms of

  • developers embracing them.

  • But again there is no clear new winners

  • in the query languages.

  • And I think in different companies there may be

  • different preferences being set up.

  • It doesn't mean five, ten years from now,

  • there won't be one.

  • I think right now, it [? is in ?] the model, let

  • developers decide--

  • let the developers of the world decide--

  • whether there is a newer querying language that can

  • replace SQL as a new one.

  • GUSTAV HORN: I would only say that SQL is going to be around

  • for a long time to come.

  • I still run into companies that are running

  • COBOL, of all things.

  • It's not going anywhere, any time soon.

  • I think what NoSQL is-- versus SQL, versus any

  • of these other things--

  • is yet another way of exposing all these internal politics

  • and battles that happen in big industry.

  • And that you're going to have legacy databases that that's

  • the only way you can talk to that.

  • And you're going to have next generation things coming out.

  • And if it wins the battle, which I think it will, you'll

  • find NoSQL becoming more and more popular.

  • And you'll find more and more of these aggregate,

  • heterogeneous kind of data stores becoming more popular--

  • provided they provide the answers that

  • they're supposed to.

  • Which just means that they have to be faster.

  • They have to be infinite in volume and size,

  • and they can grow.

  • And they have to never forget anything.

  • That's kind of the key.

  • When we talk big data, I always get a laugh sometimes

  • because they say, well we only need a 200-node system.

  • I said, well, that's today.

  • What are you going to do in five years?

  • How are you going to grow that?

  • I mean the most important thing in big data--

  • it's not the complete computational engines.

  • That's the most volatile thing in your big data system.

  • You want to get rid of that old crap

  • anyway, every two years.

  • You don't want to have to then re-migrate all your data.

  • The most important thing-- and OK, so this little plug for me

  • from NetApp-- is that the data is what's important.

  • The thing that it runs on this the most volatile or least

  • important thing.

  • It's the thing that you want to be able to flush out, and

  • read, and make faster--

  • over and over again-- provided that data stays and you don't

  • have to move it.

  • Because moving stuff is a waste.

  • And in Google, you don't want to be moving data either.

  • That's wasted energy.

  • THEO VASSILAKIS: Absolutely.

  • And I agree with that point, that the systems change.

  • Many, many systems have changed over

  • the years at Google.

  • And we'd migrate it forward and the older storage

  • systems have died.

  • But the data is always there.

  • I'm pretty sure that Jim Gray, who's a Turing Award winner,

  • felt like he needed to apologize for SQL in his

  • Turing Award acceptance speech-- sorry for SQL.

  • And as a builder SQL systems, I think SQL will stay.

  • It's great.

  • But actually the only thing I would point out about it is, I

  • think it's main and most positive attribute is that

  • it's a declarative language.

  • Meaning, it doesn't say how to compute what you want to

  • compute, but it just says what you want as the answer.

  • And I think that that's the key characteristic that--

  • whatever the language is, be it SQL, be it something else--

  • will be important.

  • Because the bigger the computation is, the more

  • complex the program is that you would have to write if

  • you're writing a real procedural program.

  • And so you're going to need systems to actually turn that

  • into computation for you.

  • So whatever the language is-- maybe it's SQL, maybe it's a

  • variant, maybe it's something else--

  • if it's declarative, then it gives the maximum ability to

  • the execution system to actually do

  • the right thing fast.

  • HAL VARIAN: We do have a few minutes for

  • questions from the audience.

  • We have a hard stop at seven because of a plane leaving.

  • But questions?

  • Back there.

  • Speak loudly please.

  • AUDIENCE: [INAUDIBLE]

  • THEO VASSILAKIS: Sure.

  • Privacy and what are we going to do.

  • So the question is, what are we going to

  • do about data privacy?

  • How are we going to make these systems protect people's data?

  • I can give you one view from Google which is, obviously

  • privacy is one of the critical things that we do here.

  • In the sense that if people don't trust

  • Google, none of it works.

  • And I think I would go back to this point

  • about declarative languages.

  • I think in the early stages of the development of analytical

  • systems, you wrote things down to the metal

  • because you had to.

  • There was no other way to do it.

  • And that gave no safeguards for what people

  • did with the data.

  • That you had to give them a code of conduct and say hey,

  • you should only apply it like this.

  • But actually when you go up the stack and up the

  • abstraction level, and you say, look tell me what you

  • want to compute.

  • And the system will actually compute it for you, then you

  • have a lot of opportunity to actually apply policy--

  • privacy policy in particular--

  • in an automatic manner.

  • So I think that ultimately, that's kind of

  • the long-term answer--

  • is that there will be mediation between the people

  • asking the questions, and systems that are executing the

  • queries, that then apply the right policies there.

  • CHARLES FAN: And I think this question mostly apply to the

  • service provider-- the cloud analytics, big data analytics,

  • the service provider.

  • And VMware recently bought a company called Cetas.

  • And we're looking at the same problem.

  • There's customers of various online gaming companies

  • uploading their data into our services.

  • And there are various technology encryptions around

  • to protect the privacy.

  • But I think more important than technology, is really the

  • business model.

  • It's like you're operating an information bank, similar to a

  • bank for money.

  • So you can argue that, why would you give your money to

  • another service provider, rather than

  • keep it in your home.

  • But the thing is, as Google, as Cetas--

  • if you breach that, they cannot continue

  • to exist as a business.

  • So it's in every interest of the operating company, of the

  • service, to protect the privacy of their customers so

  • that it can continue to exist as a bank--

  • as a service provider, just like bank do to the money.

  • So arguably, in most countries in the world, putting money in

  • the bank is safer--

  • we have the chief accountant can tell if I'm wrong here--

  • than putting your money under mattress.

  • And similar, it can be argued is already better protection

  • of the privacy of data if you entrust it to a service

  • provider in most cases.

  • GUSTAV HORN: So in short, trust no one.

  • And I actually mean that.

  • I mean you shouldn't really trust the banks almost.

  • And you shouldn't trust service providers

  • to any great extent.

  • It has to be earned.

  • And this is always proven, time and time again.

  • I think Google has done a very good job.

  • For better or worse, people have argued about how privacy

  • statements can be morphed and changed.

  • Think nothing stays consistent.

  • Nothing will stay the same.

  • It will always change.

  • So the minute you let any of your private information out

  • there, even if you believe it to be private, and only

  • amongst a small circle of people, I would never make

  • that assumption.

  • If you want to keep something private,

  • you keep it to yourself.

  • That's the only way.

  • HAL VARIAN: Yes.

  • AUDIENCE: I have a similar question for both of you.

  • [INAUDIBLE]

  • Also, the four layers that you discussed, [INAUDIBLE]

  • GUSTAV HORN: You can have privacy in the data.

  • You can have control of your data.

  • For example, you can encrypt the data on the drive.

  • So you can encrypt the data throughout the past.

  • But if you hand the key to that data out to anyone, then

  • you've already basically unlocked the door.

  • So I think compliance, privacy, protection of data,

  • that's a very tough problem.

  • HAL VARIAN: I think that Theo's point was really an

  • excellent one.

  • That you can build a lot of this

  • compliance into the system.

  • So you just can't link this with that, unless you have

  • some specific override from higher up.

  • And one of the advantage of having those declarative

  • languages is exactly that.

  • That the system can enforce compliance in

  • ways that humans can't.

  • THEO VASSILAKIS: Right.

  • I think I would sort of see privacy as a special case of

  • compliance.

  • You want the data in your organization, in your cloud,

  • to be managed according to a set of rules.

  • Now those rules will sometimes be about protecting

  • individuals.

  • But sometimes they'll be about financial regulations, and

  • whether revenues can be viewed by certain people, or modified

  • by certain people, or whatever it is.

  • And so exactly.

  • I'm actually responsible-- or was up until recently-- for a

  • lot of our billing-related computation.

  • And those are a lot of the questions as well.

  • Who gets to be able to touch any of that

  • data along that path.

  • And I think I agree with you, trust no one.

  • But I would apply that more to people than to systems.

  • I hope that we can get the systems where the systems are

  • proven over time to have the right behaviors.

  • CHARLES FAN: Right.

  • And to the four layers, I do think compliance cuts across

  • the entire stack--

  • the entire environment.

  • From my experience, I had more experience with

  • the bottom two layers.

  • Compliance certainly big with both of those layers.

  • But I can imagine there're probably things on the

  • application layer that you need to pay

  • attention to as well.

  • OK.

  • HAL VARIAN: Well, on that note, let us thank the group,

  • and thanks for coming.

  • Thanks to all of you.

LEE FLEMING: Good evening.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it