Subtitles section Play video Print subtitles LEE FLEMING: Good evening. I am really pleased to welcome you all to "Leaders in Big Data" hosted by Google and the Fung Institute of Engineering Leadership at UC Berkeley. I'm Lee Fleming. I'm director of the Institute and this is a Ikhlaq Sidhu, chief scientist and co-founder. The first and most important thing is to thank Google for hosting the event. So thank you very, very much. There's a couple people in particular, Irena Coffman and Gail Hernandez-- thank you-- and also Arnav Anant, our entrepreneur in residence at the Fung Institute. So here's Arnav. AUDIENCE: A lot of work. LEE FLEMING: Huge amount of work. The Fung Institute-- we were founded about two years ago. And the intent is to do research and pedagogical development in topics of engineering leadership. We have our degree, the Master's of Engineering-- professional Master's of Engineering M. Eng. program-- mainly around the Institute. We also have ties though across the campus, as you'll see shortly. This is our intent to have a series of talks on topics of interest to engineering leaders. As it turns out, this Wednesday we have our next talk. It's sponsored by [? Thai ?] and the Fung Institute. And the topic is entrepreneurship-- being an entrepreneur within your firm. And fittingly, we have representatives from Google, and Cisco, and SAP. That's Wednesday. Consult the Fung website or the [? Thai ?] website for details on that. So besides enjoying a good discussion tonight, we have an ulterior motive, as you can probably tell. We're trying to advertise all of our fantastic programs in big data at Cal. Now, whether you're interested in computation, or inference, or application, or some combination of those things, we've got the right program for you. As I mentioned, the professional Masters of Engineering, or M. Eng., across all the different engineering departments-- one year degree. We have another one-year degree in the stats department-- a professional degree. There's a two-year degree in the Information School. And finally, there's the Haas MBA. Tonight we've got people from all these programs. You can find their tables, ask them questions, and hopefully we'll see you see at Cal soon. And we also have an additional executive and other programs associated with each of those departments and schools as well. Ikhlaq will now introduce our speakers. IKHLAQ SIDHU: OK, thanks. So let me see. LEE FLEMING: Just slide this here. IKHLAQ SIDHU: All right. Welcome, I want to also thank a couple of people. One is [? Claus Nickoli ?], who is not here at the moment, but to you in the ether, he's just not at the meeting. But he's our host here, and so thank you. You guys can tell him that I thanked him. And also, many of you I've seen here are basically friends, and so thanks for coming. It's good to see you again. This is an event on big data. And so I'm going to give you a little data on who is speaking today-- who is here. And the way I think of this is, what we've got is three perspectives of big data from leading firms-- from people who represent leading firms in the area. And so let's start with NetApp. We've got Gustav Horn. He is a senior consulting engineer with 25 years of experience. And he's built some of the largest enterprise-class Hadoop systems in the world-- on the planet. And from Google, Theodore Vassilakis, and he's a principal engineer at Google. He's ahead of the team that works on data analytics. And he's been responsible for numerous contributions to Google in terms [? about ?] search, and the visualization and representation of the results. And from VMware, Charles Fan, who's senior VP of strategic R&D. He co-founded Rainfinity and was CTO of the company prior to its acquisition by EMC in 2005. And our distinguished set of speakers is moderated by our distinguished moderator, Hal Varian. He is chief economist here at Google. He's an emeritus professor at UC Berkeley and the founding dean of the School of Information. So with that, there's hardly anything more I could possibly say. Come on up Hal and take it away. HAL VARIAN: Thank you. I'm very impressed with the turnout tonight, seeing as you're missing both the debate and the baseball game. But at least it eliminates a difficult choice for many people. I will say that I'm going to follow the same rules as the presidential debates. So no kicking, biting, scratching, or bean balls are allowed during this performance. We're going to talk about foreign policy, wasn't that the agreement? No. All right. In any event, what I thought we'd would do is, we'd have each person talk for about five minutes, lay out their theme, where they're coming from, what their perspective is on big data. And I will take some notes, and then ask some questions, get a conversation going. And I think we'll have a little time at the end for some questions from the floor. So, take it away. THEO VASSILAKIS: Sure. So, should I start, Hal? HAL VARIAN: Yes. THEO VASSILAKIS: All right. Well, hey it's a real pleasure to be here. Thank you guys also, and thank you guys for coming. It's a huge, huge audience. Just a couple of words. As you heard, my name is Theo. I lead some of our analytical systems. So I'm responsible-- well, actually up until two weeks ago, I was responsible for a stack that had parallel data warehousing components, query engines, pieces like Dremel, and Tenzing systems that let you query this data, and visualization layers on top. And that's one of the many, many systems at Google that I think, outside, one would think of as big-data type of systems. And so I'll try to give you my perspective at least on the Google view of big data. And hopefully someone will cut me off when it's time. I think I'll probably go for five minutes. This could take a while. AUDIENCE: [INAUDIBLE] THEO VASSILAKIS: All right, sounds good. Thank you. I think, as you guys know, Google's business is primarily about taking data and organizing the world's information, and making it universally accessible and useful. So a lot of what the company does is really about sucking in data-- whether it be the web, whether it be the imagery from Street View, or satellite imagery, or maps information, or Android pings, or you name it. And then transforming it into usable forms. So really, Google is kind of a big data machine in some sense. And I think the term big data came into currency relatively recently. And we all said, yeah, OK, that speaks to what we do. Because we don't really have a word for it. We just kind of knew that the data was large. But just to try to put maybe more structure on to that, I think the Google view on a lot of "what is big data processing" kind of splits up into probably what I would call ingestion type of processes-- things like the crawlers, things like all those Street View cars running through all the streets of the world. And then goes into transaction processing systems, where perhaps we capture data through interactions on a lot of our web properties, or a lot of the web properties that we partner with. This means people clicking on search, or people interacting with docs, or people interacting with maps. All generate many, many clicks and many, many interactions that then become transactional big data. Of course, that also includes people using let's say Google Analytics on their sites to measure traffic on their properties, which then generates huge volumes of pings into Google-- many tens of thousands of QPS of pings. So that's kind of the second big component. And then probably the third component is the processing side of all of that. The process side includes things like map [? reduce, ?] analysis, generating insights from that data-- maybe in the form of building machine learning models. Maybe in the form of building, for example, Zeitgeist top queries that can then be served out to the world to say, hey here is what people are searching for. Maybe in the form of engrams of all the books that Google scanned over many, many years of its ingestion processes. But it's really baking all of that information and then presenting it in some usable form, either through a system such as our ad system that takes models and decides what ads to show, or in a more direct form such as the engrams. Just to say, OK, here are those three broad classes-- ingestion, transaction processing, and analytical processing. To dig a little bit deeper into each of those areas, I would say the ingestion processes, especially the very large scale ingestion processes, are highly custom systems. If you think about our web crawlers, if you think about the Street View cars, if you think about maps stitching, or satellite imagery stitching-- those are very, very custom processes that I think, at least to this date, don't have a clear analog in the general industry. And maybe this is something that you guys might address or might see differently than how I see the version. They're still highly-specialized systems that produce very large images. And they're very high performance, very complex systems that are run by dedicated engineering teams. The transaction processing systems or the storage systems are things like the Google File System. These are things like Big Table. These are things like Megastore. Those are the ones that we've actually published papers about and that are now reasonably well known in the industry-- have evolved a little bit past the purely custom stage, where they're fairly general purpose. And there was a time at Google where actually most people did their own storage in some form or another, until these GFS-like systems evolved to the point where they were good enough that more than one team could use them. And actually, that evolution had many steps in which, for example, everybody ran their own GFS. And so maybe the ads team had their own GFS cells, and the search team maybe had their own GFS cells. And in time, the systems matured to the point where actually we could have a centrally-managed file system. And I think recently you may have seen, we've now talked about this global file system called Spanner which takes that to yet another level of transactions and global availability. And then the third step, which is I think still in a relatively immature stage compared to some of the storage systems, is the analysis. And I think a lot of people know about MapReduce and some of the systems that have been built on top of that. So for example, Flume is the way of chaining MapReduces in a more programmer-friendly way so that you don't end up with 50 MapReduce stages that are individually managed. But rather, you end up with one program that can then be pushed down into many MapReduces that are automatically managed. The process there is still very engineering focused and essentially requires engineering teams to process this large data. And so I think what we're seeing in that area is the same maturation that we saw in the storage and transaction processing systems. Where little by little, systems such as Dremel, such as Tenzing-- such as many others inside of Google that we haven't talked about externally-- are aggregating a lot of that usage, and saying hey, we really should do it in a much simpler manner. And not really require people to have a full engineering team to get the value out of all that big data. Because at the end of the day, that's what Google wants as a whole. And that's what Google's customers want as a whole. How do we get the value out of those big pieces of information? I would just leave you with those three big pieces. And also this idea that, this is evolving into a higher-level service that people can use without necessarily being very, very low-level engineering oriented. And that more and more value is being derived out of that, and hopefully something that you're seeing in Google's properties and Google's services. I don't know how much if I'm over, but I can hand over here. Thank you. GUSTAV HORN: I'm Gus Horn, and thanks again for everybody for coming tonight. I know it's a big baseball night and you probably want us to get done quick. I come to it from a different approach in a sense and feel, because Theo has-- Google has-- really been at the forefront of big data, big data analytics, and in particular Hadoop and MapReduce. So I'm not going to go on the premise that everybody in this room understands what MapReduce is, or what big data is, and what data scientists are. These are all buzz words that are really evolving. I think what I found in my travels globally is that we're really at the forefront right now of big data analytics. I have a presentation that really characterizes it more like a tsunami of data. It's relentless, and it's coming at us. It's coming at us from our Android phones, from our iPhones. It's coming at us from cameras that are everywhere, from our TiVo boxes, from our PVR boxes, from everything we do and touch in our world today. We're generating data. And the question is, do we either let the data fall on the floor-- and we do nothing with it-- or are we going to pick that data up and actually do intelligent things with it? And we're finding more and more commercial applications. Google I look at from a pragmatic perspective. It's a commercial entity, but they are having a much more philanthropic and broad approach to the world as well. It was great back in 2003 that they defined GFS and gave us MapReduce, which brought us back to the mainframe days of old IBM. But this is basically what it feels like to me, right? Because it's batch-oriented processing at that time, when we're talking MapReduce jobs. But basically that was the genesis or the beginning of what we call the Hadoop as we know it-- the Facebooks, the Yahoos, the LinkedIn-- all of these companies that are embracing this technology. But now we look at companies like Progressive Insurance, where they're giving you these dongles to plug into your car. They're generating data. They're collecting data on your habits, your driving habits. Health care industry is looking at how often do you see the doctor, what are your statistics? I was at the Mayo Clinic recently, and they have a human genome initiative where they are looking at all of their patients. And they're actually doing a full genetic map all of their cancer patients. And their following these people for their entire life expectancy. And they want to keep their data 25 years, post mortem. They want to build a repository where they can understand exactly how does that one genetic mutation affect your propensity to be carrying a disease. Because they recognize that diseases aren't just on or off. There can't just be one mutation that gives you that problem. It's your environment, the mutation. And that builds a susceptibility. They're trying to really paint a huge picture, and that's a big data problem. So I see big data problems from health care. I see big data problems in consumer-related industries, whether they be the Walmarts, the Targets. And not everybody is trying to be evil about this. If you think about Target or Walmart, they would much rather show you an advertisement that you care about than to bore you to tears with something that doesn't matter. Just as Google doesn't want you to see a pop-up ad for baby diapers if you're 60 years old and you're not going to have a baby. It doesn't do them many good, it doesn't do you any good. There are a lot of positive things to take away from a lot of this big data, and there's some negative things, too. I'll focus on the positive in that I look at what companies like the auto manufacturers in Europe are doing. You look at BMW. All of these cars are data-generating monsters. And nowadays, you don't even know when you have to go for an oil change, because they're predictively analyzing the fluids in that car. And they're determining when is it time for you to get that oil change. It's not like, oh I have to do to every 4,000 miles. Your car tells you when you need to get it done because of viscosity changes and because of analytical testing. And they're collecting all of this data. I think we're very lucky that we are at this forefront. And I think that big data-- big data scientists-- are going to become more and more important. And I think that, as Theo said, that it's going to get to the point where, you don't have to become a MapReduce job expert. You really need to become a logical thinker and be able to articulate the questions you're asking against a data set, where you don't even care where the data came from. You just know that all the data is in there. And that's the key-- is to have a repository that's able to hold all the data, and be able to allow for this kind of processing to take place on that data, and produce results in a timely fashion. And what I've done is, I'm approaching it from more of a corporate perspective, where people are looking at enterprise-class systems, versus what we call white box or dirt cheap. And there're different kind of cut-offs for companies. And I think as you go through your process at UC Berkeley, and you're learning about where you want to go, you'll see that you have to pick and choose your battles when it comes to big data. And the battle you have to choose is, am I going to be setting up my data centers and my infrastructure to support commodity-based platforms, and this-- do I want to own all the data internally? Do I want to virtualize the data in the cloud? At what point do I bring that data internally. Do I want to use services from Google? They're all inflection points that you are going to be making decisions over the next five years to decide how to do that. And this is what I'm dealing with all the time. I think, hopefully, we all learn a lot from this experience. CHARLES FAN: Thank you for coming. My name is Charles. And unlike presidential debate, I agree with what they just said. Big data is like an elephant. We were told we are allowed to touch this elephant from different angles, from different perspectives. But before that, I'll just try to repeat what Theo and Gus just mentioned. First, I think Internet is pretty big in terms of its impact to our lives. And not only to our lives, but also to enterprise IT. And I think what we have seen in the last 20 years has been the repeated tidal waves that's caused by the Internet and the leaders in the Internet space, including Google. The advances they are making, and how those are hitting the enterprise world. And I think big data is the latest of such a tidal wave. Essentially what the scale of data that the Internet providers are dealing with, with consumers, the enterprises are facing the same. And now the challenge is, how do we adopt and massage this technology so it's consumable by the various people inside the enterprise worlds. And that's what's behind the big data world we see. And I think, like what Gus said. Enterprises are working different sectors. There are people doing retailing-- selling stuff. There are people doing a manufacturing-- building cars. There are people in health care. There are people doing financial trading. In almost every field, they are generating more and more data. And almost every field has many questions they need to ask based on those data. And they need to make decisions based on those data. And unlike the DWBI world, which has been around also for 20 years, the amount of data, the variety of data, and the speed of data coming at you are going beyond the existing infrastructure can take. And that's why to answer these different questions in different verticals, everybody is seeing a need for new infrastructure, a new database, a new storage to be created to support the decision making based on all these data. What's different in those data, besides just the size or the volume of it? When people typically refer to big data, they call it the "three v," which is volume, velocity and variety of data. Some of them call them "four s." It's the source-- there more data sources-- the size, the speed, and structure of data that are very different. And I have another name for it, which is probably less elegant, but also I think it's pretty true. When we look at the old data, the small data, or the classic data, they're typically record-based data, especially those generated by transactional applications. They usually have people generate it. And they go through the whole life cycle. So we typically call them CRUD data that you need to create, read, update, and delete. I'm sure all of you Berkeley students know the CRUD data word. You manage on the storage front. You also have database design for it. But with the new data, more and more of them are machine generated. We just have more and more devices that's connected to the Internet. Not all of them have a warm body sitting behind them. There're both servers, as well as sensors, RFID, mobile devices, cameras, and so on. And they're all generating Google Cars, they're all generating tons and tons of data, without people sitting behind them. But you still need to create them, but you don't update them that much. Those are usually write once and read many type of data. So there's not much update. And there's not much delete. You need to retain data 25 years after people die. And even after 20 years-- 25 years-- people don't remember to delete them. So there's not much delete, not much update. There are a lot of application. So instead of CRUD, now it's like create, replicate, append. There's more and more append. All the data in append-only mode. And process-- there's a constant need to process them in real-time, during ingestion, or interactive. So it's just crap data, is what big data is. It's C-R-A-P-- create, replicate, append, process. And when we are talking about structured data verses unstructured data, we say there are more and more data that are unstructured than structured. I think it's just because the database technology or the underlying technology is not scalable enough to put them in a schema or in some kind of structure. That's why they are all CRAP. But you still need to process them in a more efficient way. And that causes a lot of your challenges. I think essentially whoever designs the new data management system for CRAP and makes them consumable by enterprises, is going to be the winner of this big data race. GUSTAV HORN: So Google invented the new crapper? HAL VARIAN: Yes, OK, thank you for starting us out on such provocative comments. I wanted to follow up on your own your little troika there with the ingestion, transaction, and analytical. I come at the end of that food chain. So what we get is, the data's been pulled in, the data is available to us, and we're working on the analytical side. I want to say a few words about that. When we have these analytical systems at Google, one of the things you can do is just monitor the system and make sure everything's running the way we expect it to. And these guys have done a fantastic job, because now you can take almost anything that's gathering data at Google and create a dashboard with about 20 minutes of work, which is a fantastic thing for running the business. The other you can do is, you can build the machine-learning models that he alluded to and engage in this kind of predictive analytics. That's very in-vogue these days and it's a great thing to do. But the thing that a lot of people miss, I think, is you can use that data to conduct experiments. And that's really the secret sauce at Google. Our leader of the search team, Amit Singhal, said that a couple years ago, we did over 5,000 experiments with the search algorithm-- made 400 changes. On the ad side, we're running roughly 500 experiments at any one time. Any time you're logged into Google-- or any time you're accessing Google, I should say-- you're probably in a dozen or more experiments. And it's having the capability to manage that data, not just for the current incarnation of the system, but all the variations you might contemplate, is really a fantastic help in moving the whole system forward. So that experimentation rule is very, very important at Google. I wanted to raise a question of standards and interoperability. You mentioned Hadoop. That's really become an industry standard here at Google. We have our own internal staff. It's a lot easier to enforce these standards for interoperability internally, than industry wide. But to make this system work-- of starting with ingestion and transactions, and then the analysis-- outside of Google, or outside of other big data companies, you've got to have this kind of standards to interconnect the flow of data. And Charles, why don't you say a few things about what's going on in that area CHARLES FAN: I do think we are at the early stage of this industry. And right now there is no standards, per se, to my knowledge that has emerged. Hadoop has been a very popular technology that's born out of the open source community's effort to-- based on the Google papers-- to create the MapReduce and the GFS, as well as the other things they built on top of it. And I think, in lieu of standards, my perspective is, open source plays a huge role here. That in terms of overall data management as I mentioned, we are going from a world that every thing is relational. You basically have your relational data model, which is the standard across all-- SQL being the standard query language. Go into a more chaotic world, where there's many kinds of data stores, many kinds of queries. Even on Hadoop, there are various ways you can query on top of it. And open source really gives people the choice. In this chaotic period, it is the choice. It's basically the developers and users who's going to decide which will become the standard. And open source really provide this way to make it happen. GUSTAV HORN: I just want to make one comment. I think open source actually is the best way to make sure that you don't get yourself pigeonholed into anything that's proprietary. And I think that with Hadoop and big data, as I look five or ten years down the road, I think that standards aren't going to provide structure. It'll be more of an inhibitor than it's going to be of a benefit in this area. I think one of the key attributes-- and I think you can maybe talk more about that-- is the fact that you want to be able to connect or stitch together a bunch of disparate data sets. You want to be able to look at things where you don't have to be rigidly defined from the standard. You want to be able to look at strange queries where weather patterns, and people's buying habits, and the cars they drive have some correlation. And if you start imposing standards on top of something that is that robust, I think it's going to probably stifle development. So I think the key here is open source. The key is to have published innovations so that people are publishing their works. And I think as we get better and better at natural-language processing and being able to get away from having to be hard-core programmers, to glean insight into any of this data store, it's going to be more beneficial. I think in the next decade you'll find that you'll probably be doing less and less Java programming and more and more just natural language logic, I would think. HAL VARIAN: Theo, I hope you're going to say a word or two about protocol buffers. THEO VASSILAKIS: Protocol buffers, yes, of course. I'll plug protocol buffers for sure. HAL VARIAN: Which you made as an open standard, right? THEO VASSILAKIS: Right. It's actually an open source system. But before that, I was actually going to say I really agree with your point about experimentation. And I actually remember a time at Google where, if you wanted to run an experiment-- for example, on search-- there was one engineer who is one of our distinguished engineers now, Diane. And you had to go ask her for some cookies on which you could run your experiment. It was sort of like, she would allot you some cookies. Those days are over, but they really do generate a lot of this CRAP data, because all of those experiments accumulate over the years. And yet it's really important to have the historical view of hey, we tried this. Here's what happened then. And I think actually this plugs directly into this problem of standards, because the way that all of the engineers years back recorded their results, was very, very different than the ways that engineers today record their results. So maybe, at the time, some of them didn't have protocol buffers. Which is, if you like, a kind of XML-like format for representing data that Google created, but is a much more efficient to represent type of format. And so I think the problem comes because we want to integrate all of this variety of data. And what I would say is, I agree with Gus that I don't see a lot of appetite for very generic standards. But I do see people having a need to bridge all of their old data and the new data. And I would basically make two analogs here-- is that I think one of the things that really helped the development of data warehousing was fairly standard SQL. And it was never a standard standard. Like, there existed a standard, but no one really followed the standard very closely. But if it was close enough, you could get your systems to work. And I think the other aspect is file formats. If you can take a file format and feed it into different systems, that will really help. And so until now, CSV was the end-all, be-all file format for interchange. I think we'll see more of these as we need to trade data that's more structured-- that has protocol buffers or XML. THEO VASSILAKIS: And if I could, let me add a plug for VMware, as well. As we mentioned, I think we are agreeing that we should allow the chaos to continue for a little awhile. However, there are certain parts I think we can help people to make it easier. Which is how do you stand things up. Hadoop has a great system, but as Gus can probably tell you, it's not so easy for enterprises to stand up a Hadoop cluster. Often the enterprise needs to stand up many of those Hadoop clusters. And some will need to stand up other type of data stores. And that's where VMware is a leader in the virtualization software and cloud infrastructure. And we are building tools which includes some open source project called Serengeti, which is helping people to easily stand up their Hadoop clusters, as well as other data stores-- really automate some of those headaches or tough work. And so they can focus on the work that matters. HAL VARIAN: Let me put in a good word about standards. Because when you look at companies, how do they grow? They grow through acquisition. When they grow through acquisition, you end up with data silos everywhere. And data silos are the enemy of big data. And the amazing thing about Google, because of the work that Theo and his team do, is we have no data silos at Google. Now that's not 100% true, of course, but when we bring an acquisition in, we spend a lot of time trying to get their data infrastructure aligned with our own internal infrastructure. And what it means is, you can basically pick an engineer off of one project and move them on to another project, completely at the other side of the company. And they're productive in the first week because of having that standardized infrastructure that we have. And that is not something that most companies have the luxury of dealing with. The biggest problem that most companies face in data management is trying to get this interoperation among the different legacy systems. You know, there's this old line, how did God create the world in only six days? And the answer is, he didn't have a legacy system to worry about. So everybody in the business faces, how's that going to be solved? That's my question. How do you solve that? GUSTAV HORN: I think you're right. There are a lot of heterogeneous databases and a lot of things that need to be stitched together. And I think that big data-- again from the Hadoop prospective-- . there are lots of connectors out there-- from Flume, from Scoop. And I think that's key. You'll find that a lot of these big database companies are having to embrace open source. They're having to embrace Hadoop, because if they don't embrace it, they're going to become roadkill. So they're looking for ways to monetize it, from consulting services and things like that. And also how then can they play in this market and become leaders in this market, so they retain their customer base. Because the bottom line is, the Oracles of the world, the SAPs, these people make money through selling licenses. Hadoop is a license killer. So that's going to directly impact their ability to be profitable from a stock market perspective. They need to find ways to innovate that allow them to keep that trajectory. And then the other thing I would say is, that a lot of times the biggest problem I've found in industry, when I go meeting with big customers or potential customers, is that they don't know where to start. They have a huge data problem, not just a big data problem. They have data everywhere and silos in different corners of the organization. And they don't have one person who is competent enough from a technical perspective to know how to move forward. They have individual islands or teams that are looking at how they can move forward. And the real strength in big data and big data analytics is the heterogeneous nature of the data. That's one of the key strengths of this entire industry-- is the fact that you want to stitch together all of these different data sources, and then be able to find those correlations amongst them. It doesn't do anybody any good to do a structured database in Hadoop, and you're just doing the same old thing. What's the benefit? There is no benefit. The benefit is when you're able to combine all of these sources into one place and you find that needle in the haystack. Or you're able to better understand your customer. Because fundamentally, all of these things are customer driven. I don't care whether it's Google. I don't care whether it's VMware. If the customer isn't happy, they're not going to come back. They're not going to like your website. They're not going to like your product. So the bottom line is, how can you find ways to modify what you're doing to make it better for the customer. And if you're able to find those needles because you can stitch together all of these different sources-- including social media, including global search engines and global communities-- and find out what people are doing, you'll find out those subtle differences that really become the real game changer. And that's really what big data is about. CHARLES FAN: Yeah and I think another way I'll dissect the big data, is that it can be looked at as four layers of functionalities. From the very top is the big data applications. And to the second layer, which is big data analytics-- the various machine learning and other algorithms you can apply. The third layer is the big data management-- the query engines and so on, that you can query the data. And the bottom layer is the data infrastructure-- the storage, and so on where you store the data. I think to the question, the more bottom the layer, I think it's closer to standardization. I think there is, maybe to Theo's comment, there probably can be a unified big data store, where all the bits, all the CRAP, eventually end up somewhere. There's a sync, a common sync for all the CRAP. And they come into here. I think right now we should still allow various different ways for them to be queried. Even in our Hadoop system, some people like to use Pick, some people to use Hide, some people like to just do H-based direct on HDFS. Some people like to, Dremel is another way you can interact with it. And I'm sure there are new innovations coming out of Google, out of everywhere in the ecosystem. And like in [INAUDIBLE]. When I talk about standardization chaos, sometimes I'll go back to the history-- for me, it's Chinese history. Where, for those of you who have read the Chinese book called "The Romance of Three Kingdoms," where the first line of the novel is, "After unification it's chaos. After chaos, it's unification. " And it's describing how often of all the warlords fighting chaos, inevitably somebody's struggle will emerge and unify the land. And that will be your emperor. And also inevitably, whether it's after he gets old or, whether he dies and his kids get weak, that it will fall back into chaos. And this is traditional dynasties that repeat about a dozen times. That's 4,000 years of Chinese history. And I think that can apply to the history of anywhere else. As well, it can apply to the data processing, the data management here. Where we are in this period, going from a more unified SQL interface, a more unified data management query engines, to a more diversified world. But I would predict in ten years, there will be leading standards or ad hoc standards-- de facto standards-- that's going to emerge where the majority of the big data problem going to be solved in that way. THEO VASSILAKIS: Yeah, I agree with that. I don't know if it'll be in the form of a W3C standard or something like that, but I think that's a little bit the dynamic that Hal was referring to inside of Google. That after n years of fighting with all of the different varieties of things, people kind of said, well we understand now that it's not the purpose of our team here over in maps to really build that entire stack. Because now that we know what all that entire stack entails, we realize that it's really far too big for us to do on our own. And so we're willing to concentrate further up the stack in the parts that we really care about. And that then led lots of groups of Google to look around and say, OK well, what is a piece of technology that exists, and is reasonably mature. And a lot of people will use it, and it gives us this advantage. And so that's how some of the components such as Dremel and others emerged as de facto standards of how we analyze data. And I think that those de facto standards will in time, probably lead into some kind of more formal standards that can be adopted across companies and across organizations. HAL VARIAN: Let me switch gears and turn to the infrastructure, the hardware infrastructure. So there's two models out there. You could buy your infrastructure, and people to maintain it, and run it in-house. Or you can lease it on the cloud. And what do you see as the advantages and disadvantages of those two approaches? GUSTAV HORN: I think that there's a place for both, to be honest with you. I think that you'll find that the cloud is a great place to get started. It's a great place for you to kick the tires. I think you're always going to have the open source-- what I call white box, commodity-based approach. And a lot of groups where you're going to be doing your sandbox, your proof of concept, you're going to be testing out your code, from an infrastructure perspective. And also I think that there's a place even for what's being done over at VMware, where they're looking at fundamentally providing an infrastructure and product in a box, so that people can go to service providers and spin up map producers and build their file systems. At some point in time there is going to be again, like I said, a decision where companies are either going to embrace the technology because that internal leadership or their leader within the company has proven the value of this. And that's going to be the tough slog that everybody in this room is going to have to deal with over the next five years-- is that you're going to be battling internal processes, internal fights with in every organization that I've met. Where you have the legacy database people-- the people who said this is how we do it, this is why we do it. We have these checks and balances. We have these constraints. That data has to stay within our walls. And then you're going to have the leaders, who are more aware of what's available in technology with virtualization, with cloud-based technology. And in some cases, it does make sense. There're regulations and laws that are going to dictate where data resides, or where it can reside, or where it has to be. And there're going to be places where the cloud is going to be paramount. But you're going to find in the next five years, that you're going to be fighting more political battles than doing anything else. THEO VASSILAKIS: I agree with that. I think there will certainly be lots of ways to run infrastructure locally, as well as on the cloud. I think though, that what people will realize over time is that a lot of the reason why it may sometimes appear cheaper to run locally than it is to run on the cloud these days, is because with cloud services, you get a lot of services by default. So perhaps you would get back-up by default, perhaps you would get certain compliance functionality by default. Whereas sort of on reasonably bare machine, in perhaps your own data center, you wouldn't get these automatically. And I think over time, as more of this computation becomes a commodity, in that you just expect it to work and that's it-- you won't be able to live without some of those things that are today considered value-added services. And I think there will be a crossover point where it'll start to be more expensive to actually do all of these things on your own appliance, than it will be to do it at scale in somebody's data center. And I think the fatter and fatter pipes that connect us to these data centers are going to make that a possibility. HAL VARIAN: Go ahead. CHARLES FAN: Again, in the anti-presidential presidential debate style, I agree with both Theo and Gus. And VMware's view is that it's a hybrid cloud where that we want to provide the same benefit to customers, whether they are running things in their data centers or out of a cloud service provider. All that being said, I do think there will be an increasing amount of infrastructure moving out of data center, over time, to the cloud services. Meaning the applications will be more and more delivered as a service to the enterprise customers, as opposed to as a packaged software today. That will take time, but I think that will happen. But even after that happens, even after the infrastructure is outsourced so called, to the cloud service providers, ownership of the data, of the big data, medium data, small data, still is with the enterprises. And it is still their responsibility to be able to make their decisions based on the data that they own. Even some of the data may be sitting at the service provider, at the cloud provider. It is still their responsibility to analyze those data and to make decisions based on those. THEO VASSILAKIS: And clearly, security is going to be one of those big items. And so if anyone's working on cryptography, that's going to continue to be a pretty hot thing. HAL VARIAN: It's always good to have a job where there's an adversary. Coming back to the elections again-- same model. Let's come down to the query language. We've seen SQL mentioned a few times. What about NoSQL? Tell me what's the role of that in today's world? Is SQL going to be obsolete? Or are we going to continue to rely on that as our query basis? CHARLES FAN: OK I'll start. I'm sure Gus and Theo have more to add. I think NoSQL is sort of part of this common, chaotic phenomenon that we are seeing. It's driven by a few factors. Still by far, SQL is the most popular query language today. But NoSQL is one out of the need for people for looking for more flexible schema. And they're developing applications. Sometimes they have the data stay the same, but they want to structure them differently. And they want to do that in a more easier way. And they want to relax some of the consistency requirement of their databases so they can deal with scale in a much easier and much better way. And it's basically driven through various different needs. So there are different flavors of new query model that emerged. And I think there are no better name. So the easiest to one is you call them what they are not. It's just NoSQL. I do see that there is a strong trend in terms of developers embracing them. But again there is no clear new winners in the query languages. And I think in different companies there may be different preferences being set up. It doesn't mean five, ten years from now, there won't be one. I think right now, it [? is in ?] the model, let developers decide-- let the developers of the world decide-- whether there is a newer querying language that can replace SQL as a new one. GUSTAV HORN: I would only say that SQL is going to be around for a long time to come. I still run into companies that are running COBOL, of all things. It's not going anywhere, any time soon. I think what NoSQL is-- versus SQL, versus any of these other things-- is yet another way of exposing all these internal politics and battles that happen in big industry. And that you're going to have legacy databases that that's the only way you can talk to that. And you're going to have next generation things coming out. And if it wins the battle, which I think it will, you'll find NoSQL becoming more and more popular. And you'll find more and more of these aggregate, heterogeneous kind of data stores becoming more popular-- provided they provide the answers that they're supposed to. Which just means that they have to be faster. They have to be infinite in volume and size, and they can grow. And they have to never forget anything. That's kind of the key. When we talk big data, I always get a laugh sometimes because they say, well we only need a 200-node system. I said, well, that's today. What are you going to do in five years? How are you going to grow that? I mean the most important thing in big data-- it's not the complete computational engines. That's the most volatile thing in your big data system. You want to get rid of that old crap anyway, every two years. You don't want to have to then re-migrate all your data. The most important thing-- and OK, so this little plug for me from NetApp-- is that the data is what's important. The thing that it runs on this the most volatile or least important thing. It's the thing that you want to be able to flush out, and read, and make faster-- over and over again-- provided that data stays and you don't have to move it. Because moving stuff is a waste. And in Google, you don't want to be moving data either. That's wasted energy. THEO VASSILAKIS: Absolutely. And I agree with that point, that the systems change. Many, many systems have changed over the years at Google. And we'd migrate it forward and the older storage systems have died. But the data is always there. I'm pretty sure that Jim Gray, who's a Turing Award winner, felt like he needed to apologize for SQL in his Turing Award acceptance speech-- sorry for SQL. And as a builder SQL systems, I think SQL will stay. It's great. But actually the only thing I would point out about it is, I think it's main and most positive attribute is that it's a declarative language. Meaning, it doesn't say how to compute what you want to compute, but it just says what you want as the answer. And I think that that's the key characteristic that-- whatever the language is, be it SQL, be it something else-- will be important. Because the bigger the computation is, the more complex the program is that you would have to write if you're writing a real procedural program. And so you're going to need systems to actually turn that into computation for you. So whatever the language is-- maybe it's SQL, maybe it's a variant, maybe it's something else-- if it's declarative, then it gives the maximum ability to the execution system to actually do the right thing fast. HAL VARIAN: We do have a few minutes for questions from the audience. We have a hard stop at seven because of a plane leaving. But questions? Back there. Speak loudly please. AUDIENCE: [INAUDIBLE] THEO VASSILAKIS: Sure. Privacy and what are we going to do. So the question is, what are we going to do about data privacy? How are we going to make these systems protect people's data? I can give you one view from Google which is, obviously privacy is one of the critical things that we do here. In the sense that if people don't trust Google, none of it works. And I think I would go back to this point about declarative languages. I think in the early stages of the development of analytical systems, you wrote things down to the metal because you had to. There was no other way to do it. And that gave no safeguards for what people did with the data. That you had to give them a code of conduct and say hey, you should only apply it like this. But actually when you go up the stack and up the abstraction level, and you say, look tell me what you want to compute. And the system will actually compute it for you, then you have a lot of opportunity to actually apply policy-- privacy policy in particular-- in an automatic manner. So I think that ultimately, that's kind of the long-term answer-- is that there will be mediation between the people asking the questions, and systems that are executing the queries, that then apply the right policies there. CHARLES FAN: And I think this question mostly apply to the service provider-- the cloud analytics, big data analytics, the service provider. And VMware recently bought a company called Cetas. And we're looking at the same problem. There's customers of various online gaming companies uploading their data into our services. And there are various technology encryptions around to protect the privacy. But I think more important than technology, is really the business model. It's like you're operating an information bank, similar to a bank for money. So you can argue that, why would you give your money to another service provider, rather than keep it in your home. But the thing is, as Google, as Cetas-- if you breach that, they cannot continue to exist as a business. So it's in every interest of the operating company, of the service, to protect the privacy of their customers so that it can continue to exist as a bank-- as a service provider, just like bank do to the money. So arguably, in most countries in the world, putting money in the bank is safer-- we have the chief accountant can tell if I'm wrong here-- than putting your money under mattress. And similar, it can be argued is already better protection of the privacy of data if you entrust it to a service provider in most cases. GUSTAV HORN: So in short, trust no one. And I actually mean that. I mean you shouldn't really trust the banks almost. And you shouldn't trust service providers to any great extent. It has to be earned. And this is always proven, time and time again. I think Google has done a very good job. For better or worse, people have argued about how privacy statements can be morphed and changed. Think nothing stays consistent. Nothing will stay the same. It will always change. So the minute you let any of your private information out there, even if you believe it to be private, and only amongst a small circle of people, I would never make that assumption. If you want to keep something private, you keep it to yourself. That's the only way. HAL VARIAN: Yes. AUDIENCE: I have a similar question for both of you. [INAUDIBLE] Also, the four layers that you discussed, [INAUDIBLE] GUSTAV HORN: You can have privacy in the data. You can have control of your data. For example, you can encrypt the data on the drive. So you can encrypt the data throughout the past. But if you hand the key to that data out to anyone, then you've already basically unlocked the door. So I think compliance, privacy, protection of data, that's a very tough problem. HAL VARIAN: I think that Theo's point was really an excellent one. That you can build a lot of this compliance into the system. So you just can't link this with that, unless you have some specific override from higher up. And one of the advantage of having those declarative languages is exactly that. That the system can enforce compliance in ways that humans can't. THEO VASSILAKIS: Right. I think I would sort of see privacy as a special case of compliance. You want the data in your organization, in your cloud, to be managed according to a set of rules. Now those rules will sometimes be about protecting individuals. But sometimes they'll be about financial regulations, and whether revenues can be viewed by certain people, or modified by certain people, or whatever it is. And so exactly. I'm actually responsible-- or was up until recently-- for a lot of our billing-related computation. And those are a lot of the questions as well. Who gets to be able to touch any of that data along that path. And I think I agree with you, trust no one. But I would apply that more to people than to systems. I hope that we can get the systems where the systems are proven over time to have the right behaviors. CHARLES FAN: Right. And to the four layers, I do think compliance cuts across the entire stack-- the entire environment. From my experience, I had more experience with the bottom two layers. Compliance certainly big with both of those layers. But I can imagine there're probably things on the application layer that you need to pay attention to as well. OK. HAL VARIAN: Well, on that note, let us thank the group, and thanks for coming. Thanks to all of you.
B1 US data big data theo hadoop sql people Leaders in Big Data 653 57 大瑞今 posted on 2014/06/15 More Share Save Report Video vocabulary