Subtitles section Play video Print subtitles Hi, I'm Bill Appelbe and today in seven minutes flat I'm going to explain how Hadoop works and what you can do with it and what Big Data is I've done a lot of Big Data projects in Australia in Canada in the United States and I'm also a Learning Tree instructor OK, so why big data? Firstly we all know that governments and businesses are all gathering lots of data these days, movies, images transactions But why? The answer is that data is incredibly valuable analyzing all data lets us do things like detect fraud going years back these days too, disc is cheap. We can afford to keep all that data. But there's a catch. All that data won't fit anymore on a single processor or single disc so we have to distribute it across thousands of nodes. But there's a good side to that. If its distributed, and we run in parallel, we can compute thousands of times faster and do things we couldn't possibly do before. And that's the trick behind Hadoop. OK, how does Hadoop work? Suppose what I wanted to do was look for an image spread across many hundreds of files. So first off Hadoop has to know where that data is it goes and queries something called the name node to find out all the places where the data file is located. Once it has figured that out it sends your job out to each one of those nodes. Each one of those processors independently reads its input file each one of them looks for the image and writes the result out to a local output file. That's all done in parallel. When they all report finished, you're done. okay We've seen one simple example what you might want to do with Hadoop - image recognition. But there's a lot more to it than that. For example I can do statistical data analysis I might want to calculate means, averages correlations all sorts of other data. For example I might want one look at unemployment versus population versus income versus States. If I have all the data in Hadoop I can do that. I can also do machine learning and all sorts of other analysis. Once you've got the data in Hadoop there's almost no limit to what you can do. Okay we've seen that in Hadoop data is always distributed, both the input and the output. There's more to it than that. The data is also replicated. Copies are kept of all the data blocks so if one node falls over, it doesn't affect the result. That's how we get reliability. But sometimes we need to communicate between nodes it's not enough that everybody processes their local data alone. An example is counting or sorting. In that case communication is required and a Hadoop trick for that is called MapReduce. Let's look at an example of how MapReduce works. What we are going to do is take a little application call Count Dates. That counts the number of times a date occurred spread across many different files. The First phase is called the map phase. Each processor that has an input file, reads the input file in, counts the number of times those dates occurred, and then writes it in as a set of key/value pairs. After that's done we have what's called the shuffle phase. Hadoop automatically sends all the 2000 data to one processor, all the 2001 data to another processor and the 2002 data to another processor. After that shuffle phase is complete we can do what's called a reduce. In the reduce phase all the 2000 data is summed up and written to the output file. When everybody is complete with their summations, a report done and the job is done. Ok we've seen a couple of great examples a how Hadoop works. The next question is how does Hadoop compare to conventional relational databases because they've dominated the market for years. We’ve seen one big difference which is that in Hadoop data distributed across many nodes and the processing of that data is distributed. By contrast, in a conventional relational database, conceptually all the data sits on one server and one database. But there are more differences than that. The biggest difference is that in Hadoop data is write once read many. In other words once you’ve written data, you are not allowed to modify it. You can delete but you cannot modify it. By contrast in relational databases data can be written many times, like the balance on your account. But in archival data which Hadoop is optimized for, once you’ve written the data you don't want to modify it. If it’s archival data about telephone calls or transactions, you don't want to change it once you written it. There's another difference too In relational databases we always use SQL. By contrast Hadoop doesn't support SQL at all. It supports lightweight versions of SQL called NoSQL but not conventional SQL. Also Hadoop is not just a single product or platform. It's a very rich eco-system of tools and technologies and platforms. Almost all of which are open source and all work together. So what’s in the Hadoop ecosystem? At the lowest level, Hadoop just runs on commodity hardware and software. You don't need to buy any special hardware, it runs on many operating systems. On top of that, is the Hadoop Layer which is MapReduce and a Hadoop distributed file system. On top of that is a set of tools and utilities such as: RHadoop which is statistical data processing using the R programming language. There's a machine learning tool. There are also tools for doing NoSQL like Hive and Pig and the neat thing about those tools is they support semi-structured or unstructured data You don't have to have you data stored in a conventional schema. Instead you can read the data and figure out the schema as you go along. Finally we have tools for getting data into and out of the Hadoop file system like Sqoop. That ecosystem is constantly evolving. So for example there's now a new tool for managing the Pig tool called Lipstick on Pig. And there are many more and that environment keeps being added to all the time. So now we have seen how Hadoop works and what it can do. I’m sure you've got more questions than that such as how do I install Hadoop and on what platforms? The differences between different Hadoop versions or how to do Extract Transform and Load in Hadoop. Answers to those questions are on our website at the following URL I really hope you enjoy this video. Take care, Cheers!
B1 AU hadoop data file big data processor sql What is Big Data and Hadoop? 194 23 Ron posted on 2015/12/27 More Share Save Report Video vocabulary