The term ‘Big Data’ was first used in a late 90s paper to describe the complexities involved in performing computer graphics calculations. The authors, NASA scientists, remarked upon it in the following fashion:
"data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources."
The problem is that nowadays data grows so fast that not even Moore’s law can match it. Moore’s law is a rule of thumb that states that the number of transistors in an integrated circuit tends to double every year, with transistor density being linked to memory capacity under more. So If technology is being outrun by data, what is there to do?
Think of it this way: if you had to carry 100 kg of bricks from point A to point B, you could have come up with two options:
- The old computing approach in mind, you could go out and search for a Hercules who could shoulder that weight.
- Alternatively, you could ask your 10 nerdy colleagues and get it done in the same amount of time.
The second solution is probably cheaper to accomplish but requires more coordination, just as parallel computing, the current paradigm of big data.
In parallel computing, one uses a cluster of (commodity) machines. Just as in our bricks example, these computers coordinate between each other in order to split up the problem in equal parts and solve their piece in parallel with the others. This is known as distributed processing. The first distributed processing paradigm to come into popularity was Map Reduce.
Map Reduce (or divide-and-conquer)
MapReduce first came into the public domain when it was published by Google in the year 2004 in the paper ‘MapReduce: Simplified Data Processing on Large Clusters’. It was subsequently put into use by (essentially) the whole world (thereby ensuring that Google would in the future think twice before publishing important advances again).
MapReduce can be perfectly explained as a concept outside of the world of computing. I will, however, need to take a few liberties with common sense to set up the following story:
Imagine that you work in a petstore and your boss is (unfortunately for you; fortunately for this story) quite the megalomaniac. Out of the blue he decides that to effectively run his business, he desperately needs a bit of market research done. More concretely, he wants to know how many pets are living in the city, and he wants this information reported per kind of animal! You are the lucky employee tasked with this all-important assignment. He casually mentions that you have only two days as he wants to include the results in his opening speech at the national petstore conference. Luckily (his words), you can count on the help of all of the city’s elementary school students. In this example, these children represent our computing resources.
After a short crisis where you swear you will quit this time, you start to think how you can accomplish this task before the deadline passes. Inspired by MapReduce you come up with an ingenious system. This system requires the creation of 3 ‘roles’:
Every class of children is given responsibility for a distinct sub-quarter of the city. The children go door to door and ask how many pets the household has. Each class has a notebook in which they record the answers they get. The entries in the notebook would start off like this:
…. (thousands and thousand lines long)
As you can see, the kind of animal is mapped to the total number of that animal encountered in a specific household. If they hear that the household has a dog, they write down ‘Dog, 1’; if the household has two parrots they write down: ‘Parrot, 2’.
One class is appointed ‘grouper’. Each child of that class is responsible for a particular type of animal. Johnny, for example, has to count all the dogs. He goes through all the notebooks given to him and copies every line which starts with the word ‘dog’. He writes those lines down in a new, pristine notebook. After having finished, his notebook starts of like this:
….. (thousands and thousands lines long)
Ofcourse, every distinct animal type is taken care of by another child. Annie copies lines starting with ‘Cat’, Bert copies lines starting with ‘Parrot’ etc.
Yet another class is given the role of ‘reducer’. For each notebook they receive from the groupers, they use their TI-84 calculator to sum all the numbers. For dogs they would calculate 1+3+1+.... and write down the result with a large red marker on the cover page of the notebook. When they are finished, they give those notebooks to me, providing me with the end result.
Hadoop Distributed File System (HDFS)
A filesystem should be designed with a clear idea of its eventual use in mind. When we look closer to the mammoth assignment carried out above, we can begin to give some pointers to the sort of capabilities the filesystem should incorporate.
Firstly, storage constraints (similar to computing constraints) dictate that we will probably need to use more than 1 machine. In nice symmetry to the processing step, this means that the filesystem will also be distributed. To achieve this, we will split the data in large pieces and divide them (distribute it) over the machines (nodes) in our cluster. Concretely, every machine will contain the pet data for a subquarter of the city on its hard disk.
However, having each part of data on a different machine gives us a new risk: what if one of those machines breaks down? If you only have 1 computer, the risk is manageable, but with more than 100 computers, the chance that 1 out of those 100 breaks down becomes already quite tangible. In jargon, this is called ‘availability’, you want to minimize the chance that you have corrupted/incomplete data at any one moment. The solution is to replicate the data pieces over the nodes. So, by default, the HDFS will keep 3 copies of every data part (pet data for each subquarter), each on different machines. As a result, you no longer have to worry about availability. The nice thing though is that this solution is so robust, that you don’t need to worry about having high-quality machines to store the data on. You can just use average, beaten-down PCs and the increased risk of machine failure hardly affects your data availability (that’s why I mentioned commodity machines in the MapReduce section).
When continuing the reasoning, you realize that the usage of commodity (read: cheap-ass) machines has suddenly made storage cheap(er). Also, to reduce complexity and save computing resources, data is read sequentially and not randomly. This makes it hard to quickly look up a specific data entry in the filesystem. Because of these two things, data is usually never modified. Once written into the file system, data stays there undefinitely. A different mindset regarding data management makes this possible: if a family has 2 dogs and one year later one of those dogs dies, you would need to change that family’s data entry, right? Not necessarily. In fact, if you timestamp each data entry, you can argue that each data entry was correct at that time, and to get the most current state, you just look up the most recent date.
To tie all of this back to Map Reduce, you can already suspect that each computer will provide its Map/Reduce operations on the data it already possesses. This in order to limit the transport of data via the network connecting all computers. On top of this, the Map Reduce code is actually sent to each data node (while in conventional computing systems, the opposite was the case).