MapReduce

MapReduce

Mapreduce begins with some sort of input like seen above.

The data is split to distribute the work across multiple systems.

This is followed by a process called mapping which has a node in the system reading that piece of work and do a basic job like cataloging all of those items and account associated with them.

Like seen above we got two nodes that each know the count of the individual words and we got to start bringing all those back together in a process that is called shuffling.

In this, the system is now going to group similar pieces close together an effort that we will sometimes call partitioning and it's going to distribute them such that the nodes can find and load that data and that related data significantly faster.

In this particular example, we can see that with the groupings, we could easily split this now into nodes that could each process a portion of the load i.e one doing red , one doing blue, one doing everything else.

So now those nodes having access to all that shuffled data, it will load that data and perform reduce, it will group everything together, aggregate it and give us back the final numbers which can then be output as the end result.

Essentially we are making a trade-off here by trading off a lot of storage and data being organized in different ways so we can operate with lots of small CPU's

rather than having to have one big CPU and one big set of disks run through everything

Using this technique, an answer is got back quickly because each individual computer only has to work on a tiny part of the task.

In all of these steps, what is noticed is it requires something to be persisted.

We write things to a distributed disk system and it is going to be replicated to make ensure no data is lost.

Consequently, compute is generally not the bottleneck in these types of designs many times.

Disk I/O rapidly becomes the bottleneck because the disks can only process so much at once as a result we ended up with some basic

fundamental issues here the overall system tended to be more optimized for batch.

With all that I/O we are losing some of the benefits we got by distributing the system and then the overall system was not designed to be interactive like with sql server or other classic database architectures where CLI or GUI tools to execute the query and get result very quickly

Mapreduce wasn't really designed for that the original architecture was highly programmatic and focused a lot on java based jobs and while tools were built on top of

this to make it easier to convert the sql into those java based jobs and then present the results.

This led to the evolution of Spark.

https://www.geeksforgeeks.org/mapreduce-understanding-with-real-life-example/

Why Use MapReduce ?