Sunday, June 16, 2013

Musings on the Motivations for Map Reduce

Update (Aug 25, 2013): I contacted Jeff Dean on his perspective on this, and have incorporated his response in this update.

Google's Map Reduce is something I am learning more about.  Developed in the late 1990's at Google, it has branched out to be a core component of many open source and commercial applications.

In reading the various explanations and motivations across the web, I was never satisfied with understanding WHY they had done things this way.  After the successful anything exists, it can be difficult to put yourself in the position of the creators before it existed at all, and I felt a nagging incompleteness in the hindsight and post-facto explanations.  

From a practical perspective,  this motivation is not critical to users nowadays, versus knowing how to "squint" so as to extract the needed map/reduce tasks to solve your particular problem, and how to run/tune the implementing software (e.g., Hadoop).

One Perspective: Failures


I came across an interesting video from 2008 that may explain some of this history - it is an interview with Jeff Dean and Sanjay Ghemawat, the creators of Google's MapReduce, and they touch on answering some of my (esoteric) curiosity:


Google Technology Roundtable: Map Reduce
with Jeff Dean, Sanjay Ghemawat, Jerry Zhao, and Matt Austern
  (from 2008)

I watched this several times, trying to pick out the statements that gave some hint of the core midnight oil inspiration(s), and at 6:17 there is this:
"If we hadn't had to deal with failures, if we had a perfectly reliable set of computers to run this on, we probably never would have implemented Map Reduce, because without having to deal with failures, the rest of the support code just isn't that complicated." Sanjay Ghemawat
That seems to be a fairly conclusive statement.  His collaborator Jeff Dean does not correct him here (although, to be fair, the format of this mildly awkward interview does not lend itself to any back-and-forth).

So the failures, with a higher likelihood of occurring since they were using commodity hardware to begin with, may have been the factor that spawned the original building of a particular system around the two map/reduce components.  Parallelization, and its abstraction from the end user, were likely going to be a component of ANY successful solution.

Updated Perspective from Jeff Dean


After I had written this short blog, it was always nagging at me how Jeff Dean might have felt about the origins of Map-Reduce, in contrast to the fairly unequivocal "Failures" in the video above.  So, I sent him an email asking for his perspective.  I think the subject "Map Reduce - The Movie?" got his attention.  Here was his response:
I'd like to say that the idea for MapReduce was there in the back of our minds from the very beginning, but it wasn't until we'd actually gone through a couple iterations of rewriting our crawling and indexing systems (scaling things and adding features as part of these rewrites) that we had seen enough different kinds of parallel data processing operations that we wanted to perform that the patterns for MapReduce started to crystalize in our heads.  At that point, we started looking at the various operations in our indexing system and tried to come up with a general interface that would allow us to implement each of those operations, and would also allow us to have a number of different optimizations underneath the covers of that interface that would make things robust and scalable.  Part of the reason we didn't develop MapReduce earlier was probably because when we were operating at a smaller scale, then our computations were using fewer machines, and therefore robustness wasn't quite such a big deal: it was fine to periodically checkpoint some computations and just restart the whole computation from a checkpoint if a machine died.  Once you reach a certain scale, though, that becomes fairly untenable since you'd always be restarting things and never make any forward progress. Jeff Dean, email, Aug 21, 2013
A few things strike me about his response.

One is that they had gone through the task at hand - crawling, indexing - several times, and were therefore intimately familiar with their peculiar problem space.  I can't find the exact source, but I remember reading one of Tom Kyte's articles once about dealing with large databases/systems: until you've done it, there's just no way to really learn.  There are too many odd things that come up when systems are at or near capacity.  Things creak in weird ways you could not have anticipated, and you just have to be there before you can start to appreciate the problem (this is true about a lot of things in life, of course: some things you just can't appreciate until you trying to actually do it).

Another thing here is that the scale of the problem made the robustness critical - not a surprising thing, of course.  Starting the computations over, despite their "simplicity" for normal-sized systems, was not "tenable" for their time-frame needs.  This actually does seem to coincide with Ghemawat's "failure" assertion, though it is not conveyed as forcefully.

Map Reduce - the Movie


This original motivations for Map Reduce kind of do not matter any more - interesting but irrelevant catalysts - at least in the context of explaining/using Map Reduce for new users.  But I think it does matter for understanding this history, and can serve as one more personal nugget in how we deal with our own challenges.

As I had indicated in a previous version of this note, I wondered about a movie about this time in history - the whiteboards, the iterations to the final implementation, and the negative reactions despite the method's success.  We were coming to terms with a new and dramatic explosion of data size and growth.  The repercussions of this period on current high performance computing and general commoditization of computing resources for the masses are profound.  

I'd still pay to see the movie.  

Popular Posts