I'm bored sitting idle on this flight from San Francisco to Toronto. I've just finished conquering multiple new levels of Where's My Water released today on my iPad. So I've decided to ramble on some random, yet simple, thoughts that I cooked up in the shower this morning.
Map Reduce is a pretty well known concept in the field of computer science. It's just a fancy name that some PhD-types from Google published a paper on and popularized for a concept we already know quite well as 'divide and conquer'.
Map Reduce is frequently used to speed up a computation by breaking it down into multiple smaller problems and then simultaneously solving each smaller problem in parallel using multiple computers, or just CPUs. If you wanted to count the number of words in a large essay for example, you could process the essay serially, count the words in each sentence and then sum them all up. But if you had access to multiple computers, or even multiple CPUs on the same computer, you could break down the essay into smaller chunks, such as paragraphs or sentences, and then give each computer one of these chunks to count individually, but in parallel with other computers. When all computers have finished counting the number of words in their assigned paragraphs, all you do is sum up the results of each paragraph and you would get the total number of words in the entire essay.
Similarly, if you wanted to compute the arithmetic mean of a large set of
numbers, you would break the large set down into 10 smaller but equal
subsets, get each CPU to compute the mean of each smaller subset in parallel,
and then simply
sum up take the mean of the results of all
CPUs to get the mean of the entire set. If you had 10 CPUs, you would divide
your set into 10 equal subsets, assuming your 10 CPUs were equally
powerful. If you had a large cluster of 1000 CPUs, each with different
processing speeds, then the size of the subset or sub-job given to each CPU
would be proportional to its processing power/speed in order to achieve
Edit: My good friend Paul Butler points out that the sum of means of each subset does not equal the mean of the whole set. He's absolutely right, and I don't know what I was smoking when I wrote that. But what about the mean of means? Does that equal the mean of the whole set? Yes, but only if the subsets are all equal in size. In general however, summing and then dividing in the end is a better solution computationally. Paul writes:
However, a better way is to simply keep a sum of the numbers and a count of the numbers at each step, and divide the sum by the count as the final reduce step. This avoids the problem of splitting into equal subsets (eg. what if the total number of numbers is not divisible by 1000, but that's how many CPUs you have) and is more numerically stable.
Map-Reduce, as the name implies, consists of two steps or stages: a) Map and b) Reduce. That's the easy part. The map step involves breaking down the problem into multiple smaller sub-problems and handing them out to the individual CPUs in the cluster, while the reduce step involves putting together the results of each sub-problem to determine the answer to the original problem.
A problem or job is labelled map-reducible only if the type of problem and its definition fits into the map-reduce framework. What this means is that the problem must be easily broken down into multiple smaller problems and that the final answer can be quickly and accurately determined from the results of each sub-computation. Counting the number of words in an essay or determining the arithmetic mean of a large set of numbers are clearly map-reducible owing to their associative nature: a) the sum of the number of words in each paragraph equals the number of words in the whole essay, and b) the sum of the arithmetic means of subsets of a larger set of numbers equals the arithmetic mean of the whole set. In fact, these problems are so easily broken down that they have special name. They are called embarrassingly parallel problems. And no, I did not just make that up. Brute forcing a 4-digit online PIN is another such example.
Some problems are inherently non-map-reducible owing to their more complex nature. If the computation of each sub-task requires one or more results of other sub-computations, then it is impossible to solve each sub-task independently and completely in parallel. Because subsets of data cannot be solved individually, such problems may have to be computed solely by only one CPU since the entire data set is needed to solve the problem.
It turns out that there are several instances where the principle of divide and conquer, or map-reduce can be applied in real life to parallelize a final task among multiple people. Most of us only think of divide and conquer in the context of war, splitting up assignment problems at school among multiple students, or working on a group project where each team-member contributes one section and the last team member puts the whole project together. These situations aren't great examples of map-reduce since there isn't a reduce step involved. Putting together the results of the each sub-computation with one further final computation is essential in a map-reduced job. Furthermore, adding more people to the problem in a team or organization that is not a factory doesn't necessarily make the job faster because of the Mythical Man Month effect. Linear scalability is a must in a map-reducible job.
There are a lot of examples where map-reduce can be applied in everyday life to enhance collaboration among friends or family members, but we often overlook them since our brains are trained to solve problems independently and serially. But if there are other people with you that want to solve the same problem as you, but as quickly as possible, map-reduce may often offer a smarter and worthwhile solution.
Example 1: Line-up at customs and immigration counters
San Francisco international airport's international terminal I consists of multiple immigration lines when arriving abroad from a country other than Canada in certain cases. Each lane has an immigration officer or two at its front. Having just sat on a plane for multiple hours, the only thing on people's minds is to get immigration as quickly as possible, collect their baggage, and go home. However, most people simply choose the shortest visible lane keeping the lengths of the lanes fairly the same. But, can we do better?
When most set of friends travel abroad together, they will often stand in the same line, typically the shortest, while waiting in immigration lines. This is a natural tendency so the friends can at least chit-chat among themselves while waiting. But this system is highly inefficient since the lines move forward at random speeds and the length of the line typically has poor correlation to how fast you will reach the immigration officer at the front. On the other hand, it is provably smarter to split up the group wherein each person stands in his/her own separate line. But how do you pick which lines to stand in? If you have 4 people, pick the 4 shortest lines. Picking the shortest lines isn't a guarantee of top speed, but it is a quick-enough and simple-enough heuristic to apply. The bigger the group, the more accurate this heuristic is.
After the split-up, each member constantly observes his friends to see who reaches the immigration officer first. Just as that happens, the whole group joins that person so that they can be processed as a group. If visibility between lines is limited, then the use of text messaging is recommended. Apps like iMessage and GroupMe can easily text multiple people at once. Upon receiving the text, everyone but the texter abandons their line to look for the line that the texter is in and joins them. This guarantees that everyone is processed and out of the immigration line as fast as is practically possible with n people. It's of course not the absolutely minimum time since that is only possible if you have at least as many people as there are lines.
Some caveats: you would want to have at least 3 people to take advantage of this map-reduce process. Otherwise, the extra overhead of splitting up, texting, and re-joining may not be worth the marginal improvement in processing time. Also, if the lines are short enough, it may still be worth slitting up, but not re-grouping since then at least all friends will be processed in parallel by different officers and not serially by one officer. The above map-reduce system assumes that waiting in line is the bottleneck to the whole process, and not the immigration officer inspecting your documents.
This technique doesn't apply to just airports. It applies in any situation where people are processed at multiple counters like big grocery stores and supermarkets. However, it doesn't apply at the US immigration line at Toronto Pearson as there is only one line that people stand in. Once they reach the front of the line, people are fanned out to individual counters where the lines are substantially shorter. Once split up and fanned out, it is quite hard to switch lines without being noticed. Whereas in the incoming immigration lines at San Francisco and Dubai, switching lines isn't as challenging.
The biggest drawback to this type of map-reducing is that it assumes you're the only group doing it. If every other group starts doing it, you can no longer reliably predict how long a person will take to get processed through the immigration officer. Imagine you were at the front of the line and you thought you'd be next, but then all of a sudden another 3 people cut in line to join their friend in front of you! You might have just gone from being in the fastest lane to the slowest within a matter of a few seconds.
Thusly, try to keep these type of gaming solutions to yourself and don't boast about it to all your friends lest this becomes public knowledge. If it does, everyone's going to start doing it thereby reducing your success rate, or officials might prohibit people from cutting the line while trying to join your "friend" by putting barriers between lines/counters. In publishing this very essay, I am making the same mistake I am warning you about, but unlike you, very few people have the patience to read such a lengthy essay and apply it in everyday life. So I think I'll be okay.
Example 2: Finding an open table in an open-seating arrangement at a restaurant
Ghirardelli's is a famous tourist destination in San Francisco, California that specializes in manufacturing chocolate among other equally fattening edibles. Interestingly, their chocolate sucks, so most people go there for their delicious and heart-warming sundaes and milk-shakes. During sunny weekends, the crowd is so large that finding a table to sit at in their open-seating area can be quite tedious.
When most couples or friends go to Ghirardelli's they often stick together and keep walking around the place until a table clears. As we've seen in the previous example, this is often highly inefficient. Splitting up and walking around in non-overlapping areas of the seating area is provably more efficient. This is the map phase. When one person finds a free table big enough to seat their whole party, that person claims that table and uses an app like iMessage or GroupMe to text the other members of the party. The other members then walk around the entire restaurant until they find their party. This concludes the reduce phase.
The biggest edge case of this algorithm is if two people find a table at the exact same time. When this happens, the team's unofficially elected "leader" will walk to both (or all) tables, and pick the better table based on location, comfort (couch vs. chair), and amount of natural sunshine. The rest of the members would then re-locate based on the final decision made by the leader.
The above are just two social examples of map-reduce in real life. With even a little thought, you could come up with at least a dozen other examples. I hope you'll put these ideas to practice in your everyday life saving you a little bit of time each time.
Thanks to Paul Butler for reviewing the final version of this essay and pointing out the salient errors.