Misplaced Pages

MapReduce: Difference between revisions

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 06:09, 10 December 2006 editGwern (talk | contribs)Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers47,730 editsm Robot-assisted disambiguation (you can help!): Python programming language← Previous edit Revision as of 05:08, 18 December 2006 edit undo80.202.86.195 (talk) Removed end of sentence that claimed the internal network at Google does not have much bandwidth. This could of course be the case, but I suspect it is just vandalismNext edit →
Line 15: Line 15:
MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the ]) records the node as dead, and sends out the node's assigned data to other nodes. Individual operations use ] operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for ]). MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the ]) records the node as dead, and sends out the node's assigned data to other nodes. Individual operations use ] operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for ]).


The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth, which their internal networks do not have much of. The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.


==Uses== ==Uses==

Revision as of 05:08, 18 December 2006

MapReduce is a programming tool developed by Google in C++ (Python and Java are supported through interfaces), in which parallel computations over large (> 1 terabyte) data sets are performed. The terminology of "Map" and "Reduce", and their general idea, is borrowed from functional programming languages use of the constructs map and reduce in functional programming and features of array programming languages.

The actual software is implemented by specifying a Map function that maps key-value pairs to new key-value pairs and a subsequent Reduce function that consolidates all mapped key-value pairs sharing the same keys to single key-value pairs.

Map and Reduce

A map function iterates over a list of independent elements and performs a specified operation on each element. The list of answers is stored independently from the original list. Because each element is operated on independently and the original list is not being modified, it is very easy to perform a map operation in parallel. On appropriate hardware this allows extremely large data sets to be processed in short amounts of elapsed time.

For example consider a list of test scores where each score has been found to be 1 too high. A map function of s 1 {\displaystyle s-1} could be applied to correct every score s {\displaystyle s} .

A reduce operation takes a list and combines elements according to some algorithm. Since a reduce always ends up with a single answer, it is not as parallelizable as a map function, but the large number of relatively independent calculations means that reduce functions are still useful in highly parallel environments.

Continuing the previous example, what if one wanted to know the average of the test scores? One could define a reduce function which halved the size of the list by adding an entry in the list to its neighbor, recursively continuing until there is only one (large) entry, and dividing the total sum by the original number of elements to get the average.

Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned data to other nodes. Individual operations use atomic operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.

Uses

According to Google, they use MapReduce in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the Internet, and replaced the old ad hoc programs that updated the index and ran the various analyses.

MapReduce generates a large number of intermediate, temporary files, which are generally managed by, and accessed through, Google File System, for greater performance.

Other implementations

  • The Hadoop project has developed an experimental implementation of MapReduce.
  • A Ruby implementation is available in the Starfish gem .

References

  1. "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
  2. "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

External links

Categories:
MapReduce: Difference between revisions Add topic