Skip to content
mfisk edited this page Sep 13, 2010 · 23 revisions

FileMap

FileMap is a file-based map-reduce system for data-parallel computation.

Why Map-Reduce?

The map-reduce method of parallel computing was introduced by Google and further popularized by open source implementations like Hadoop, Disco, and others. Map-Reduce is remarkable in its simplicity and scalability. Traditional parallel environments have been based either on explicit message-passing APIs or on the appearance of a global shared-memory system. In contrast, Map-Reduce provides a rigid data-flow model in which the user need only write discrete kernel functions that fit within that dataflow. This restrictive model does not support many communication patterns in parallel codes. However, it is sufficient for a large number of data-intensive computing tasks. In return for accepting these restrictions, programmers can write simple, serial functions that a map-reduce run-time can parallelize very effectively.

What is unique about FileMap among other Map-Reduce frameworks?

FileMap is developed around several characteristic themes:

  1. File-based, rather than tuple-based processing
  2. Reuse existing domain-specific and POSIX file processing tools
  3. Thin layer on top of a POSIX (Linux, Unix, MacOS, etc.) environment. (Don’t re-invent the wheel. Use OpenSSH for network communication & authentication; use existing file systems & file access control)
  4. Intermediate result caching
  5. Data replication
  6. Streaming (Still largely a work in progress.)
  7. No privileged user access or software installation required
  8. Support multiple coalitions of participating systems

Please see the Examples page.

Installation

No real “installation” is required.

  1. Download the “fm” command (a Python script) to the computer you will use to launch computations.
  2. Create a “filemap.conf” file describing the nodes you are using.
  3. Run “fm init” once to prepare necessary directory structures on each node.
  4. Copy “fm” to each node in the computation and make sure it is in your PATH on each node.

Dependencies

  • Python
  • OpenSSH >= 4.0
Clone this wiki locally