Data science relies on the availability of large volumes of data, commonly known as "big data." Big data comes from various sources such as recommender algorithms, genomic data, and medicine. For instance, Netflix's recommendation system utilizes historical viewing data, while Google Translate uses translations of books in various languages. Tech giants have built business models around data. They provide free services to users and monetize the data by selling it to advertisers. Amassing vast amounts of data enables more statistically significant insights and is often more cost-effective than selectively retaining data. Big data is characterized by the "4 V's":
- Volume (the quantity of data)
- Velocity (the speed of data inflow)
- Variety (the range of data types and sources)
- Veracity (the accuracy and reliability of data).
Handling big data effectively requires:
- Automatic parallelization and distribution of data processing
- Fault tolerance
- Status and monitoring tools
- Clean programming abstractions
The MapReduce programming model, introduced by Google in 2004, allows application programs to be written using high-level operations on immutable data. The runtime system handles scheduling, load balancing, communication, and fault tolerance.
- In Map phase, individual elements are outputted to one or more
<key, value>
pairs. - In the Reduce phase, the values with the same key are reduced to a single value.
The map function is stateless and its output depends only on the specific word received as input: reducers receive an immutable list (iterator) of values associated with each key. This ensures that the data is read sequentially from the disk only once. Designing an algorithm that updates data only once and avoids excessive state storage can result in improved performance in the given context. Typical applications involve:
- Sequences of map and reduce steps
- Data transformations
- Iterations until convergence (e.g., Google PageRank)
Strengths of MapReduce are:
- Simplified programming model
- Automated management of complex tasks (allocation, synchronization, etc.)
- Versatility
- Suitability for large-scale data analysis
Limitations of MapReduce are:
- Higher overhead compared to High-Performance Computing (HPC)
- Inherent performance constraints
- Rigid structure; each step must complete before the next begins
MapReduce platforms provide us many functionalities:
- Scheduling: Efficient allocation of tasks to mappers and reducers, with consideration for data locality. The master node is responsible for scheduling tasks, monitoring their execution, and reassigning tasks in case of worker node failures.
- Data distribution: Optimization of data transfer from mappers to reducers to minimize network bandwidth consumption. The Google File System (GFS) uses a block-based approach to store data files by dividing them into 64MB blocks. Each block is then replicated and stored on three different machines. GFS avoids limitations caused by rack switches and enables a higher read rate by considering location of these replicas and taking into account data distribution.
- Fault tolerance: MapReduce handles transparently node failures by re-executing failed tasks. The master node monitors worker nodes and reassigns tasks as necessary.
Over the last decade, the MapReduce framework has undergone several advancements with an introduction of new dataflow abstractions. Recent improvements in big data processing include:
- Support for complex transformation graphs
- Transition from batch to stream processing
- Shift towards in-memory and hybrid processing methods
Key features of modern platforms:
- arbitrary number of stages join, filter, groupBy besides map and reduce
- Caching intermediate results if reused multiple times.
- Apache Spark adopts a scheduling approach which is focused on "data parallelism". This optimize throughput, reduce overhead and eventually compress data. The scheduling approach of Spark simplifies load balancing. Spark can perform in-memory processing and supports arbitrary acyclic graphs of transformations, which can lead to faster execution times compared to the on-disk batch processing of MapReduce.
- Apache Flink adopts a pipelined approach which basically means "task parallelism". This makes Flink ideal for stream processing because its lower latency approach. Load balancing is basically impossible since everything is statically decided when the job is deployed and there isn't scheduling.
- Elasticity refers to the ability to adjust resource usage dynamically. Elasticity is about scaling the resources to meet the demands, while load balancing is about efficiently distributing the workload across the available resources. Elasticity is better supported by systems with dynamic scheduling (like Spark) than by pipelined approaches, because scheduling decisions take place at runtime while elasticity in pipelined approaches is not possible.
- Fault tolerance in Big Data processing platforms is achieved in 2 ways:
- if the processing is scheduled: Re-scheduling and re-compute if a node fails.
- if the processing is pipelined: checkpointing to a distributed file system is the way.
MapReduce can be still an appropriate choice for simple batch processing tasks, where the overhead of Spark's in-memory processing or Flink's streaming capabilities is not justified.