mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-23 21:04:07 -06:00
872 B
872 B
A programming model for processing big data in parallel.
-
Distributed processing - Job is run in parallel on several nodes
-
Run the process where the data is!
-
Horizontal Scalability
-
Map step: transform input
- Transform, Filter, Calculate
- Local data
- e.g., count 1 per word
-
Combine step: Reorganization of map output.
- Shuffle, Sort, Group
-
Reduce step: Aggregate / Sum the groups
- e.g., sum word counts
MapReduce runs code where the data is, saving data transfer time.
Example:
From the sentence:
“how many cookies could a good cook cook if a good cook could cook cookies”
Steps:
- Map:
- Each word becomes a pair like ("cook", 1)
- Shuffle:
- Group by word
- Reduce:
- Add up counts → ("cook", 4)

