quartz/MapReduce.md at 7b7a97b7cf19075c84db31690cf504e4dd0069a8

GitHub/quartz

mirror of https://github.com/jackyzha0/quartz.git synced 2025-12-23 21:04:07 -06:00

DeGefen 7b7a97b7cf Add my Obsidian notes

2025-07-23 20:36:04 +03:00

A programming model for processing big data in parallel.

Distributed processing - Job is run in parallel on several nodes
Run the process where the data is!
Horizontal Scalability
Map step: transform input
- Transform, Filter, Calculate
- Local data
- e.g., count 1 per word
Combine step: Reorganization of map output.
- Shuffle, Sort, Group
Reduce step: Aggregate / Sum the groups
- e.g., sum word counts

MapReduce runs code where the data is, saving data transfer time.

From the sentence:

“how many cookies could a good cook cook if a good cook could cook cookies”

Steps: