quartz/content/BigData/Hadoop/MapReduce.md
2025-07-23 20:36:04 +03:00

32 lines
872 B
Markdown

A programming model for processing big data in parallel.
- Distributed processing - Job is run in parallel on several nodes
- Run the process where the data is!
- Horizontal Scalability
- **Map** step: transform input
- Transform, Filter, Calculate
- Local data
- e.g., count 1 per word
- **Combine** step: Reorganization of map output.
- Shuffle, Sort, Group
- **Reduce** step: Aggregate / Sum the groups
- e.g., sum word counts
MapReduce **runs code where the data is**, saving data transfer time.
![[Screenshot 2025-07-23 at 13.00.20.png]]
##### Example:
From the sentence:
> “how many cookies could a good cook cook if a good cook could cook cookies”
Steps:
1. **Map**:
- Each word becomes a pair like ("cook", 1)
2. **Shuffle**:
- Group by word
3. **Reduce**:
- Add up counts → ("cook", 4)
![[Screenshot 2025-07-23 at 13.01.20.png]]