quartz/content/BigData/Hadoop/RDD.md
2025-07-23 20:36:04 +03:00

771 B
Raw Blame History

RDD (Resilient Distributed Dataset)

RDD is an immutable (read only) distributed collection of objects.

Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster

!Screenshot 2025-07-23 at 19.08.40.png

Key Properties:
  • Distributed: Automatically split across cluster nodes.
  • Lazy Evaluation: Transformations arent executed until an action is called.
  • Fault-tolerant: Can recompute lost partitions using lineage graph.
  • Parallel: Operates concurrently across cluster cores.
Data Sharing

In Hadoop MapReduce !Screenshot 2025-07-23 at 19.11.44.png

In Apache Spark !Screenshot 2025-07-23 at 19.12.57.png 10-100x faster than network and disk!