mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-23 21:04:07 -06:00
771 B
771 B
RDD (Resilient Distributed Dataset)
RDD is an immutable (read only) distributed collection of objects.
Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster
!Screenshot 2025-07-23 at 19.08.40.png
Key Properties:
- Distributed: Automatically split across cluster nodes.
- Lazy Evaluation: Transformations aren’t executed until an action is called.
- Fault-tolerant: Can recompute lost partitions using lineage graph.
- Parallel: Operates concurrently across cluster cores.
Data Sharing
In Apache Spark !Screenshot 2025-07-23 at 19.12.57.png 10-100x faster than network and disk!