--- aliases: - Spark --- > [[Hadoop Eccosystem|Systems based on MapReduce]] ## Apache Spark > Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing. ##### Key Characteristics: - **Unified analytics engine** – supports batch, streaming, SQL, machine learning, and graph processing. - **In-memory computation** – stores intermediate results in RAM (vs. Hadoop which writes to disk). - **Fault tolerant** and scalable. ##### Benefits of Spark Over [[Hadoop]] [[MapReduce]] | **Feature** | **Spark** | **Hadoop MapReduce** | | ------------------- | ------------------------------------------------------------------------- | -------------------------------------- | | **Performance** | Up to **100x faster** (in-memory operations) | Disk-based, slower | | **Ease of use** | High-level APIs in Python, Java, Scala, R | Java-based, verbose programming | | **Generality** | Unified engine for batch, stream, ML, graph | Focused on batch processing | | **Fault tolerance** | Efficient recovery via lineage | Slower fault recovery via re-execution | | **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. | | ##### How is Spark Fault Tolerant? > Resilient Distributed Datasets ([[RDD]]s) - Restricted form of distributed shared memory - Immutable, partitioned collections of records - Recompute lost partitions on failure - No cost if nothing fails ![[Screenshot 2025-07-23 at 19.17.31.png|500]] - **Lineage Graph** - Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations. ##### Writing Spark Code in Python ``` # Spark Context Initialization from pyspark import SparkConf, SparkContext conf = SparkConf().setAppName("MyApp").setMaster("local") sc = SparkContext(conf=conf) # Create RDDs: # 1. From a Python list data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) # 2. From a file distFile = sc.textFile("data.txt") distFile = sc.textFile("folder/*.txt") ``` ##### **RDD Transformations (Lazy)** These create a new RDD from an existing one. | map(func) | Apply function to each element | | ----------------- | -------------------------------------------- | | filter(func) | Keep elements where func returns True | | flatMap(func) | Like map, but flattens results | | union(otherRDD) | Union of two RDDs | | distinct() | Remove duplicates | | reduceByKey(func) | Combine values for each key (key-value RDDs) | | sortByKey() | Sort by keys | | join(otherRDD) | Join two key-value RDDs | | repartition(n) | Re-distribute RDD to n partitions | Transformations are **lazy** – they only execute when an action is triggered.