quartz/content/BigData/Hadoop/Apache Spark.md
2025-07-23 20:36:04 +03:00

3.2 KiB
Raw Blame History

aliases
Spark

Hadoop Eccosystem

Apache Spark

Apache Spark is a fast, general-purpose, open-source cluster computing system designed for large-scale data processing.

Key Characteristics:
  • Unified analytics engine supports batch, streaming, SQL, machine learning, and graph processing.
  • In-memory computation stores intermediate results in RAM (vs. Hadoop which writes to disk).
  • Fault tolerant and scalable.
Benefits of Spark Over Hadoop MapReduce
Feature Spark Hadoop MapReduce
Performance Up to 100x faster (in-memory operations) Disk-based, slower
Ease of use High-level APIs in Python, Java, Scala, R Java-based, verbose programming
Generality Unified engine for batch, stream, ML, graph Focused on batch processing
Fault tolerance Efficient recovery via lineage Slower fault recovery via re-execution
Runs Everywhere Runs on Hadoop, Apache Mesos, Kubernetes, Standalone or in the cloud.
How is Spark Fault Tolerant?

Resilient Distributed Datasets (RDDs)

  • Restricted form of distributed shared memory
  • Immutable, partitioned collections of records
  • Recompute lost partitions on failure
  • No cost if nothing fails

!Screenshot 2025-07-23 at 19.17.31.png

  • Lineage Graph
    • Each RDD keeps track of how it was derived. If a node fails, Spark recomputes only the lost partition from the original transformations.
Writing Spark Code in Python
# Spark Context Initialization
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("MyApp").setMaster("local")
sc = SparkContext(conf=conf)

# Create RDDs:
# 1. From a Python list
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

# 2. From a file
distFile = sc.textFile("data.txt")
distFile = sc.textFile("folder/*.txt")
RDD Transformations (Lazy)

These create a new RDD from an existing one.

map(func) Apply function to each element
filter(func) Keep elements where func returns True
flatMap(func) Like map, but flattens results
union(otherRDD) Union of two RDDs
distinct() Remove duplicates
reduceByKey(func) Combine values for each key (key-value RDDs)
sortByKey() Sort by keys
join(otherRDD) Join two key-value RDDs
repartition(n) Re-distribute RDD to n partitions
Transformations are lazy they only execute when an action is triggered.