Add my Obsidian notes
91
content/BigData/AWS/AWS Cloud Services.md
Normal file
@ -0,0 +1,91 @@
|
||||
[[Cloud Computing]]
|
||||
## AWS Overview
|
||||
- over 175+ services
|
||||
- **Pay-as-you-go** pricing
|
||||
- **No upfront costs**
|
||||
- **Ideal for experimentation**
|
||||
- **Access to cutting-edge tools and scalability**
|
||||
##### **Region**
|
||||
- A physical location worldwide with multiple data centers.
|
||||
##### **Availability Zone (AZ)**
|
||||
- Logical group of one or more data centers within a region.
|
||||
- Physically isolated (up to 100 km apart).
|
||||
- Designed for **high availability and fault tolerance**.
|
||||
##### **Edge Location**
|
||||
- are physical sites dispersed across the globe
|
||||
- Part of Amazon’s CDN (content delivery network).
|
||||
- Distributes services/data closer to users to reduce latency.
|
||||
##### **Planning for Failure (Resiliency)**
|
||||
- **Storage**:
|
||||
* S3 service is designed for failure.
|
||||
* Each file is copied to every [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] in the region. Thus you always have three copies of your file.
|
||||
|
||||
- **Compute**:
|
||||
- The owner is responsible to manually distribute resources across multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s.
|
||||
- If one fails the others still operate.
|
||||
|
||||
- **Databases**:
|
||||
- The owner can configure DB deployment in multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s to keep redundancy.
|
||||
|
||||
##### **Benefits of AWS Global Infrastructure**
|
||||
- High performance
|
||||
- Low latency
|
||||
- High availability
|
||||
- Scalability
|
||||
- Unlimited capacity (horizontally scalable)
|
||||
- Built-in security and monitoring
|
||||
- Confidential
|
||||
- Reliable
|
||||
- Low Cost
|
||||
##### Shared Responsibility of Security
|
||||
![[Screenshot 2025-07-23 at 14.20.31.png]]
|
||||
|
||||
## AWS Core Services
|
||||
##### Networking
|
||||
* [[Amazon VPC]]
|
||||
##### Security & Identity
|
||||
- [[Amazon IAM]]
|
||||
##### Compute
|
||||
- [[Amazon EC2]]
|
||||
- [[Amazon Lambda]]
|
||||
##### Storage
|
||||
- **Instance Store:**
|
||||
- Specified by instance type. Data is stored on the same server as the [[Amazon EC2|EC2]] instance. It is removed when the instance is terminated.
|
||||
- [[Amazon EBS]]
|
||||
- [[Amazon S3]]
|
||||
##### Databases
|
||||
- Relational
|
||||
- [[Amazon RDS]]
|
||||
- Amazon Redshift
|
||||
- Amazon Aurora
|
||||
|
||||
- Non-Relational
|
||||
- [[Amazon DynamoDB]]
|
||||
- Amazon ElastiCache
|
||||
- Amazon Neptune
|
||||
|
||||
- Alternatively:
|
||||
- you can install a DB of your choice in an [[Amazon EC2|EC2]] instance and not use one provided by AWS. In that case, you take all responsibility of the security and management of your DB.
|
||||
|
||||
## AWS Pricing Models
|
||||
##### Principles:
|
||||
- **Pay-as-you-go** (only pay for usage)
|
||||
- **Reserved pricing** (discounted with commitment)
|
||||
- **Volume discount** (pay less when you use more)
|
||||
##### Free Tier Options:
|
||||
- **Always free** (e.g., 1M free Lambda calls)
|
||||
- **12-months free** (introductory offer)
|
||||
- **Trial services**
|
||||
|
||||
### **Billing Examples:**
|
||||
- [[Amazon EC2|EC2]]: Pay for runtime only.
|
||||
|
||||
- [[Amazon S3|S3]]: Pay for
|
||||
- Storage volume
|
||||
- Requests (PUT/GET)
|
||||
- Data transfer
|
||||
|
||||
- [[Amazon Lambda|Lambda]]: Pay for
|
||||
- Number of requests
|
||||
- Execution time
|
||||
|
||||
10
content/BigData/AWS/Amazon EBS.md
Normal file
@ -0,0 +1,10 @@
|
||||
---
|
||||
aliases:
|
||||
- EBS
|
||||
---
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **Amazon EBS (Elastic Block Store)**
|
||||
extra storage that is connected to the [[Amazon EC2|EC2]] instance but is not the same as the instance storage.
|
||||
- Persistent and can be attached to any [[Amazon EC2|EC2]] instance in the [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
|
||||
- It is not deleted when the [[Amazon EC2|EC2]] instance is terminated.
|
||||
|
||||
43
content/BigData/AWS/Amazon EC2.md
Normal file
@ -0,0 +1,43 @@
|
||||
---
|
||||
aliases:
|
||||
- EC2
|
||||
---
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **Amazon EC2 (Elastic Compute Cloud)**
|
||||
A web service that provides secure, resizable compute capacity in
|
||||
the cloud.
|
||||
* Designed to make web-scale cloud computing easier for developers.
|
||||
- Secure, resizable compute capacity (virtual servers).
|
||||
- Complete control over OS and apps.
|
||||
|
||||
**Pricing Models**:
|
||||
- **On-Demand**: Pay for what you use.
|
||||
- **Spot Instances**: Cheap, temporary, short-life instance, save up to 90%.
|
||||
- **Reserved**: Lower price for long-term usage.
|
||||
|
||||
**Features:**
|
||||
- **Amazon Machin Image (AMI):**
|
||||
- Preconfigured OS image
|
||||
- e.g., Linux, maxOS, Windows
|
||||
- **Instance type**:
|
||||
- Defines CPU, memory, storage, networking capacity
|
||||
- **Networking**:
|
||||
- [[Amazon VPC|VPC]] and subnets
|
||||
- **Storage**
|
||||
- **Security Group(s)**:
|
||||
- Like a firewall, Define access to and from EC2 instance
|
||||
- **Key pair**:
|
||||
- establish a remote connection (Secure SSH access)
|
||||
|
||||
- **Instance Type**:
|
||||
- defines CPU, memory, storage, and network performance.
|
||||
![[Screenshot 2025-07-23 at 16.52.08.png]]
|
||||
- **Instance Families**
|
||||
|
||||
| Family Type | Use Case<br> |
|
||||
| --------------------------------- | ------------------------- |
|
||||
| General Purpose (M / T / A) | Web servers |
|
||||
| Compute Optimized (C) | Analytics, gaming |
|
||||
| Memory Optimized (R / X) | High-performance DB |
|
||||
| Accelerated Computing (P / G / F) | AI, ML, GPU compute |
|
||||
| Storage Optimized (I) | Big data, NoSQL databases |
|
||||
13
content/BigData/AWS/Amazon IAM.md
Normal file
@ -0,0 +1,13 @@
|
||||
---
|
||||
aliases:
|
||||
- IAM
|
||||
---
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **Amazon IAM (Identity and Access Management)**
|
||||
- Manages user access to services.
|
||||
- Attach permission policies to identities to manage the kind of actions the identity can perform.
|
||||
- Identities in Amazon IAM are ***users***, ***groups*** and ***roles***.
|
||||
- Based on ***least privilege*** principle.
|
||||
* user or entity should only have access to the specific data, resources and applications when you explicitly granted them access.
|
||||
* example usage:
|
||||
* Grant cross-account permissions to upload objects while ensuring that the bucket owner has full control.
|
||||
18
content/BigData/AWS/Amazon Lambda.md
Normal file
@ -0,0 +1,18 @@
|
||||
---
|
||||
aliases:
|
||||
- Lambda
|
||||
---
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **AWS Lambda (a serverless compute service)**
|
||||
- Run backend code without provisioning servers.
|
||||
- Event-driven: triggered by events (e.g., file upload).
|
||||
- Languages: Python, Node.js, Java, C#, Go, Ruby.
|
||||
- Automatically scales with demand.
|
||||
|
||||
**Work flow**
|
||||
![[Screenshot 2025-07-23 at 17.51.04.png|600]]
|
||||
|
||||
**Example Use Case:**
|
||||
- You can configure Lambda function to perform an action when an event occurs. For example, when an image is stored in Bucket-A, an event invokes the Lambda function to process the image to a new format and store it in Bucket-B
|
||||
![[Screenshot 2025-07-23 at 17.52.39.png|400]]
|
||||
|
||||
14
content/BigData/AWS/Amazon RDS.md
Normal file
@ -0,0 +1,14 @@
|
||||
---
|
||||
aliases:
|
||||
- RDS
|
||||
- AWS RDS
|
||||
---
|
||||
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **Amazon RDS (Relational Database Service)**
|
||||
A cloud based distributed relational database managed service.
|
||||
- Managed relational DBs (e.g., MySQL, PostgreSQL, Oracle).
|
||||
- AWS handles backups, patching, and scaling.
|
||||
- You can build fault-tolerant DB by configuring RDS for **Multi-[[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]** deployment.
|
||||
Placing your master RDS instance in one [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] and a standby replica in another [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
|
||||
![[Screenshot 2025-07-23 at 17.48.07.png|600]]
|
||||
20
content/BigData/AWS/Amazon S3.md
Normal file
@ -0,0 +1,20 @@
|
||||
---
|
||||
aliases:
|
||||
- S3
|
||||
---
|
||||
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### Amazon S3 (Simple Storage Service)
|
||||
An object storage service that offers industry-leading scalability, data availability, security, and performance.
|
||||
- Object-based storage.
|
||||
- Designed for 99.999999999% (11 9’s) of durability.
|
||||
- Ideal for:
|
||||
- Media storage
|
||||
- Backups / archives
|
||||
- Data lakes
|
||||
- ML and analytics
|
||||
|
||||
**Components:**
|
||||
- **Bucket**: A container to store an unlimited number of objects
|
||||
- **Object**: The actual entities stored in the buckets
|
||||
- **Key**: Unique identifier for the object
|
||||
12
content/BigData/AWS/Amazon VPC.md
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
aliases:
|
||||
- VPC
|
||||
---
|
||||
|
||||
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
|
||||
##### **Amazon VPC (Virtual Private Cloud)**
|
||||
An logically **isolated network** within AWS for your resources.
|
||||
- Create a ***public-facing*** subnet for your web servers which have access to the internet.
|
||||
- Create a ***private-facing*** subnet with no internet access for your backend system
|
||||
- e.g., databases, application servers
|
||||
- Enables fine-grained control over traffic with both a public and a private subnet.
|
||||
41
content/BigData/Big Data Intro.md
Normal file
@ -0,0 +1,41 @@
|
||||
##### data vs. information
|
||||
- **Data** is just raw facts (like the number 42).
|
||||
- But 42 could mean: age, shoe size, stock amount, etc.
|
||||
|
||||
- **Information** is when you give meaning to the data.
|
||||
- Example: “Age = 42” gives context and becomes useful.
|
||||
|
||||
##### Big Data implementations
|
||||
- **Delta** – *Sentiment analysis* (e.g., of customer feedback).
|
||||
- **Netflix** – *User Behavioral Analysis* (e.g., what you watch and when).
|
||||
- **Time Warner** – *Customer segmentation* (dividing customers into groups).
|
||||
- **Volkswagen** – *Predictive support* (e.g., predict car issues).
|
||||
- **Visa** – *Fraud detection*.
|
||||
- **China government** – *Security Intelligence* (National security).
|
||||
- **Weather forecasting** – *Weather prediction models* to predicting the weather.
|
||||
- **Hospitals** – Diagnosing diseases using *machine learning* on images.
|
||||
- **Amazon** – *Price optimization*.
|
||||
- **Facebook** – Targeted advertising using *user profiling*.
|
||||
##### Design Principles for Big Data
|
||||
1. **Horizontal Growth** – Add more machines instead of stronger ones.
|
||||
2. **Distributed Processing** – Split work across machines.
|
||||
3. **Process where Data is** – Don’t move data, move the code.
|
||||
4. **Simplicity of Code** – Keep logic understandable.
|
||||
5. **Recover from Failures** – Systems should self-heal.
|
||||
6. **Idempotency** – Running the same job twice shouldn’t break results.
|
||||
|
||||
##### Big Data SLA (Service Level Agreement)
|
||||
define performance expectations
|
||||
|
||||
- **Reliability** – Will the data be there?
|
||||
- **Consistency** – Is the data accurate across systems?
|
||||
- **Availability** – Is the system always accessible?
|
||||
- **Freshness** – How up-to-date is the data?
|
||||
- **Response time** – How fast do queries return?
|
||||
|
||||
* Other concerns:
|
||||
- **Cost**
|
||||
- **Scalability**
|
||||
- **Performance**
|
||||
|
||||
> Next [[Cloud Services]]
|
||||
15
content/BigData/Big Data.md
Normal file
@ -0,0 +1,15 @@
|
||||
>“An accumulation of data
|
||||
>that is too large and complex
|
||||
>for processing by traditional
|
||||
>database management tools”
|
||||
>
|
||||
>**In Short:**
|
||||
>Big Data = too big for standard tools like Excel or regular SQL databases.
|
||||
|
||||
[[Big Data Intro]]
|
||||
[[Cloud Services]]
|
||||
[[AWS Cloud Services]]
|
||||
[[Database Overview]]
|
||||
[[RDBMS]]
|
||||
[[Hadoop]]
|
||||
[[Hadoop Eccosystem]]
|
||||
29
content/BigData/Cloud Computing.md
Normal file
@ -0,0 +1,29 @@
|
||||
##### Benefits of Cloud Computing
|
||||
- **Elasticity**: Start small and scale as needed.
|
||||
- **Cost-efficiency**: No need to spend money on data centers.
|
||||
- **No capacity guessing**: Scale automatically based on demand.
|
||||
- **Economies of scale**: Benefit from AWS’s vast infrastructure.
|
||||
- **Agility**: Deploy resources quickly.
|
||||
- **Global Reach**: Go international within minutes.
|
||||
##### Deployment Models in the Cloud
|
||||
- **IaaS** (Infrastructure as a Service):
|
||||
- Virtual machines, storage, networks
|
||||
- e.g., Amazon EC2.
|
||||
|
||||
- **PaaS** (Platform as a Service):
|
||||
- Managed environments for building apps
|
||||
- e.g., AWS Elastic Beanstalk.
|
||||
|
||||
- **[[Cloud Services#Selling Your Service |SaaS]]** (Software as a Service):
|
||||
- Full applications delivered over the internet
|
||||
- e.g., Gmail.
|
||||
|
||||
##### Deployment Strategies of Cloud Computing
|
||||
- **On-Premises (Private Cloud)**: Owned and operated on-site.
|
||||
- **Public Cloud**: Fully hosted on cloud provider infrastructure.
|
||||
- **Hybrid Cloud**: Combines on-premises and cloud resources.
|
||||
|
||||
##### Cloud Providers Comparison
|
||||
![[Screenshot 2025-07-23 at 13.54.07.png | 600]]
|
||||
|
||||
> Next [[AWS Cloud Services]]
|
||||
58
content/BigData/Cloud Services.md
Normal file
@ -0,0 +1,58 @@
|
||||
Introduction to cloud computing concepts relevant for Big Data.
|
||||
##### traditional software deployment process:
|
||||
1. **Coding**
|
||||
2. **Compiling** – turning source code into executable files.
|
||||
3. **Installing** – putting the software on computers.
|
||||
|
||||
##### Clustered Software
|
||||
Introduces three related architectures:
|
||||
|
||||
1. **Redundant Servers** – multiple servers running the same service for fault-tolerance.
|
||||
- E.g., several identical web servers.
|
||||
|
||||
2. **Micro-services** – the system is broken into **small, independent services** that communicate with each other.
|
||||
- Each handles a specific function.
|
||||
|
||||
3. **Clustered Computing** – a large task is **split into sub-tasks** running on **multiple nodes**.
|
||||
- Used in Big Data systems like **NoSQL databases**.
|
||||
|
||||
##### Scaling a Software System
|
||||
Two ways to handle growing demand:
|
||||
- **Scale Up**: Make one machine stronger
|
||||
- When running out of resources we can add: *Memory*, *CPU*, *Disk*, *Network Bandwidth*
|
||||
- Can become expensive or reach hardware limits.
|
||||
|
||||
- **Scale Out**: Add more machines to share the work.
|
||||
- Add **redundant servers** or use **cluster computing**.
|
||||
- Each server can be **standalone** (like a web server), or part of a **coordinated system** (like a NoSQL cluster).
|
||||
- More fault-tolerant and scalable than vertical scaling.
|
||||
|
||||
- Tradeoff:
|
||||
- **Scale-up** is simpler but has limits.
|
||||
- **Scale-out** is more flexible and resilient but more complex.
|
||||
|
||||
##### Selling Your Service
|
||||
- **Install** - Software as installation
|
||||
- e.g., Microsoft's office package
|
||||
|
||||
- Saas - Software as a Service
|
||||
- No need to install, just log in and use.
|
||||
- e.g., Google Docs, Zoom, Dropbox.\
|
||||
|
||||
- Common SaaS pricing models:
|
||||
1. **Per-user** – Pay per person.
|
||||
2. **Tiered** – Fixed price for different feature levels.
|
||||
3. **Usage-based** – Pay for what you use (e.g., storage, API calls).
|
||||
|
||||
##### Deployment Models
|
||||
Where you run your software:
|
||||
- **On-Premises**: Your own machines or rented servers (or VM’s).
|
||||
- **Cloud**: Run on virtual machines (VMs) from a cloud provider (e.g., AWS, Azure, GCP).
|
||||
|
||||
##### Cloud Deployment Options
|
||||
When deploying to the cloud, you have options:
|
||||
1. **Vanilla Node**: Raw VM – you install everything.
|
||||
2. **Cloud VM**: VM with pre-installed software.
|
||||
3. **Managed Service**: Cloud provider handles setup, scaling, updates (e.g., [[Amazon RDS|AWS RDS]], Google BigQuery).
|
||||
|
||||
> Next [[Cloud Computing]]
|
||||
24
content/BigData/Database History.md
Normal file
@ -0,0 +1,24 @@
|
||||
|
||||
1. **Punch cards** – physical cards with holes. Early computers read data this way.![[Screenshot 2025-07-23 at 12.08.22.png | 400]]
|
||||
|
||||
2. **Magnetic media**:
|
||||
- First: **Floppy disks**
|
||||
![[Screenshot 2025-07-23 at 12.08.48.png | 400]]
|
||||
- Then: **Hard disks** (faster, more storage)
|
||||
![[Screenshot 2025-07-23 at 12.09.30.png]]
|
||||
3. **1960s**: First **Database Management Systems (DBMSs)** created:
|
||||
- Charles W. Bachman developed the **Integrated Database System**
|
||||
- IBM developed **IMS**
|
||||
4. **1970s**:
|
||||
- IBM created **SQL** (Structured Query Language)
|
||||
- Modern relational databases (RDBMS) were born
|
||||
5. **20th century:** Many RDBMS's
|
||||
- ORACLE, Microsoft's SQLServer, IBM's DB2, MySQL, SYBASE...
|
||||
![[Screenshot 2025-07-23 at 12.10.02.png]]
|
||||
|
||||
|
||||
##### Hadoop history
|
||||
![[Screenshot 2025-07-23 at 12.16.32.png | 200]]
|
||||
2005 - Started by ***Doug Cutting*** at Yahoo!
|
||||
[[Hadoop]] is an [[Open Source]] Apache project
|
||||
Benefits: free, flexible, community-supported.
|
||||
22
content/BigData/Database Overview.md
Normal file
@ -0,0 +1,22 @@
|
||||
|
||||
[[Database History]]
|
||||
[[RDBMS]] - Relational Models
|
||||
[[Hadoop]]
|
||||
##### **Big Data Challenges**
|
||||
Examples of tasks that are hard with large datasets:
|
||||
1. Count the **most frequent words** in Wikipedia.
|
||||
2. Find the **hottest November** per country from weather data.
|
||||
3. Find the **day with most critical errors** in company logs.
|
||||
|
||||
These problems require:
|
||||
- **Huge data**
|
||||
- **Efficient distributed computing**
|
||||
|
||||
#### [[RDBMS]] vs. [[Hadoop]]
|
||||
|
||||
| **Feature** | **RDBMS** | **Hadoop** |
|
||||
| -------------- | ------------------- | ----------------------------- |
|
||||
| Data structure | Structured (tables) | Any (structured/unstructured) |
|
||||
| Scalability | Limited | Highly scalable |
|
||||
| Speed | Fast (small data) | Designed for huge data |
|
||||
| Access | SQL | Code (e.g., Java, Python) |
|
||||
59
content/BigData/Hadoop/Apache Hive.md
Normal file
@ -0,0 +1,59 @@
|
||||
---
|
||||
aliases:
|
||||
- Hive
|
||||
---
|
||||
> [[Hadoop Eccosystem|Systems based on MapReduce]]
|
||||
|
||||
### Apache Hive
|
||||
##### **Key Features**
|
||||
- Developed by **Apache**.
|
||||
- General SQL-like syntax for querying [[HDFS]] or other large databases
|
||||
- Translates SQL queries into one or more [[MapReduce]] jobs.
|
||||
- Maps data in [[HDFS]] into virtual [[RDBMS]]-like tables.
|
||||
- **Pro**:
|
||||
- Convenient for **data analytics** uses SQL.
|
||||
* **Con**:
|
||||
* Quite slow in response time
|
||||
|
||||
##### **Hive Data Model**
|
||||
**Structure**
|
||||
- **Physical**: Data stored in [[HDFS]] blocks across nodes.
|
||||
- **Virtual Table**: Defined with schema using metadata.
|
||||
- **Partitions**: Logical splits of data to speed up queries.
|
||||
|
||||
**Metadata**
|
||||
- Hive stores metadatain DB
|
||||
- Map physical files to tables.
|
||||
- Map fields (columns) to line structures in raw data.
|
||||
|
||||
![[Screenshot 2025-07-23 at 18.25.32.png]]
|
||||
|
||||
**Hive Architecture**
|
||||
![[Screenshot 2025-07-23 at 18.27.30.png|]]
|
||||
|
||||
##### Hive Usage
|
||||
```
|
||||
#Start a hive shell:
|
||||
$hive
|
||||
|
||||
#create hive table:
|
||||
hive> CREATE TABLE mta (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)
|
||||
|
||||
#Show all tables:
|
||||
hive> SHOW TABLES;
|
||||
|
||||
#Add a new column to the table:
|
||||
hive> ALTER TABLE mta ADD COLUMNS (description STRING);
|
||||
|
||||
#Load HDFS data file into the table:
|
||||
hive> LOAD DATA INPATH '/home/hadoop/mta_users' OVERWRITE INTO TABLE mta;
|
||||
|
||||
#Query employees that work more than a year:
|
||||
hive> SELECT name FROM mta WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
|
||||
|
||||
#Execute command without shell
|
||||
$hive -e 'SELECT name FROM mta;'
|
||||
|
||||
#Execute script from file
|
||||
$hive -f hive_script.txt
|
||||
```
|
||||
68
content/BigData/Hadoop/Apache Spark.md
Normal file
@ -0,0 +1,68 @@
|
||||
---
|
||||
aliases:
|
||||
- Spark
|
||||
---
|
||||
> [[Hadoop Eccosystem|Systems based on MapReduce]]
|
||||
|
||||
## Apache Spark
|
||||
> Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing.
|
||||
|
||||
##### Key Characteristics:
|
||||
- **Unified analytics engine** – supports batch, streaming, SQL, machine learning, and graph processing.
|
||||
- **In-memory computation** – stores intermediate results in RAM (vs. Hadoop which writes to disk).
|
||||
- **Fault tolerant** and scalable.
|
||||
|
||||
##### Benefits of Spark Over [[Hadoop]] [[MapReduce]]
|
||||
|
||||
| **Feature** | **Spark** | **Hadoop MapReduce** |
|
||||
| ------------------- | ------------------------------------------------------------------------- | -------------------------------------- |
|
||||
| **Performance** | Up to **100x faster** (in-memory operations) | Disk-based, slower |
|
||||
| **Ease of use** | High-level APIs in Python, Java, Scala, R | Java-based, verbose programming |
|
||||
| **Generality** | Unified engine for batch, stream, ML, graph | Focused on batch processing |
|
||||
| **Fault tolerance** | Efficient recovery via lineage | Slower fault recovery via re-execution |
|
||||
| **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. | |
|
||||
|
||||
##### How is Spark Fault Tolerant?
|
||||
> Resilient Distributed Datasets ([[RDD]]s)
|
||||
|
||||
- Restricted form of distributed shared memory
|
||||
- Immutable, partitioned collections of records
|
||||
- Recompute lost partitions on failure
|
||||
- No cost if nothing fails
|
||||
|
||||
![[Screenshot 2025-07-23 at 19.17.31.png|500]]
|
||||
|
||||
- **Lineage Graph**
|
||||
- Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations.
|
||||
|
||||
##### Writing Spark Code in Python
|
||||
```
|
||||
# Spark Context Initialization
|
||||
from pyspark import SparkConf, SparkContext
|
||||
|
||||
conf = SparkConf().setAppName("MyApp").setMaster("local")
|
||||
sc = SparkContext(conf=conf)
|
||||
|
||||
# Create RDDs:
|
||||
# 1. From a Python list
|
||||
data = [1, 2, 3, 4, 5]
|
||||
distData = sc.parallelize(data)
|
||||
|
||||
# 2. From a file
|
||||
distFile = sc.textFile("data.txt")
|
||||
distFile = sc.textFile("folder/*.txt")
|
||||
```
|
||||
##### **RDD Transformations (Lazy)**
|
||||
These create a new RDD from an existing one.
|
||||
|
||||
| map(func) | Apply function to each element |
|
||||
| ----------------- | -------------------------------------------- |
|
||||
| filter(func) | Keep elements where func returns True |
|
||||
| flatMap(func) | Like map, but flattens results |
|
||||
| union(otherRDD) | Union of two RDDs |
|
||||
| distinct() | Remove duplicates |
|
||||
| reduceByKey(func) | Combine values for each key (key-value RDDs) |
|
||||
| sortByKey() | Sort by keys |
|
||||
| join(otherRDD) | Join two key-value RDDs |
|
||||
| repartition(n) | Re-distribute RDD to n partitions |
|
||||
Transformations are **lazy** – they only execute when an action is triggered.
|
||||
27
content/BigData/Hadoop/Google Dremel.md
Normal file
@ -0,0 +1,27 @@
|
||||
> [[Hadoop Eccosystem|Systems based on MapReduce]]
|
||||
|
||||
**Key Ideas**
|
||||
• Leverages columnar file format
|
||||
• Optimized for SQL performance
|
||||
|
||||
**Concepts**
|
||||
- Tree-based **query execution**.
|
||||
- Efficient scanning and aggregation of **nested columnar data**.
|
||||
### Columnare data format
|
||||
> Illustration of what columnar storage is all about:
|
||||
> given a 3 columns:
|
||||
![[Screenshot 2025-07-23 at 18.42.46.png|170]]
|
||||
> In a row-oriented storage, the data is laid out one row at a time as follows:
|
||||
![[Screenshot 2025-07-23 at 18.45.25.png|500]]
|
||||
> Whereas in a column-oriented storage, it is laid out one column at a time:
|
||||
![[Screenshot 2025-07-23 at 18.46.55.png|500]]
|
||||
|
||||
**Nested data in columnar format**
|
||||
![[Screenshot 2025-07-23 at 18.50.10.png]]![[Screenshot 2025-07-23 at 18.50.16.png]]
|
||||
|
||||
### Frameworks inspired by Google Dremel
|
||||
• Apache Dril (MapR)
|
||||
• Apache Impala (Cloudera)
|
||||
• Apache Tez (Hortonworks)
|
||||
• Presto (Facebook)
|
||||
|
||||
64
content/BigData/Hadoop/HDFS.md
Normal file
@ -0,0 +1,64 @@
|
||||
##### HDFS ([[Hadoop]] Distributed File System)
|
||||
Stores huge files (Typical file size GB-TB) across multiple machines.
|
||||
- Breaks files into **blocks** (typically 128 MB).
|
||||
- **Replicates** blocks (default 3 copies) for fault tolerance.
|
||||
- Access using POSIX API.
|
||||
|
||||
##### HDFS design principles
|
||||
* **Immutable**: **write-once, read-many**
|
||||
* **No Failures**: Disk or node failure does not affect file system
|
||||
* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
|
||||
* **File Num Limited**: 1048576 files in a directory
|
||||
* **Prefer bigger files**: Big files provide better performance
|
||||
|
||||
##### HDFS File Formats
|
||||
- Text/CSV - No schema, no metadata
|
||||
- Json Records - metadata is stored with data
|
||||
- Avro Files - schema independent of data
|
||||
- Sequence Files - binary files (used as intermediate storage in M/R)
|
||||
- RC Files - Record Columnar files
|
||||
- ORC Files - Optimized RC files. Compress better
|
||||
- Parquet Files - Yet another RC file
|
||||
|
||||
##### HDFS Command Line
|
||||
```
|
||||
# List files
|
||||
hadoop fs -ls /path
|
||||
|
||||
# Make directory
|
||||
hadoop fs -mkdir /user/hadoop
|
||||
|
||||
# Print file
|
||||
hadoop fs -cat /file
|
||||
|
||||
# Upload file
|
||||
hadoop fs -copyFromLocal file.txt hdfs://...
|
||||
```
|
||||
|
||||
#### HDFS Architecture – Main Components
|
||||
##### **1.** NameNode (Master Node)
|
||||
- **Stores metadata** about the filesystem:
|
||||
- Filenames
|
||||
- Directory structure
|
||||
- Block locations
|
||||
- Permissions
|
||||
|
||||
- It **does not store the actual data**.
|
||||
- There is **one active NameNode** per cluster.
|
||||
|
||||
##### **2.** DataNodes (Worker Nodes)
|
||||
- Store the **actual data blocks** of files.
|
||||
- Send **heartbeat** messages to the NameNode to report that they are alive.
|
||||
- When a file is written, it’s split into blocks and distributed across many DataNodes.
|
||||
- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
|
||||
|
||||
#### File Read / Write
|
||||
**When a file is written:**
|
||||
1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
|
||||
2. The NameNode responds with a list of **DataNodes** to use.
|
||||
3. The client sends the blocks of the file to those DataNodes.
|
||||
4. Blocks are **replicated** automatically across different nodes for redundancy.
|
||||
|
||||
**When a file is read:**
|
||||
1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
|
||||
2. The client reads the blocks **directly** from the DataNodes.
|
||||
17
content/BigData/Hadoop/Hadoop Eccosystem.md
Normal file
@ -0,0 +1,17 @@
|
||||
### Systems based on [[MapReduce]]
|
||||
> Early generation frameworks for big data processing.
|
||||
* [[Apache Hive]]
|
||||
|
||||
### Systems that replace MapReduce
|
||||
> newer, faster frameworks with different architectures and performance improvements.
|
||||
|
||||
**Motivation**: [[MapReduce]] and [[Apache Hive|Hive]] are too slow!
|
||||
- [[Google Dremel]]
|
||||
- [[Apache Spark]]
|
||||
- Replaces MapReduce with its own engine that works much faster without compromising consistency
|
||||
- Architecture not based on Map-reduce but rather on two concepts:
|
||||
- RDD (Resilient Distributed Dataset)
|
||||
- DAG (Directed Acyclic Graph)
|
||||
- Pro’s:
|
||||
- Works much faster than MapReduce;
|
||||
- fast growing community.
|
||||
13
content/BigData/Hadoop/Hadoop.md
Normal file
@ -0,0 +1,13 @@
|
||||
![[Screenshot 2025-07-23 at 12.20.09.png | 400]]
|
||||
> Hadoop is an **[[Open Source]] framework** for:
|
||||
> - **Distributed storage** (across many machines)
|
||||
> - **Distributed processing** (run programs on many machines in parallel)
|
||||
>
|
||||
> > It is **not a database** — it is an ecosystem for managing and analyzing **Big Data**.
|
||||
## **Hadoop Components Overview**
|
||||
![[Screenshot 2025-07-23 at 11.58.48.png ]]
|
||||
> 1. [[HDFS]]
|
||||
> 2. [[MapReduce]]
|
||||
> 3. [[Yarn]]
|
||||
|
||||
[[Hadoop Eccosystem]]
|
||||
32
content/BigData/Hadoop/MapReduce.md
Normal file
@ -0,0 +1,32 @@
|
||||
A programming model for processing big data in parallel.
|
||||
- Distributed processing - Job is run in parallel on several nodes
|
||||
- Run the process where the data is!
|
||||
- Horizontal Scalability
|
||||
|
||||
- **Map** step: transform input
|
||||
- Transform, Filter, Calculate
|
||||
- Local data
|
||||
- e.g., count 1 per word
|
||||
|
||||
- **Combine** step: Reorganization of map output.
|
||||
- Shuffle, Sort, Group
|
||||
|
||||
- **Reduce** step: Aggregate / Sum the groups
|
||||
- e.g., sum word counts
|
||||
|
||||
MapReduce **runs code where the data is**, saving data transfer time.
|
||||
|
||||
![[Screenshot 2025-07-23 at 13.00.20.png]]
|
||||
##### Example:
|
||||
From the sentence:
|
||||
> “how many cookies could a good cook cook if a good cook could cook cookies”
|
||||
|
||||
Steps:
|
||||
1. **Map**:
|
||||
- Each word becomes a pair like ("cook", 1)
|
||||
2. **Shuffle**:
|
||||
- Group by word
|
||||
3. **Reduce**:
|
||||
- Add up counts → ("cook", 4)
|
||||
|
||||
![[Screenshot 2025-07-23 at 13.01.20.png]]
|
||||
20
content/BigData/Hadoop/RDD.md
Normal file
@ -0,0 +1,20 @@
|
||||
## RDD (Resilient Distributed Dataset)
|
||||
>RDD is an immutable (read only) distributed collection of objects.
|
||||
>
|
||||
>Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster
|
||||
|
||||
![[Screenshot 2025-07-23 at 19.08.40.png|600]]
|
||||
##### **Key Properties:**
|
||||
- Distributed: Automatically split across cluster nodes.
|
||||
- Lazy Evaluation: Transformations aren’t executed until an action is called.
|
||||
- Fault-tolerant: Can **recompute lost partitions** using lineage graph.
|
||||
- Parallel: Operates concurrently across cluster cores.
|
||||
##### Data Sharing
|
||||
> In [[Hadoop]] [[MapReduce]]
|
||||
![[Screenshot 2025-07-23 at 19.11.44.png|500]]
|
||||
|
||||
> In [[Apache Spark|Spark]]
|
||||
![[Screenshot 2025-07-23 at 19.12.57.png|500]]
|
||||
>10-100x faster than network and disk!
|
||||
|
||||
|
||||
35
content/BigData/Hadoop/Yarn.md
Normal file
@ -0,0 +1,35 @@
|
||||
**YARN (Yet Another Resource Negotiator)**
|
||||
is [[Hadoop]]’s cluster resource management system
|
||||
- Multiple jobs running simultaneously
|
||||
- Multiple jobs use same resources (disk, CPU, memory)
|
||||
- Assign resources to jobs and tasks exclusively
|
||||
|
||||
##### YARN is in charge of:
|
||||
1. Allocates Resources
|
||||
2. Schedules Jobs
|
||||
- allocate priorities to jobs by policies:
|
||||
FIFO scheduler, Fair scheduler, Capacity scheduler
|
||||
|
||||
##### Components:
|
||||
- **ResourceManager**
|
||||
- oversees resource allocation across the cluster
|
||||
|
||||
- **NodeManager**
|
||||
- Each node in the cluster runs a NodeManager.
|
||||
- This component manages the execution of containers on its node.
|
||||
|
||||
- **ApplicationMaster**
|
||||
- manages the lifecycle of applications.
|
||||
- handles job scheduling and monitors progress.
|
||||
|
||||
- **Resource Container**
|
||||
- a logical bundle of resources (e.g., CPU, Memory) that is allocated by the ResourceManager
|
||||
|
||||
![[Screenshot 2025-07-23 at 13.29.37.png]]
|
||||
|
||||
##### YARN ecosystem
|
||||
Yarn can run other applications beside Hadoop [[MapReduce]], that can
|
||||
integrate to the Hadoop ecosystem:
|
||||
• Apache Storm (Data Streaming engine)
|
||||
• [[Apache Spark]] (Data Batch and streaming engine)
|
||||
• Apache Solr (Search platform)
|
||||
13
content/BigData/Open Source.md
Normal file
@ -0,0 +1,13 @@
|
||||
• Source Code available
|
||||
• Free Redistribution
|
||||
• Derived Works
|
||||
|
||||
![[Screenshot 2025-07-23 at 12.24.23.png]]
|
||||
|
||||
Open-source replace Closed-source
|
||||
![[Screenshot 2025-07-23 at 12.25.00.png]]
|
||||
|
||||
More Open-source solutions
|
||||
![[Screenshot 2025-07-23 at 12.25.28.png]]
|
||||
|
||||
![[Screenshot 2025-07-23 at 12.27.11.png]]![[Screenshot 2025-07-23 at 12.27.39.png]]
|
||||
63
content/BigData/RDBMS.md
Normal file
@ -0,0 +1,63 @@
|
||||
[[Database Overview]]
|
||||
##### What is an RDBMS?
|
||||
**Relational Database Management System**:
|
||||
- Data is stored in **tables**:
|
||||
- **Rows** = records
|
||||
- **Columns** = fields
|
||||
|
||||
- Each table has:
|
||||
- **Indexes** for fast searching
|
||||
- **Relationships** with other tables (via keys)
|
||||
|
||||
##### Relational model - Keys and Indexes
|
||||
Ability to find record(s) quickly
|
||||
- Operations become efficient:
|
||||
- **Find by key** → O(log n)
|
||||
- **Fetch record by ID** → O(1)
|
||||
|
||||
Indexes = sorted references to data locations → like a book index.
|
||||
|
||||
##### Relational model - Operations
|
||||
Relational databases support **CRUD**:
|
||||
- **C**reate
|
||||
- **R**ead
|
||||
- **U**pdate
|
||||
- **D**elete
|
||||
|
||||
Each operation uses both:
|
||||
- The **index** (to locate data)
|
||||
- The **data** itself (to read/write)
|
||||
|
||||
##### Relational model - Transactional
|
||||
Relational databases guarantee **transaction safety** with ACID:
|
||||
- **A**tomicity – all or nothing
|
||||
- **C**onsistency – valid data only
|
||||
- **I**solation – no interference from other transactions
|
||||
- **D**urability – survives crashes
|
||||
|
||||
* Examples:
|
||||
- Transferring money, Posting a tweet
|
||||
- Both must either **succeed completely** or **fail completely**.
|
||||
|
||||
Transactions guarantee data validity despite errors & failures
|
||||
|
||||
##### Relational model - SQL
|
||||
**SQL** is the language used to talk to relational databases.
|
||||
- **S**tandard
|
||||
- **Q**uery
|
||||
- **L**anguage
|
||||
|
||||
- All RDBMSs use it (MySQL, PostgreSQL, Oracle, etc.)
|
||||
|
||||
##### Pros and Cons of RDBMS
|
||||
**Pros:**
|
||||
- Structured data
|
||||
- ACID transactions
|
||||
- Powerful SQL
|
||||
- Fast (for small/medium size)
|
||||
|
||||
**Cons**:
|
||||
- Doesn’t scale well (single machine or SPOF = Single Point of Failure)
|
||||
- Becomes **slow** with **big data**
|
||||
- **Less fault tolerant**
|
||||
- Not designed for **massive, distributed systems**
|
||||
BIN
content/BigData/res/Pasted image 20250723182835.png
Normal file
|
After Width: | Height: | Size: 156 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 11.58.31.png
Normal file
|
After Width: | Height: | Size: 887 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 11.58.48.png
Normal file
|
After Width: | Height: | Size: 883 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.08.22.png
Normal file
|
After Width: | Height: | Size: 584 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.08.48.png
Normal file
|
After Width: | Height: | Size: 358 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.09.30.png
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.10.02.png
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.16.32.png
Normal file
|
After Width: | Height: | Size: 1.9 MiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.20.09.png
Normal file
|
After Width: | Height: | Size: 217 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.24.23.png
Normal file
|
After Width: | Height: | Size: 1.6 MiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.25.00.png
Normal file
|
After Width: | Height: | Size: 1.3 MiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.25.28.png
Normal file
|
After Width: | Height: | Size: 1.2 MiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.27.11.png
Normal file
|
After Width: | Height: | Size: 485 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 12.27.39.png
Normal file
|
After Width: | Height: | Size: 994 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 13.00.20.png
Normal file
|
After Width: | Height: | Size: 905 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 13.01.20.png
Normal file
|
After Width: | Height: | Size: 660 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 13.29.37.png
Normal file
|
After Width: | Height: | Size: 675 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 13.54.07.png
Normal file
|
After Width: | Height: | Size: 216 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 14.20.31.png
Normal file
|
After Width: | Height: | Size: 752 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 16.52.08.png
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 17.48.07.png
Normal file
|
After Width: | Height: | Size: 569 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 17.51.04.png
Normal file
|
After Width: | Height: | Size: 283 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 17.52.39.png
Normal file
|
After Width: | Height: | Size: 299 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.25.32.png
Normal file
|
After Width: | Height: | Size: 266 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.27.30.png
Normal file
|
After Width: | Height: | Size: 642 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.42.46.png
Normal file
|
After Width: | Height: | Size: 114 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.45.25.png
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.46.55.png
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.50.10.png
Normal file
|
After Width: | Height: | Size: 709 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 18.50.16.png
Normal file
|
After Width: | Height: | Size: 410 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 19.08.40.png
Normal file
|
After Width: | Height: | Size: 182 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 19.11.44.png
Normal file
|
After Width: | Height: | Size: 594 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 19.12.57.png
Normal file
|
After Width: | Height: | Size: 692 KiB |
BIN
content/BigData/res/Screenshot 2025-07-23 at 19.17.31.png
Normal file
|
After Width: | Height: | Size: 625 KiB |