Add my Obsidian notes

This commit is contained in:
DeGefen 2025-07-23 20:36:04 +03:00
parent 059848f8b0
commit 7b7a97b7cf
59 changed files with 821 additions and 0 deletions

View File

@ -0,0 +1,91 @@
[[Cloud Computing]]
## AWS Overview
- over 175+ services
- **Pay-as-you-go** pricing
- **No upfront costs**
- **Ideal for experimentation**
- **Access to cutting-edge tools and scalability**
##### **Region**
- A physical location worldwide with multiple data centers.
##### **Availability Zone (AZ)**
- Logical group of one or more data centers within a region.
- Physically isolated (up to 100 km apart).
- Designed for **high availability and fault tolerance**.
##### **Edge Location**
- are physical sites dispersed across the globe
- Part of Amazons CDN (content delivery network).
- Distributes services/data closer to users to reduce latency.
##### **Planning for Failure (Resiliency)**
- **Storage**:
* S3 service is designed for failure.
* Each file is copied to every [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] in the region. Thus you always have three copies of your file.
- **Compute**:
- The owner is responsible to manually distribute resources across multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s.
- If one fails the others still operate.
- **Databases**:
- The owner can configure DB deployment in multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s to keep redundancy.
##### **Benefits of AWS Global Infrastructure**
- High performance
- Low latency
- High availability
- Scalability
- Unlimited capacity (horizontally scalable)
- Built-in security and monitoring
- Confidential
- Reliable
- Low Cost
##### Shared Responsibility of Security
![[Screenshot 2025-07-23 at 14.20.31.png]]
## AWS Core Services
##### Networking
* [[Amazon VPC]]
##### Security & Identity
- [[Amazon IAM]]
##### Compute
- [[Amazon EC2]]
- [[Amazon Lambda]]
##### Storage
- **Instance Store:**
- Specified by instance type. Data is stored on the same server as the [[Amazon EC2|EC2]] instance. It is removed when the instance is terminated.
- [[Amazon EBS]]
- [[Amazon S3]]
##### Databases
- Relational
- [[Amazon RDS]]
- Amazon Redshift
- Amazon Aurora
- Non-Relational
- [[Amazon DynamoDB]]
- Amazon ElastiCache
- Amazon Neptune
- Alternatively:
- you can install a DB of your choice in an [[Amazon EC2|EC2]] instance and not use one provided by AWS. In that case, you take all responsibility of the security and management of your DB.
## AWS Pricing Models
##### Principles:
- **Pay-as-you-go** (only pay for usage)
- **Reserved pricing** (discounted with commitment)
- **Volume discount** (pay less when you use more)
##### Free Tier Options:
- **Always free** (e.g., 1M free Lambda calls)
- **12-months free** (introductory offer)
- **Trial services**
### **Billing Examples:**
- [[Amazon EC2|EC2]]: Pay for runtime only.
- [[Amazon S3|S3]]: Pay for
- Storage volume
- Requests (PUT/GET)
- Data transfer
- [[Amazon Lambda|Lambda]]: Pay for
- Number of requests
- Execution time

View File

@ -0,0 +1,10 @@
---
aliases:
- EBS
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **Amazon EBS (Elastic Block Store)**
extra storage that is connected to the [[Amazon EC2|EC2]] instance but is not the same as the instance storage.
- Persistent and can be attached to any [[Amazon EC2|EC2]] instance in the [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
- It is not deleted when the [[Amazon EC2|EC2]] instance is terminated.

View File

@ -0,0 +1,43 @@
---
aliases:
- EC2
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **Amazon EC2 (Elastic Compute Cloud)**
A web service that provides secure, resizable compute capacity in
the cloud.
* Designed to make web-scale cloud computing easier for developers.
- Secure, resizable compute capacity (virtual servers).
- Complete control over OS and apps.
**Pricing Models**:
- **On-Demand**: Pay for what you use.
- **Spot Instances**: Cheap, temporary, short-life instance, save up to 90%.
- **Reserved**: Lower price for long-term usage.
**Features:**
- **Amazon Machin Image (AMI):**
- Preconfigured OS image
- e.g., Linux, maxOS, Windows
- **Instance type**:
- Defines CPU, memory, storage, networking capacity
- **Networking**:
- [[Amazon VPC|VPC]] and subnets
- **Storage**
- **Security Group(s)**:
- Like a firewall, Define access to and from EC2 instance
- **Key pair**:
- establish a remote connection (Secure SSH access)
- **Instance Type**:
- defines CPU, memory, storage, and network performance.
![[Screenshot 2025-07-23 at 16.52.08.png]]
- **Instance Families**
| Family Type | Use Case<br> |
| --------------------------------- | ------------------------- |
| General Purpose (M / T / A) | Web servers |
| Compute Optimized (C) | Analytics, gaming |
| Memory Optimized (R / X) | High-performance DB |
| Accelerated Computing (P / G / F) | AI, ML, GPU compute |
| Storage Optimized (I) | Big data, NoSQL databases |

View File

@ -0,0 +1,13 @@
---
aliases:
- IAM
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **Amazon IAM (Identity and Access Management)**
- Manages user access to services.
- Attach permission policies to identities to manage the kind of actions the identity can perform.
- Identities in Amazon IAM are ***users***, ***groups*** and ***roles***.
- Based on ***least privilege*** principle.
* user or entity should only have access to the specific data, resources and applications when you explicitly granted them access.
* example usage:
* Grant cross-account permissions to upload objects while ensuring that the bucket owner has full control.

View File

@ -0,0 +1,18 @@
---
aliases:
- Lambda
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **AWS Lambda (a serverless compute service)**
- Run backend code without provisioning servers.
- Event-driven: triggered by events (e.g., file upload).
- Languages: Python, Node.js, Java, C#, Go, Ruby.
- Automatically scales with demand.
**Work flow**
![[Screenshot 2025-07-23 at 17.51.04.png|600]]
**Example Use Case:**
- You can configure Lambda function to perform an action when an event occurs. For example, when an image is stored in Bucket-A, an event invokes the Lambda function to process the image to a new format and store it in Bucket-B
![[Screenshot 2025-07-23 at 17.52.39.png|400]]

View File

@ -0,0 +1,14 @@
---
aliases:
- RDS
- AWS RDS
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **Amazon RDS (Relational Database Service)**
A cloud based distributed relational database managed service.
- Managed relational DBs (e.g., MySQL, PostgreSQL, Oracle).
- AWS handles backups, patching, and scaling.
- You can build fault-tolerant DB by configuring RDS for **Multi-[[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]** deployment.
Placing your master RDS instance in one [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] and a standby replica in another [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
![[Screenshot 2025-07-23 at 17.48.07.png|600]]

View File

@ -0,0 +1,20 @@
---
aliases:
- S3
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### Amazon S3 (Simple Storage Service)
An object storage service that offers industry-leading scalability, data availability, security, and performance.
- Object-based storage.
- Designed for 99.999999999% (11 9s) of durability.
- Ideal for:
- Media storage
- Backups / archives
- Data lakes
- ML and analytics
**Components:**
- **Bucket**: A container to store an unlimited number of objects
- **Object**: The actual entities stored in the buckets
- **Key**: Unique identifier for the object

View File

@ -0,0 +1,12 @@
---
aliases:
- VPC
---
> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
##### **Amazon VPC (Virtual Private Cloud)**
An logically **isolated network** within AWS for your resources.
- Create a ***public-facing*** subnet for your web servers which have access to the internet.
- Create a ***private-facing*** subnet with no internet access for your backend system
- e.g., databases, application servers
- Enables fine-grained control over traffic with both a public and a private subnet.

View File

@ -0,0 +1,41 @@
##### data vs. information
- **Data** is just raw facts (like the number 42).
- But 42 could mean: age, shoe size, stock amount, etc.
- **Information** is when you give meaning to the data.
- Example: “Age = 42” gives context and becomes useful.
##### Big Data implementations
- **Delta** *Sentiment analysis* (e.g., of customer feedback).
- **Netflix** *User Behavioral Analysis* (e.g., what you watch and when).
- **Time Warner** *Customer segmentation* (dividing customers into groups).
- **Volkswagen** *Predictive support* (e.g., predict car issues).
- **Visa** *Fraud detection*.
- **China government** *Security Intelligence* (National security).
- **Weather forecasting** *Weather prediction models* to predicting the weather.
- **Hospitals** Diagnosing diseases using *machine learning* on images.
- **Amazon** *Price optimization*.
- **Facebook** Targeted advertising using *user profiling*.
##### Design Principles for Big Data
1. **Horizontal Growth** Add more machines instead of stronger ones.
2. **Distributed Processing** Split work across machines.
3. **Process where Data is** Dont move data, move the code.
4. **Simplicity of Code** Keep logic understandable.
5. **Recover from Failures** Systems should self-heal.
6. **Idempotency** Running the same job twice shouldnt break results.
##### Big Data SLA (Service Level Agreement)
define performance expectations
- **Reliability** Will the data be there?
- **Consistency** Is the data accurate across systems?
- **Availability** Is the system always accessible?
- **Freshness** How up-to-date is the data?
- **Response time** How fast do queries return?
* Other concerns:
- **Cost**
- **Scalability**
- **Performance**
> Next [[Cloud Services]]

View File

@ -0,0 +1,15 @@
>“An accumulation of data
>that is too large and complex
>for processing by traditional
>database management tools”
>
>**In Short:**
>Big Data = too big for standard tools like Excel or regular SQL databases.
[[Big Data Intro]]
[[Cloud Services]]
[[AWS Cloud Services]]
[[Database Overview]]
[[RDBMS]]
[[Hadoop]]
[[Hadoop Eccosystem]]

View File

@ -0,0 +1,29 @@
##### Benefits of Cloud Computing
- **Elasticity**: Start small and scale as needed.
- **Cost-efficiency**: No need to spend money on data centers.
- **No capacity guessing**: Scale automatically based on demand.
- **Economies of scale**: Benefit from AWSs vast infrastructure.
- **Agility**: Deploy resources quickly.
- **Global Reach**: Go international within minutes.
##### Deployment Models in the Cloud
- **IaaS** (Infrastructure as a Service):
- Virtual machines, storage, networks
- e.g., Amazon EC2.
- **PaaS** (Platform as a Service):
- Managed environments for building apps
- e.g., AWS Elastic Beanstalk.
- **[[Cloud Services#Selling Your Service |SaaS]]** (Software as a Service):
- Full applications delivered over the internet
- e.g., Gmail.
##### Deployment Strategies of Cloud Computing
- **On-Premises (Private Cloud)**: Owned and operated on-site.
- **Public Cloud**: Fully hosted on cloud provider infrastructure.
- **Hybrid Cloud**: Combines on-premises and cloud resources.
##### Cloud Providers Comparison
![[Screenshot 2025-07-23 at 13.54.07.png | 600]]
> Next [[AWS Cloud Services]]

View File

@ -0,0 +1,58 @@
Introduction to cloud computing concepts relevant for Big Data.
##### traditional software deployment process:
1. **Coding**
2. **Compiling** turning source code into executable files.
3. **Installing** putting the software on computers.
##### Clustered Software
Introduces three related architectures:
1. **Redundant Servers** multiple servers running the same service for fault-tolerance.
- E.g., several identical web servers.
2. **Micro-services** the system is broken into **small, independent services** that communicate with each other.
- Each handles a specific function.
3. **Clustered Computing** a large task is **split into sub-tasks** running on **multiple nodes**.
- Used in Big Data systems like **NoSQL databases**.
##### Scaling a Software System
Two ways to handle growing demand:
- **Scale Up**: Make one machine stronger
- When running out of resources we can add: *Memory*, *CPU*, *Disk*, *Network Bandwidth*
- Can become expensive or reach hardware limits.
- **Scale Out**: Add more machines to share the work.
- Add **redundant servers** or use **cluster computing**.
- Each server can be **standalone** (like a web server), or part of a **coordinated system** (like a NoSQL cluster).
- More fault-tolerant and scalable than vertical scaling.
- Tradeoff:
- **Scale-up** is simpler but has limits.
- **Scale-out** is more flexible and resilient but more complex.
##### Selling Your Service
- **Install** - Software as installation
- e.g., Microsoft's office package
- Saas - Software as a Service
- No need to install, just log in and use.
- e.g., Google Docs, Zoom, Dropbox.\
- Common SaaS pricing models:
1. **Per-user** Pay per person.
2. **Tiered** Fixed price for different feature levels.
3. **Usage-based** Pay for what you use (e.g., storage, API calls).
##### Deployment Models
Where you run your software:
- **On-Premises**: Your own machines or rented servers (or VMs).
- **Cloud**: Run on virtual machines (VMs) from a cloud provider (e.g., AWS, Azure, GCP).
##### Cloud Deployment Options
When deploying to the cloud, you have options:
1. **Vanilla Node**: Raw VM you install everything.
2. **Cloud VM**: VM with pre-installed software.
3. **Managed Service**: Cloud provider handles setup, scaling, updates (e.g., [[Amazon RDS|AWS RDS]], Google BigQuery).
> Next [[Cloud Computing]]

View File

@ -0,0 +1,24 @@
1. **Punch cards** physical cards with holes. Early computers read data this way.![[Screenshot 2025-07-23 at 12.08.22.png | 400]]
2. **Magnetic media**:
- First: **Floppy disks**
![[Screenshot 2025-07-23 at 12.08.48.png | 400]]
- Then: **Hard disks** (faster, more storage)
![[Screenshot 2025-07-23 at 12.09.30.png]]
3. **1960s**: First **Database Management Systems (DBMSs)** created:
- Charles W. Bachman developed the **Integrated Database System**
- IBM developed **IMS**
4. **1970s**:
- IBM created **SQL** (Structured Query Language)
- Modern relational databases (RDBMS) were born
5. **20th century:** Many RDBMS's
- ORACLE, Microsoft's SQLServer, IBM's DB2, MySQL, SYBASE...
![[Screenshot 2025-07-23 at 12.10.02.png]]
##### Hadoop history
![[Screenshot 2025-07-23 at 12.16.32.png | 200]]
2005 - Started by ***Doug Cutting*** at Yahoo!
[[Hadoop]] is an [[Open Source]] Apache project
Benefits: free, flexible, community-supported.

View File

@ -0,0 +1,22 @@
[[Database History]]
[[RDBMS]] - Relational Models
[[Hadoop]]
##### **Big Data Challenges**
Examples of tasks that are hard with large datasets:
1. Count the **most frequent words** in Wikipedia.
2. Find the **hottest November** per country from weather data.
3. Find the **day with most critical errors** in company logs.
These problems require:
- **Huge data**
- **Efficient distributed computing**
#### [[RDBMS]] vs. [[Hadoop]]
| **Feature** | **RDBMS** | **Hadoop** |
| -------------- | ------------------- | ----------------------------- |
| Data structure | Structured (tables) | Any (structured/unstructured) |
| Scalability | Limited | Highly scalable |
| Speed | Fast (small data) | Designed for huge data |
| Access | SQL | Code (e.g., Java, Python) |

View File

@ -0,0 +1,59 @@
---
aliases:
- Hive
---
> [[Hadoop Eccosystem|Systems based on MapReduce]]
### Apache Hive
##### **Key Features**
- Developed by **Apache**.
- General SQL-like syntax for querying [[HDFS]] or other large databases
- Translates SQL queries into one or more [[MapReduce]] jobs.
- Maps data in [[HDFS]] into virtual [[RDBMS]]-like tables.
- **Pro**:
- Convenient for **data analytics** uses SQL.
* **Con**:
* Quite slow in response time
##### **Hive Data Model**
**Structure**
- **Physical**: Data stored in [[HDFS]] blocks across nodes.
- **Virtual Table**: Defined with schema using metadata.
- **Partitions**: Logical splits of data to speed up queries.
**Metadata**
- Hive stores metadatain DB
- Map physical files to tables.
- Map fields (columns) to line structures in raw data.
![[Screenshot 2025-07-23 at 18.25.32.png]]
**Hive Architecture**
![[Screenshot 2025-07-23 at 18.27.30.png|]]
##### Hive Usage
```
#Start a hive shell:
$hive
#create hive table:
hive> CREATE TABLE mta (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)
#Show all tables:
hive> SHOW TABLES;
#Add a new column to the table:
hive> ALTER TABLE mta ADD COLUMNS (description STRING);
#Load HDFS data file into the table:
hive> LOAD DATA INPATH '/home/hadoop/mta_users' OVERWRITE INTO TABLE mta;
#Query employees that work more than a year:
hive> SELECT name FROM mta WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
#Execute command without shell
$hive -e 'SELECT name FROM mta;'
#Execute script from file
$hive -f hive_script.txt
```

View File

@ -0,0 +1,68 @@
---
aliases:
- Spark
---
> [[Hadoop Eccosystem|Systems based on MapReduce]]
## Apache Spark
> Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing.
##### Key Characteristics:
- **Unified analytics engine** supports batch, streaming, SQL, machine learning, and graph processing.
- **In-memory computation** stores intermediate results in RAM (vs. Hadoop which writes to disk).
- **Fault tolerant** and scalable.
##### Benefits of Spark Over [[Hadoop]] [[MapReduce]]
| **Feature** | **Spark** | **Hadoop MapReduce** |
| ------------------- | ------------------------------------------------------------------------- | -------------------------------------- |
| **Performance** | Up to **100x faster** (in-memory operations) | Disk-based, slower |
| **Ease of use** | High-level APIs in Python, Java, Scala, R | Java-based, verbose programming |
| **Generality** | Unified engine for batch, stream, ML, graph | Focused on batch processing |
| **Fault tolerance** | Efficient recovery via lineage | Slower fault recovery via re-execution |
| **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. | |
##### How is Spark Fault Tolerant?
> Resilient Distributed Datasets ([[RDD]]s)
- Restricted form of distributed shared memory
- Immutable, partitioned collections of records
- Recompute lost partitions on failure
- No cost if nothing fails
![[Screenshot 2025-07-23 at 19.17.31.png|500]]
- **Lineage Graph**
- Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations.
##### Writing Spark Code in Python
```
# Spark Context Initialization
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("MyApp").setMaster("local")
sc = SparkContext(conf=conf)
# Create RDDs:
# 1. From a Python list
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
# 2. From a file
distFile = sc.textFile("data.txt")
distFile = sc.textFile("folder/*.txt")
```
##### **RDD Transformations (Lazy)**
These create a new RDD from an existing one.
| map(func) | Apply function to each element |
| ----------------- | -------------------------------------------- |
| filter(func) | Keep elements where func returns True |
| flatMap(func) | Like map, but flattens results |
| union(otherRDD) | Union of two RDDs |
| distinct() | Remove duplicates |
| reduceByKey(func) | Combine values for each key (key-value RDDs) |
| sortByKey() | Sort by keys |
| join(otherRDD) | Join two key-value RDDs |
| repartition(n) | Re-distribute RDD to n partitions |
Transformations are **lazy** they only execute when an action is triggered.

View File

@ -0,0 +1,27 @@
> [[Hadoop Eccosystem|Systems based on MapReduce]]
**Key Ideas**
• Leverages columnar file format
• Optimized for SQL performance
**Concepts**
- Tree-based **query execution**.
- Efficient scanning and aggregation of **nested columnar data**.
### Columnare data format
> Illustration of what columnar storage is all about:
> given a 3 columns:
![[Screenshot 2025-07-23 at 18.42.46.png|170]]
> In a row-oriented storage, the data is laid out one row at a time as follows:
![[Screenshot 2025-07-23 at 18.45.25.png|500]]
> Whereas in a column-oriented storage, it is laid out one column at a time:
![[Screenshot 2025-07-23 at 18.46.55.png|500]]
**Nested data in columnar format**
![[Screenshot 2025-07-23 at 18.50.10.png]]![[Screenshot 2025-07-23 at 18.50.16.png]]
### Frameworks inspired by Google Dremel
• Apache Dril (MapR)
• Apache Impala (Cloudera)
• Apache Tez (Hortonworks)
• Presto (Facebook)

View File

@ -0,0 +1,64 @@
##### HDFS ([[Hadoop]] Distributed File System)
Stores huge files (Typical file size GB-TB) across multiple machines.
- Breaks files into **blocks** (typically 128 MB).
- **Replicates** blocks (default 3 copies) for fault tolerance.
- Access using POSIX API.
##### HDFS design principles
* **Immutable**: **write-once, read-many**
* **No Failures**: Disk or node failure does not affect file system
* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
* **File Num Limited**: 1048576 files in a directory
* **Prefer bigger files**: Big files provide better performance
##### HDFS File Formats
- Text/CSV - No schema, no metadata
- Json Records - metadata is stored with data
- Avro Files - schema independent of data
- Sequence Files - binary files (used as intermediate storage in M/R)
- RC Files - Record Columnar files
- ORC Files - Optimized RC files. Compress better
- Parquet Files - Yet another RC file
##### HDFS Command Line
```
# List files
hadoop fs -ls /path
# Make directory
hadoop fs -mkdir /user/hadoop
# Print file
hadoop fs -cat /file
# Upload file
hadoop fs -copyFromLocal file.txt hdfs://...
```
#### HDFS Architecture Main Components
##### **1.** NameNode (Master Node)
- **Stores metadata** about the filesystem:
- Filenames
- Directory structure
- Block locations
- Permissions
- It **does not store the actual data**.
- There is **one active NameNode** per cluster.
##### **2.** DataNodes (Worker Nodes)
- Store the **actual data blocks** of files.
- Send **heartbeat** messages to the NameNode to report that they are alive.
- When a file is written, its split into blocks and distributed across many DataNodes.
- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
#### File Read / Write
**When a file is written:**
1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
2. The NameNode responds with a list of **DataNodes** to use.
3. The client sends the blocks of the file to those DataNodes.
4. Blocks are **replicated** automatically across different nodes for redundancy.
**When a file is read:**
1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
2. The client reads the blocks **directly** from the DataNodes.

View File

@ -0,0 +1,17 @@
### Systems based on [[MapReduce]]
> Early generation frameworks for big data processing.
* [[Apache Hive]]
### Systems that replace MapReduce
> newer, faster frameworks with different architectures and performance improvements.
**Motivation**: [[MapReduce]] and [[Apache Hive|Hive]] are too slow!
- [[Google Dremel]]
- [[Apache Spark]]
- Replaces MapReduce with its own engine that works much faster without compromising consistency
- Architecture not based on Map-reduce but rather on two concepts:
- RDD (Resilient Distributed Dataset)
- DAG (Directed Acyclic Graph)
- Pros:
- Works much faster than MapReduce;
- fast growing community.

View File

@ -0,0 +1,13 @@
![[Screenshot 2025-07-23 at 12.20.09.png | 400]]
> Hadoop is an **[[Open Source]] framework** for:
> - **Distributed storage** (across many machines)
> - **Distributed processing** (run programs on many machines in parallel)
>
> > It is **not a database** — it is an ecosystem for managing and analyzing **Big Data**.
## **Hadoop Components Overview**
![[Screenshot 2025-07-23 at 11.58.48.png ]]
> 1. [[HDFS]]
> 2. [[MapReduce]]
> 3. [[Yarn]]
[[Hadoop Eccosystem]]

View File

@ -0,0 +1,32 @@
A programming model for processing big data in parallel.
- Distributed processing - Job is run in parallel on several nodes
- Run the process where the data is!
- Horizontal Scalability
- **Map** step: transform input
- Transform, Filter, Calculate
- Local data
- e.g., count 1 per word
- **Combine** step: Reorganization of map output.
- Shuffle, Sort, Group
- **Reduce** step: Aggregate / Sum the groups
- e.g., sum word counts
MapReduce **runs code where the data is**, saving data transfer time.
![[Screenshot 2025-07-23 at 13.00.20.png]]
##### Example:
From the sentence:
> “how many cookies could a good cook cook if a good cook could cook cookies”
Steps:
1. **Map**:
- Each word becomes a pair like ("cook", 1)
2. **Shuffle**:
- Group by word
3. **Reduce**:
- Add up counts → ("cook", 4)
![[Screenshot 2025-07-23 at 13.01.20.png]]

View File

@ -0,0 +1,20 @@
## RDD (Resilient Distributed Dataset)
>RDD is an immutable (read only) distributed collection of objects.
>
>Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster
![[Screenshot 2025-07-23 at 19.08.40.png|600]]
##### **Key Properties:**
- Distributed: Automatically split across cluster nodes.
- Lazy Evaluation: Transformations arent executed until an action is called.
- Fault-tolerant: Can **recompute lost partitions** using lineage graph.
- Parallel: Operates concurrently across cluster cores.
##### Data Sharing
> In [[Hadoop]] [[MapReduce]]
![[Screenshot 2025-07-23 at 19.11.44.png|500]]
> In [[Apache Spark|Spark]]
![[Screenshot 2025-07-23 at 19.12.57.png|500]]
>10-100x faster than network and disk!

View File

@ -0,0 +1,35 @@
**YARN (Yet Another Resource Negotiator)**
is [[Hadoop]]s cluster resource management system
- Multiple jobs running simultaneously
- Multiple jobs use same resources (disk, CPU, memory)
- Assign resources to jobs and tasks exclusively
##### YARN is in charge of:
1. Allocates Resources
2. Schedules Jobs
- allocate priorities to jobs by policies:
FIFO scheduler, Fair scheduler, Capacity scheduler
##### Components:
- **ResourceManager**
- oversees resource allocation across the cluster
- **NodeManager**
- Each node in the cluster runs a NodeManager.
- This component manages the execution of containers on its node.
- **ApplicationMaster**
- manages the lifecycle of applications.
- handles job scheduling and monitors progress.
- **Resource Container**
- a logical bundle of resources (e.g., CPU, Memory) that is allocated by the ResourceManager
![[Screenshot 2025-07-23 at 13.29.37.png]]
##### YARN ecosystem
Yarn can run other applications beside Hadoop [[MapReduce]], that can
integrate to the Hadoop ecosystem:
• Apache Storm (Data Streaming engine)
• [[Apache Spark]] (Data Batch and streaming engine)
• Apache Solr (Search platform)

View File

@ -0,0 +1,13 @@
• Source Code available
• Free Redistribution
• Derived Works
![[Screenshot 2025-07-23 at 12.24.23.png]]
Open-source replace Closed-source
![[Screenshot 2025-07-23 at 12.25.00.png]]
More Open-source solutions
![[Screenshot 2025-07-23 at 12.25.28.png]]
![[Screenshot 2025-07-23 at 12.27.11.png]]![[Screenshot 2025-07-23 at 12.27.39.png]]

63
content/BigData/RDBMS.md Normal file
View File

@ -0,0 +1,63 @@
[[Database Overview]]
##### What is an RDBMS?
**Relational Database Management System**:
- Data is stored in **tables**:
- **Rows** = records
- **Columns** = fields
- Each table has:
- **Indexes** for fast searching
- **Relationships** with other tables (via keys)
##### Relational model - Keys and Indexes
Ability to find record(s) quickly
- Operations become efficient:
- **Find by key** → O(log n)
- **Fetch record by ID** → O(1)
Indexes = sorted references to data locations → like a book index.
##### Relational model - Operations
Relational databases support **CRUD**:
- **C**reate
- **R**ead
- **U**pdate
- **D**elete
Each operation uses both:
- The **index** (to locate data)
- The **data** itself (to read/write)
##### Relational model - Transactional
Relational databases guarantee **transaction safety** with ACID:
- **A**tomicity all or nothing
- **C**onsistency valid data only
- **I**solation no interference from other transactions
- **D**urability survives crashes
* Examples:
- Transferring money, Posting a tweet
- Both must either **succeed completely** or **fail completely**.
Transactions guarantee data validity despite errors & failures
##### Relational model - SQL
**SQL** is the language used to talk to relational databases.
- **S**tandard
- **Q**uery
- **L**anguage
- All RDBMSs use it (MySQL, PostgreSQL, Oracle, etc.)
#####  Pros and Cons of RDBMS
**Pros:**
- Structured data
- ACID transactions
- Powerful SQL
- Fast (for small/medium size)
**Cons**:
- Doesnt scale well (single machine or SPOF = Single Point of Failure)
- Becomes **slow** with **big data**
- **Less fault tolerant**
- Not designed for **massive, distributed systems**

Binary file not shown.

After

Width:  |  Height:  |  Size: 156 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 887 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 883 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 584 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 358 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 485 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 994 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 905 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 660 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 675 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 752 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 569 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 283 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 299 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 266 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 642 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 709 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 410 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 594 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 692 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 625 KiB