Add my Obsidian notes

2025-12-24 05:14:06 -06:00 · 2025-07-23 20:36:04 +03:00 · 2025-07-23 20:36:04 +03:00 · 7b7a97b7cf
commit 7b7a97b7cf
parent 059848f8b0
59 changed files with 821 additions and 0 deletions
--- a/content/BigData/AWS/AWS
+++ b/content/BigData/AWS/AWS
@ -0,0 +1,91 @@
+[[Cloud Computing]]
+## AWS Overview
+- over 175+ services
+- **Pay-as-you-go** pricing    
+- **No upfront costs**
+- **Ideal for experimentation**
+- **Access to cutting-edge tools and scalability**
+##### **Region**
+- A physical location worldwide with multiple data centers.
+##### **Availability Zone (AZ)**
+- Logical group of one or more data centers within a region.
+- Physically isolated (up to 100 km apart).
+- Designed for **high availability and fault tolerance**.
+##### **Edge Location**
+- are physical sites dispersed across the globe
+- Part of Amazon’s CDN (content delivery network).
+- Distributes services/data closer to users to reduce latency.
+##### **Planning for Failure (Resiliency)**
+- **Storage**:
+	* S3 service is designed for failure.
+	* Each file is copied to every [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] in the region. Thus you always have three copies of your file.
+	
+- **Compute**: 
+	- The owner is responsible to manually distribute resources across multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s.
+	- If one fails the others still operate.
+	
+- **Databases**: 
+	- The owner can configure DB deployment in multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s to keep redundancy.
+
+##### **Benefits of AWS Global Infrastructure**
+- High performance
+- Low latency
+- High availability
+- Scalability
+- Unlimited capacity (horizontally scalable)
+- Built-in security and monitoring
+- Confidential
+- Reliable
+- Low Cost
+##### Shared Responsibility of Security
+![[Screenshot 2025-07-23 at 14.20.31.png]]
+
+## AWS Core Services
+##### Networking
+* [[Amazon VPC]]
+##### Security & Identity
+- [[Amazon IAM]]
+##### Compute
+- [[Amazon EC2]]
+- [[Amazon Lambda]]
+##### Storage
+- **Instance Store:** 
+	- Specified by instance type. Data is stored on the same server as the [[Amazon EC2|EC2]] instance. It is removed when the instance is terminated.
+- [[Amazon EBS]]
+- [[Amazon S3]]
+##### Databases
+- Relational
+	- [[Amazon RDS]]
+	- Amazon Redshift
+	- Amazon Aurora
+	
+- Non-Relational
+	- [[Amazon DynamoDB]]
+	- Amazon ElastiCache
+	- Amazon Neptune
+	
+- Alternatively:
+	- you can install a DB of your choice in an [[Amazon EC2|EC2]] instance and not use one provided by AWS. In that case, you take all responsibility of the security and management of your DB.
+
+## AWS Pricing Models
+##### Principles:
+- **Pay-as-you-go** (only pay for usage)
+- **Reserved pricing** (discounted with commitment)
+- **Volume discount** (pay less when you use more)
+##### Free Tier Options:
+- **Always free** (e.g., 1M free Lambda calls)
+- **12-months free** (introductory offer)
+- **Trial services**
+
+### **Billing Examples:**
+- [[Amazon EC2|EC2]]: Pay for runtime only.
+
+- [[Amazon S3|S3]]: Pay for
+    - Storage volume
+    - Requests (PUT/GET)
+    - Data transfer
+
+- [[Amazon Lambda|Lambda]]: Pay for
+    - Number of requests
+    - Execution time
+
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,10 @@
+---
+aliases:
+  - EBS
+---
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **Amazon EBS (Elastic Block Store)**
+extra storage that is connected to the [[Amazon EC2|EC2]] instance but is not the same as the instance storage.
+- Persistent and can be attached to any [[Amazon EC2|EC2]] instance in the [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
+- It is not deleted when the [[Amazon EC2|EC2]] instance is terminated.
+
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,43 @@
+---
+aliases:
+  - EC2
+---
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **Amazon EC2 (Elastic Compute Cloud)**
+A web service that provides secure, resizable compute capacity in
+the cloud. 
+* Designed to make web-scale cloud computing easier for developers.
+- Secure, resizable compute capacity (virtual servers).
+- Complete control over OS and apps.
+	
+**Pricing Models**:
+- **On-Demand**: Pay for what you use.
+- **Spot Instances**: Cheap, temporary, short-life instance, save up to 90%.
+- **Reserved**: Lower price for long-term usage.
+    
+**Features:**
+- **Amazon Machin Image (AMI):** 
+	- Preconfigured OS image
+	- e.g., Linux, maxOS, Windows
+- **Instance type**: 
+	- Defines CPU, memory, storage, networking capacity
+- **Networking**: 
+	- [[Amazon VPC|VPC]] and subnets
+- **Storage**
+- **Security Group(s)**: 
+	- Like a firewall, Define access to and from EC2 instance
+- **Key pair**: 
+	- establish a remote connection (Secure SSH access)
+
+- **Instance Type**:
+	-  defines CPU, memory, storage, and network performance.
+	  ![[Screenshot 2025-07-23 at 16.52.08.png]]
+- **Instance Families** 
+
+| Family Type                       | Use Case<br>              |
+| --------------------------------- | ------------------------- |
+| General Purpose (M / T / A)       | Web servers               |
+| Compute Optimized (C)             | Analytics, gaming         |
+| Memory Optimized (R / X)          | High-performance DB       |
+| Accelerated Computing (P / G / F) | AI, ML, GPU compute       |
+| Storage Optimized (I)             | Big data, NoSQL databases |
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,13 @@
+---
+aliases:
+  - IAM
+---
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **Amazon IAM (Identity and Access Management)**
+- Manages user access to services.
+- Attach permission policies to identities to manage the kind of actions the identity can perform.
+	- Identities in Amazon IAM are ***users***, ***groups*** and ***roles***.
+- Based on ***least privilege*** principle. 
+	* user or entity should only have access to the specific data, resources and applications when you explicitly granted them access.
+* example usage:
+	* Grant cross-account permissions to upload objects while ensuring that the bucket owner has full control.
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,18 @@
+---
+aliases:
+  - Lambda
+---
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **AWS Lambda (a serverless compute service)**
+- Run backend code without provisioning servers.
+- Event-driven: triggered by events (e.g., file upload).
+- Languages: Python, Node.js, Java, C#, Go, Ruby.
+- Automatically scales with demand.
+
+**Work flow**
+![[Screenshot 2025-07-23 at 17.51.04.png|600]]
+
+**Example Use Case:**
+- You can configure Lambda function to perform an action when an event occurs. For example, when an image is stored in Bucket-A, an event invokes the Lambda function to process the image to a new format and store it in Bucket-B
+  ![[Screenshot 2025-07-23 at 17.52.39.png|400]]
+
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,14 @@
+---
+aliases:
+  - RDS
+  - AWS RDS
+---
+
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **Amazon RDS (Relational Database Service)**
+A cloud based distributed relational database managed service.
+- Managed relational DBs (e.g., MySQL, PostgreSQL, Oracle).
+- AWS handles backups, patching, and scaling.
+- You can build fault-tolerant DB by configuring RDS for **Multi-[[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]** deployment.
+  Placing your master RDS instance in one [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] and a standby replica in another [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
+![[Screenshot 2025-07-23 at 17.48.07.png|600]]
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,20 @@
+---
+aliases:
+  - S3
+---
+
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### Amazon S3 (Simple Storage Service)
+An object storage service that offers industry-leading scalability, data availability, security, and performance.
+- Object-based storage.
+- Designed for 99.999999999% (11 9’s) of durability.
+- Ideal for:
+    - Media storage
+    - Backups / archives
+    - Data lakes
+    - ML and analytics
+
+**Components:**
+- **Bucket**: A container to store an unlimited number of objects
+- **Object**: The actual entities stored in the buckets
+- **Key**: Unique identifier for the object
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,12 @@
+---
+aliases:
+  - VPC
+---
+
+> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
+##### **Amazon VPC (Virtual Private Cloud)**
+An logically **isolated network** within AWS for your resources.
+- Create a ***public-facing*** subnet for your web servers which have access to the internet.
+- Create a ***private-facing*** subnet with no internet access for your backend system
+	- e.g., databases, application servers
+- Enables fine-grained control over traffic with both a public and  a private subnet. 
--- a/content/BigData/Big
+++ b/content/BigData/Big
@ -0,0 +1,41 @@
+##### data vs. information
+- **Data** is just raw facts (like the number 42).
+    - But 42 could mean: age, shoe size, stock amount, etc.
+
+- **Information** is when you give meaning to the data.
+    - Example: “Age = 42” gives context and becomes useful.
+
+##### Big Data implementations
+- **Delta** – *Sentiment analysis* (e.g., of customer feedback).
+- **Netflix** – *User Behavioral Analysis* (e.g., what you watch and when).
+- **Time Warner** – *Customer segmentation* (dividing customers into groups).
+- **Volkswagen** – *Predictive support* (e.g., predict car issues).
+- **Visa** – *Fraud detection*.
+- **China government** – *Security Intelligence* (National security).
+- **Weather forecasting** – *Weather prediction models* to predicting the weather.
+- **Hospitals** – Diagnosing diseases using *machine learning* on images.
+- **Amazon** – *Price optimization*.
+- **Facebook** – Targeted advertising using *user profiling*.
+##### Design Principles for Big Data
+1. **Horizontal Growth** – Add more machines instead of stronger ones.
+2. **Distributed Processing** – Split work across machines.
+3. **Process where Data is** – Don’t move data, move the code.
+4. **Simplicity of Code** – Keep logic understandable.
+5. **Recover from Failures** – Systems should self-heal.
+6. **Idempotency** – Running the same job twice shouldn’t break results.
+
+##### Big Data SLA (Service Level Agreement)
+define performance expectations
+
+- **Reliability** – Will the data be there?
+- **Consistency** – Is the data accurate across systems?
+- **Availability** – Is the system always accessible?
+- **Freshness** – How up-to-date is the data?
+- **Response time** – How fast do queries return?
+
+* Other concerns:
+	- **Cost**
+	- **Scalability**
+	- **Performance**
+
+> Next [[Cloud Services]]
--- a/content/BigData/Big
+++ b/content/BigData/Big
@ -0,0 +1,15 @@
+>“An accumulation of data 
+>that is too large and complex
+>for processing by traditional
+>database management tools”
+>
+>**In Short:**
+>Big Data = too big for standard tools like Excel or regular SQL databases.
+
+[[Big Data Intro]]
+[[Cloud Services]]
+	[[AWS Cloud Services]]
+[[Database Overview]]
+	[[RDBMS]]
+	[[Hadoop]]
+		[[Hadoop Eccosystem]]
--- a/content/BigData/Cloud
+++ b/content/BigData/Cloud
@ -0,0 +1,29 @@
+##### Benefits of Cloud Computing
+- **Elasticity**: Start small and scale as needed.
+- **Cost-efficiency**: No need to spend money on data centers.
+- **No capacity guessing**: Scale automatically based on demand.
+- **Economies of scale**: Benefit from AWS’s vast infrastructure.
+- **Agility**: Deploy resources quickly.
+- **Global Reach**: Go international within minutes.
+##### Deployment Models in the Cloud
+- **IaaS** (Infrastructure as a Service):
+	- Virtual machines, storage, networks
+	- e.g., Amazon EC2.
+	
+- **PaaS** (Platform as a Service):
+	- Managed environments for building apps
+	- e.g., AWS Elastic Beanstalk.
+	
+- **[[Cloud Services#Selling Your Service |SaaS]]** (Software as a Service): 
+	- Full applications delivered over the internet
+	-  e.g., Gmail.
+
+##### Deployment Strategies of Cloud Computing
+- **On-Premises (Private Cloud)**: Owned and operated on-site.
+- **Public Cloud**: Fully hosted on cloud provider infrastructure.
+- **Hybrid Cloud**: Combines on-premises and cloud resources.
+
+##### Cloud Providers Comparison
+![[Screenshot 2025-07-23 at 13.54.07.png | 600]]
+
+> Next [[AWS Cloud Services]]
--- a/content/BigData/Cloud
+++ b/content/BigData/Cloud
@ -0,0 +1,58 @@
+Introduction to cloud computing concepts relevant for Big Data.
+##### traditional software deployment process:
+1. **Coding**
+2. **Compiling** – turning source code into executable files.
+3. **Installing** – putting the software on computers.
+
+##### Clustered Software
+Introduces three related architectures:
+
+1. **Redundant Servers** – multiple servers running the same service for fault-tolerance.
+    - E.g., several identical web servers.
+    
+2. **Micro-services** – the system is broken into **small, independent services** that communicate with each other.
+    - Each handles a specific function.
+    
+3. **Clustered Computing** – a large task is **split into sub-tasks** running on **multiple nodes**.
+    - Used in Big Data systems like **NoSQL databases**.
+
+##### Scaling a Software System
+Two ways to handle growing demand:
+- **Scale Up**: Make one machine stronger
+	- When running out of resources we can add: *Memory*, *CPU*, *Disk*, *Network Bandwidth*
+	- Can become expensive or reach hardware limits.
+	
+- **Scale Out**: Add more machines to share the work.
+	- Add **redundant servers** or use **cluster computing**.
+	- Each server can be **standalone** (like a web server), or part of a **coordinated system** (like a NoSQL cluster).
+	- More fault-tolerant and scalable than vertical scaling.
+
+- Tradeoff:
+    - **Scale-up** is simpler but has limits.
+    - **Scale-out** is more flexible and resilient but more complex.
+
+##### Selling Your Service
+- **Install** - Software as installation
+	- e.g., Microsoft's office package
+
+- Saas - Software as a Service
+	- No need to install, just log in and use.
+	- e.g., Google Docs, Zoom, Dropbox.\
+	 
+- Common SaaS pricing models:
+1. **Per-user** – Pay per person.
+2. **Tiered** – Fixed price for different feature levels.
+3. **Usage-based** – Pay for what you use (e.g., storage, API calls).
+
+##### Deployment Models
+Where you run your software:
+- **On-Premises**: Your own machines or rented servers (or VM’s).
+- **Cloud**: Run on virtual machines (VMs) from a cloud provider (e.g., AWS, Azure, GCP).
+
+##### Cloud Deployment Options
+When deploying to the cloud, you have options:
+1. **Vanilla Node**: Raw VM – you install everything.
+2. **Cloud VM**: VM with pre-installed software.
+3. **Managed Service**: Cloud provider handles setup, scaling, updates (e.g., [[Amazon RDS|AWS RDS]], Google BigQuery).
+
+> Next [[Cloud Computing]]
--- a/content/BigData/Database
+++ b/content/BigData/Database
@ -0,0 +1,24 @@
+
+1. **Punch cards** – physical cards with holes. Early computers read data this way.![[Screenshot 2025-07-23 at 12.08.22.png | 400]]
+
+2. **Magnetic media**:
+    - First: **Floppy disks**
+      ![[Screenshot 2025-07-23 at 12.08.48.png | 400]]
+    - Then: **Hard disks** (faster, more storage)
+      ![[Screenshot 2025-07-23 at 12.09.30.png]]
+3. **1960s**: First **Database Management Systems (DBMSs)** created:
+    - Charles W. Bachman developed the **Integrated Database System**
+    - IBM developed **IMS**
+4. **1970s**:
+    - IBM created **SQL** (Structured Query Language)
+    - Modern relational databases (RDBMS) were born
+5. **20th century:** Many RDBMS's
+	- ORACLE, Microsoft's SQLServer, IBM's DB2, MySQL, SYBASE...
+	  ![[Screenshot 2025-07-23 at 12.10.02.png]]
+
+
+##### Hadoop history
+![[Screenshot 2025-07-23 at 12.16.32.png | 200]]
+2005 - Started by ***Doug Cutting*** at Yahoo! 
+[[Hadoop]] is an [[Open Source]] Apache project
+Benefits: free, flexible, community-supported.
--- a/content/BigData/Database
+++ b/content/BigData/Database
@ -0,0 +1,22 @@
+
+[[Database History]]
+[[RDBMS]] - Relational Models
+[[Hadoop]]
+##### **Big Data Challenges**
+Examples of tasks that are hard with large datasets:
+1. Count the **most frequent words** in Wikipedia.
+2. Find the **hottest November** per country from weather data.
+3. Find the **day with most critical errors** in company logs.
+
+These problems require:
+- **Huge data**
+- **Efficient distributed computing**
+
+#### [[RDBMS]] vs. [[Hadoop]]
+
+| **Feature**    | **RDBMS**           | **Hadoop**                    |
+| -------------- | ------------------- | ----------------------------- |
+| Data structure | Structured (tables) | Any (structured/unstructured) |
+| Scalability    | Limited             | Highly scalable               |
+| Speed          | Fast (small data)   | Designed for huge data        |
+| Access         | SQL                 | Code (e.g., Java, Python)     |
--- a/content/BigData/Hadoop/Apache
+++ b/content/BigData/Hadoop/Apache
@ -0,0 +1,59 @@
+---
+aliases:
+  - Hive
+---
+> [[Hadoop Eccosystem|Systems based on MapReduce]]
+
+### Apache Hive
+##### **Key Features**
+- Developed by **Apache**.
+- General SQL-like syntax for querying [[HDFS]] or other large databases
+- Translates SQL queries into one or more [[MapReduce]] jobs.
+- Maps data in [[HDFS]] into virtual [[RDBMS]]-like tables.
+- **Pro**:
+	- Convenient for **data analytics** uses SQL.
+* **Con**:
+	* Quite slow in response time
+
+##### **Hive Data Model**
+**Structure**
+- **Physical**: Data stored in [[HDFS]] blocks across nodes.
+- **Virtual Table**: Defined with schema using metadata.
+- **Partitions**: Logical splits of data to speed up queries.
+
+**Metadata**
+- Hive stores metadatain DB
+- Map physical files to tables. 
+- Map fields (columns) to line structures in raw data.
+
+![[Screenshot 2025-07-23 at 18.25.32.png]]
+
+**Hive Architecture**
+![[Screenshot 2025-07-23 at 18.27.30.png|]]
+
+##### Hive Usage
+```
+#Start a hive shell:
+$hive
+
+#create hive table:
+hive> CREATE TABLE mta (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)
+
+#Show all tables:
+hive> SHOW TABLES;
+
+#Add a new column to the table:
+hive> ALTER TABLE mta ADD COLUMNS (description STRING);
+
+#Load HDFS data file into the table:
+hive> LOAD DATA INPATH '/home/hadoop/mta_users' OVERWRITE INTO TABLE mta;
+
+#Query employees that work more than a year:
+hive> SELECT name FROM mta WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
+
+#Execute command without shell
+$hive -e 'SELECT name FROM mta;'
+
+#Execute script from file
+$hive -f hive_script.txt
+```
--- a/content/BigData/Hadoop/Apache
+++ b/content/BigData/Hadoop/Apache
@ -0,0 +1,68 @@
+---
+aliases:
+  - Spark
+---
+> [[Hadoop Eccosystem|Systems based on MapReduce]]
+
+## Apache Spark
+> Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing.
+
+##### Key Characteristics:
+- **Unified analytics engine** – supports batch, streaming, SQL, machine learning, and graph processing.
+- **In-memory computation** – stores intermediate results in RAM (vs. Hadoop which writes to disk).
+- **Fault tolerant** and scalable.
+
+##### Benefits of Spark Over [[Hadoop]] [[MapReduce]]
+
+| **Feature**         | **Spark**                                                                 | **Hadoop MapReduce**                   |
+| ------------------- | ------------------------------------------------------------------------- | -------------------------------------- |
+| **Performance**     | Up to **100x faster** (in-memory operations)                              | Disk-based, slower                     |
+| **Ease of use**     | High-level APIs in Python, Java, Scala, R                                 | Java-based, verbose programming        |
+| **Generality**      | Unified engine for batch, stream, ML, graph                               | Focused on batch processing            |
+| **Fault tolerance** | Efficient recovery via lineage                                            | Slower fault recovery via re-execution |
+| **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. |                                        |
+
+##### How is Spark Fault Tolerant?
+> Resilient Distributed Datasets ([[RDD]]s)
+
+- Restricted form of distributed shared memory
+- Immutable, partitioned collections of records
+- Recompute lost partitions on failure
+- No cost if nothing fails
+
+![[Screenshot 2025-07-23 at 19.17.31.png|500]]
+
+- **Lineage Graph**
+	- Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations.
+	
+##### Writing Spark Code in Python
+```
+# Spark Context Initialization
+from pyspark import SparkConf, SparkContext
+
+conf = SparkConf().setAppName("MyApp").setMaster("local")
+sc = SparkContext(conf=conf)
+
+# Create RDDs:
+# 1. From a Python list
+data = [1, 2, 3, 4, 5]
+distData = sc.parallelize(data)
+
+# 2. From a file
+distFile = sc.textFile("data.txt")
+distFile = sc.textFile("folder/*.txt")
+```
+##### **RDD Transformations (Lazy)**
+These create a new RDD from an existing one.
+
+| map(func)         | Apply function to each element               |
+| ----------------- | -------------------------------------------- |
+| filter(func)      | Keep elements where func returns True        |
+| flatMap(func)     | Like map, but flattens results               |
+| union(otherRDD)   | Union of two RDDs                            |
+| distinct()        | Remove duplicates                            |
+| reduceByKey(func) | Combine values for each key (key-value RDDs) |
+| sortByKey()       | Sort by keys                                 |
+| join(otherRDD)    | Join two key-value RDDs                      |
+| repartition(n)    | Re-distribute RDD to n partitions            |
+Transformations are **lazy** – they only execute when an action is triggered.
--- a/content/BigData/Hadoop/Google
+++ b/content/BigData/Hadoop/Google
@ -0,0 +1,27 @@
+> [[Hadoop Eccosystem|Systems based on MapReduce]]
+
+**Key Ideas**
+• Leverages columnar file format
+• Optimized for SQL performance
+
+**Concepts**
+- Tree-based **query execution**. 
+- Efficient scanning and aggregation of **nested columnar data**.
+### Columnare data format
+> Illustration of what columnar storage is all about:
+> given a 3 columns:
+![[Screenshot 2025-07-23 at 18.42.46.png|170]]
+> In a row-oriented storage, the data is laid out one row at a time as follows:
+![[Screenshot 2025-07-23 at 18.45.25.png|500]]
+> Whereas in a column-oriented storage, it is laid out one column at a time:
+![[Screenshot 2025-07-23 at 18.46.55.png|500]]
+
+**Nested data in columnar format**
+![[Screenshot 2025-07-23 at 18.50.10.png]]![[Screenshot 2025-07-23 at 18.50.16.png]]
+
+### Frameworks inspired by Google Dremel
+• Apache Dril (MapR)
+• Apache Impala (Cloudera)
+• Apache Tez (Hortonworks)
+• Presto (Facebook)
+
--- a/content/BigData/Hadoop/HDFS.md
+++ b/content/BigData/Hadoop/HDFS.md
@ -0,0 +1,64 @@
+##### HDFS ([[Hadoop]] Distributed File System)
+Stores huge files (Typical file size GB-TB) across multiple machines.
+- Breaks files into **blocks** (typically 128 MB).
+- **Replicates** blocks (default 3 copies) for fault tolerance.
+- Access using POSIX API.
+
+##### HDFS design principles
+* **Immutable**: **write-once, read-many**
+* **No Failures**: Disk or node failure does not affect file system
+* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
+* **File Num Limited**: 1048576 files in a directory
+* **Prefer bigger files**: Big files provide better performance
+
+##### HDFS File Formats
+- Text/CSV - No schema, no metadata
+- Json Records - metadata is stored with data
+- Avro Files - schema independent of data
+- Sequence Files - binary files (used as intermediate storage in M/R)
+- RC Files - Record Columnar files
+- ORC Files - Optimized RC files. Compress better
+- Parquet Files - Yet another RC file
+
+##### HDFS Command Line
+```
+# List files
+hadoop fs -ls /path
+
+# Make directory
+hadoop fs -mkdir /user/hadoop
+
+# Print file
+hadoop fs -cat /file
+
+# Upload file
+hadoop fs -copyFromLocal file.txt hdfs://...
+```
+
+#### HDFS Architecture – Main Components
+##### **1.** NameNode (Master Node)
+- **Stores metadata** about the filesystem:
+    - Filenames
+    - Directory structure
+    - Block locations
+    - Permissions
+    
+- It **does not store the actual data**. 
+- There is **one active NameNode** per cluster.
+
+##### **2.** DataNodes (Worker Nodes)
+- Store the **actual data blocks** of files.
+- Send **heartbeat** messages to the NameNode to report that they are alive.
+- When a file is written, it’s split into blocks and distributed across many DataNodes.
+- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
+
+#### File Read / Write
+**When a file is written:**
+1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
+2. The NameNode responds with a list of **DataNodes** to use.
+3. The client sends the blocks of the file to those DataNodes.
+4. Blocks are **replicated** automatically across different nodes for redundancy.
+
+**When a file is read:**
+1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
+2. The client reads the blocks **directly** from the DataNodes.
--- a/content/BigData/Hadoop/Hadoop
+++ b/content/BigData/Hadoop/Hadoop
@ -0,0 +1,17 @@
+### Systems based on [[MapReduce]]
+> Early generation frameworks for big data processing.
+* [[Apache Hive]]
+
+### Systems that replace MapReduce
+> newer, faster frameworks with different architectures and performance improvements.
+
+**Motivation**: [[MapReduce]] and [[Apache Hive|Hive]] are too slow!
+- [[Google Dremel]]
+- [[Apache Spark]]
+	- Replaces MapReduce with its own engine that works much faster without compromising consistency
+	- Architecture not based on Map-reduce but rather on two concepts:
+		- RDD (Resilient Distributed Dataset)
+		- DAG (Directed Acyclic Graph)
+	- Pro’s:
+		- Works much faster than MapReduce;
+		- fast growing community.
--- a/content/BigData/Hadoop/Hadoop.md
+++ b/content/BigData/Hadoop/Hadoop.md
@ -0,0 +1,13 @@
+![[Screenshot 2025-07-23 at 12.20.09.png | 400]]
+> Hadoop is an **[[Open Source]] framework** for:
+> - **Distributed storage** (across many machines)
+> - **Distributed processing** (run programs on many machines in parallel)
+>
+> > It is **not a database** — it is an ecosystem for managing and analyzing **Big Data**.
+## **Hadoop Components Overview**
+![[Screenshot 2025-07-23 at 11.58.48.png ]]
+> 1. [[HDFS]]
+> 2. [[MapReduce]]
+> 3. [[Yarn]]
+
+[[Hadoop Eccosystem]]
--- a/content/BigData/Hadoop/MapReduce.md
+++ b/content/BigData/Hadoop/MapReduce.md
@ -0,0 +1,32 @@
+A programming model for processing big data in parallel.
+- Distributed processing - Job is run in parallel on several nodes
+- Run the process where the data is!
+-  Horizontal Scalability
+
+- **Map** step: transform input
+	- Transform, Filter, Calculate
+	- Local data
+	- e.g., count 1 per word
+	
+- **Combine** step: Reorganization of map output.
+	- Shuffle, Sort, Group
+	
+- **Reduce** step: Aggregate / Sum the groups 
+	- e.g., sum word counts
+	
+MapReduce **runs code where the data is**, saving data transfer time.
+
+![[Screenshot 2025-07-23 at 13.00.20.png]]
+##### Example:
+From the sentence:
+> “how many cookies could a good cook cook if a good cook could cook cookies”
+
+Steps:
+1. **Map**:
+    - Each word becomes a pair like ("cook", 1)
+2. **Shuffle**:
+    - Group by word
+3. **Reduce**:
+    - Add up counts → ("cook", 4)
+
+![[Screenshot 2025-07-23 at 13.01.20.png]]
--- a/content/BigData/Hadoop/RDD.md
+++ b/content/BigData/Hadoop/RDD.md
@ -0,0 +1,20 @@
+## RDD (Resilient Distributed Dataset)
+>RDD is an immutable (read only) distributed collection of objects.
+>
+>Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster
+
+![[Screenshot 2025-07-23 at 19.08.40.png|600]]
+##### **Key Properties:**
+- Distributed: Automatically split across cluster nodes.
+- Lazy Evaluation: Transformations aren’t executed until an action is called.
+- Fault-tolerant: Can **recompute lost partitions** using lineage graph.
+- Parallel: Operates concurrently across cluster cores.
+##### Data Sharing 
+> In [[Hadoop]] [[MapReduce]]
+![[Screenshot 2025-07-23 at 19.11.44.png|500]]
+
+> In [[Apache Spark|Spark]]
+![[Screenshot 2025-07-23 at 19.12.57.png|500]]
+>10-100x faster than network and disk!
+
+
--- a/content/BigData/Hadoop/Yarn.md
+++ b/content/BigData/Hadoop/Yarn.md
@ -0,0 +1,35 @@
+**YARN (Yet Another Resource Negotiator)** 
+is [[Hadoop]]’s cluster resource management system
+- Multiple jobs running simultaneously
+- Multiple jobs use same resources (disk, CPU, memory)
+- Assign resources to jobs and tasks exclusively
+
+##### YARN is in charge of:
+1. Allocates Resources
+2. Schedules Jobs
+	- allocate priorities to jobs by policies:
+		FIFO scheduler, Fair scheduler, Capacity scheduler
+
+##### Components:
+- **ResourceManager**
+	- oversees resource allocation across the cluster
+	
+- **NodeManager**
+	- Each node in the cluster runs a NodeManager.
+	- This component manages the execution of containers on its node.
+	
+- **ApplicationMaster**
+	- manages the lifecycle of applications.
+	- handles job scheduling and monitors progress.
+	
+- **Resource Container**
+	- a logical bundle of resources (e.g., CPU, Memory) that is allocated by the ResourceManager
+
+![[Screenshot 2025-07-23 at 13.29.37.png]]
+
+##### YARN ecosystem
+Yarn can run other applications beside Hadoop [[MapReduce]], that can
+integrate to the Hadoop ecosystem:
+• Apache Storm (Data Streaming engine)
+• [[Apache Spark]] (Data Batch and streaming engine)
+• Apache Solr (Search platform)
--- a/content/BigData/Open
+++ b/content/BigData/Open
@ -0,0 +1,13 @@
+• Source Code available
+• Free Redistribution
+• Derived Works
+
+![[Screenshot 2025-07-23 at 12.24.23.png]]
+
+Open-source replace Closed-source
+![[Screenshot 2025-07-23 at 12.25.00.png]]
+
+More Open-source solutions
+![[Screenshot 2025-07-23 at 12.25.28.png]]
+
+![[Screenshot 2025-07-23 at 12.27.11.png]]![[Screenshot 2025-07-23 at 12.27.39.png]]
--- a/content/BigData/RDBMS.md
+++ b/content/BigData/RDBMS.md
@ -0,0 +1,63 @@
+[[Database Overview]]
+##### What is an RDBMS?
+**Relational Database Management System**:
+- Data is stored in **tables**:
+    - **Rows** = records
+    - **Columns** = fields
+    
+- Each table has:
+    - **Indexes** for fast searching
+    - **Relationships** with other tables (via keys)
+
+##### Relational model - Keys and Indexes
+Ability to find record(s) quickly
+- Operations become efficient:
+    - **Find by key** → O(log n)
+    - **Fetch record by ID** → O(1)
+  
+Indexes = sorted references to data locations → like a book index.
+
+##### Relational model - Operations
+Relational databases support **CRUD**:
+- **C**reate
+- **R**ead
+- **U**pdate
+- **D**elete
+
+Each operation uses both:
+- The **index** (to locate data)
+- The **data** itself (to read/write)
+
+##### Relational model - Transactional
+Relational databases guarantee **transaction safety** with ACID:
+- **A**tomicity – all or nothing
+- **C**onsistency – valid data only
+- **I**solation – no interference from other transactions
+- **D**urability – survives crashes
+
+* Examples:
+	- Transferring money, Posting a tweet
+	- Both must either **succeed completely** or **fail completely**.
+
+Transactions guarantee data validity despite errors & failures
+
+##### Relational model - SQL
+**SQL** is the language used to talk to relational databases.
+- **S**tandard
+- **Q**uery
+- **L**anguage
+
+- All RDBMSs use it (MySQL, PostgreSQL, Oracle, etc.)
+
+#####  Pros and Cons of RDBMS
+**Pros:**
+- Structured data
+- ACID transactions
+- Powerful SQL
+- Fast (for small/medium size)
+
+**Cons**:
+- Doesn’t scale well (single machine or SPOF = Single Point of Failure)
+- Becomes **slow** with **big data**
+- **Less fault tolerant**
+- Not designed for **massive, distributed systems**
--- a/content/BigData/res/Pasted
+++ b/content/BigData/res/Pasted
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot