Add my Obsidian notes

2025-12-24 05:14:06 -06:00 · 2025-07-23 20:36:04 +03:00 · 2025-07-23 20:36:04 +03:00 · 7b7a97b7cf
commit 7b7a97b7cf
parent 059848f8b0
59 changed files with 821 additions and 0 deletions
--- a/content/BigData/AWS/AWS
+++ b/content/BigData/AWS/AWS
@ -0,0 +1,91 @@
 [[Cloud Computing]]
 ## AWS Overview
 - over 175+ services
 - **Pay-as-you-go** pricing    
 - **No upfront costs**
 - **Ideal for experimentation**
 - **Access to cutting-edge tools and scalability**
 ##### **Region**
 - A physical location worldwide with multiple data centers.
 ##### **Availability Zone (AZ)**
 - Logical group of one or more data centers within a region.
 - Physically isolated (up to 100 km apart).
 - Designed for **high availability and fault tolerance**.
 ##### **Edge Location**
 - are physical sites dispersed across the globe
 - Part of Amazon’s CDN (content delivery network).
 - Distributes services/data closer to users to reduce latency.
 ##### **Planning for Failure (Resiliency)**
 - **Storage**:
 	* S3 service is designed for failure.
 	* Each file is copied to every [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] in the region. Thus you always have three copies of your file.
 - **Compute**: 
 	- The owner is responsible to manually distribute resources across multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s.
 	- If one fails the others still operate.
 - **Databases**: 
 	- The owner can configure DB deployment in multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s to keep redundancy.
 ##### **Benefits of AWS Global Infrastructure**
 - High performance
 - Low latency
 - High availability
 - Scalability
 - Unlimited capacity (horizontally scalable)
 - Built-in security and monitoring
 - Confidential
 - Reliable
 - Low Cost
 ##### Shared Responsibility of Security
 ![[Screenshot 2025-07-23 at 14.20.31.png]]
 ## AWS Core Services
 ##### Networking
 * [[Amazon VPC]]
 ##### Security & Identity
 - [[Amazon IAM]]
 ##### Compute
 - [[Amazon EC2]]
 - [[Amazon Lambda]]
 ##### Storage
 - **Instance Store:** 
 	- Specified by instance type. Data is stored on the same server as the [[Amazon EC2|EC2]] instance. It is removed when the instance is terminated.
 - [[Amazon EBS]]
 - [[Amazon S3]]
 ##### Databases
 - Relational
 	- [[Amazon RDS]]
 	- Amazon Redshift
 	- Amazon Aurora
 - Non-Relational
 	- [[Amazon DynamoDB]]
 	- Amazon ElastiCache
 	- Amazon Neptune
 - Alternatively:
 	- you can install a DB of your choice in an [[Amazon EC2|EC2]] instance and not use one provided by AWS. In that case, you take all responsibility of the security and management of your DB.
 ## AWS Pricing Models
 ##### Principles:
 - **Pay-as-you-go** (only pay for usage)
 - **Reserved pricing** (discounted with commitment)
 - **Volume discount** (pay less when you use more)
 ##### Free Tier Options:
 - **Always free** (e.g., 1M free Lambda calls)
 - **12-months free** (introductory offer)
 - **Trial services**
 ### **Billing Examples:**
 - [[Amazon EC2|EC2]]: Pay for runtime only.
 - [[Amazon S3|S3]]: Pay for
    - Storage volume
    - Requests (PUT/GET)
    - Data transfer
 - [[Amazon Lambda|Lambda]]: Pay for
    - Number of requests
    - Execution time
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,10 @@
 ---
 aliases:
  - EBS
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **Amazon EBS (Elastic Block Store)**
 extra storage that is connected to the [[Amazon EC2|EC2]] instance but is not the same as the instance storage.
 - Persistent and can be attached to any [[Amazon EC2|EC2]] instance in the [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
 - It is not deleted when the [[Amazon EC2|EC2]] instance is terminated.
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,43 @@
 ---
 aliases:
  - EC2
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **Amazon EC2 (Elastic Compute Cloud)**
 A web service that provides secure, resizable compute capacity in
 the cloud. 
 * Designed to make web-scale cloud computing easier for developers.
 - Secure, resizable compute capacity (virtual servers).
 - Complete control over OS and apps.
 **Pricing Models**:
 - **On-Demand**: Pay for what you use.
 - **Spot Instances**: Cheap, temporary, short-life instance, save up to 90%.
 - **Reserved**: Lower price for long-term usage.
 **Features:**
 - **Amazon Machin Image (AMI):** 
 	- Preconfigured OS image
 	- e.g., Linux, maxOS, Windows
 - **Instance type**: 
 	- Defines CPU, memory, storage, networking capacity
 - **Networking**: 
 	- [[Amazon VPC|VPC]] and subnets
 - **Storage**
 - **Security Group(s)**: 
 	- Like a firewall, Define access to and from EC2 instance
 - **Key pair**: 
 	- establish a remote connection (Secure SSH access)
 - **Instance Type**:
 	-  defines CPU, memory, storage, and network performance.
 	  ![[Screenshot 2025-07-23 at 16.52.08.png]]
 - **Instance Families** 
 | Family Type                       | Use Case<br>              |
 | --------------------------------- | ------------------------- |
 | General Purpose (M / T / A)       | Web servers               |
 | Compute Optimized (C)             | Analytics, gaming         |
 | Memory Optimized (R / X)          | High-performance DB       |
 | Accelerated Computing (P / G / F) | AI, ML, GPU compute       |
 | Storage Optimized (I)             | Big data, NoSQL databases |
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,13 @@
 ---
 aliases:
  - IAM
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **Amazon IAM (Identity and Access Management)**
 - Manages user access to services.
 - Attach permission policies to identities to manage the kind of actions the identity can perform.
 	- Identities in Amazon IAM are ***users***, ***groups*** and ***roles***.
 - Based on ***least privilege*** principle. 
 	* user or entity should only have access to the specific data, resources and applications when you explicitly granted them access.
 * example usage:
 	* Grant cross-account permissions to upload objects while ensuring that the bucket owner has full control.
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,18 @@
 ---
 aliases:
  - Lambda
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **AWS Lambda (a serverless compute service)**
 - Run backend code without provisioning servers.
 - Event-driven: triggered by events (e.g., file upload).
 - Languages: Python, Node.js, Java, C#, Go, Ruby.
 - Automatically scales with demand.
 **Work flow**
 ![[Screenshot 2025-07-23 at 17.51.04.png|600]]
 **Example Use Case:**
 - You can configure Lambda function to perform an action when an event occurs. For example, when an image is stored in Bucket-A, an event invokes the Lambda function to process the image to a new format and store it in Bucket-B
  ![[Screenshot 2025-07-23 at 17.52.39.png|400]]
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,14 @@
 ---
 aliases:
  - RDS
  - AWS RDS
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **Amazon RDS (Relational Database Service)**
 A cloud based distributed relational database managed service.
 - Managed relational DBs (e.g., MySQL, PostgreSQL, Oracle).
 - AWS handles backups, patching, and scaling.
 - You can build fault-tolerant DB by configuring RDS for **Multi-[[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]** deployment.
  Placing your master RDS instance in one [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] and a standby replica in another [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]].
 ![[Screenshot 2025-07-23 at 17.48.07.png|600]]
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,20 @@
 ---
 aliases:
  - S3
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### Amazon S3 (Simple Storage Service)
 An object storage service that offers industry-leading scalability, data availability, security, and performance.
 - Object-based storage.
 - Designed for 99.999999999% (11 9’s) of durability.
 - Ideal for:
    - Media storage
    - Backups / archives
    - Data lakes
    - ML and analytics
 **Components:**
 - **Bucket**: A container to store an unlimited number of objects
 - **Object**: The actual entities stored in the buckets
 - **Key**: Unique identifier for the object
--- a/content/BigData/AWS/Amazon
+++ b/content/BigData/AWS/Amazon
@ -0,0 +1,12 @@
 ---
 aliases:
  - VPC
 ---
 > Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]]
 ##### **Amazon VPC (Virtual Private Cloud)**
 An logically **isolated network** within AWS for your resources.
 - Create a ***public-facing*** subnet for your web servers which have access to the internet.
 - Create a ***private-facing*** subnet with no internet access for your backend system
 	- e.g., databases, application servers
 - Enables fine-grained control over traffic with both a public and  a private subnet. 
--- a/content/BigData/Big
+++ b/content/BigData/Big
@ -0,0 +1,41 @@
 ##### data vs. information
 - **Data** is just raw facts (like the number 42).
    - But 42 could mean: age, shoe size, stock amount, etc.
 - **Information** is when you give meaning to the data.
    - Example: “Age = 42” gives context and becomes useful.
 ##### Big Data implementations
 - **Delta** – *Sentiment analysis* (e.g., of customer feedback).
 - **Netflix** – *User Behavioral Analysis* (e.g., what you watch and when).
 - **Time Warner** – *Customer segmentation* (dividing customers into groups).
 - **Volkswagen** – *Predictive support* (e.g., predict car issues).
 - **Visa** – *Fraud detection*.
 - **China government** – *Security Intelligence* (National security).
 - **Weather forecasting** – *Weather prediction models* to predicting the weather.
 - **Hospitals** – Diagnosing diseases using *machine learning* on images.
 - **Amazon** – *Price optimization*.
 - **Facebook** – Targeted advertising using *user profiling*.
 ##### Design Principles for Big Data
 1. **Horizontal Growth** – Add more machines instead of stronger ones.
 2. **Distributed Processing** – Split work across machines.
 3. **Process where Data is** – Don’t move data, move the code.
 4. **Simplicity of Code** – Keep logic understandable.
 5. **Recover from Failures** – Systems should self-heal.
 6. **Idempotency** – Running the same job twice shouldn’t break results.
 ##### Big Data SLA (Service Level Agreement)
 define performance expectations
 - **Reliability** – Will the data be there?
 - **Consistency** – Is the data accurate across systems?
 - **Availability** – Is the system always accessible?
 - **Freshness** – How up-to-date is the data?
 - **Response time** – How fast do queries return?
 * Other concerns:
 	- **Cost**
 	- **Scalability**
 	- **Performance**
 > Next [[Cloud Services]]
--- a/content/BigData/Big
+++ b/content/BigData/Big
@ -0,0 +1,15 @@
 >“An accumulation of data 
 >that is too large and complex
 >for processing by traditional
 >database management tools”
 >
 >**In Short:**
 >Big Data = too big for standard tools like Excel or regular SQL databases.
 [[Big Data Intro]]
 [[Cloud Services]]
 	[[AWS Cloud Services]]
 [[Database Overview]]
 	[[RDBMS]]
 	[[Hadoop]]
 		[[Hadoop Eccosystem]]
--- a/content/BigData/Cloud
+++ b/content/BigData/Cloud
@ -0,0 +1,29 @@
 ##### Benefits of Cloud Computing
 - **Elasticity**: Start small and scale as needed.
 - **Cost-efficiency**: No need to spend money on data centers.
 - **No capacity guessing**: Scale automatically based on demand.
 - **Economies of scale**: Benefit from AWS’s vast infrastructure.
 - **Agility**: Deploy resources quickly.
 - **Global Reach**: Go international within minutes.
 ##### Deployment Models in the Cloud
 - **IaaS** (Infrastructure as a Service):
 	- Virtual machines, storage, networks
 	- e.g., Amazon EC2.
 - **PaaS** (Platform as a Service):
 	- Managed environments for building apps
 	- e.g., AWS Elastic Beanstalk.
 - **[[Cloud Services#Selling Your Service |SaaS]]** (Software as a Service): 
 	- Full applications delivered over the internet
 	-  e.g., Gmail.
 ##### Deployment Strategies of Cloud Computing
 - **On-Premises (Private Cloud)**: Owned and operated on-site.
 - **Public Cloud**: Fully hosted on cloud provider infrastructure.
 - **Hybrid Cloud**: Combines on-premises and cloud resources.
 ##### Cloud Providers Comparison
 ![[Screenshot 2025-07-23 at 13.54.07.png | 600]]
 > Next [[AWS Cloud Services]]
--- a/content/BigData/Cloud
+++ b/content/BigData/Cloud
@ -0,0 +1,58 @@
 Introduction to cloud computing concepts relevant for Big Data.
 ##### traditional software deployment process:
 1. **Coding**
 2. **Compiling** – turning source code into executable files.
 3. **Installing** – putting the software on computers.
 ##### Clustered Software
 Introduces three related architectures:
 1. **Redundant Servers** – multiple servers running the same service for fault-tolerance.
    - E.g., several identical web servers.
 2. **Micro-services** – the system is broken into **small, independent services** that communicate with each other.
    - Each handles a specific function.
 3. **Clustered Computing** – a large task is **split into sub-tasks** running on **multiple nodes**.
    - Used in Big Data systems like **NoSQL databases**.
 ##### Scaling a Software System
 Two ways to handle growing demand:
 - **Scale Up**: Make one machine stronger
 	- When running out of resources we can add: *Memory*, *CPU*, *Disk*, *Network Bandwidth*
 	- Can become expensive or reach hardware limits.
 - **Scale Out**: Add more machines to share the work.
 	- Add **redundant servers** or use **cluster computing**.
 	- Each server can be **standalone** (like a web server), or part of a **coordinated system** (like a NoSQL cluster).
 	- More fault-tolerant and scalable than vertical scaling.
 - Tradeoff:
    - **Scale-up** is simpler but has limits.
    - **Scale-out** is more flexible and resilient but more complex.
 ##### Selling Your Service
 - **Install** - Software as installation
 	- e.g., Microsoft's office package
 - Saas - Software as a Service
 	- No need to install, just log in and use.
 	- e.g., Google Docs, Zoom, Dropbox.\
 - Common SaaS pricing models:
 1. **Per-user** – Pay per person.
 2. **Tiered** – Fixed price for different feature levels.
 3. **Usage-based** – Pay for what you use (e.g., storage, API calls).
 ##### Deployment Models
 Where you run your software:
 - **On-Premises**: Your own machines or rented servers (or VM’s).
 - **Cloud**: Run on virtual machines (VMs) from a cloud provider (e.g., AWS, Azure, GCP).
 ##### Cloud Deployment Options
 When deploying to the cloud, you have options:
 1. **Vanilla Node**: Raw VM – you install everything.
 2. **Cloud VM**: VM with pre-installed software.
 3. **Managed Service**: Cloud provider handles setup, scaling, updates (e.g., [[Amazon RDS|AWS RDS]], Google BigQuery).
 > Next [[Cloud Computing]]
--- a/content/BigData/Database
+++ b/content/BigData/Database
@ -0,0 +1,24 @@
 1. **Punch cards** – physical cards with holes. Early computers read data this way.![[Screenshot 2025-07-23 at 12.08.22.png | 400]]
 2. **Magnetic media**:
    - First: **Floppy disks**
      ![[Screenshot 2025-07-23 at 12.08.48.png | 400]]
    - Then: **Hard disks** (faster, more storage)
      ![[Screenshot 2025-07-23 at 12.09.30.png]]
 3. **1960s**: First **Database Management Systems (DBMSs)** created:
    - Charles W. Bachman developed the **Integrated Database System**
    - IBM developed **IMS**
 4. **1970s**:
    - IBM created **SQL** (Structured Query Language)
    - Modern relational databases (RDBMS) were born
 5. **20th century:** Many RDBMS's
 	- ORACLE, Microsoft's SQLServer, IBM's DB2, MySQL, SYBASE...
 	  ![[Screenshot 2025-07-23 at 12.10.02.png]]
 ##### Hadoop history
 ![[Screenshot 2025-07-23 at 12.16.32.png | 200]]
 2005 - Started by ***Doug Cutting*** at Yahoo! 
 [[Hadoop]] is an [[Open Source]] Apache project
 Benefits: free, flexible, community-supported.
--- a/content/BigData/Database
+++ b/content/BigData/Database
@ -0,0 +1,22 @@
 [[Database History]]
 [[RDBMS]] - Relational Models
 [[Hadoop]]
 ##### **Big Data Challenges**
 Examples of tasks that are hard with large datasets:
 1. Count the **most frequent words** in Wikipedia.
 2. Find the **hottest November** per country from weather data.
 3. Find the **day with most critical errors** in company logs.
 These problems require:
 - **Huge data**
 - **Efficient distributed computing**
 #### [[RDBMS]] vs. [[Hadoop]]
 | **Feature**    | **RDBMS**           | **Hadoop**                    |
 | -------------- | ------------------- | ----------------------------- |
 | Data structure | Structured (tables) | Any (structured/unstructured) |
 | Scalability    | Limited             | Highly scalable               |
 | Speed          | Fast (small data)   | Designed for huge data        |
 | Access         | SQL                 | Code (e.g., Java, Python)     |
--- a/content/BigData/Hadoop/Apache
+++ b/content/BigData/Hadoop/Apache
@ -0,0 +1,59 @@
 ---
 aliases:
  - Hive
 ---
 > [[Hadoop Eccosystem|Systems based on MapReduce]]
 ### Apache Hive
 ##### **Key Features**
 - Developed by **Apache**.
 - General SQL-like syntax for querying [[HDFS]] or other large databases
 - Translates SQL queries into one or more [[MapReduce]] jobs.
 - Maps data in [[HDFS]] into virtual [[RDBMS]]-like tables.
 - **Pro**:
 	- Convenient for **data analytics** uses SQL.
 * **Con**:
 	* Quite slow in response time
 ##### **Hive Data Model**
 **Structure**
 - **Physical**: Data stored in [[HDFS]] blocks across nodes.
 - **Virtual Table**: Defined with schema using metadata.
 - **Partitions**: Logical splits of data to speed up queries.
 **Metadata**
 - Hive stores metadatain DB
 - Map physical files to tables. 
 - Map fields (columns) to line structures in raw data.
 ![[Screenshot 2025-07-23 at 18.25.32.png]]
 **Hive Architecture**
 ![[Screenshot 2025-07-23 at 18.27.30.png|]]
 ##### Hive Usage
 ```
 #Start a hive shell:
 $hive
 #create hive table:
 hive> CREATE TABLE mta (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)
 #Show all tables:
 hive> SHOW TABLES;
 #Add a new column to the table:
 hive> ALTER TABLE mta ADD COLUMNS (description STRING);
 #Load HDFS data file into the table:
 hive> LOAD DATA INPATH '/home/hadoop/mta_users' OVERWRITE INTO TABLE mta;
 #Query employees that work more than a year:
 hive> SELECT name FROM mta WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
 #Execute command without shell
 $hive -e 'SELECT name FROM mta;'
 #Execute script from file
 $hive -f hive_script.txt
 ```
--- a/content/BigData/Hadoop/Apache
+++ b/content/BigData/Hadoop/Apache
@ -0,0 +1,68 @@
 ---
 aliases:
  - Spark
 ---
 > [[Hadoop Eccosystem|Systems based on MapReduce]]
 ## Apache Spark
 > Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing.
 ##### Key Characteristics:
 - **Unified analytics engine** – supports batch, streaming, SQL, machine learning, and graph processing.
 - **In-memory computation** – stores intermediate results in RAM (vs. Hadoop which writes to disk).
 - **Fault tolerant** and scalable.
 ##### Benefits of Spark Over [[Hadoop]] [[MapReduce]]
 | **Feature**         | **Spark**                                                                 | **Hadoop MapReduce**                   |
 | ------------------- | ------------------------------------------------------------------------- | -------------------------------------- |
 | **Performance**     | Up to **100x faster** (in-memory operations)                              | Disk-based, slower                     |
 | **Ease of use**     | High-level APIs in Python, Java, Scala, R                                 | Java-based, verbose programming        |
 | **Generality**      | Unified engine for batch, stream, ML, graph                               | Focused on batch processing            |
 | **Fault tolerance** | Efficient recovery via lineage                                            | Slower fault recovery via re-execution |
 | **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. |                                        |
 ##### How is Spark Fault Tolerant?
 > Resilient Distributed Datasets ([[RDD]]s)
 - Restricted form of distributed shared memory
 - Immutable, partitioned collections of records
 - Recompute lost partitions on failure
 - No cost if nothing fails
 ![[Screenshot 2025-07-23 at 19.17.31.png|500]]
 - **Lineage Graph**
 	- Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations.
 ##### Writing Spark Code in Python
 ```
 # Spark Context Initialization
 from pyspark import SparkConf, SparkContext
 conf = SparkConf().setAppName("MyApp").setMaster("local")
 sc = SparkContext(conf=conf)
 # Create RDDs:
 # 1. From a Python list
 data = [1, 2, 3, 4, 5]
 distData = sc.parallelize(data)
 # 2. From a file
 distFile = sc.textFile("data.txt")
 distFile = sc.textFile("folder/*.txt")
 ```
 ##### **RDD Transformations (Lazy)**
 These create a new RDD from an existing one.
 | map(func)         | Apply function to each element               |
 | ----------------- | -------------------------------------------- |
 | filter(func)      | Keep elements where func returns True        |
 | flatMap(func)     | Like map, but flattens results               |
 | union(otherRDD)   | Union of two RDDs                            |
 | distinct()        | Remove duplicates                            |
 | reduceByKey(func) | Combine values for each key (key-value RDDs) |
 | sortByKey()       | Sort by keys                                 |
 | join(otherRDD)    | Join two key-value RDDs                      |
 | repartition(n)    | Re-distribute RDD to n partitions            |
 Transformations are **lazy** – they only execute when an action is triggered.
--- a/content/BigData/Hadoop/Google
+++ b/content/BigData/Hadoop/Google
@ -0,0 +1,27 @@
 > [[Hadoop Eccosystem|Systems based on MapReduce]]
 **Key Ideas**
 • Leverages columnar file format
 • Optimized for SQL performance
 **Concepts**
 - Tree-based **query execution**. 
 - Efficient scanning and aggregation of **nested columnar data**.
 ### Columnare data format
 > Illustration of what columnar storage is all about:
 > given a 3 columns:
 ![[Screenshot 2025-07-23 at 18.42.46.png|170]]
 > In a row-oriented storage, the data is laid out one row at a time as follows:
 ![[Screenshot 2025-07-23 at 18.45.25.png|500]]
 > Whereas in a column-oriented storage, it is laid out one column at a time:
 ![[Screenshot 2025-07-23 at 18.46.55.png|500]]
 **Nested data in columnar format**
 ![[Screenshot 2025-07-23 at 18.50.10.png]]![[Screenshot 2025-07-23 at 18.50.16.png]]
 ### Frameworks inspired by Google Dremel
 • Apache Dril (MapR)
 • Apache Impala (Cloudera)
 • Apache Tez (Hortonworks)
 • Presto (Facebook)
--- a/content/BigData/Hadoop/HDFS.md
+++ b/content/BigData/Hadoop/HDFS.md
@ -0,0 +1,64 @@
 ##### HDFS ([[Hadoop]] Distributed File System)
 Stores huge files (Typical file size GB-TB) across multiple machines.
 - Breaks files into **blocks** (typically 128 MB).
 - **Replicates** blocks (default 3 copies) for fault tolerance.
 - Access using POSIX API.
 ##### HDFS design principles
 * **Immutable**: **write-once, read-many**
 * **No Failures**: Disk or node failure does not affect file system
 * **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
 * **File Num Limited**: 1048576 files in a directory
 * **Prefer bigger files**: Big files provide better performance
 ##### HDFS File Formats
 - Text/CSV - No schema, no metadata
 - Json Records - metadata is stored with data
 - Avro Files - schema independent of data
 - Sequence Files - binary files (used as intermediate storage in M/R)
 - RC Files - Record Columnar files
 - ORC Files - Optimized RC files. Compress better
 - Parquet Files - Yet another RC file
 ##### HDFS Command Line
 ```
 # List files
 hadoop fs -ls /path
 # Make directory
 hadoop fs -mkdir /user/hadoop
 # Print file
 hadoop fs -cat /file
 # Upload file
 hadoop fs -copyFromLocal file.txt hdfs://...
 ```
 #### HDFS Architecture – Main Components
 ##### **1.** NameNode (Master Node)
 - **Stores metadata** about the filesystem:
    - Filenames
    - Directory structure
    - Block locations
    - Permissions
 - It **does not store the actual data**. 
 - There is **one active NameNode** per cluster.
 ##### **2.** DataNodes (Worker Nodes)
 - Store the **actual data blocks** of files.
 - Send **heartbeat** messages to the NameNode to report that they are alive.
 - When a file is written, it’s split into blocks and distributed across many DataNodes.
 - DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
 #### File Read / Write
 **When a file is written:**
 1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
 2. The NameNode responds with a list of **DataNodes** to use.
 3. The client sends the blocks of the file to those DataNodes.
 4. Blocks are **replicated** automatically across different nodes for redundancy.
 **When a file is read:**
 1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
 2. The client reads the blocks **directly** from the DataNodes.
--- a/content/BigData/Hadoop/Hadoop
+++ b/content/BigData/Hadoop/Hadoop
@ -0,0 +1,17 @@
 ### Systems based on [[MapReduce]]
 > Early generation frameworks for big data processing.
 * [[Apache Hive]]
 ### Systems that replace MapReduce
 > newer, faster frameworks with different architectures and performance improvements.
 **Motivation**: [[MapReduce]] and [[Apache Hive|Hive]] are too slow!
 - [[Google Dremel]]
 - [[Apache Spark]]
 	- Replaces MapReduce with its own engine that works much faster without compromising consistency
 	- Architecture not based on Map-reduce but rather on two concepts:
 		- RDD (Resilient Distributed Dataset)
 		- DAG (Directed Acyclic Graph)
 	- Pro’s:
 		- Works much faster than MapReduce;
 		- fast growing community.
--- a/content/BigData/Hadoop/Hadoop.md
+++ b/content/BigData/Hadoop/Hadoop.md
@ -0,0 +1,13 @@
 ![[Screenshot 2025-07-23 at 12.20.09.png | 400]]
 > Hadoop is an **[[Open Source]] framework** for:
 > - **Distributed storage** (across many machines)
 > - **Distributed processing** (run programs on many machines in parallel)
 >
 > > It is **not a database** — it is an ecosystem for managing and analyzing **Big Data**.
 ## **Hadoop Components Overview**
 ![[Screenshot 2025-07-23 at 11.58.48.png ]]
 > 1. [[HDFS]]
 > 2. [[MapReduce]]
 > 3. [[Yarn]]
 [[Hadoop Eccosystem]]
--- a/content/BigData/Hadoop/MapReduce.md
+++ b/content/BigData/Hadoop/MapReduce.md
@ -0,0 +1,32 @@
 A programming model for processing big data in parallel.
 - Distributed processing - Job is run in parallel on several nodes
 - Run the process where the data is!
 -  Horizontal Scalability
 - **Map** step: transform input
 	- Transform, Filter, Calculate
 	- Local data
 	- e.g., count 1 per word
 - **Combine** step: Reorganization of map output.
 	- Shuffle, Sort, Group
 - **Reduce** step: Aggregate / Sum the groups 
 	- e.g., sum word counts
 MapReduce **runs code where the data is**, saving data transfer time.
 ![[Screenshot 2025-07-23 at 13.00.20.png]]
 ##### Example:
 From the sentence:
 > “how many cookies could a good cook cook if a good cook could cook cookies”
 Steps:
 1. **Map**:
    - Each word becomes a pair like ("cook", 1)
 2. **Shuffle**:
    - Group by word
 3. **Reduce**:
    - Add up counts → ("cook", 4)
 ![[Screenshot 2025-07-23 at 13.01.20.png]]
--- a/content/BigData/Hadoop/RDD.md
+++ b/content/BigData/Hadoop/RDD.md
@ -0,0 +1,20 @@
 ## RDD (Resilient Distributed Dataset)
 >RDD is an immutable (read only) distributed collection of objects.
 >
 >Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster
 ![[Screenshot 2025-07-23 at 19.08.40.png|600]]
 ##### **Key Properties:**
 - Distributed: Automatically split across cluster nodes.
 - Lazy Evaluation: Transformations aren’t executed until an action is called.
 - Fault-tolerant: Can **recompute lost partitions** using lineage graph.
 - Parallel: Operates concurrently across cluster cores.
 ##### Data Sharing 
 > In [[Hadoop]] [[MapReduce]]
 ![[Screenshot 2025-07-23 at 19.11.44.png|500]]
 > In [[Apache Spark|Spark]]
 ![[Screenshot 2025-07-23 at 19.12.57.png|500]]
 >10-100x faster than network and disk!
--- a/content/BigData/Hadoop/Yarn.md
+++ b/content/BigData/Hadoop/Yarn.md
@ -0,0 +1,35 @@
 **YARN (Yet Another Resource Negotiator)** 
 is [[Hadoop]]’s cluster resource management system
 - Multiple jobs running simultaneously
 - Multiple jobs use same resources (disk, CPU, memory)
 - Assign resources to jobs and tasks exclusively
 ##### YARN is in charge of:
 1. Allocates Resources
 2. Schedules Jobs
 	- allocate priorities to jobs by policies:
 		FIFO scheduler, Fair scheduler, Capacity scheduler
 ##### Components:
 - **ResourceManager**
 	- oversees resource allocation across the cluster
 - **NodeManager**
 	- Each node in the cluster runs a NodeManager.
 	- This component manages the execution of containers on its node.
 - **ApplicationMaster**
 	- manages the lifecycle of applications.
 	- handles job scheduling and monitors progress.
 - **Resource Container**
 	- a logical bundle of resources (e.g., CPU, Memory) that is allocated by the ResourceManager
 ![[Screenshot 2025-07-23 at 13.29.37.png]]
 ##### YARN ecosystem
 Yarn can run other applications beside Hadoop [[MapReduce]], that can
 integrate to the Hadoop ecosystem:
 • Apache Storm (Data Streaming engine)
 • [[Apache Spark]] (Data Batch and streaming engine)
 • Apache Solr (Search platform)
--- a/content/BigData/Open
+++ b/content/BigData/Open
@ -0,0 +1,13 @@
 • Source Code available
 • Free Redistribution
 • Derived Works
 ![[Screenshot 2025-07-23 at 12.24.23.png]]
 Open-source replace Closed-source
 ![[Screenshot 2025-07-23 at 12.25.00.png]]
 More Open-source solutions
 ![[Screenshot 2025-07-23 at 12.25.28.png]]
 ![[Screenshot 2025-07-23 at 12.27.11.png]]![[Screenshot 2025-07-23 at 12.27.39.png]]
--- a/content/BigData/RDBMS.md
+++ b/content/BigData/RDBMS.md
@ -0,0 +1,63 @@
 [[Database Overview]]
 ##### What is an RDBMS?
 **Relational Database Management System**:
 - Data is stored in **tables**:
    - **Rows** = records
    - **Columns** = fields
 - Each table has:
    - **Indexes** for fast searching
    - **Relationships** with other tables (via keys)
 ##### Relational model - Keys and Indexes
 Ability to find record(s) quickly
 - Operations become efficient:
    - **Find by key** → O(log n)
    - **Fetch record by ID** → O(1)
 Indexes = sorted references to data locations → like a book index.
 ##### Relational model - Operations
 Relational databases support **CRUD**:
 - **C**reate
 - **R**ead
 - **U**pdate
 - **D**elete
 Each operation uses both:
 - The **index** (to locate data)
 - The **data** itself (to read/write)
 ##### Relational model - Transactional
 Relational databases guarantee **transaction safety** with ACID:
 - **A**tomicity – all or nothing
 - **C**onsistency – valid data only
 - **I**solation – no interference from other transactions
 - **D**urability – survives crashes
 * Examples:
 	- Transferring money, Posting a tweet
 	- Both must either **succeed completely** or **fail completely**.
 Transactions guarantee data validity despite errors & failures
 ##### Relational model - SQL
 **SQL** is the language used to talk to relational databases.
 - **S**tandard
 - **Q**uery
 - **L**anguage
 - All RDBMSs use it (MySQL, PostgreSQL, Oracle, etc.)
 #####  Pros and Cons of RDBMS
 **Pros:**
 - Structured data
 - ACID transactions
 - Powerful SQL
 - Fast (for small/medium size)
 **Cons**:
 - Doesn’t scale well (single machine or SPOF = Single Point of Failure)
 - Becomes **slow** with **big data**
 - **Less fault tolerant**
 - Not designed for **massive, distributed systems**
--- a/content/BigData/res/Pasted
+++ b/content/BigData/res/Pasted
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot
--- a/content/BigData/res/Screenshot
+++ b/content/BigData/res/Screenshot