diff --git a/content/BigData/AWS/AWS Cloud Services.md b/content/BigData/AWS/AWS Cloud Services.md new file mode 100644 index 000000000..975749916 --- /dev/null +++ b/content/BigData/AWS/AWS Cloud Services.md @@ -0,0 +1,91 @@ +[[Cloud Computing]] +## AWS Overview +- over 175+ services +- **Pay-as-you-go** pricing +- **No upfront costs** +- **Ideal for experimentation** +- **Access to cutting-edge tools and scalability** +##### **Region** +- A physical location worldwide with multiple data centers. +##### **Availability Zone (AZ)** +- Logical group of one or more data centers within a region. +- Physically isolated (up to 100 km apart). +- Designed for **high availability and fault tolerance**. +##### **Edge Location** +- are physical sites dispersed across the globe +- Part of Amazon’s CDN (content delivery network). +- Distributes services/data closer to users to reduce latency. +##### **Planning for Failure (Resiliency)** +- **Storage**: + * S3 service is designed for failure. + * Each file is copied to every [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] in the region. Thus you always have three copies of your file. + +- **Compute**: + - The owner is responsible to manually distribute resources across multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s. + - If one fails the others still operate. + +- **Databases**: + - The owner can configure DB deployment in multiple [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]s to keep redundancy. + +##### **Benefits of AWS Global Infrastructure** +- High performance +- Low latency +- High availability +- Scalability +- Unlimited capacity (horizontally scalable) +- Built-in security and monitoring +- Confidential +- Reliable +- Low Cost +##### Shared Responsibility of Security +![[Screenshot 2025-07-23 at 14.20.31.png]] + +## AWS Core Services +##### Networking +* [[Amazon VPC]] +##### Security & Identity +- [[Amazon IAM]] +##### Compute +- [[Amazon EC2]] +- [[Amazon Lambda]] +##### Storage +- **Instance Store:** + - Specified by instance type. Data is stored on the same server as the [[Amazon EC2|EC2]] instance. It is removed when the instance is terminated. +- [[Amazon EBS]] +- [[Amazon S3]] +##### Databases +- Relational + - [[Amazon RDS]] + - Amazon Redshift + - Amazon Aurora + +- Non-Relational + - [[Amazon DynamoDB]] + - Amazon ElastiCache + - Amazon Neptune + +- Alternatively: + - you can install a DB of your choice in an [[Amazon EC2|EC2]] instance and not use one provided by AWS. In that case, you take all responsibility of the security and management of your DB. + +## AWS Pricing Models +##### Principles: +- **Pay-as-you-go** (only pay for usage) +- **Reserved pricing** (discounted with commitment) +- **Volume discount** (pay less when you use more) +##### Free Tier Options: +- **Always free** (e.g., 1M free Lambda calls) +- **12-months free** (introductory offer) +- **Trial services** + +### **Billing Examples:** +- [[Amazon EC2|EC2]]: Pay for runtime only. + +- [[Amazon S3|S3]]: Pay for + - Storage volume + - Requests (PUT/GET) + - Data transfer + +- [[Amazon Lambda|Lambda]]: Pay for + - Number of requests + - Execution time + diff --git a/content/BigData/AWS/Amazon EBS.md b/content/BigData/AWS/Amazon EBS.md new file mode 100644 index 000000000..733b74109 --- /dev/null +++ b/content/BigData/AWS/Amazon EBS.md @@ -0,0 +1,10 @@ +--- +aliases: + - EBS +--- +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **Amazon EBS (Elastic Block Store)** +extra storage that is connected to the [[Amazon EC2|EC2]] instance but is not the same as the instance storage. +- Persistent and can be attached to any [[Amazon EC2|EC2]] instance in the [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]. +- It is not deleted when the [[Amazon EC2|EC2]] instance is terminated. + diff --git a/content/BigData/AWS/Amazon EC2.md b/content/BigData/AWS/Amazon EC2.md new file mode 100644 index 000000000..09089f99f --- /dev/null +++ b/content/BigData/AWS/Amazon EC2.md @@ -0,0 +1,43 @@ +--- +aliases: + - EC2 +--- +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **Amazon EC2 (Elastic Compute Cloud)** +A web service that provides secure, resizable compute capacity in +the cloud. +* Designed to make web-scale cloud computing easier for developers. +- Secure, resizable compute capacity (virtual servers). +- Complete control over OS and apps. + +**Pricing Models**: +- **On-Demand**: Pay for what you use. +- **Spot Instances**: Cheap, temporary, short-life instance, save up to 90%. +- **Reserved**: Lower price for long-term usage. + +**Features:** +- **Amazon Machin Image (AMI):** + - Preconfigured OS image + - e.g., Linux, maxOS, Windows +- **Instance type**: + - Defines CPU, memory, storage, networking capacity +- **Networking**: + - [[Amazon VPC|VPC]] and subnets +- **Storage** +- **Security Group(s)**: + - Like a firewall, Define access to and from EC2 instance +- **Key pair**: + - establish a remote connection (Secure SSH access) + +- **Instance Type**: + - defines CPU, memory, storage, and network performance. + ![[Screenshot 2025-07-23 at 16.52.08.png]] +- **Instance Families** + +| Family Type | Use Case
| +| --------------------------------- | ------------------------- | +| General Purpose (M / T / A) | Web servers | +| Compute Optimized (C) | Analytics, gaming | +| Memory Optimized (R / X) | High-performance DB | +| Accelerated Computing (P / G / F) | AI, ML, GPU compute | +| Storage Optimized (I) | Big data, NoSQL databases | diff --git a/content/BigData/AWS/Amazon IAM.md b/content/BigData/AWS/Amazon IAM.md new file mode 100644 index 000000000..5e0eaf863 --- /dev/null +++ b/content/BigData/AWS/Amazon IAM.md @@ -0,0 +1,13 @@ +--- +aliases: + - IAM +--- +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **Amazon IAM (Identity and Access Management)** +- Manages user access to services. +- Attach permission policies to identities to manage the kind of actions the identity can perform. + - Identities in Amazon IAM are ***users***, ***groups*** and ***roles***. +- Based on ***least privilege*** principle. + * user or entity should only have access to the specific data, resources and applications when you explicitly granted them access. +* example usage: + * Grant cross-account permissions to upload objects while ensuring that the bucket owner has full control. \ No newline at end of file diff --git a/content/BigData/AWS/Amazon Lambda.md b/content/BigData/AWS/Amazon Lambda.md new file mode 100644 index 000000000..d0bcbe571 --- /dev/null +++ b/content/BigData/AWS/Amazon Lambda.md @@ -0,0 +1,18 @@ +--- +aliases: + - Lambda +--- +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **AWS Lambda (a serverless compute service)** +- Run backend code without provisioning servers. +- Event-driven: triggered by events (e.g., file upload). +- Languages: Python, Node.js, Java, C#, Go, Ruby. +- Automatically scales with demand. + +**Work flow** +![[Screenshot 2025-07-23 at 17.51.04.png|600]] + +**Example Use Case:** +- You can configure Lambda function to perform an action when an event occurs. For example, when an image is stored in Bucket-A, an event invokes the Lambda function to process the image to a new format and store it in Bucket-B + ![[Screenshot 2025-07-23 at 17.52.39.png|400]] + diff --git a/content/BigData/AWS/Amazon RDS.md b/content/BigData/AWS/Amazon RDS.md new file mode 100644 index 000000000..362073a9a --- /dev/null +++ b/content/BigData/AWS/Amazon RDS.md @@ -0,0 +1,14 @@ +--- +aliases: + - RDS + - AWS RDS +--- + +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **Amazon RDS (Relational Database Service)** +A cloud based distributed relational database managed service. +- Managed relational DBs (e.g., MySQL, PostgreSQL, Oracle). +- AWS handles backups, patching, and scaling. +- You can build fault-tolerant DB by configuring RDS for **Multi-[[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]** deployment. + Placing your master RDS instance in one [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]] and a standby replica in another [[AWS Cloud Services#**Availability Zone (AZ)**|AZ]]. +![[Screenshot 2025-07-23 at 17.48.07.png|600]] \ No newline at end of file diff --git a/content/BigData/AWS/Amazon S3.md b/content/BigData/AWS/Amazon S3.md new file mode 100644 index 000000000..10d618f8b --- /dev/null +++ b/content/BigData/AWS/Amazon S3.md @@ -0,0 +1,20 @@ +--- +aliases: + - S3 +--- + +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### Amazon S3 (Simple Storage Service) +An object storage service that offers industry-leading scalability, data availability, security, and performance. +- Object-based storage. +- Designed for 99.999999999% (11 9’s) of durability. +- Ideal for: + - Media storage + - Backups / archives + - Data lakes + - ML and analytics + +**Components:** +- **Bucket**: A container to store an unlimited number of objects +- **Object**: The actual entities stored in the buckets +- **Key**: Unique identifier for the object \ No newline at end of file diff --git a/content/BigData/AWS/Amazon VPC.md b/content/BigData/AWS/Amazon VPC.md new file mode 100644 index 000000000..4f0f64a95 --- /dev/null +++ b/content/BigData/AWS/Amazon VPC.md @@ -0,0 +1,12 @@ +--- +aliases: + - VPC +--- + +> Part of [[AWS Cloud Services#AWS Core Services|AWS Core Services]] +##### **Amazon VPC (Virtual Private Cloud)** +An logically **isolated network** within AWS for your resources. +- Create a ***public-facing*** subnet for your web servers which have access to the internet. +- Create a ***private-facing*** subnet with no internet access for your backend system + - e.g., databases, application servers +- Enables fine-grained control over traffic with both a public and a private subnet. \ No newline at end of file diff --git a/content/BigData/Big Data Intro.md b/content/BigData/Big Data Intro.md new file mode 100644 index 000000000..0f16cd97b --- /dev/null +++ b/content/BigData/Big Data Intro.md @@ -0,0 +1,41 @@ +##### data vs. information +- **Data** is just raw facts (like the number 42). + - But 42 could mean: age, shoe size, stock amount, etc. + +- **Information** is when you give meaning to the data. + - Example: “Age = 42” gives context and becomes useful. + +##### Big Data implementations +- **Delta** – *Sentiment analysis* (e.g., of customer feedback). +- **Netflix** – *User Behavioral Analysis* (e.g., what you watch and when). +- **Time Warner** – *Customer segmentation* (dividing customers into groups). +- **Volkswagen** – *Predictive support* (e.g., predict car issues). +- **Visa** – *Fraud detection*. +- **China government** – *Security Intelligence* (National security). +- **Weather forecasting** – *Weather prediction models* to predicting the weather. +- **Hospitals** – Diagnosing diseases using *machine learning* on images. +- **Amazon** – *Price optimization*. +- **Facebook** – Targeted advertising using *user profiling*. +##### Design Principles for Big Data +1. **Horizontal Growth** – Add more machines instead of stronger ones. +2. **Distributed Processing** – Split work across machines. +3. **Process where Data is** – Don’t move data, move the code. +4. **Simplicity of Code** – Keep logic understandable. +5. **Recover from Failures** – Systems should self-heal. +6. **Idempotency** – Running the same job twice shouldn’t break results. + +##### Big Data SLA (Service Level Agreement) +define performance expectations + +- **Reliability** – Will the data be there? +- **Consistency** – Is the data accurate across systems? +- **Availability** – Is the system always accessible? +- **Freshness** – How up-to-date is the data? +- **Response time** – How fast do queries return? + +* Other concerns: + - **Cost** + - **Scalability** + - **Performance** + +> Next [[Cloud Services]] \ No newline at end of file diff --git a/content/BigData/Big Data.md b/content/BigData/Big Data.md new file mode 100644 index 000000000..8f3ba3c3d --- /dev/null +++ b/content/BigData/Big Data.md @@ -0,0 +1,15 @@ +>“An accumulation of data +>that is too large and complex +>for processing by traditional +>database management tools” +> +>**In Short:** +>Big Data = too big for standard tools like Excel or regular SQL databases. + +[[Big Data Intro]] +[[Cloud Services]] + [[AWS Cloud Services]] +[[Database Overview]] + [[RDBMS]] + [[Hadoop]] + [[Hadoop Eccosystem]] diff --git a/content/BigData/Cloud Computing.md b/content/BigData/Cloud Computing.md new file mode 100644 index 000000000..c561cce0a --- /dev/null +++ b/content/BigData/Cloud Computing.md @@ -0,0 +1,29 @@ +##### Benefits of Cloud Computing +- **Elasticity**: Start small and scale as needed. +- **Cost-efficiency**: No need to spend money on data centers. +- **No capacity guessing**: Scale automatically based on demand. +- **Economies of scale**: Benefit from AWS’s vast infrastructure. +- **Agility**: Deploy resources quickly. +- **Global Reach**: Go international within minutes. +##### Deployment Models in the Cloud +- **IaaS** (Infrastructure as a Service): + - Virtual machines, storage, networks + - e.g., Amazon EC2. + +- **PaaS** (Platform as a Service): + - Managed environments for building apps + - e.g., AWS Elastic Beanstalk. + +- **[[Cloud Services#Selling Your Service |SaaS]]** (Software as a Service): + - Full applications delivered over the internet + - e.g., Gmail. + +##### Deployment Strategies of Cloud Computing +- **On-Premises (Private Cloud)**: Owned and operated on-site. +- **Public Cloud**: Fully hosted on cloud provider infrastructure. +- **Hybrid Cloud**: Combines on-premises and cloud resources. + +##### Cloud Providers Comparison +![[Screenshot 2025-07-23 at 13.54.07.png | 600]] + +> Next [[AWS Cloud Services]] \ No newline at end of file diff --git a/content/BigData/Cloud Services.md b/content/BigData/Cloud Services.md new file mode 100644 index 000000000..4ee541da2 --- /dev/null +++ b/content/BigData/Cloud Services.md @@ -0,0 +1,58 @@ +Introduction to cloud computing concepts relevant for Big Data. +##### traditional software deployment process: +1. **Coding** +2. **Compiling** – turning source code into executable files. +3. **Installing** – putting the software on computers. + +##### Clustered Software +Introduces three related architectures: + +1. **Redundant Servers** – multiple servers running the same service for fault-tolerance. + - E.g., several identical web servers. + +2. **Micro-services** – the system is broken into **small, independent services** that communicate with each other. + - Each handles a specific function. + +3. **Clustered Computing** – a large task is **split into sub-tasks** running on **multiple nodes**. + - Used in Big Data systems like **NoSQL databases**. + +##### Scaling a Software System +Two ways to handle growing demand: +- **Scale Up**: Make one machine stronger + - When running out of resources we can add: *Memory*, *CPU*, *Disk*, *Network Bandwidth* + - Can become expensive or reach hardware limits. + +- **Scale Out**: Add more machines to share the work. + - Add **redundant servers** or use **cluster computing**. + - Each server can be **standalone** (like a web server), or part of a **coordinated system** (like a NoSQL cluster). + - More fault-tolerant and scalable than vertical scaling. + +- Tradeoff: + - **Scale-up** is simpler but has limits. + - **Scale-out** is more flexible and resilient but more complex. + +##### Selling Your Service +- **Install** - Software as installation + - e.g., Microsoft's office package + +- Saas - Software as a Service + - No need to install, just log in and use. + - e.g., Google Docs, Zoom, Dropbox.\ + +- Common SaaS pricing models: +1. **Per-user** – Pay per person. +2. **Tiered** – Fixed price for different feature levels. +3. **Usage-based** – Pay for what you use (e.g., storage, API calls). + +##### Deployment Models +Where you run your software: +- **On-Premises**: Your own machines or rented servers (or VM’s). +- **Cloud**: Run on virtual machines (VMs) from a cloud provider (e.g., AWS, Azure, GCP). + +##### Cloud Deployment Options +When deploying to the cloud, you have options: +1. **Vanilla Node**: Raw VM – you install everything. +2. **Cloud VM**: VM with pre-installed software. +3. **Managed Service**: Cloud provider handles setup, scaling, updates (e.g., [[Amazon RDS|AWS RDS]], Google BigQuery). + +> Next [[Cloud Computing]] \ No newline at end of file diff --git a/content/BigData/Database History.md b/content/BigData/Database History.md new file mode 100644 index 000000000..a4fd84d4d --- /dev/null +++ b/content/BigData/Database History.md @@ -0,0 +1,24 @@ + +1. **Punch cards** – physical cards with holes. Early computers read data this way.![[Screenshot 2025-07-23 at 12.08.22.png | 400]] + +2. **Magnetic media**: + - First: **Floppy disks** + ![[Screenshot 2025-07-23 at 12.08.48.png | 400]] + - Then: **Hard disks** (faster, more storage) + ![[Screenshot 2025-07-23 at 12.09.30.png]] +3. **1960s**: First **Database Management Systems (DBMSs)** created: + - Charles W. Bachman developed the **Integrated Database System** + - IBM developed **IMS** +4. **1970s**: + - IBM created **SQL** (Structured Query Language) + - Modern relational databases (RDBMS) were born +5. **20th century:** Many RDBMS's + - ORACLE, Microsoft's SQLServer, IBM's DB2, MySQL, SYBASE... + ![[Screenshot 2025-07-23 at 12.10.02.png]] + + +##### Hadoop history +![[Screenshot 2025-07-23 at 12.16.32.png | 200]] +2005 - Started by ***Doug Cutting*** at Yahoo! +[[Hadoop]] is an [[Open Source]] Apache project +Benefits: free, flexible, community-supported. \ No newline at end of file diff --git a/content/BigData/Database Overview.md b/content/BigData/Database Overview.md new file mode 100644 index 000000000..990198597 --- /dev/null +++ b/content/BigData/Database Overview.md @@ -0,0 +1,22 @@ + +[[Database History]] +[[RDBMS]] - Relational Models +[[Hadoop]] +##### **Big Data Challenges** +Examples of tasks that are hard with large datasets: +1. Count the **most frequent words** in Wikipedia. +2. Find the **hottest November** per country from weather data. +3. Find the **day with most critical errors** in company logs. + +These problems require: +- **Huge data** +- **Efficient distributed computing** + +#### [[RDBMS]] vs. [[Hadoop]] + +| **Feature** | **RDBMS** | **Hadoop** | +| -------------- | ------------------- | ----------------------------- | +| Data structure | Structured (tables) | Any (structured/unstructured) | +| Scalability | Limited | Highly scalable | +| Speed | Fast (small data) | Designed for huge data | +| Access | SQL | Code (e.g., Java, Python) | diff --git a/content/BigData/Hadoop/Apache Hive.md b/content/BigData/Hadoop/Apache Hive.md new file mode 100644 index 000000000..7d6b5f0ff --- /dev/null +++ b/content/BigData/Hadoop/Apache Hive.md @@ -0,0 +1,59 @@ +--- +aliases: + - Hive +--- +> [[Hadoop Eccosystem|Systems based on MapReduce]] + +### Apache Hive +##### **Key Features** +- Developed by **Apache**. +- General SQL-like syntax for querying [[HDFS]] or other large databases +- Translates SQL queries into one or more [[MapReduce]] jobs. +- Maps data in [[HDFS]] into virtual [[RDBMS]]-like tables. +- **Pro**: + - Convenient for **data analytics** uses SQL. +* **Con**: + * Quite slow in response time + +##### **Hive Data Model** +**Structure** +- **Physical**: Data stored in [[HDFS]] blocks across nodes. +- **Virtual Table**: Defined with schema using metadata. +- **Partitions**: Logical splits of data to speed up queries. + +**Metadata** +- Hive stores metadatain DB +- Map physical files to tables. +- Map fields (columns) to line structures in raw data. + +![[Screenshot 2025-07-23 at 18.25.32.png]] + +**Hive Architecture** +![[Screenshot 2025-07-23 at 18.27.30.png|]] + +##### Hive Usage +``` +#Start a hive shell: +$hive + +#create hive table: +hive> CREATE TABLE mta (id BIGINT, name STRING, startdate TIMESTAMP, email STRING) + +#Show all tables: +hive> SHOW TABLES; + +#Add a new column to the table: +hive> ALTER TABLE mta ADD COLUMNS (description STRING); + +#Load HDFS data file into the table: +hive> LOAD DATA INPATH '/home/hadoop/mta_users' OVERWRITE INTO TABLE mta; + +#Query employees that work more than a year: +hive> SELECT name FROM mta WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60); + +#Execute command without shell +$hive -e 'SELECT name FROM mta;' + +#Execute script from file +$hive -f hive_script.txt +``` diff --git a/content/BigData/Hadoop/Apache Spark.md b/content/BigData/Hadoop/Apache Spark.md new file mode 100644 index 000000000..f3188d9c0 --- /dev/null +++ b/content/BigData/Hadoop/Apache Spark.md @@ -0,0 +1,68 @@ +--- +aliases: + - Spark +--- +> [[Hadoop Eccosystem|Systems based on MapReduce]] + +## Apache Spark +> Apache Spark is a **fast**, **general-purpose**, **open-source** cluster computing system designed for large-scale data processing. + +##### Key Characteristics: +- **Unified analytics engine** – supports batch, streaming, SQL, machine learning, and graph processing. +- **In-memory computation** – stores intermediate results in RAM (vs. Hadoop which writes to disk). +- **Fault tolerant** and scalable. + +##### Benefits of Spark Over [[Hadoop]] [[MapReduce]] + +| **Feature** | **Spark** | **Hadoop MapReduce** | +| ------------------- | ------------------------------------------------------------------------- | -------------------------------------- | +| **Performance** | Up to **100x faster** (in-memory operations) | Disk-based, slower | +| **Ease of use** | High-level APIs in Python, Java, Scala, R | Java-based, verbose programming | +| **Generality** | Unified engine for batch, stream, ML, graph | Focused on batch processing | +| **Fault tolerance** | Efficient recovery via lineage | Slower fault recovery via re-execution | +| **Runs Everywhere** | Runs on [[Hadoop]], Apache Mesos, Kubernetes, Standalone or in the cloud. | | + +##### How is Spark Fault Tolerant? +> Resilient Distributed Datasets ([[RDD]]s) + +- Restricted form of distributed shared memory +- Immutable, partitioned collections of records +- Recompute lost partitions on failure +- No cost if nothing fails + +![[Screenshot 2025-07-23 at 19.17.31.png|500]] + +- **Lineage Graph** + - Each [[RDD]] keeps track of how it was derived. If a node fails, Spark **recomputes only the lost partition** from the original transformations. + +##### Writing Spark Code in Python +``` +# Spark Context Initialization +from pyspark import SparkConf, SparkContext + +conf = SparkConf().setAppName("MyApp").setMaster("local") +sc = SparkContext(conf=conf) + +# Create RDDs: +# 1. From a Python list +data = [1, 2, 3, 4, 5] +distData = sc.parallelize(data) + +# 2. From a file +distFile = sc.textFile("data.txt") +distFile = sc.textFile("folder/*.txt") +``` +##### **RDD Transformations (Lazy)** +These create a new RDD from an existing one. + +| map(func) | Apply function to each element | +| ----------------- | -------------------------------------------- | +| filter(func) | Keep elements where func returns True | +| flatMap(func) | Like map, but flattens results | +| union(otherRDD) | Union of two RDDs | +| distinct() | Remove duplicates | +| reduceByKey(func) | Combine values for each key (key-value RDDs) | +| sortByKey() | Sort by keys | +| join(otherRDD) | Join two key-value RDDs | +| repartition(n) | Re-distribute RDD to n partitions | +Transformations are **lazy** – they only execute when an action is triggered. \ No newline at end of file diff --git a/content/BigData/Hadoop/Google Dremel.md b/content/BigData/Hadoop/Google Dremel.md new file mode 100644 index 000000000..a30cbba30 --- /dev/null +++ b/content/BigData/Hadoop/Google Dremel.md @@ -0,0 +1,27 @@ +> [[Hadoop Eccosystem|Systems based on MapReduce]] + +**Key Ideas** +• Leverages columnar file format +• Optimized for SQL performance + +**Concepts** +- Tree-based **query execution**. +- Efficient scanning and aggregation of **nested columnar data**. +### Columnare data format +> Illustration of what columnar storage is all about: +> given a 3 columns: +![[Screenshot 2025-07-23 at 18.42.46.png|170]] +> In a row-oriented storage, the data is laid out one row at a time as follows: +![[Screenshot 2025-07-23 at 18.45.25.png|500]] +> Whereas in a column-oriented storage, it is laid out one column at a time: +![[Screenshot 2025-07-23 at 18.46.55.png|500]] + +**Nested data in columnar format** +![[Screenshot 2025-07-23 at 18.50.10.png]]![[Screenshot 2025-07-23 at 18.50.16.png]] + +### Frameworks inspired by Google Dremel +• Apache Dril (MapR) +• Apache Impala (Cloudera) +• Apache Tez (Hortonworks) +• Presto (Facebook) + diff --git a/content/BigData/Hadoop/HDFS.md b/content/BigData/Hadoop/HDFS.md new file mode 100644 index 000000000..276a5029a --- /dev/null +++ b/content/BigData/Hadoop/HDFS.md @@ -0,0 +1,64 @@ +##### HDFS ([[Hadoop]] Distributed File System) +Stores huge files (Typical file size GB-TB) across multiple machines. +- Breaks files into **blocks** (typically 128 MB). +- **Replicates** blocks (default 3 copies) for fault tolerance. +- Access using POSIX API. + +##### HDFS design principles +* **Immutable**: **write-once, read-many** +* **No Failures**: Disk or node failure does not affect file system +* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB) +* **File Num Limited**: 1048576 files in a directory +* **Prefer bigger files**: Big files provide better performance + +##### HDFS File Formats +- Text/CSV - No schema, no metadata +- Json Records - metadata is stored with data +- Avro Files - schema independent of data +- Sequence Files - binary files (used as intermediate storage in M/R) +- RC Files - Record Columnar files +- ORC Files - Optimized RC files. Compress better +- Parquet Files - Yet another RC file + +##### HDFS Command Line +``` +# List files +hadoop fs -ls /path + +# Make directory +hadoop fs -mkdir /user/hadoop + +# Print file +hadoop fs -cat /file + +# Upload file +hadoop fs -copyFromLocal file.txt hdfs://... +``` + +#### HDFS Architecture – Main Components +##### **1.** NameNode (Master Node) +- **Stores metadata** about the filesystem: + - Filenames + - Directory structure + - Block locations + - Permissions + +- It **does not store the actual data**. +- There is **one active NameNode** per cluster. + +##### **2.** DataNodes (Worker Nodes) +- Store the **actual data blocks** of files. +- Send **heartbeat** messages to the NameNode to report that they are alive. +- When a file is written, it’s split into blocks and distributed across many DataNodes. +- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**. + +#### File Read / Write +**When a file is written:** +1. The client contacts the **NameNode** to ask: “Where should I write the blocks?” +2. The NameNode responds with a list of **DataNodes** to use. +3. The client sends the blocks of the file to those DataNodes. +4. Blocks are **replicated** automatically across different nodes for redundancy. + +**When a file is read:** +1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks. +2. The client reads the blocks **directly** from the DataNodes. \ No newline at end of file diff --git a/content/BigData/Hadoop/Hadoop Eccosystem.md b/content/BigData/Hadoop/Hadoop Eccosystem.md new file mode 100644 index 000000000..87ee684a8 --- /dev/null +++ b/content/BigData/Hadoop/Hadoop Eccosystem.md @@ -0,0 +1,17 @@ +### Systems based on [[MapReduce]] +> Early generation frameworks for big data processing. +* [[Apache Hive]] + +### Systems that replace MapReduce +> newer, faster frameworks with different architectures and performance improvements. + +**Motivation**: [[MapReduce]] and [[Apache Hive|Hive]] are too slow! +- [[Google Dremel]] +- [[Apache Spark]] + - Replaces MapReduce with its own engine that works much faster without compromising consistency + - Architecture not based on Map-reduce but rather on two concepts: + - RDD (Resilient Distributed Dataset) + - DAG (Directed Acyclic Graph) + - Pro’s: + - Works much faster than MapReduce; + - fast growing community. diff --git a/content/BigData/Hadoop/Hadoop.md b/content/BigData/Hadoop/Hadoop.md new file mode 100644 index 000000000..7c76dda23 --- /dev/null +++ b/content/BigData/Hadoop/Hadoop.md @@ -0,0 +1,13 @@ +![[Screenshot 2025-07-23 at 12.20.09.png | 400]] +> Hadoop is an **[[Open Source]] framework** for: +> - **Distributed storage** (across many machines) +> - **Distributed processing** (run programs on many machines in parallel) +> +> > It is **not a database** — it is an ecosystem for managing and analyzing **Big Data**. +## **Hadoop Components Overview** +![[Screenshot 2025-07-23 at 11.58.48.png ]] +> 1. [[HDFS]] +> 2. [[MapReduce]] +> 3. [[Yarn]] + +[[Hadoop Eccosystem]] diff --git a/content/BigData/Hadoop/MapReduce.md b/content/BigData/Hadoop/MapReduce.md new file mode 100644 index 000000000..5ad0bf7a5 --- /dev/null +++ b/content/BigData/Hadoop/MapReduce.md @@ -0,0 +1,32 @@ +A programming model for processing big data in parallel. +- Distributed processing - Job is run in parallel on several nodes +- Run the process where the data is! +- Horizontal Scalability + +- **Map** step: transform input + - Transform, Filter, Calculate + - Local data + - e.g., count 1 per word + +- **Combine** step: Reorganization of map output. + - Shuffle, Sort, Group + +- **Reduce** step: Aggregate / Sum the groups + - e.g., sum word counts + +MapReduce **runs code where the data is**, saving data transfer time. + +![[Screenshot 2025-07-23 at 13.00.20.png]] +##### Example: +From the sentence: +> “how many cookies could a good cook cook if a good cook could cook cookies” + +Steps: +1. **Map**: + - Each word becomes a pair like ("cook", 1) +2. **Shuffle**: + - Group by word +3. **Reduce**: + - Add up counts → ("cook", 4) + +![[Screenshot 2025-07-23 at 13.01.20.png]] \ No newline at end of file diff --git a/content/BigData/Hadoop/RDD.md b/content/BigData/Hadoop/RDD.md new file mode 100644 index 000000000..875730013 --- /dev/null +++ b/content/BigData/Hadoop/RDD.md @@ -0,0 +1,20 @@ +## RDD (Resilient Distributed Dataset) +>RDD is an immutable (read only) distributed collection of objects. +> +>Dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster + +![[Screenshot 2025-07-23 at 19.08.40.png|600]] +##### **Key Properties:** +- Distributed: Automatically split across cluster nodes. +- Lazy Evaluation: Transformations aren’t executed until an action is called. +- Fault-tolerant: Can **recompute lost partitions** using lineage graph. +- Parallel: Operates concurrently across cluster cores. +##### Data Sharing +> In [[Hadoop]] [[MapReduce]] +![[Screenshot 2025-07-23 at 19.11.44.png|500]] + +> In [[Apache Spark|Spark]] +![[Screenshot 2025-07-23 at 19.12.57.png|500]] +>10-100x faster than network and disk! + + diff --git a/content/BigData/Hadoop/Yarn.md b/content/BigData/Hadoop/Yarn.md new file mode 100644 index 000000000..ce52b8529 --- /dev/null +++ b/content/BigData/Hadoop/Yarn.md @@ -0,0 +1,35 @@ +**YARN (Yet Another Resource Negotiator)** +is [[Hadoop]]’s cluster resource management system +- Multiple jobs running simultaneously +- Multiple jobs use same resources (disk, CPU, memory) +- Assign resources to jobs and tasks exclusively + +##### YARN is in charge of: +1. Allocates Resources +2. Schedules Jobs + - allocate priorities to jobs by policies: + FIFO scheduler, Fair scheduler, Capacity scheduler + +##### Components: +- **ResourceManager** + - oversees resource allocation across the cluster + +- **NodeManager** + - Each node in the cluster runs a NodeManager. + - This component manages the execution of containers on its node. + +- **ApplicationMaster** + - manages the lifecycle of applications. + - handles job scheduling and monitors progress. + +- **Resource Container** + - a logical bundle of resources (e.g., CPU, Memory) that is allocated by the ResourceManager + +![[Screenshot 2025-07-23 at 13.29.37.png]] + +##### YARN ecosystem +Yarn can run other applications beside Hadoop [[MapReduce]], that can +integrate to the Hadoop ecosystem: +• Apache Storm (Data Streaming engine) +• [[Apache Spark]] (Data Batch and streaming engine) +• Apache Solr (Search platform) \ No newline at end of file diff --git a/content/BigData/Open Source.md b/content/BigData/Open Source.md new file mode 100644 index 000000000..964268e5a --- /dev/null +++ b/content/BigData/Open Source.md @@ -0,0 +1,13 @@ +• Source Code available +• Free Redistribution +• Derived Works + +![[Screenshot 2025-07-23 at 12.24.23.png]] + +Open-source replace Closed-source +![[Screenshot 2025-07-23 at 12.25.00.png]] + +More Open-source solutions +![[Screenshot 2025-07-23 at 12.25.28.png]] + +![[Screenshot 2025-07-23 at 12.27.11.png]]![[Screenshot 2025-07-23 at 12.27.39.png]] \ No newline at end of file diff --git a/content/BigData/RDBMS.md b/content/BigData/RDBMS.md new file mode 100644 index 000000000..9929e2fc4 --- /dev/null +++ b/content/BigData/RDBMS.md @@ -0,0 +1,63 @@ +[[Database Overview]] +##### What is an RDBMS? +**Relational Database Management System**: +- Data is stored in **tables**: + - **Rows** = records + - **Columns** = fields + +- Each table has: + - **Indexes** for fast searching + - **Relationships** with other tables (via keys) + +##### Relational model - Keys and Indexes +Ability to find record(s) quickly +- Operations become efficient: + - **Find by key** → O(log n) + - **Fetch record by ID** → O(1) + +Indexes = sorted references to data locations → like a book index. + +##### Relational model - Operations +Relational databases support **CRUD**: +- **C**reate +- **R**ead +- **U**pdate +- **D**elete + +Each operation uses both: +- The **index** (to locate data) +- The **data** itself (to read/write) + +##### Relational model - Transactional +Relational databases guarantee **transaction safety** with ACID: +- **A**tomicity – all or nothing +- **C**onsistency – valid data only +- **I**solation – no interference from other transactions +- **D**urability – survives crashes + +* Examples: + - Transferring money, Posting a tweet + - Both must either **succeed completely** or **fail completely**. + +Transactions guarantee data validity despite errors & failures + +##### Relational model - SQL +**SQL** is the language used to talk to relational databases. +- **S**tandard +- **Q**uery +- **L**anguage + +- All RDBMSs use it (MySQL, PostgreSQL, Oracle, etc.) + +#####  Pros and Cons of RDBMS +**Pros:** +- Structured data +- ACID transactions +- Powerful SQL +- Fast (for small/medium size) + +**Cons**: +- Doesn’t scale well (single machine or SPOF = Single Point of Failure) +- Becomes **slow** with **big data** +- **Less fault tolerant** +- Not designed for **massive, distributed systems** diff --git a/content/BigData/res/Pasted image 20250723182835.png b/content/BigData/res/Pasted image 20250723182835.png new file mode 100644 index 000000000..3acddd0bc Binary files /dev/null and b/content/BigData/res/Pasted image 20250723182835.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 11.58.31.png b/content/BigData/res/Screenshot 2025-07-23 at 11.58.31.png new file mode 100644 index 000000000..d3cd6b689 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 11.58.31.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 11.58.48.png b/content/BigData/res/Screenshot 2025-07-23 at 11.58.48.png new file mode 100644 index 000000000..3fbd4bdc7 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 11.58.48.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.08.22.png b/content/BigData/res/Screenshot 2025-07-23 at 12.08.22.png new file mode 100644 index 000000000..17a919e13 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.08.22.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.08.48.png b/content/BigData/res/Screenshot 2025-07-23 at 12.08.48.png new file mode 100644 index 000000000..09f100c2e Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.08.48.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.09.30.png b/content/BigData/res/Screenshot 2025-07-23 at 12.09.30.png new file mode 100644 index 000000000..1fec3979b Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.09.30.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.10.02.png b/content/BigData/res/Screenshot 2025-07-23 at 12.10.02.png new file mode 100644 index 000000000..2f4852a6c Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.10.02.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.16.32.png b/content/BigData/res/Screenshot 2025-07-23 at 12.16.32.png new file mode 100644 index 000000000..19ce0ec03 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.16.32.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.20.09.png b/content/BigData/res/Screenshot 2025-07-23 at 12.20.09.png new file mode 100644 index 000000000..732dd4edb Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.20.09.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.24.23.png b/content/BigData/res/Screenshot 2025-07-23 at 12.24.23.png new file mode 100644 index 000000000..4a5d3c6ba Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.24.23.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.25.00.png b/content/BigData/res/Screenshot 2025-07-23 at 12.25.00.png new file mode 100644 index 000000000..335554769 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.25.00.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.25.28.png b/content/BigData/res/Screenshot 2025-07-23 at 12.25.28.png new file mode 100644 index 000000000..b207799d5 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.25.28.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.27.11.png b/content/BigData/res/Screenshot 2025-07-23 at 12.27.11.png new file mode 100644 index 000000000..36c57ebdc Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.27.11.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 12.27.39.png b/content/BigData/res/Screenshot 2025-07-23 at 12.27.39.png new file mode 100644 index 000000000..6f895961d Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 12.27.39.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 13.00.20.png b/content/BigData/res/Screenshot 2025-07-23 at 13.00.20.png new file mode 100644 index 000000000..a100f9aaf Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 13.00.20.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 13.01.20.png b/content/BigData/res/Screenshot 2025-07-23 at 13.01.20.png new file mode 100644 index 000000000..b0459096b Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 13.01.20.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 13.29.37.png b/content/BigData/res/Screenshot 2025-07-23 at 13.29.37.png new file mode 100644 index 000000000..9f139b430 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 13.29.37.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 13.54.07.png b/content/BigData/res/Screenshot 2025-07-23 at 13.54.07.png new file mode 100644 index 000000000..7ad78013a Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 13.54.07.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 14.20.31.png b/content/BigData/res/Screenshot 2025-07-23 at 14.20.31.png new file mode 100644 index 000000000..3420cf554 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 14.20.31.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 16.52.08.png b/content/BigData/res/Screenshot 2025-07-23 at 16.52.08.png new file mode 100644 index 000000000..e41e99249 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 16.52.08.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 17.48.07.png b/content/BigData/res/Screenshot 2025-07-23 at 17.48.07.png new file mode 100644 index 000000000..fef381224 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 17.48.07.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 17.51.04.png b/content/BigData/res/Screenshot 2025-07-23 at 17.51.04.png new file mode 100644 index 000000000..204a4cfa7 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 17.51.04.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 17.52.39.png b/content/BigData/res/Screenshot 2025-07-23 at 17.52.39.png new file mode 100644 index 000000000..beacd3ef3 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 17.52.39.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.25.32.png b/content/BigData/res/Screenshot 2025-07-23 at 18.25.32.png new file mode 100644 index 000000000..b080f36c2 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.25.32.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.27.30.png b/content/BigData/res/Screenshot 2025-07-23 at 18.27.30.png new file mode 100644 index 000000000..1556a0899 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.27.30.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.42.46.png b/content/BigData/res/Screenshot 2025-07-23 at 18.42.46.png new file mode 100644 index 000000000..0d03e111c Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.42.46.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.45.25.png b/content/BigData/res/Screenshot 2025-07-23 at 18.45.25.png new file mode 100644 index 000000000..96d780a21 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.45.25.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.46.55.png b/content/BigData/res/Screenshot 2025-07-23 at 18.46.55.png new file mode 100644 index 000000000..e23786c2e Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.46.55.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.50.10.png b/content/BigData/res/Screenshot 2025-07-23 at 18.50.10.png new file mode 100644 index 000000000..21af0b8e9 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.50.10.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 18.50.16.png b/content/BigData/res/Screenshot 2025-07-23 at 18.50.16.png new file mode 100644 index 000000000..fd722273b Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 18.50.16.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 19.08.40.png b/content/BigData/res/Screenshot 2025-07-23 at 19.08.40.png new file mode 100644 index 000000000..3ca538126 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 19.08.40.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 19.11.44.png b/content/BigData/res/Screenshot 2025-07-23 at 19.11.44.png new file mode 100644 index 000000000..197db8170 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 19.11.44.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 19.12.57.png b/content/BigData/res/Screenshot 2025-07-23 at 19.12.57.png new file mode 100644 index 000000000..eade485b0 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 19.12.57.png differ diff --git a/content/BigData/res/Screenshot 2025-07-23 at 19.17.31.png b/content/BigData/res/Screenshot 2025-07-23 at 19.17.31.png new file mode 100644 index 000000000..d21530a51 Binary files /dev/null and b/content/BigData/res/Screenshot 2025-07-23 at 19.17.31.png differ