mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-24 05:14:06 -06:00
64 lines
2.3 KiB
Markdown
64 lines
2.3 KiB
Markdown
##### HDFS ([[Hadoop]] Distributed File System)
|
||
Stores huge files (Typical file size GB-TB) across multiple machines.
|
||
- Breaks files into **blocks** (typically 128 MB).
|
||
- **Replicates** blocks (default 3 copies) for fault tolerance.
|
||
- Access using POSIX API.
|
||
|
||
##### HDFS design principles
|
||
* **Immutable**: **write-once, read-many**
|
||
* **No Failures**: Disk or node failure does not affect file system
|
||
* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
|
||
* **File Num Limited**: 1048576 files in a directory
|
||
* **Prefer bigger files**: Big files provide better performance
|
||
|
||
##### HDFS File Formats
|
||
- Text/CSV - No schema, no metadata
|
||
- Json Records - metadata is stored with data
|
||
- Avro Files - schema independent of data
|
||
- Sequence Files - binary files (used as intermediate storage in M/R)
|
||
- RC Files - Record Columnar files
|
||
- ORC Files - Optimized RC files. Compress better
|
||
- Parquet Files - Yet another RC file
|
||
|
||
##### HDFS Command Line
|
||
```
|
||
# List files
|
||
hadoop fs -ls /path
|
||
|
||
# Make directory
|
||
hadoop fs -mkdir /user/hadoop
|
||
|
||
# Print file
|
||
hadoop fs -cat /file
|
||
|
||
# Upload file
|
||
hadoop fs -copyFromLocal file.txt hdfs://...
|
||
```
|
||
|
||
#### HDFS Architecture – Main Components
|
||
##### **1.** NameNode (Master Node)
|
||
- **Stores metadata** about the filesystem:
|
||
- Filenames
|
||
- Directory structure
|
||
- Block locations
|
||
- Permissions
|
||
|
||
- It **does not store the actual data**.
|
||
- There is **one active NameNode** per cluster.
|
||
|
||
##### **2.** DataNodes (Worker Nodes)
|
||
- Store the **actual data blocks** of files.
|
||
- Send **heartbeat** messages to the NameNode to report that they are alive.
|
||
- When a file is written, it’s split into blocks and distributed across many DataNodes.
|
||
- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
|
||
|
||
#### File Read / Write
|
||
**When a file is written:**
|
||
1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
|
||
2. The NameNode responds with a list of **DataNodes** to use.
|
||
3. The client sends the blocks of the file to those DataNodes.
|
||
4. Blocks are **replicated** automatically across different nodes for redundancy.
|
||
|
||
**When a file is read:**
|
||
1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
|
||
2. The client reads the blocks **directly** from the DataNodes. |