quartz/content/BigData/Hadoop/HDFS.md
2025-07-23 20:36:04 +03:00

64 lines
2.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

##### HDFS ([[Hadoop]] Distributed File System)
Stores huge files (Typical file size GB-TB) across multiple machines.
- Breaks files into **blocks** (typically 128 MB).
- **Replicates** blocks (default 3 copies) for fault tolerance.
- Access using POSIX API.
##### HDFS design principles
* **Immutable**: **write-once, read-many**
* **No Failures**: Disk or node failure does not affect file system
* **File Size Unlimited**: Up to 512 yottabytes (2^63 X 64MB)
* **File Num Limited**: 1048576 files in a directory
* **Prefer bigger files**: Big files provide better performance
##### HDFS File Formats
- Text/CSV - No schema, no metadata
- Json Records - metadata is stored with data
- Avro Files - schema independent of data
- Sequence Files - binary files (used as intermediate storage in M/R)
- RC Files - Record Columnar files
- ORC Files - Optimized RC files. Compress better
- Parquet Files - Yet another RC file
##### HDFS Command Line
```
# List files
hadoop fs -ls /path
# Make directory
hadoop fs -mkdir /user/hadoop
# Print file
hadoop fs -cat /file
# Upload file
hadoop fs -copyFromLocal file.txt hdfs://...
```
#### HDFS Architecture Main Components
##### **1.** NameNode (Master Node)
- **Stores metadata** about the filesystem:
- Filenames
- Directory structure
- Block locations
- Permissions
- It **does not store the actual data**.
- There is **one active NameNode** per cluster.
##### **2.** DataNodes (Worker Nodes)
- Store the **actual data blocks** of files.
- Send **heartbeat** messages to the NameNode to report that they are alive.
- When a file is written, its split into blocks and distributed across many DataNodes.
- DataNodes also **replicate** blocks (typically 3 copies) to provide **fault tolerance**.
#### File Read / Write
**When a file is written:**
1. The client contacts the **NameNode** to ask: “Where should I write the blocks?”
2. The NameNode responds with a list of **DataNodes** to use.
3. The client sends the blocks of the file to those DataNodes.
4. Blocks are **replicated** automatically across different nodes for redundancy.
**When a file is read:**
1. The client contacts the **NameNode** to get the list of DataNodes storing the required blocks.
2. The client reads the blocks **directly** from the DataNodes.