mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-23 21:04:07 -06:00
2.3 KiB
2.3 KiB
HDFS (Hadoop Distributed File System)
Stores huge files (Typical file size GB-TB) across multiple machines.
- Breaks files into blocks (typically 128 MB).
- Replicates blocks (default 3 copies) for fault tolerance.
- Access using POSIX API.
HDFS design principles
- Immutable: write-once, read-many
- No Failures: Disk or node failure does not affect file system
- File Size Unlimited: Up to 512 yottabytes (2^63 X 64MB)
- File Num Limited: 1048576 files in a directory
- Prefer bigger files: Big files provide better performance
HDFS File Formats
- Text/CSV - No schema, no metadata
- Json Records - metadata is stored with data
- Avro Files - schema independent of data
- Sequence Files - binary files (used as intermediate storage in M/R)
- RC Files - Record Columnar files
- ORC Files - Optimized RC files. Compress better
- Parquet Files - Yet another RC file
HDFS Command Line
# List files
hadoop fs -ls /path
# Make directory
hadoop fs -mkdir /user/hadoop
# Print file
hadoop fs -cat /file
# Upload file
hadoop fs -copyFromLocal file.txt hdfs://...
HDFS Architecture – Main Components
1. NameNode (Master Node)
-
Stores metadata about the filesystem:
- Filenames
- Directory structure
- Block locations
- Permissions
-
It does not store the actual data.
-
There is one active NameNode per cluster.
2. DataNodes (Worker Nodes)
- Store the actual data blocks of files.
- Send heartbeat messages to the NameNode to report that they are alive.
- When a file is written, it’s split into blocks and distributed across many DataNodes.
- DataNodes also replicate blocks (typically 3 copies) to provide fault tolerance.
File Read / Write
When a file is written:
- The client contacts the NameNode to ask: “Where should I write the blocks?”
- The NameNode responds with a list of DataNodes to use.
- The client sends the blocks of the file to those DataNodes.
- Blocks are replicated automatically across different nodes for redundancy.
When a file is read:
- The client contacts the NameNode to get the list of DataNodes storing the required blocks.
- The client reads the blocks directly from the DataNodes.