quartz/HDFS.md at 7b7a97b7cf19075c84db31690cf504e4dd0069a8

GitHub/quartz

Fork 2

mirror of https://github.com/jackyzha0/quartz.git synced 2025-12-23 21:04:07 -06:00

DeGefen 7b7a97b7cf Add my Obsidian notes

2025-07-23 20:36:04 +03:00

2.3 KiB

Raw Blame History

HDFS (Hadoop Distributed File System)

Stores huge files (Typical file size GB-TB) across multiple machines.

Breaks files into blocks (typically 128 MB).
Replicates blocks (default 3 copies) for fault tolerance.
Access using POSIX API.

HDFS design principles

Immutable: write-once, read-many
No Failures: Disk or node failure does not affect file system
File Size Unlimited: Up to 512 yottabytes (2^63 X 64MB)
File Num Limited: 1048576 files in a directory
Prefer bigger files: Big files provide better performance

HDFS File Formats

Text/CSV - No schema, no metadata
Json Records - metadata is stored with data
Avro Files - schema independent of data
Sequence Files - binary files (used as intermediate storage in M/R)
RC Files - Record Columnar files
ORC Files - Optimized RC files. Compress better
Parquet Files - Yet another RC file

HDFS Command Line

# List files
hadoop fs -ls /path

# Make directory
hadoop fs -mkdir /user/hadoop

# Print file
hadoop fs -cat /file

# Upload file
hadoop fs -copyFromLocal file.txt hdfs://...

HDFS Architecture – Main Components

1. NameNode (Master Node)

Stores metadata about the filesystem:
- Filenames
- Directory structure
- Block locations
- Permissions
It does not store the actual data.
There is one active NameNode per cluster.

2. DataNodes (Worker Nodes)

Store the actual data blocks of files.
Send heartbeat messages to the NameNode to report that they are alive.
When a file is written, it’s split into blocks and distributed across many DataNodes.
DataNodes also replicate blocks (typically 3 copies) to provide fault tolerance.

File Read / Write

When a file is written:

The client contacts the NameNode to ask: “Where should I write the blocks?”
The NameNode responds with a list of DataNodes to use.
The client sends the blocks of the file to those DataNodes.
Blocks are replicated automatically across different nodes for redundancy.

When a file is read:

The client contacts the NameNode to get the list of DataNodes storing the required blocks.
The client reads the blocks directly from the DataNodes.

2.3 KiB Raw Blame History Unescape Escape