HDFS | Big Data Analysis With HDFS

HDFS(Hadoop Distributed File System)

HDFS is a filesystem designed for storing very large files with streaming data access patterns,running on clusters of commodity hardware

HDFS is a filesystem written in Java
--Based on Google’s GFS
Sits on top of a native filesystem
--ext3, xfs etc
Provides redundant storage for massive amounts of data
--Using cheap, unreliable computers
HDFS performs best with a ‘modest’ number of large files
--Millions, rather than billions, of files
--Each file typically 100Mb or more
Files in HDFS are ‘write once’
--No random writes to files are allowed
HDFS is optimized for large, streaming reads of files
--Rather than random reads

How Files Are Stored
Files are split into blocks.
Data is distributed across many machines at load time
--Different blocks from the same file will be stored on different machines
--This provides for efficient MapReduce processing
Blocks are replicated across multiple machines, known as DataNodes
--Default replication is three-fold
  – i.e., each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file, and where those 
--blocks are located
--Known as the metadata

How Files Are Stored:Example

NameNode holds metadata for the data files.
DataNodes hold the actual blocks
--Each block is replicated three times on the cluster

HDFS:Point To Note
When a client application wants to read a file:
--It communicates with the NameNode to determine which blocks make up the file,and which DataNodes those blocks reside on
--It then communicates directly with the DataNodes to read the data

Big Data Analysis With HDFS

HDFS Concepts: Blocks, Replicas,Namenode, Datanode
NameNode manages the File system Namespace


Command line interface

Hdfs File Read

Hdfs File Write

Start-up process
-Namenode enters Safemode
  --Replication does not occur in Safemode
-Each Datanode sends Heartbeat 
-Each Datanode sends Blockreport
  --Lists all HDFS data blocks
-Namenode creates Blockmap from Blockreports
-Namenode exits Safemode
-Replicate any under-replicated blocks
Checkpoint process
-Performed by Namenode
-Two versions of FsImage
   --One stored on disk
   --One in memory
-Applies all transactions in EditLog to in-memory FsImage
-Flushes FsImage to disk
-Truncates EditLog
Namenode memory concern
For fast access Namenode keeps all block metadata in-memory
--The bigger the cluster - the more RAM required
--Best for millions of large files (100mb or more) rather than billions
--Will work well for clusters of 100s machines
Hadoop 2+
--Namenode Federations
--Each namenode will host part of the blocks
--Horizontally scale the Namenode
--Support for 1000+ machine clusters
--Yahoo! runs 50,000+ machines
For more detail visit Apache Hadoop
Namenode’s fault tolerance
Namenode daemon process must be running at all times
--If process crashes then cluster is down
Namenode is a single point of failure
--Host on a machine with reliable hardware (ex. sustain a diskfailure)
--Usually is not an issue
Hadoop 2+
--High Availability Namenode
--Active Standby is always running and takes over in case main namenode fails
--Still in its infancy
Powered by Blogger.