Basics of the Hadoop Framework
HDFS is a Hadoop distributed file system. It can be installed on commodity servers and run on as many servers as you need - HDFS easily scales to thousands of nodes and petabytes of data.
The larger HDFS setup is, the bigger probability that some disks, servers or network switches will fail. HDFS survives these types of failures by replicating data on multiple servers. HDFS automatically detects that a given component has failed and takes necessary recovery actions that happen transparently to the user.
HDFS is designed for storing large files of the magnitude of hundreds of megabytes or gigabytes and provides high-throughput streaming data access to them. Last but not least, HDFS supports the write-once-read-many model. For this use case HDFS works like a charm. If you need, however, to store a large number of small files with a random read-write access, then other systems like RDBMS and Apache HBase can do a better job.
Note: HDFS does not allow you to modify a file’s content. There is only support for appending data at the end of the file. However, Hadoop was designed with HDFS to be one of many pluggable storage options – for example, with MapR-Fs, a proprietary filesystem, files are fully read-write. Other HDFS alternatives include Amazon S3 and IBM GPFS.
Architecture of HDFS
HDFS consists of following daemons that are installed and run on selected cluster nodes:
- NameNode - the master process responsible for managing the file system namespace (filenames, permissions and ownership, last modification date etc.) and controlling access to data stored in HDFS. It is the one place where there is a full overview of the distributed file system. If the NameNode is down, you can not access your data. If your namespace is permanently lost, you’ve essentially lost all of your data!
- DataNodes - slave processes that take care of storing and serving data. A DataNode is installed on each worker node in the cluster.
Figure 1 illustrates installation of HDFS on a 4-node cluster. One of the node hosts the NameNode daemon while the other three run DataNode daemons.
Note: NameNode and DataNode are Java processes that run on top of Linux distribution such as RedHat, Centos, Ubuntu and more. They use local disks for storing HDFS data.
HDFS splits each file into a sequence of smaller, but still large, blocks (default block size equals to 128MB – bigger blocks mean fewer disk seek operations, which results in large throughput). Each block is stored redundantly on multiple DataNodes for fault-tolerance. The block itself does not know which file it belongs to – this information is only maintained by the NameNode that has a global picture of all directories, files and blocks in HDFS.
Figure 2 illustrates the concept of splitting files into blocks. File X is split into blocks B1 and B2 and File Y comprises of only one block B3. All blocks are replicated 2 times within the cluster. As mentioned, information about which blocks compose a file is kept by the NameNode while raw data is stored by DataNodes.
Interacting with HDFS
HDFS provides a simple POSIX-like interface to work with data. You perform file system operations using hdfs dfs command.
To start playing with Hadoop you don’t have to go through process of setting up a whole cluster. Hadoop can run in so called pseudo-distributed mode on a single machine. You can download the sandbox Virtual Machine with all HDFS components already installed and start using Hadoop in no time! Just follow one of these links:
The following steps illustrate typical operations that a HDFS user can perform:
List the content of home directory
$ hdfs dfs -ls /user/adam
Upload a file from local file system to HDFS
$ hdfs dfs -put songs.txt /user/adam
Read the content of the file from HDFS
$ hdfs dfs -cat /user/adam/songs.txt
Change the permission of a file
$ hdfs dfs -chmod 700 /user/adam/songs.txt
Set the replication factor of a file to 4
$ hdfs dfs -setrep -w 4 /user/adam/songs.txt
Check the size of the file
$ hdfs dfs -du -h /user/adam/songs.txt
Move the file to the newly created subdirectory
$ hdfs dfs -mv songs.txt songs/
Remove directory from HDFS
$ hdfs dfs -rm -r songs