HPC: Hadoop Distributed File System (HDFS) Tutorial


MapReduce is a Google software framework for easily writing applications that process large amounts of data in parallel on clusters. A MapReduce computation to solve a problem consists of two kinds of tasks:

  1. mappers, that process the data on a given node of the cluster; and
  2. reducers, that take the results produced by the mappers and combine those results as needed to solve the problem.

A MapReduce job splits the input data-set into independent chunks that are processed by the mapper tasks in a completely parallel manner. The MapReduce framework sorts the outputs of the maps, which are then sent to the reduce tasks as inputs. The framework schedules tasks, monitors them and re-executes failed tasks.

Hadoop is an open source version of MapReduce, developed at Yahoo and made freely available. In Hadoop, both the input and output of a job are usually stored in a shared file system called the Hadoop Distributed File System (HDFS). As its name implies, HDFS is a file system that is distributed across the nodes of a cluster, and that provides a unified interface to the distributed files. For fault tolerance, HDFS distributes multiple copies of the data files to different nodes. By keeping track of which data files are on which nodes, the Hadoop/MapReduce framework can schedule processes on the nodes where data is already present, minimizing the movement of data through the cluster's network.

A Hadoop job client submits a job (jar, executable, etc) and job configuration to the Hadoop master ResourceManager. The ResourceManager distributes the software and configuration to its workers, schedules tasks, monitors them, restarts them if necessary, and provides status and diagnostic information to the job-client. In short, it handles the nitty-gritty details of making sure the computation runs and completes.

The Hadoop framework is implemented in Java, and we will be using MapReduce applications written in Java. There are several utilities, such as Hadoop Streaming and Hadoop Pipes, that can be used to allow the use of other executables for MapReduce; feel free to explore these on your own.

Here at Calvin, Dahl is configured to run Hadoop on all the nodes in the cluster. Each of the nodes is configured as a Hadoop storage node, and HDFS provides a shared filesystem across the nodes. In the rest of this part of today's exercise, we will be learning how to use HDFS, so take a minute to login to Dahl using our usual approach.

Getting Started with HDFS

HDFS is the primary filesystem that Hadoop uses for MapReduce jobs. Using it is similar to using the Unix file system, so let's practice using some basic commands to become comfortable using its command line interface.

For a full list of HDFS commands, see this overview from Apache.org. For additional (local) help, you may enter the command:

   hadoop fs -help
In the following examples, we will: (0) create our HDFS home directory, (1) list our HDFS home directory, (2) copy a file from our normal home directory into our HDFS home directory (and thus into the distributed file system), (3) view the file within HDFS, and then (4) remove our test files.
  1. Creating Your Home Directory. To create your own home directory within the HDFS file system, enter:
         hadoop fs -mkdir /user/yourUserName 
    replacing yourUserName with your user name.
  3. Listing Your Home Directory. To view the contents of your HDFS home directory, enter:
         hadoop fs -ls
    Since your home directory is empty, you should not see anything listed. To view the contents of a non-empty directory, enter:
         hadoop fs -ls /user 
    You should see the names of the home directories of all the other Hadoop users.
  5. Creating a Directory in Your HDFS Home Directory. Let's create a directory named testHDFS within your HDFS home directory. To do this, enter:
         hadoop fs -mkdir testHDFS
    Verify that the directory exists by entering the same command you entered at the beginning of (1). You should see the testHDFS directory listed.

    Last, let's verify it again using the HDFS full pathname to your home directory by entering:

         hadoop fs -ls /user/yourUserName
    Verify that this works correctly before continuing.
  7. Copy a File from the Normal Filesystem into HDFS. Before we learn how to copy a file, let's create a file to copy. Enter:
         echo "HDFS test file" >> testFile
    That will create a new file named testFile, containing the characters HDFS test file. To verify that, enter:
    to check that the file was created; then enter:
         cat testFile
    to check that it contains the characters HDFS test file.

    Next, let's copy this file into HDFS. To copy files in Linux, we use the standard Unix command cp command. To copy files from Linux into HDFS, we have to use the HDFS equivalent of cp. Enter:

         hadoop fs -copyFromLocal testFile 
    Note that we have to use the switch -copyFromLocal because the switch -cp is used to copy files within HDFS.

    Verify that the file copied over from the file system, and contains our test string by entering:

         hadoop fs -ls
         hadoop fs -cat testFile
    For future reference, the command hadoop fs -moveFromLocal will move a file from the Unix filesystem into HDFS (instead of copying it), removing it from the local filesystem.
  9. Moving and Copying Files Within HDFS. When we copied up our testfile, we put it in our base home directory. Let's see how to move it into the testHDFS directory we created back in step 2. Enter:
          hadoop fs -mv testFile testHDFS/
          hadoop fs -ls
          hadoop fs -ls testHDFS/
    The first command moves testFile from our HDFS home directory into our testHDFS directory. The second command should reveal that testFile is indeed gone from our HDFS home directory. The third command should reveal that testFile now resides in our testHDFS directory.

    Similarly, we could copy a file:

          hadoop fs -cp testHDFS/testFile testHDFS/testFile2
          hadoop fs -ls testHDFS/
  11. Checking Disk usage. Another useful command to have is to figure out how much disk space we are using in HDFS. Enter:
          hadoop fs -du
    This command will give you an idea of how much space you are using in your HDFS home directory. Another command will show you how much space is available in HDFS across the cluster:
          hadoop fs -df
  13. Deleting a file and/or directory. Since we have now covered the basics for HDFS, let's clean up the various test files and directories we created. Enter:
          hadoop fs -rm testHDFS/testFile
          hadoop fs -ls testHDFS/
    Notice that we still have our testHDFS directory and testFile2 leftover. Try to remove the directory by entering:
          hadoop fs -rmdir testhdfs
    You should see an error message to the effect that rmdir: `testhdfs': Directory is not empty.

    Similarly to how the Unix-based file systems work, a directory must be empty before it can be removed. Fortunately, there is a "do this recursively" switch that we can pass with the "rm" command to recursively remove a directory plus everything it contains:

          hadoop fs -rm -r testHDFS
          hadoop fs -ls
    At this point, we have cleaned up our test files and directories.
Congratulations! You now know how to use HDFS!

CS > 374 > Exercises > 12 > Intro to HFS

This page maintained by Joel Adams.