MapReduce is a Google software framework for easily writing applications that process large amounts of data in parallel on clusters. A MapReduce computation to solve a problem consists of two kinds of tasks:
A MapReduce job splits the input data-set into independent chunks that are processed by the mapper tasks in a completely parallel manner. The MapReduce framework sorts the outputs of the maps, which are then sent to the reduce tasks as inputs. The framework schedules tasks, monitors them and re-executes failed tasks.
Hadoop is an open source version of MapReduce, developed at Yahoo and made freely available. In Hadoop, both the input and output of a job are usually stored in a shared file system called the Hadoop Distributed File System (HDFS). As its name implies, HDFS is a file system that is distributed across the nodes of a cluster, and that provides a unified interface to the distributed files. For fault tolerance, HDFS distributes multiple copies of the data files to different nodes. By keeping track of which data files are on which nodes, the Hadoop/MapReduce framework can schedule processes on the nodes where data is already present, minimizing the movement of data through the cluster's network.
A Hadoop job client submits a job (jar, executable, etc) and job configuration to the Hadoop master ResourceManager. The ResourceManager distributes the software and configuration to its workers, schedules tasks, monitors them, restarts them if necessary, and provides status and diagnostic information to the job-client. In short, it handles the nitty-gritty details of making sure the computation runs and completes.
The Hadoop framework is implemented in Java, and we will be using MapReduce applications written in Java. There are several utilities, such as Hadoop Streaming and Hadoop Pipes, that can be used to allow the use of other executables for MapReduce; feel free to explore these on your own.
Here at Calvin, Dahl is configured to run Hadoop on all the nodes in the cluster. Each of the nodes is configured as a Hadoop storage node, and HDFS provides a shared filesystem across the nodes. In the rest of this part of today's exercise, we will be learning how to use HDFS, so take a minute to login to Dahl using our usual approach.
HDFS is the primary filesystem that Hadoop uses for MapReduce jobs. Using it is similar to using the Unix file system, so let's practice using some basic commands to become comfortable using its command line interface.
For a full list of HDFS commands, see this overview from Apache.org. For additional (local) help, you may enter the command:
hadoop fs -helpIn the following examples, we will: (0) create our HDFS home directory, (1) list our HDFS home directory, (2) copy a file from our normal home directory into our HDFS home directory (and thus into the distributed file system), (3) view the file within HDFS, and then (4) remove our test files.
hadoop fs -mkdir /user/yourUserNamereplacing yourUserName with your user name.
hadoop fs -lsSince your home directory is empty, you should not see anything listed. To view the contents of a non-empty directory, enter:
hadoop fs -ls /userYou should see the names of the home directories of all the other Hadoop users.
hadoop fs -mkdir testHDFSVerify that the directory exists by entering the same command you entered at the beginning of (1). You should see the testHDFS directory listed.
Last, let's verify it again using the HDFS full pathname to your home directory by entering:
hadoop fs -ls /user/yourUserNameVerify that this works correctly before continuing.
echo "HDFS test file" >> testFileThat will create a new file named testFile, containing the characters HDFS test file. To verify that, enter:
lsto check that the file was created; then enter:
cat testFileto check that it contains the characters HDFS test file.
Next, let's copy this file into HDFS. To copy files in Linux, we use the standard Unix command cp command. To copy files from Linux into HDFS, we have to use the HDFS equivalent of cp. Enter:
hadoop fs -copyFromLocal testFileNote that we have to use the switch -copyFromLocal because the switch -cp is used to copy files within HDFS.
Verify that the file copied over from the file system, and contains our test string by entering:
hadoop fs -ls hadoop fs -cat testFileFor future reference, the command hadoop fs -moveFromLocal will move a file from the Unix filesystem into HDFS (instead of copying it), removing it from the local filesystem.
hadoop fs -mv testFile testHDFS/ hadoop fs -ls hadoop fs -ls testHDFS/The first command moves testFile from our HDFS home directory into our testHDFS directory. The second command should reveal that testFile is indeed gone from our HDFS home directory. The third command should reveal that testFile now resides in our testHDFS directory.
Similarly, we could copy a file:
hadoop fs -cp testHDFS/testFile testHDFS/testFile2 hadoop fs -ls testHDFS/
hadoop fs -duThis command will give you an idea of how much space you are using in your HDFS home directory. Another command will show you how much space is available in HDFS across the cluster:
hadoop fs -df
hadoop fs -rm testHDFS/testFile hadoop fs -ls testHDFS/Notice that we still have our testHDFS directory and testFile2 leftover. Try to remove the directory by entering:
hadoop fs -rmdir testhdfsYou should see an error message to the effect that rmdir: `testhdfs': Directory is not empty.
Similarly to how the Unix-based file systems work, a directory must be empty before it can be removed. Fortunately, there is a "do this recursively" switch that we can pass with the "rm" command to recursively remove a directory plus everything it contains:
hadoop fs -rm -r testHDFS hadoop fs -lsAt this point, we have cleaned up our test files and directories.
CS > 374 > Exercises > 12 > Intro to HFS