CS 374: Using Torque on Dahl

Very few supercomputers allow their users to work interactively, because the users' programs might interfere with one another. Instead, users submit their "jobs" (the programs they want to run) to a batch queue, which holds a given job until all of the resources it needs to run become available.

There are a variety of batch queues available. Commonly used ones include Portable Batch System (PBS), Torque, Grid Engine, and Slurm. In this course, we will be Torque, the Terascale Open-Source Resource and QUEue Manager, which is an extension of the original PBS project.

Running a job via Torque involves three steps:

  1. Creating a job submission script.
  2. Submitting the job.
  3. Waiting until the job has completed.
  4. Retrieving the job's output/results.
Let's take these one at a time.

1. Creating a Job Submission Script

Before you can submit a job in Torque, you must first create a text file containing a job submission script, in which you specify the resources your program needs. For example, suppose you want to run a program named spmd, using 4 of the supercomputer's nodes, and with 8 processes running on each node (a total of 32 processes). Then we might create the following job submission script:
#!/bin/bash
# # # # # # # # 

# MPI specify resources needed in number of nodes and processes per node
#PBS -l nodes=4:ppn=8

# Give the job a unique-to-you name in Torque
#PBS -N yourUserName_spmd_4_8

# Optional: If you want to be notified, tell Torque how/when...
#    -m  accepts up to all three control flags 'a','b','e', where:
#        a = mail is sent when the job is aborted
#        b = mail is sent when the job begins execution
#        e = mail is sent when the job finishes execution
#  and be sure to use your email address!
#PBS -m abe -M yourUserName@students.calvin.edu

# Change to the working directory and execute spmd
cd $PBS_O_WORKDIR
mpiexec ./spmd

Within the script, we specify the number of nodes we want to use (4), the number of processes per node (8), a unique name for the job (yourUserName_spmd_4_8, with yourUserName replaced by your user name), how we want Torque to notify us, and the name of the program we want Torque to run (spmd). These are the parameters that Torque needs to run our job, marked in red in the example script above.

When you save the script, give it a descriptive name, such as run_spmd.pbs. Most people create at least one script for each project, so save it in the same directory as your program and its Makefile.

We will be running each of our programs multiple times, varying the number of processes in order to test the scalability of our programs. Whether you use a single script and change its values for each submission or write a separate script for each submission up to you. (Since you will be doing this same procedure for each MPI project, a separate script for each submission might be faster in the long run.)

2. Submitting the Job

Once you have a script created, the next step is to submit it to the Torque scheduling system. To do this, you use the qsub command:

   qsub run_spmd.pbs
This tells Torque to put our program in its job-queue. Torque will look at the resources it requires (i.e., number of nodes, processes per node) and schedule it when all of those resources are available.

After you submit your job, qsub will output a line like this:

   30997.dahl-node-00.calvin.edu
The 30997 is your job's submission ID number.

3. Waiting Until the Job has Completed

The more resources your script requires, the less likely it is that they will all be simultaneously available, and the longer it will take to get scheduled. (This keeps people from "hogging" a supercomputer's resources.)

In between these extremes, you can enter the qstat command to monitor the status of your submission, for example:

   qstat 30997
will provide a status update for my submission, displaying something like this:
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
30997.dahl-node-00         adams_spmd_4_8   adams           00:00:00 R batch          
The fields are mostly self explanatory, except for the S field, whose values may be:
Q- Queued
R- Running
C- Completed
(There are also other status values, but these are the three most common. Enter man qstat for more information.)

Entering the qstat command without an ID number should give you a list of all your jobs in the queue. The Moab showq command can also be used to see all jobs in the queue from the scheduler's point of view. This can be used to get a sense of how busy the cluster is at the present time.

If you need to remove your submission from the queue (e.g., it seems to be stuck in an infinite loop), you can do so using the qdel command:

   qdel 30997
There are many other options that can be given to these commands. See the manual pages for qsub, qstat, and qdelpage for more information.

4. Retrieving the Job's Output/Results

When your job has completed, Torque creates two files,

To view the names of these files, enter
   ls
In the example above, Torque produced files named adams_spmd_4_8.o30997 and adams_spmd_4_8.e30997, respectively. As you can see, each file's names consist of three parts:
  1. The job name you specified back in your submission script.
  2. Either o for output or e for error.
  3. The ID number of your submission.
Each time you submit a new job, Torque will give that submission a unique ID number, so the files produced by each submission will be unique.

Over the course of testing your program, many of these files will be created. If you use a descriptive name for each submission in your submission script, then the file's name will tell you which submission it represents.

To view the contents of a short file, use the cat command:

   cat adams_spmd_4_8.o30997
To view the contents of a longer file, use the less command:
   less adams_spmd_4_8.o30997

If you experience difficulty getting this to work, please contact Chris Wieringa or Prof. Adams.

Congratulations! You can now run your programs on Calvin's supercomputer!


Calvin > CS > 374 > SSH Key Authentication