Very few supercomputers allow their users to work interactively, because the users' programs might interfere with one another. Instead, users submit their "jobs" (the programs they want to run) to a batch queue, which holds a given job until all of the resources it needs to run become available.
There are a variety of batch queues available. Commonly used ones include Portable Batch System (PBS), Torque, Grid Engine, and Slurm. In this course, we will be Torque, the Terascale Open-Source Resource and QUEue Manager, which is an extension of the original PBS project.
Running a job via Torque involves three steps:
#!/bin/bash # # # # # # # # # MPI specify resources needed in number of nodes and processes per node #PBS -l nodes=4:ppn=8 # Give the job a unique-to-you name in Torque #PBS -N yourUserName_spmd_4_8 # Optional: If you want to be notified, tell Torque how/when... # -m accepts up to all three control flags 'a','b','e', where: # a = mail is sent when the job is aborted # b = mail is sent when the job begins execution # e = mail is sent when the job finishes execution # and be sure to use your email address! #PBS -m abe -M yourUserName@students.calvin.edu # Change to the working directory and execute spmd cd $PBS_O_WORKDIR mpiexec ./spmd
Within the script, we specify the number of nodes we want to use (4), the number of processes per node (8), a unique name for the job (yourUserName_spmd_4_8, with yourUserName replaced by your user name), how we want Torque to notify us, and the name of the program we want Torque to run (spmd). These are the parameters that Torque needs to run our job, marked in red in the example script above.
When you save the script, give it a descriptive name, such as run_spmd.pbs. Most people create at least one script for each project, so save it in the same directory as your program and its Makefile.
We will be running each of our programs multiple times, varying the number of processes in order to test the scalability of our programs. Whether you use a single script and change its values for each submission or write a separate script for each submission up to you. (Since you will be doing this same procedure for each MPI project, a separate script for each submission might be faster in the long run.)
Once you have a script created, the next step is to submit it to the Torque scheduling system. To do this, you use the qsub command:
qsub run_spmd.pbsThis tells Torque to put our program in its job-queue. Torque will look at the resources it requires (i.e., number of nodes, processes per node) and schedule it when all of those resources are available.
After you submit your job, qsub will output a line like this:
30997.dahl-node-00.calvin.eduThe 30997 is your job's submission ID number.
The more resources your script requires, the less likely it is that they will all be simultaneously available, and the longer it will take to get scheduled. (This keeps people from "hogging" a supercomputer's resources.)
In between these extremes, you can enter the qstat command to monitor the status of your submission, for example:
qstat 30997will provide a status update for my submission, displaying something like this:
Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 30997.dahl-node-00 adams_spmd_4_8 adams 00:00:00 R batchThe fields are mostly self explanatory, except for the S field, whose values may be:
Entering the qstat command without an ID number should give you a list of all your jobs in the queue. The Moab showq command can also be used to see all jobs in the queue from the scheduler's point of view. This can be used to get a sense of how busy the cluster is at the present time.
If you need to remove your submission from the queue (e.g., it seems to be stuck in an infinite loop), you can do so using the qdel command:
qdel 30997There are many other options that can be given to these commands. See the manual pages for qsub, qstat, and qdelpage for more information.
When your job has completed, Torque creates two files,
lsIn the example above, Torque produced files named adams_spmd_4_8.o30997 and adams_spmd_4_8.e30997, respectively. As you can see, each file's names consist of three parts:
Over the course of testing your program, many of these files will be created. If you use a descriptive name for each submission in your submission script, then the file's name will tell you which submission it represents.
To view the contents of a short file, use the cat command:
cat adams_spmd_4_8.o30997To view the contents of a longer file, use the less command:
If you experience difficulty getting this to work, please contact Chris Wieringa or Prof. Adams.
Congratulations! You can now run your programs on Calvin's supercomputer!