How do I use PBS?

PBS at TPAC

The Portable Batch Systems (PBS) used at TPAC varies between compute systems.

PBS Pro User Guide
PBS Pro Reference Guide

The following is intended to be generic information for more detailed information refer to the “man” pages or the user documentation above.

For kunanyi example qsub scripts are available in /share/apps/pbs_script_examples

Quick Syntax guide

qstat

 

Standard queue status command supplied by PBS. See man qstat for details of options.

Some common uses are:

  • List available queues: qstat -Q
qdel jobid Delete your unwanted jobs from the queues. The jobid is returned by qsub at job
submission time, and is also displayed in the nqstat output.
qsub Submit jobs to the queues. The simplest use of the qsub command is typified
by the following example (note that there is a carriage-return after  ./a.out):

$ qsub -P a99 -l select=1:ncpus=28 -l walltime=20:00:00 ./a.out
^D     (that is control-D)

or simply

$ qsub jobscript

where jobscript is an ascii file containing the shell script to run your commands
(not the compiled executable which is a binary file).
The qsub options are then placed within the script to avoid typing them for each
job e.g.:

#!/bin/bash
#PBS -P a99
#PBS -l select=1:ncpus=28
#PBS -l walltime=20:00:00
./a.out

You may need to enter data to the program and may be used to doing this interactively
when prompted by the program.

There are two ways of doing this in batch jobs.

If, for example, the program requires the numbers 1000 then 50 to be entered when
prompted. You can either create a file called, say, input containing these values

$ cat input
1000
50

then run the program as

./a.out < input

or the data can be included in the batch job script as follows:

#!/bin/bash
#PBS -P a99
#PBS -l select=1:ncpus=28
#PBS -l walltime=20:00:00
#PBS -l wd
./a.out << EOF
1000
50
EOF

Notice that the PBS directives are all at the start of the script, that there are
no blank lines between them, and there are no other non-PBS commands
until after all the PBS directives.

qsub options of note:

-l select=? The number of nodes to be allocated to the job.
Select the queue to run the job in. The queues you can use are
listed by running nqstat. By default the routeq will be used which will automatically
determine the queue based on the resources requested.
-l walltime=??:??:?? The wall clock time limit for the job. Time is expressed in seconds as
an integer, or in the form:
[[hours:]minutes:]seconds[.milliseconds]
System scheduling decisions depend heavily on the walltime request
it is always best to make as accurate a request as possible.
-l mem=???MB The total memory limit across all nodes for the job – can be specified
with units of “MB” or “GB” but only integer values can be given. There
is a small default value.
Your job will only run if there is sufficient free memory so making a
sensible memory request will allow your jobs to run sooner.A little trial
and error may be required to find how much memory your jobs are
using – nqstat lists jobs actual usage.
-l ncpus=? The number of cpus required for the job to run on. The default is 1.

 -l ncpus=N – If the number of cpus requested, N, is small (currently
16 or less on NF systems) the job will run within a single shared memory node.

If the number of cpus specified is greater, the job will be distributed
over multiple nodes. Currently on NF systems, these larger requests
are restricted to multiples of 16 cpus.

-l jobfs=???GB The requested job scratch space. This will reserve disk space, making it
unavailable for other jobs, so please do not over estimate your needs.Any files created in the $PBS_JOBFS directory are automatically
removed at the end of the job. Ensure that you use integers, and
units of mb, MB, gb, or GB.
-l software=??? Specifies licensed software the job requires to run. See the software
for the string to use for specific software.The string should be a colon separated list (no spaces) if more than
one software product is used.If your job uses licensed software and you do not specify this option (or
mis-spell the software), you will probably receive an automatically generated
email from the license shadowing daemon, and the job may be terminated.You can check the lsd status and find out more by looking at the license
status website.
-l other=??? Specifies other requirements or attributes of the job. The string should be
a colon separated list (no spaces) if more than one attribute is required.
Generally supported attributes are:

  • iobound – the job should not share a node with other IO bound jobs
  • mdss – the job requires access to the MDSS (usually via the mdss
    command). If MDSS is down, the job will not be started.
  • gdata1 – the job requires access to the /g/data1. If /g/data1 filesystem
    is down, the job will not be started.
  • pernodejobfs – the job’s jobfs resource request should be treated as a
    per node request.
    Normally the jobfs request is for total jobfs summed over all nodes allocated
    to the job (like mem). Only relevant to distributed parallel jobs using jobfs.

You may be asked to specify other options at times to support particular needs
or circumstances.

-r y Specifies your job is restartable, and if the job is executing on a node when it
crashes, the job will be requeued.Both resources used by and resource limits set for the original job will carry
over to the requeued job.
Hence a restartable job must be checkpointing such that it will still be
able to complete in the remaining walltime should it suffer a node crash.The default is that jobs are assumed to not be restartable.
Note that regardless of the restartable status of a job, time used by
jobs on crashed nodes is charged against the project they are running under,
since the onus is on users to ensure minimum waste of resources via a
checkpointing mechanism which they must build into any particularly long running codes.
-l wd Start the job in the directory from which it was submitted. Normally jobs
are started in the users home directory.
qps jobid  show the processes of a running job
qls jobid  list the files in a job’s jobfs directory
qcat jobid  show a running job’s stdout, stderr or script
qcp jobid  copy a file from a running job’s jobfs directory

 

The man pages for these commands on the system detail the various options you will probably need to use.