HPC FAQ

How do I connect to the HPC systems?

An SSH client is required to connect to all HPC systems.  For windows users SmarTTY, MobaXterm and PuTTY are good free clients but there are many others.  Mac OS X users can use the builtin Terminal app.

Your account will have been enabled only for the HPC system appropriate for your project(s).

Access to Eddy and kunanyi is via jumpbox.tpac.org.au.  Once connected to jumpbox please read the “Message of the Day” (MOTD) for cluster status and instructions for how to connect to each cluster.

Storage in HPC systems and transferring data

/u Home directory

All clusters share the same home directory (/u).  So files in your cluster home directory can be
accessed on any cluster.  The home directory file system is provided by a Hierarchical Storage System (HSM).  As disk space gets short the HSM will automatically move files which have been least accessed to tape.  When the file is again accessed it will automatically be retrieved to disk.  This can result in a brief delay before the file can be accessed while the tape is found and read.

The tape system does not perform well with large numbers of files.  Please use “tar” or “zip” utilities to group together large numbers of files into a single archive.  Both utilities will conveniently work on a directory tree.

Your jumpbox home directory is different to your cluster home directory but you can still
access your cluster home directory on jumpbox via /cluster-home/{username}

/scratch

Each cluster has a separate /scratch filesystem on high performance infrastructure suitable for jobs to read and write data from at speed.  /scratch should be considered to be ephemeral i.e. files may be deleted from this file system  with short notice.  There is also very little data resiliency provided by these systems.  It is your responsibility to ensure you have a copy of critical data in your home directory.  Please consider the use of “tar” for this action.

Transferring data into your HPC home directory

For small amounts of data < 100GB you can use ‘scp’ like tools e.g. WinSCP.   Also SmarTTY and MobaXterm both have inbuilt tools to transfer files as well as to make ssh connections.  Transfer the data to jumpbox.tpac.org.au:/cluster-home/{username}

For large amounts of data > 100GB please contact helpdesk@tpac.org.au.

How do I use PBS?

PBS at TPAC

The Portable Batch Systems (PBS) used at TPAC varies between compute systems.

PBS Pro User Documentation

The following is intended to be generic information for more detailed information refer to the “man” pages or the user documentation above.

For kunanyi example qsub scripts are available in /share/apps/pbs_script_examples

Quick Syntax guide

qstat

 

Standard queue status command supplied by PBS. See man qstat for details of options.

Some common uses are:

  • List available queues: qstat -Q
qdel jobid Delete your unwanted jobs from the queues. The jobid is returned by qsub at job
submission time, and is also displayed in the nqstat output.
qsub Submit jobs to the queues. The simplest use of the qsub command is typified
by the following example (note that there is a carriage-return after  ./a.out):

$ qsub -P a99 -q normal -l walltime=20:00:00,mem=300MB ./a.out
^D     (that is control-D)

or simply

$ qsub jobscript

where jobscript is an ascii file containing the shell script to run your commands
(not the compiled executable which is a binary file).
The qsub options are then placed within the script to avoid typing them for each
job e.g.:

#!/bin/bash
#PBS -P a99
#PBS -q normal
#PBS -l walltime=20:00:00,mem=300MB
./a.out

You may need to enter data to the program and may be used to doing this interactively
when prompted by the program.

There are two ways of doing this in batch jobs.

If, for example, the program requires the numbers 1000 then 50 to be entered when
prompted. You can either create a file called, say, input containing these values

$ cat input
1000
50

then run the program as

./a.out < input

or the data can be included in the batch job script as follows:

#!/bin/bash
#PBS -P a99
#PBS -q normal
#PBS -l walltime=20:00:00,mem=300MB
#PBS -l wd
./a.out << EOF
1000
50
EOF

Notice that the PBS directives are all at the start of the script, that there are
no blank lines between them, and there are no other non-PBS commands
until after all the PBS directives.

qsub options of note:

-P project The project which you want to charge the jobs resource usage to.
The default project is specified by the PROJECT environment variable.
-q queue Select the queue to run the job in. The queues you can use are
listed by running nqstat.
-l walltime=??:??:?? The wall clock time limit for the job. Time is expressed in seconds as
an integer, or in the form:
[[hours:]minutes:]seconds[.milliseconds]
System scheduling decisions depend heavily on the walltime request
it is always best to make as accurate a request as possible.
-l mem=???MB The total memory limit across all nodes for the job – can be specified
with units of “MB” or “GB” but only integer values can be given. There
is a small default value.
Your job will only run if there is sufficient free memory so making a
sensible memory request will allow your jobs to run sooner.A little trial
and error may be required to find how much memory your jobs are
using – nqstat lists jobs actual usage.
-l ncpus=? The number of cpus required for the job to run on. The default is 1.

 -l ncpus=N – If the number of cpus requested, N, is small (currently
16 or less on NF systems) the job will run within a single shared memory node.

If the number of cpus specified is greater, the job will be distributed
over multiple nodes. Currently on NF systems, these larger requests
are restricted to multiples of 16 cpus.

-l jobfs=???GB The requested job scratch space. This will reserve disk space, making it
unavailable for other jobs, so please do not over estimate your needs.Any files created in the $PBS_JOBFS directory are automatically
removed at the end of the job. Ensure that you use integers, and
units of mb, MB, gb, or GB.
-l software=??? Specifies licensed software the job requires to run. See the software
for the string to use for specific software.The string should be a colon separated list (no spaces) if more than
one software product is used.

If your job uses licensed software and you do not specify this option (or
mis-spell the software), you will probably receive an automatically generated
email from the license shadowing daemon, and the job may be terminated.

You can check the lsd status and find out more by looking at the license
status website.

-l other=??? Specifies other requirements or attributes of the job. The string should be
a colon separated list (no spaces) if more than one attribute is required.
Generally supported attributes are:

  • iobound – the job should not share a node with other IO bound jobs
  • mdss – the job requires access to the MDSS (usually via the mdss
    command). If MDSS is down, the job will not be started.
  • gdata1 – the job requires access to the /g/data1. If /g/data1 filesystem
    is down, the job will not be started.
  • pernodejobfs – the job’s jobfs resource request should be treated as a
    per node request.
    Normally the jobfs request is for total jobfs summed over all nodes allocated
    to the job (like mem). Only relevant to distributed parallel jobs using jobfs.

You may be asked to specify other options at times to support particular needs
or circumstances.

-r y Specifies your job is restartable, and if the job is executing on a node when it
crashes, the job will be requeued.Both resources used by and resource limits set for the original job will carry
over to the requeued job.
Hence a restartable job must be checkpointing such that it will still be
able to complete in the remaining walltime should it suffer a node crash.

The default is that jobs are assumed to not be restartable.
Note that regardless of the restartable status of a job, time used by
jobs on crashed nodes is charged against the project they are running under,
since the onus is on users to ensure minimum waste of resources via a
checkpointing mechanism which they must build into any particularly long running codes.

-l wd Start the job in the directory from which it was submitted. Normally jobs
are started in the users home directory.

 

qps jobid  show the processes of a running job
qls jobid  list the files in a job’s jobfs directory
qcat jobid  show a running job’s stdout, stderr or script
qcp jobid  copy a file from a running job’s jobfs directory

 

The man pages for these commands on the system detail the various options you will probably need to use.

What applications/software is available?

On each of the compute systems you can run “module avail” which will provide a list of installed software and their versions.  See “module help” for information about how to load modules and other ways to search for modules.

Additional software and or versions can be requested via the TPAC helpdesk or email to helpdesk@tpac.org.au.

How do I get further assistance?

Requests to TPAC helpdesk can be submitted online via the TPAC Jira Portal or via email to helpdesk@tpac.org.au.  Through the Jira portal you will be able to track the progress of your request.

What are the basic Unix commands that one needs to know to use the linux cluster?

How to get online help

To find a command or library routine that performs a required function, try searching by keyword e.g.

man -k keyword
or
apropos keyword

Use “man command_name” to find details on how to use a unix command or library e.g.

man cat
man ls

If no man page is found, try “info command_name” e.g.

info module

Manipulating files and directories

  • ls
    List contents of current directory
  • cd
    Change directory
  • rm
    Remove file of directory
  • mkdir
    Make a new directory

Use the “man” command for more information on the above.

A few notes on Unix directory names.

A Unix file full path name is constructed of the directory and subdirectory names separated by slashes “/”. ie. /u/jsmith/work/file1.  When you first login you will be in your “home” directory at TPAC this is usually /u/username.  In most shells this can also be referenced as ~username.

For example if your username is asmith then “cd ~asmith/work” will take you to the “work” directory in your home directory.

All Unix full path names start with “/”, (There are no Drive/Volume names as in Windows). Hence any filename starting with “/” is a full pathname.

A filename containing one or more slashes “/” will refer to a subdirectory of the “current working directory”.  The current working directory may also be referenced as dot “.” i.e. ./subdirectory/file.

The parent of the “current working directory” may be referenced as dot-dot “..”.  For example if you have two directories in your home directory work1 and work2 and you cd to work1 you can then change to work2 by typing the command “cd ../work2”.

 

How do I capture the STDOUT and STDERR generated by my job?

If the application you are running on a node produces output on STDOUT it is important to ensure that this output is captured to a file in your home directory.  If it isn’t redirected it will be captured to a file within the node’s local storage which has limited space.  If the local storage file system fills up it may cause the job that is running to terminate early or produce inconsistent data.

It is recommended that you use the -e and -o options when running qsub.  To ensure these are not forgotten it would be best to create a shell script to start your job as follows:

qsub_job.sh:
#!/bin/bash
#PBS -e errors.txt
#PBS -o output.txt

cmds

Alternatively the two output streams can be joined together.  The following sends the error stream to the output stream:

qsub_job.sh:
#!/bin/bash
#PBS -j oe
#PBS -o output.txt

cmds

It is also possible on kunanyi to see the output and error streams while the job is still running using jobtail.sh and jobcat.sh.

How can I monitor the progress of my PBS jobs?

Use “qstat -f” to provide information about the current status of the job.

In your job startup script you should redirect STDOUT to a file in your home directory.  This may give you information about what the job is doing depending on your application.

 

How do I specify memory allocation in the pbsscript?

#PBS -l mem=600mb

Because of the unique architecture of eddy cpus and memory are linked.  If you specify both it is possible to end up with a job that can’t be run and will remain queued until deleted.  When submitting jobs on Eddy specify memory or cpus but not both.

On kunanyi each node has 128GB of RAM and 28 CPUs  if you are using  whole nodes then there is no need to specify memory.  If you are using a portion of a node then specifying memory accurately will allow the schedule to find a suitable node more easily to run your jobs

 

Running array jobs on kunanyi gives invalid option '-t'

When specifying an array request to qsub please be aware of the following.  The version of PBS running on kunanyi has replaced the “-t” option with “-J”.  This change has not been reflected in the qsub man pages which still refer to “-t” but qsub will complain with invalid option ‘-t’ if used.

How do I run StarCCM/Matlab/Ansys interactively?

StarCCM, Matlab and Ansys all provide an X-Windows GUI.  To use these applications interactively you will need to run an application on your desktop that provides an X-Server. e.g. SmarTTY, MobaXterm, Mac OS (with XQuartz) or any linux desktop.

With MacOS start a Terminal from within the XQuartz application.

If using SmarTTY or MobaXterm connect to jumpbox.tpac.org.au X-Server is enabled by default if not please refer to the help for the respective application.  If using linux or Mac OS X terminal run:

ssh -Y jumpbox.tpac.org.au

The -Y option enables forwarding of X sessions from the remote server.

On jumpbox connect to the kunanyi login node using:

ssh -Y kunanyi

No applications other than compiling software should be run on the login node.  If they are all  your login sessions will be terminated.  You must submit a job to the cluster first using qsub.  An example session is as follows:

[kunanyi-01]~% qsub -I -X -l select=1:ncpus=28
qsub: waiting for job 14237.kunanyi-ohpc.tpac.org.au to start
qsub: job 14237.kunanyi-ohpc.tpac.org.au ready

localhost:50.0
[n209]~%

The above will allocate all 28 cpus on one node to your job.  You can also use “select=2:ncpu=28” to request 28 cpus on 2 nodes for a total of 56 cpus etc.  Run “man qsub” on the login node for more information about the qsub options.  After running the qsub command you will now be in a shell running on a compute node.  The prompt will change to include “nXXX” where XXX is in the range 002 to 256.  You can now load the environment module for the application you wish to run e.g.

module load starccm+
module load matlab
module load ansys

The above will load the most recent version that is available on the cluster.  To see which versions are available run:

module avail starccm+
module avail matlab
module avail ansys

Then use the module load command with the full name including the version as shown in the modul avail listing.