HPC FAQ

How do I connect to the HPC systems?

An SSH client is required to connect to all HPC systems. For windows users SmarTTY, MobaXterm and PuTTY are good free clients but there are many others. Mac OS X users can use the builtin Terminal app.

Your account will have been enabled only for the HPC system appropriate for your project(s).

Access to HPC clusters is via jumpbox.tpac.org.au. Once connected to jumpbox please read the “Message of the Day” (MOTD) for cluster status and instructions for how to connect the login node of each cluster.

Notes:

The TPAC HPC clusters operate in a similar manner to most other HPC clusters in that they use a login node from where jobs are submitted to a scheduler to be then run in the cluster. Jobs found running on the login nodes will be terminated.
Jobs cannot be submitted from jumpbox only from the login node of the relevant cluster.

Storage in HPC systems and transferring data

Note: Rosalind users should refer to documentation provided separately regarding storage.

For kunanyi home (/u) and scratch (/scratch) directory file systems are provided by a ceph cluster using cephfs. In front of the cluster is a NFS server. Directories are auto mounted so your scratch directory will not be visible until accessed the first time i.e. doing “ls /scratch” may not show your directory.

Quotas are applied to both home and scratch directories. The default quota is 8TiB but can be increased given sufficient reason although we would ask that you do some house keeping before making a request.

Your current usage and quota is printed as part of the message of the day (MOTD) on logging into the kunanyi login node. You can also display this information by running “my-quota-usage”. The data provided is updated approximately every 15 mins.

Transferring data into your HPC home directory

For small amounts of data < 100GB you can use ‘scp’ like tools e.g. WinSCP. Also SmarTTY and MobaXterm both have inbuilt tools to transfer files as well as to make ssh connections. The following linux command will transfer data to your home directory.

scp file.tar.gz {username}@jumpbox.tpac.org.au:~/

Please consider others when transferring data and ensure you are transferring compressed archives (zip, gzip etc.)

For large amounts of data > 100GB please contact us via the UTas Service Catalog.

How do I use PBS?

PBS at TPAC

The Portable Batch Systems (PBS) used at TPAC varies between compute systems.

PBS Pro User Guide
PBS Pro Reference Guide

The following is intended to be generic information for more detailed information refer to the “man” pages or the user documentation above.

For kunanyi example qsub scripts are available in /share/apps/pbs_script_examples

Quick Syntax guide

qstat

Standard queue status command supplied by PBS. See man qstat for details of options.

Some common uses are:

List available queues: qstat -Q

qdel jobid Delete your unwanted jobs from the queues. The jobid is returned by qsub at job
submission time, and is also displayed in the nqstat output.

qsub

Submit jobs to the queues. The simplest use of the qsub command is typified
by the following example (note that there is a carriage-return after ./a.out):

$ qsub -P a99 -l select=1:ncpus=28 -l walltime=20:00:00 ./a.out
^D     (that is control-D)

or simply

$ qsub jobscript

where jobscript is an ascii file containing the shell script to run your commands
(not the compiled executable which is a binary file).
The qsub options are then placed within the script to avoid typing them for each
job e.g.:

#!/bin/bash
#PBS -P a99
#PBS -l select=1:ncpus=28
#PBS -l walltime=20:00:00
./a.out

You may need to enter data to the program and may be used to doing this interactively
when prompted by the program.

There are two ways of doing this in batch jobs.

If, for example, the program requires the numbers 1000 then 50 to be entered when
prompted. You can either create a file called, say, input containing these values

$ cat input
1000
50

then run the program as

./a.out < input

or the data can be included in the batch job script as follows:

#!/bin/bash
#PBS -P a99
#PBS -l select=1:ncpus=28
#PBS -l walltime=20:00:00
#PBS -l wd
./a.out << EOF
1000
50
EOF

Notice that the PBS directives are all at the start of the script, that there are
no blank lines between them, and there are no other non-PBS commands
until after all the PBS directives.

qsub options of note:

-l select=?	The number of nodes to be allocated to the job.
	Select the queue to run the job in. The queues you can use are listed by running nqstat. By default the routeq will be used which will automatically determine the queue based on the resources requested.
-l walltime=??:??:??	The wall clock time limit for the job. Time is expressed in seconds as an integer, or in the form: `[[hours:]minutes:]seconds[.milliseconds]` System scheduling decisions depend heavily on the walltime request it is always best to make as accurate a request as possible.
`-l mem=???MB`	The total memory limit across all nodes for the job – can be specified with units of “MB” or “GB” but only integer values can be given. There is a small default value. Your job will only run if there is sufficient free memory so making a sensible memory request will allow your jobs to run sooner.A little trial and error may be required to find how much memory your jobs are using – `nqstat` lists jobs actual usage.
-l ncpus=?	The number of cpus required for the job to run on. The default is 1. `-l ncpus=N` – If the number of cpus requested, N, is small (currently 16 or less on NF systems) the job will run within a single shared memory node. If the number of cpus specified is greater, the job will be distributed over multiple nodes. Currently on NF systems, these larger requests are restricted to multiples of 16 cpus.
-l jobfs=???GB	The requested job scratch space. This will reserve disk space, making it unavailable for other jobs, so please do not over estimate your needs.Any files created in the $PBS_JOBFS directory are automatically removed at the end of the job. Ensure that you use integers, and units of mb, MB, gb, or GB.
-l software=???	Specifies licensed software the job requires to run. See the software for the string to use for specific software.The string should be a colon separated list (no spaces) if more than one software product is used.If your job uses licensed software and you do not specify this option (or mis-spell the software), you will probably receive an automatically generated email from the license shadowing daemon, and the job may be terminated.You can check the lsd status and find out more by looking at the license status website.
-l other=???	Specifies other requirements or attributes of the job. The string should be a colon separated list (no spaces) if more than one attribute is required. Generally supported attributes are: `iobound` – the job should not share a node with other IO bound jobs `mdss` – the job requires access to the MDSS (usually via the mdss command). If MDSS is down, the job will not be started. `gdata1` – the job requires access to the /g/data1. If /g/data1 filesystem is down, the job will not be started. `pernodejobfs` – the job’s jobfs resource request should be treated as a per node request. Normally the jobfs request is for total jobfs summed over all nodes allocated to the job (like mem). Only relevant to distributed parallel jobs using jobfs. You may be asked to specify other options at times to support particular needs or circumstances.
-r y	Specifies your job is restartable, and if the job is executing on a node when it crashes, the job will be requeued.Both resources used by and resource limits set for the original job will carry over to the requeued job. Hence a restartable job must be checkpointing such that it will still be able to complete in the remaining walltime should it suffer a node crash.The default is that jobs are assumed to not be restartable. Note that regardless of the restartable status of a job, time used by jobs on crashed nodes is charged against the project they are running under, since the onus is on users to ensure minimum waste of resources via a checkpointing mechanism which they must build into any particularly long running codes.
-l wd	Start the job in the directory from which it was submitted. Normally jobs are started in the users home directory.

qps jobid show the processes of a running job

qls jobid list the files in a job’s jobfs directory

qcat jobid show a running job’s stdout, stderr or script

qcp jobid copy a file from a running job’s jobfs directory

The man pages for these commands on the system detail the various options you will probably need to use.

What applications/software is available?

On each of the compute systems you can run “module avail” which will provide a list of installed software and their versions. See “module help” for information about how to load modules and other ways to search for modules.

Additional software and or versions can be requested via the UTas Service Catalog.

How do I get further assistance?

Requests to TPAC helpdesk can be submitted online via Service Now.

What are the basic Unix commands that one needs to know to use the linux cluster?

How to get online help

To find a command or library routine that performs a required function, try searching by keyword e.g.

man -k keyword
or
apropos keyword

Use “man command_name” to find details on how to use a unix command or library e.g.

man cat
man ls

If no man page is found, try “info command_name” e.g.

info module

Manipulating files and directories

ls
List contents of current directory
cd
Change directory
rm
Remove file of directory
mkdir
Make a new directory

Use the “man” command for more information on the above.

A few notes on Unix directory names.

A Unix file full path name is constructed of the directory and subdirectory names separated by slashes “/”. ie. /u/jsmith/work/file1. When you first login you will be in your “home” directory at TPAC this is usually /u/username. In most shells this can also be referenced as ~username.

For example if your username is asmith then “cd ~asmith/work” will take you to the “work” directory in your home directory.

All Unix full path names start with “/”, (There are no Drive/Volume names as in Windows). Hence any filename starting with “/” is a full pathname.

A filename containing one or more slashes “/” will refer to a subdirectory of the “current working directory”. The current working directory may also be referenced as dot “.” i.e. ./subdirectory/file.

The parent of the “current working directory” may be referenced as dot-dot “..”. For example if you have two directories in your home directory work1 and work2 and you cd to work1 you can then change to work2 by typing the command “cd ../work2”.

How do I capture the STDOUT and STDERR generated by my job?

If the application you are running on a node produces output on STDOUT it is important to ensure that this output is captured to a file in your home directory. If it isn’t redirected it will be captured to a file within the node’s local storage which has limited space. If the local storage file system fills up it may cause the job that is running to terminate early or produce inconsistent data.

It is recommended that you use the -e and -o options when running qsub. To ensure these are not forgotten it would be best to create a shell script to start your job as follows:

qsub_job.sh:
#!/bin/bash
#PBS -e errors.txt
#PBS -o output.txt

cmds

Alternatively the two output streams can be joined together. The following sends the error stream to the output stream:

qsub_job.sh:
#!/bin/bash
#PBS -j oe
#PBS -o output.txt

cmds

It is also possible on kunanyi to see the output and error streams while the job is still running using jobtail.sh and jobcat.sh.

How can I monitor the progress of my PBS jobs?

Use “qstat -f” to provide information about the current status of the job.

In your job startup script you should redirect STDOUT to a file in your home directory. This may give you information about what the job is doing depending on your application.

How do I specify memory allocation in the pbsscript?

#PBS -l mem=600mb

Because of the unique architecture of eddy cpus and memory are linked. If you specify both it is possible to end up with a job that can’t be run and will remain queued until deleted. When submitting jobs on Eddy specify memory or cpus but not both.

On kunanyi each node has 128GB of RAM and 28 CPUs if you are using whole nodes then there is no need to specify memory. If you are using a portion of a node then specifying memory accurately will allow the schedule to find a suitable node more easily to run your jobs

Running array jobs on kunanyi gives invalid option ‘-t’

When specifying an array request to qsub please be aware of the following. The version of PBS running on kunanyi has replaced the “-t” option with “-J”. This change has not been reflected in the qsub man pages which still refer to “-t” but qsub will complain with invalid option ‘-t’ if used.

How do I run StarCCM/Matlab/Ansys interactively?

StarCCM, Matlab and Ansys all provide an X-Windows GUI. To use these applications interactively you will need to run an application on your desktop that provides an X-Server. e.g. SmarTTY, MobaXterm, Mac OS (with XQuartz) or any linux desktop.

With MacOS start a Terminal from within the XQuartz application.

If using SmarTTY or MobaXterm connect to jumpbox.tpac.org.au X-Server is enabled by default if not please refer to the help for the respective application. If using linux or Mac OS X terminal run:

ssh -Y jumpbox.tpac.org.au

The -Y option enables forwarding of X sessions from the remote server.

On jumpbox connect to the kunanyi login node using:

ssh -Y kunanyi

No applications other than compiling software should be run on the login node. If they are all your login sessions will be terminated. You must submit a job to the cluster first using qsub. An example session is as follows:

[kunanyi-01]~% qsub -I -X -l select=1:ncpus=28
qsub: waiting for job 14237.kunanyi-ohpc.tpac.org.au to start
qsub: job 14237.kunanyi-ohpc.tpac.org.au ready

localhost:50.0
[n209]~%

The above will allocate all 28 cpus on one node to your job. You can also use “select=2:ncpu=28” to request 28 cpus on 2 nodes for a total of 56 cpus etc. Run “man qsub” on the login node for more information about the qsub options. After running the qsub command you will now be in a shell running on a compute node. The prompt will change to include “nXXX” where XXX is in the range 002 to 256. You can now load the environment module for the application you wish to run e.g.

module load starccm+
module load matlab
module load ansys

The above will load the most recent version that is available on the cluster. To see which versions are available run:

module avail starccm+
module avail matlab
module avail ansys

Then use the module load command with the full name including the version as shown in the modul avail listing.

Acknowledging Your Use of the TPAC High Performance Computing Environment

Maintaining a High Performance Computing environment is a time consuming and expensive task. Funding requests require evidence that there is a return on those funds. One of the best ways to show this is when research that benefited from these facilities acknowledges their use in publications.

Please consider including a statement similar to the following in your publications:

We would like to acknowledge the use of the high performance computing facilities provided by Digital Research Services, IT Services at the University of Tasmania

Also please let us know via helpdesk@tpac.org.au when and where your work is published.

HPC FAQ

Transferring data into your HPC home directory

PBS at TPAC

Quick Syntax guide

How to get online help

Manipulating files and directories

A few notes on Unix directory names.

Search

Resources

TPAC supported by

TPAC Supports

Recent Posts

Recent Comments

Archives

Categories

Meta

HPC FAQ

Transferring data into your HPC home directory

PBS at TPAC

Quick Syntax guide

How to get online help

Manipulating files and directories

A few notes on Unix directory names.

Search

Resources

TPAC supported by

TPAC Supports

Tags

Recent Posts

Recent Comments

Archives

Categories

Meta