pyissm.model.classes.cluster
Cluster classes for ISSM.
Classes
|
Gadi HPC cluster interface for ISSM job submission and management. |
|
Generic cluster class for ISSM. |
- class pyissm.model.classes.cluster.gadi(config_file=None, other=None)
Bases:
manage_stateGadi HPC cluster interface for ISSM job submission and management.
This class represents the Gadi HPC cluster at the National Computational Infrastructure (NCI) and provides methods for configuring cluster parameters, building PBS queue scripts, and managing job submission and file transfers.
The Gadi cluster uses PBS Pro for job scheduling and supports parallel execution via MPI. Configuration can be provided via YAML config files or programmatically through object attributes.
- Parameters:
config_file (
str, optional) – Path to YAML configuration file containing cluster parameters. If provided, will override default parameters with values from the file.other (object, optional) – Another cluster object to inherit matching fields from.
- name
Hostname of the cluster. Defaults to ‘gadi.nci.org.au’ if not on Gadi.
- Type:
str
- login
Login username for the cluster. Must be provided for cluster access.
- Type:
str
- np
Number of processors to use for job execution. Default is 16.
- Type:
int
- memory
Memory per node in GB. Default is 40.
- Type:
int
- port
SSH port number for cluster connection. Default is 0.
- Type:
int
- queue
PBS queue name. Options include ‘normal’, ‘express’, ‘hugemem’. Default is ‘normal’.
- Type:
str
- time
Walltime limit for job execution in minutes. Default is 60.
- Type:
int
- codepath
Path to the ISSM executable directory (e.g., $ISSM_DIR/bin). Must be provided.
- Type:
str
- executionpath
Path to the execution/working directory on the cluster. Must be provided.
- Type:
str
- project
NCI project code for job submission. Must be provided.
- Type:
str
- storage
Storage paths to access (e.g., ‘gdata/XXX+scratch/XXX’). Must be provided.
- Type:
str
- moduleload
List of module load commands needed for PBS job execution.
- Type:
list of
str
- moduleuse
List of module use commands to specify module paths.
- Type:
list of
str
Notes
All required attributes (login, codepath, executionpath, project, storage) must be set either via configuration file or programmatically before building and launching jobs. The moduleload and moduleuse lists must have equal length.
Queue specifications:
normal: 48 hours on up to 3072 cores
express: 2 hours on up to 960 cores
hugemem: 48 hours on up to 3072 cores
Examples
>>> cluster = gadi(config_file='gadi_config.yaml') >>> cluster.np = 32 >>> cluster.queue = 'express'
- build_queue_script(dir_name, model_name, solution, io_gather, is_valgrind, is_gprof, is_dakota, is_ocean_coupling, executable='issm.exe')
Generate a PBS queue submission script for running ISSM models on the Gadi cluster. The script includes resource specifications, module loading, and execution commands.
- Parameters:
dir_name (
str) – Directory name where the model execution files are stored.model_name (
str) – Name of the model, used for output file naming.solution (
str) – Solution type or identifier to pass to the executable.io_gather (
bool) – If True, output files are pre-gathered. If False, output binary files are concatenated after execution.is_valgrind (
bool) – If True, raisesNotImplementedErroras Valgrind is not supported.is_gprof (
bool) – If True, raisesNotImplementedErroras gprof is not supported.is_dakota (
bool) – If True, raisesNotImplementedErroras DAKOTA is not supported.is_ocean_coupling (
bool) – If True, raisesNotImplementedErroras ocean coupling is not supported.executable (
str, optional) – Name of the executable to run. Default is ‘issm.exe’.
- Raises:
IOError – If Python wrappers are not installed.
NotImplementedError – If any of the unsupported features (DAKOTA, ocean coupling, Valgrind, gprof) are requested.
- Returns:
Writes a queue script file named ‘{model_name}.queue’ to the current directory.
- Return type:
None
- check_consistency(md, solution, analyses)
Check consistency of the [cluster.gadi] parameters.
- Parameters:
md (
pyissm.model.Model) – The model object to check.solution (
str) – The solution name to check.analyses (list of
str) – List of analyses to check consistency for.
- Returns:
md – The model object with any consistency errors noted.
- Return type:
- download(dir_name, file_list)
Download files from a remote cluster to the local machine.
- Parameters:
dir_name (
str) – The name of the directory on the remote cluster containing the files to download.file_list (list of
str) – A list of filenames to download from the remote cluster directory.
- Return type:
None
- launch_queue_job(model_name, dir_name, restart=None, batch=False)
Launch a job on the Gadi cluster queue system.
This method submits a job to the Gadi PBS queue system. It handles both fresh job submissions and job restarts, with optional batch processing mode.
- Parameters:
model_name (
str) – Name of the model to be executed on the cluster.dir_name (
str) – Name of the directory where the job will be executed.restart (
boolor None, optional) – If not None, indicates this is a restart of an existing job. When restarting, the method assumes the job directory already exists and only submits the queue script via qsub. Default is None.batch (
bool, optional) – Flag indicating whether to run in batch mode. Currently unused for Gadi cluster but maintained for interface compatibility. Default is False.
- Return type:
None
Notes
The method performs different operations based on the restart parameter:
If restart is not None: Changes to the existing execution directory and
submits the queue script using qsub. - If restart is None: Removes any existing directory, creates a new one, moves and extracts the tar.gz file, then submits the queue script using qsub.
The job is launched via SSH connection to the cluster using the cluster’s name, login credentials, and port configuration.
Examples
# Launch a new job: >>> cluster.launch_queue_job('simulation_01', 'run_dir') # Restart an existing job: >>> cluster.launch_queue_job('simulation_01', 'run_dir', restart=True)
- upload_queue_job(model_name, dir_name, file_list)
Upload job files to the cluster queue system.
This method compresses the specified files into a tar.gz archive and transfers it to the cluster using SCP. If running in interactive mode, also includes error and output log files in the archive.
- Parameters:
model_name (
str) – Name of the model, used for naming log files in interactive mode (not used here).dir_name (
str) – Name of the directory/archive to be created (without extension).file_list (list of
str) – List of file paths to be included in the compressed archive.
Notes
The function creates a tar.gz archive with the name {dir_name}.tar.gz containing all files in file_list. The compressed archive is then transferred to the cluster using the cluster’s configured connection parameters (name, execution path, login, and port).
- class pyissm.model.classes.cluster.generic(config_file=None, other=None)
Bases:
manage_stateGeneric cluster class for ISSM.
This class provides a generic interface for managing cluster configurations and job execution in the ISSM framework. It handles cluster parameters, queue script generation, job submission, and result retrieval.
- Parameters:
config_file (
str, optional) – Path to YAML configuration file containing cluster parameters. If provided, will override default parameters with values from the file.other (object, optional) – Another cluster object to inherit matching fields from.
- name
Name of the cluster (defaults to hostname).
- Type:
str
- login
Login username for the cluster (defaults to current username).
- Type:
str
- np
Number of processors to use (default: 1).
- Type:
int
- port
Port number for connections (default: 0).
- Type:
int
- interactive
Interactive mode flag (default: 1).
- Type:
int
- codepath
Path to the ISSM executables directory (default: $ISSM_DIR/bin).
- Type:
str
- executionpath
Path to the execution directory on the cluster (default: $ISSM_DIR/execution).
- Type:
str
- valgrind
Path to valgrind executable for memory debugging (default: $ISSM_DIR/externalpackages/valgrind/bin/valgrind).
- Type:
str
- valgrindlib
Path to valgrind MPI debug library (default: $ISSM_DIR/externalpackages/valgrind/install/lib/libmpidebug.so).
- Type:
str
- valgrindsup
List of valgrind suppression files (default: $ISSM_DIR/externalpackages/valgrind/issm.supp).
- Type:
list of
str
- verbose
Verbose output flag (default: 1).
- Type:
int
- shell
Shell to use for command execution (default: ‘/bin/sh’).
- Type:
str
Notes
Configuration parameters can be overridden via YAML configuration files or by inheriting from other cluster objects.
Examples
>>> cluster = generic() >>> cluster.np = 4 >>> cluster.name = 'my_cluster' >>> cluster = generic(config_file='cluster_config.yaml')
- build_kriging_queue_script(model_name, solution, io_gather, is_valgrind, is_gprof, executable='kriging.exe')
Build a queue script for executing kriging models on the cluster.
This method generates platform-specific execution scripts (bash for Linux/Mac, batch for Windows) that handle kriging model execution with various configurations including MPI, debugging tools, and profiling.
- Parameters:
model_name (
str) – Name of the kriging model to execute.solution (
str) – Solution type or configuration parameter.io_gather (
bool) – Flag indicating whether to gather I/O operations. If False, output files will be concatenated.is_valgrind (
bool) – Flag to enable Valgrind memory debugging tool execution.is_gprof (
bool) – Flag to enable gprof profiling tool execution.executable (
str, optional) – Name of the executable file to run. Default is ‘kriging.exe’.
- Raises:
IOError – If Python wrappers are not installed.
Notes
On Linux/Mac systems, creates a ‘.queue’ bash script
On Windows systems, creates a ‘.bat’ batch script
Automatically handles MPI execution for kriging operations
In interactive mode, creates empty error and output log files
Supports memory debugging with Valgrind and profiling with gprof
Specifically designed for kriging executable execution
- build_queue_script(dir_name, model_name, solution, io_gather, is_valgrind, is_gprof, is_dakota, is_ocean_coupling, executable='issm.exe')
Build a queue script for executing ISSM models on the cluster.
This method generates platform-specific execution scripts (bash for Linux/Mac, batch for Windows) that handle model execution with various configurations including MPI, debugging tools, and specialized executables.
- Parameters:
dir_name (
str) – Directory name where the model files are located.model_name (
str) – Name of the model to execute.solution (
str) – Solution type or configuration parameter.io_gather (
bool) – Flag indicating whether to gather I/O operations. If False, output files will be concatenated.is_valgrind (
bool) – Flag to enable Valgrind memory debugging tool execution.is_gprof (
bool) – Flag to enable gprof profiling tool execution.is_dakota (
bool) – Flag to use DAKOTA optimization executable.is_ocean_coupling (
bool) – Flag to use ocean coupling executable.executable (
str, optional) – Name of the executable file to run. Default is ‘issm.exe’.
- Raises:
IOError – If Python wrappers are not installed or if DAKOTA support is requested but not available in the ISSM build.
Notes
On Linux/Mac systems, creates a ‘.queue’ bash script
On Windows systems, creates a ‘.bat’ batch script
Automatically handles MPI execution when available
In interactive mode, creates empty error and output log files
Supports various debugging and profiling tools integration
Handles different executable types based on coupling requirements
- check_consistency(md, solution, analyses)
Check consistency of the [cluster.generic] parameters.
- Parameters:
md (
pyissm.model.Model) – The model object to check.solution (
str) – The solution name to check.analyses (list of
str) – List of analyses to check consistency for.
- Returns:
md – The model object with any consistency errors noted.
- Return type:
- download(dir_name, file_list)
Download files from a remote cluster to the local machine.
This method retrieves specified files from a remote cluster directory to the current local directory. On Windows systems, this operation is skipped as it’s not supported.
- Parameters:
dir_name (
str) – The name of the directory on the remote cluster containing the files to download.file_list (list of
str) – A list of filenames to download from the remote cluster directory.
- Return type:
None
Notes
This method does nothing on Windows platforms and returns immediately.
Files are copied from the cluster’s execution path combined with the
specified directory name. - The actual file transfer is handled by the model.io.issm_scp_in function.
- launch_queue_job(model_name, dir_name, restart=None, batch=False)
Launch a job on the cluster queue system.
This method builds and executes the appropriate launch command for submitting a job to the cluster’s queue system. It handles both fresh job submissions and job restarts, with optional batch processing mode.
- Parameters:
model_name (
str) – Name of the model to be executed on the cluster.dir_name (
str) – Name of the directory where the job will be executed.restart (
boolor None, optional) – If not None, indicates this is a restart of an existing job. When restarting, the method assumes the job directory already exists and only executes the queue script. Default is None.batch (
bool, optional) – Flag indicating whether to run in batch mode. When True, only extracts the tar.gz file without executing the queue script. When False (default), extracts and immediately executes the job. Only relevant when restart is None.
Notes
The method performs different operations based on the restart parameter:
- If restart is not None: Changes to the execution directory and runs
the existing queue script.
- If restart is None: Removes any existing directory, creates a new one,
moves and extracts the tar.gz file, and optionally runs the queue script depending on the batch parameter.
The job is launched via SSH connection to the cluster using the cluster’s name, login credentials, and port configuration.
Examples
# Launch a new job: >>> cluster.launch_queue_job('simulation_01', 'run_dir') # Restart an existing job: >>> cluster.launch_queue_job('simulation_01', 'run_dir', restart=True) # Launch in batch mode (extract only, no execution): >>> cluster.launch_queue_job('simulation_01', 'run_dir', batch=True)
- upload_queue_job(model_name, dir_name, file_list)
Upload job files to the cluster queue system.
Compresses the specified files into a tar.gz archive and transfers it to the cluster using SCP. If running in interactive mode, also includes error and output log files in the archive.
- Parameters:
model_name (
str) – Name of the model, used for naming log files in interactive mode.dir_name (
str) – Name of the directory/archive to be created (without extension).file_list (list of
str) – List of file paths to be included in the compressed archive.
Notes
The function creates a tar.gz archive with the name {dir_name}.tar.gz containing all files in file_list. In interactive mode, it also includes {model_name}.errlog and {model_name}.outlog files. The compressed archive is then transferred to the cluster using the cluster’s configured connection parameters (name, execution path, login, and port).
See also
model.io.issm_scp_outFunction used for transferring files to cluster