HPC system: Job Scheduling Slurm

From arccwiki
Jump to: navigation, search

Introduction

The Teton cluster uses the Slurm Workload Manager to schedule jobs, control resource access, provide fairshare, implement preemption, and provide record keeping. All compute activity should be used from within a Slurm resource allocation (i.e., job). Teton is a condominium resource and as such, investors do have priority on invested resources. This is implemented through preemption and jobs not associated with the investment could be requeued on the system when investor submits jobs. However, if the investor chooses not to implement preemption on their resources, ARCC can disable preemption while offering next-in-line access if that mode is preferred.

  • There are default concurrent limits in place to prevent individual project accounts and users from saturating the cluster away from others. The default limits are listed below. To incentivize investments into the condo system, investors will have their limits increased.
  • The system leverages a fairshare mechanism to offer a mechanism for projects that execute jobs on a more rare occasion priority over those who continuously run jobs on the system. To incentivize investments into the condo system, investors will have their fairshare value increased as well.
  • Finally, individual jobs occur runtime limits based on a study that was performed in ~2014 such that our maximum walltime for a compute job is 7 days. ARCC is currently evaluating this to determine whether the orthogonal limits of CPU count and walltime are optimal operational modes. ARCC is considering concurrent usage limits based on a relational combination of CPU count, Memory, and walltime that would allow more flexibility for different areas of science. There will likely still be an upper limit on individual compute job walltime as ARCC will not allow infinite job walltime and due to possible hardware faults.

Required Inputs and Default Values and Limits

There are some default limits set for Slurm jobs. By default the following is required for submission:

  1. Walltime limit
    (--time=[days-hours:mins:secs]
  2. Project account
    --account=account

Default Values

Additionally, the default submission has the following characteristics:

nodes 
is for one node (-N 1, --nodes=1)
task count 
one tasks (-n 1, --ntasks-per-node=1)
memory amount 
1000 MB RAM / CPU (--mem-per-cpu=1000).

These can be changed by requesting different allocation schemes by modifying the appropriate flags. Please reference our Slurm documentation.

Default Limits

On Mount Moran, the default limits were specifically represented by concurrent used cores by each project account. Investors received an increase concurrent core usage capability. To facilitate more flexible scheduling for all research groups, ARCC is looking at implementing limits based on concurrent usage of cores, memory, and walltime of jobs. This will be defined in the near future and will be subject to the FAC review.


Partitions

The Slurm configuration on Teton is quite complicated to help with the layout of hardware, investors, and runtime limits. The following tables represents the partition on Teton. Some require a QoS which will be auto-assigned during job submission. The tables represent the Slurm allocatable units rather than hardware units.

Teton General Slurm Partitions
Partition Max Walltime Node Cnt Core Cnt Thds / Core CPUS Mem (MB) / Node Req'd QoS
teton 7-00:00:00 180 5760 1 5760 128000 N/A
teton-gpu 7-00:00:00 8 256 1 256 512000 N/A
teton-hugemem 7-00:00:00 8 256 1 256 1024000 N/A
teton-knl 7-00:00:00 12 864 4 3456 384000 N/A

Investor Partitions

Investor partitions are likely to be quite heterogeneous and may have a mix of hardware and are indicated below where appropriate. They require a special QoS for access.

Teton Investor Slurm Partitions
Partition Max Walltime Node Cnt Core Cnt Thds / Core Mem (MB) / Node Req'd QoS Preemption Owner
t-inv-microbiome 7-00:00:00 88 2816 1 128000 TODO Disabled EPSCoR

Special Partitions

Special partitions require access to be given directly to user accounts or project accounts and likely require additional approval for access.

Partition Max Walltime Node Cnt Core Cnt Thds / Core Mem (MB) / Node Owner Notes
dgx 7-00:00:00 1 40 2 512000 EvolvingAI Lab NVIDIA V100 with NVLink, Ubuntu 16.04



More details

Generally, to run a job on a cluster you will need the following:

A handy migration reference to compare MOAB/Torque commands to SLURM commands can be found on the SLURM home site: Batch system Rosetta stone

For further details on using Slurm, see Slurm.