Difference between revisions of "SysAdmin:Slurm"

From arccwiki
Jump to: navigation, search
(Created page with "== Slurm Workload Manager == === Basics === TODO ---- === Compiling and Installing === ==== prerequisites ==== Red Hat / EPEL provided dependencies (use yum with appropriat...")
(No difference)

Revision as of 05:48, 6 November 2019

Slurm Workload Manager

Basics

TODO


Compiling and Installing

prerequisites

Red Hat / EPEL provided dependencies (use yum with appropriate repositories configured):

  1. GCC
  2. readline(-devel)
  3. MariaDB(-devel)
  4. Perl(-devel)
  5. lua(-devel)
  6. cURL(curl & libcurl(-devel))
  7. JSON (json-c(-devel))
  8. munge(-devel)

[root@tmgt1 ~]# yum -y install \
  gcc \
  readline readline-devel \
  mariadb mariadb-devel \
  perl perl-devel \
  curl libcurl libcurl-devel \
  json-c json-c-devel \
  munge munge-devel munge-libs

There are probably more that need to be explicitly listed here, but as they're revisited, they should be added.


Mellanox provided dependencies (Use mlnxofedinstall script)

  1. libibmad(-devel)
  2. libibumad(-devel)


ARCC Supplied RPM

  1. PMIx
  2. UCX (slurm not currently configured to use it)


PMIx

PMIx is used to exchange information about the communications and launching platforms of parallel applications (i.e., mpirun, srun, etc.). the PMIx implementation is a launcher that prefers to do communications in conjunction with the job scheduler rather than using older RSH/SSH methods. The communications time to starting applications can also be reduced significantly at high node counts compared to the older ring start up or even the Hydra implementation. May be from EPEL now, need to double check.

powerman $ rpmbuild ...

UCX

TODO, may actually be an EPEL provided RPM. Will check on this...

HDF5

Compile HDF5 from source and put it in a systems directory not to be confused with the user accessible HDF5 installations that may have additional dependencies like Intel compilers or an MPI implementation.

powerman $ cd /software/slurm

powerman $ tar xf hdf5-1.10.5.tar.bz2

powerman $ cd hdf5-1.10.5

powerman $ ./configure --prefix=/apps/s/hdf5/1.10.5

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

powerman $ cd /apps/s/hdf5

powerman $ ln -s 1.10.5 latest
hwloc

Use the ultra-stable version of hwloc and install it in a global location. It can be used by users if needed, but otherwise you can have a separate installation for user hwloc library when needed. This is specifically to address cgrouping within the system and used by Slurm.

powerman $ cd /software/slurm

powerman $ tar xf hwloc-1.11.13.tar.bz2

powerman $ cd hwloc-1.11.13

powerman $ ./configure --prefix=/apps/s/hwloc/1.11.13

powerman $ make -j4

powerman $ make install

If you keep track of ABI compatibility (you're a sysadmin, you should), then you may want to make the link to the "latest" release of this in the parent of the installed directory as shown below.

powerman $ cd /apps/s/hwloc

powerman $ ln -s 1.11.13 latest

Slurm

Download the latest version that's available. Slurm is on a 9-month release cycle, but often has build fixes for specific versions, CVEs for specific fixes, even potentially hot fixes that need to be addressed.

Assuming the downloaded version is '19.05.3-2.tar.bz2', instructions below. Also reference the symbolic links for HDF5 and hwloc libraries as well.

powerman $ cd /software/slurm

powerman $ tar xf slurm-19.05.3-2.tar.bz2

powerman $ cd slurm-19.05.3-2

powerman $ ./configure \
  --prefix=/apps/s/slurm/19.05.3-2 \
  --with-hdf5=/apps/s/hdf5/latest/bin/h5cc \
  --with-hwloc=/apps/s/hwloc/latest

powerman $ make -j8

powerman $ make install
Additional Features / Utilities
powerman $ cd contribs
powerman $ make

powerman $ for i in lua openlava perlapi seff sjobexit torque;
do
  cd $i
  make install
  cd -
done

PAM libraries are special and if you want to use them, they go in a specific location that is node local. Generally, /usr/lib64/security/

powerman $ for i in pam pam_slurm_adopt;
do
  cd $i
  echo $i
  cd -
done

NOTE: Honestly, the only one that should really be after if you find users abusing nodes is the pam_slurm_adopt code which can pull in users to their cgroups and not allow them access a node if they don't have a job on it. Additionally, remember that configuring PAM is more than just installing libraries and that the PAM stack (/etc/pam.d/...) will need to be modified appropriately. Will post an example at later date.


First Installation & Configuration

Munge

TODO


MariaDB

TODO


Performing Upgrades

It's very important to Slurm that the upgrades happen in a certain order to make sure that continuous service be provided with backwards compatible communications schemes where clients talk to servers. Specifically, the ordering is as follows:

  1. Slurm database
  2. Slurm controller
  3. Slurm compute nodes

Updating the Slurm Database

The Slurm database is a critical component in the ARCC infrastructure and the concepts of keeping accounts and investorship in line rely extensively on this database being active. Therefore, it's quite important to perform a back up of the database before attempting an upgrade. Use the normal MySQL/MariaDB backup capability to accomplish this. Also be aware that ARCC does not prune the database so far, but that may become an issue later if more high throughput computing is instroduced.


[root@tmgt1]# ssh tdb1

[root@tdb1]# systemctl stop slurmdbd.service

[root@tdb1]# ## PERFORM DB BACKUP ##

[root@tdb1]# install -m 0644 /software/slurm/19.05.3-2/slurmdbd.service /etc/systemd/system/slurmdbd.service

[root@tdb1]# systemctl daemon-reload

[root@tdb1]# su - slurm

bash-4.2$ cd /tmp

bash-4.2$ /apps/s/slurm/19.05.3-2/sbin/slurmdbd -D -vvv

Wait for Slurm to resume normal operations and completely make the database changes necessary. Once the changes are done continue, Ctrl-C to interrupt the process:

bash-4.2$ ^c

[root@tdb1]# systemctl start slurmdbd.service

[root@tdb1]# exit

Updating the Slurm Controller

[root@tmgt1] systemctl stop slurmctld.service

[root@tmgt1] install -m 0644 /apps/s/slurm/19.05.3-2/slurmctld.service /etc/systemd/system/slurmctld.service

[root@tmgt1] systemctl daemon-reload

[root@tmgt1] systemctl start slurmctld.service

Updating the Slurm Compute Nodes

Running Nodes

TODO

Compute Node Image

TODO


Slurm Validation

Version Checks

Check command versions ...


Controller & Database Checks

Make sure scontrol, sacctmgr, sacct, sreport, and job_submit.lua works appropriately...

PMI Checks

Make sure Intel launches appropriately...

Special Nodes

DGX Systems

TODO