Difference between revisions of "Slurm"

From arccwiki
Jump to: navigation, search
(Trouble Shooting)
 
(3 intermediate revisions by the same user not shown)
Line 3: Line 3:
  
 
reference here for filling in this page [https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks]
 
reference here for filling in this page [https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks]
 +
 +
 +
== Trouble Shooting ==
 +
 +
<li>Node won't come online</li>
 +
 +
If a node won't come online for some reason check to node information for a slurm reason.  run
 +
 +
scontrol show node=XXX
 +
 +
The command output should include a reason for why slurm won't bring the node online. As an example:
 +
 +
root@tmgt1:/apps/s/lenovo/dsa# scontrol show node=mtest2
 +
NodeName=mtest2 Arch=x86_64 CoresPerSocket=10
 +
    CPUAlloc=0 CPUTot=20 CPULoad=0.02
 +
    AvailableFeatures=ib,dau,haswell,arcc
 +
    ActiveFeatures=ib,dau,haswell,arcc
 +
    Gres=(null)
 +
    NodeAddr=mtest2 NodeHostName=mtest2 Version=18.08
 +
    OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018
 +
    RealMemory=64000 AllocMem=0 FreeMem=55805 Sockets=2 Boards=1
 +
    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
 +
    Partitions=arcc
 +
    BootTime=06.08-11:44:57 SlurmdStartTime=06.08-11:47:35
 +
    CfgTRES=cpu=20,mem=62.50G,billing=20
 +
    AllocTRES=
 +
    CapWatts=n/a
 +
    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
 +
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 +
    '''Reason=Low RealMemory''' [slurm@06.10-10:00:27]
 +
 +
This indicates that the memory definition for the node and what Slurm actually found are different.  You can use
 +
 +
free -m
 +
 +
to see what the system thinks it has in terms of memory.
 +
 +
The node definition should have a memory definition less or equal to the total showed by the "free" command.  You should verify that the settings are correct for the memory the node should have.  If not, investigate and determine why the discrepancy.

Latest revision as of 16:45, 10 June 2019

Place holder for general Slurm information.


reference here for filling in this page [1]


Trouble Shooting

  • Node won't come online
  • If a node won't come online for some reason check to node information for a slurm reason. run

    scontrol show node=XXX
    

    The command output should include a reason for why slurm won't bring the node online. As an example:

    root@tmgt1:/apps/s/lenovo/dsa# scontrol show node=mtest2
    NodeName=mtest2 Arch=x86_64 CoresPerSocket=10 
       CPUAlloc=0 CPUTot=20 CPULoad=0.02
       AvailableFeatures=ib,dau,haswell,arcc
       ActiveFeatures=ib,dau,haswell,arcc
       Gres=(null)
       NodeAddr=mtest2 NodeHostName=mtest2 Version=18.08
       OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 
       RealMemory=64000 AllocMem=0 FreeMem=55805 Sockets=2 Boards=1
       State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
       Partitions=arcc 
       BootTime=06.08-11:44:57 SlurmdStartTime=06.08-11:47:35
       CfgTRES=cpu=20,mem=62.50G,billing=20
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
       Reason=Low RealMemory [slurm@06.10-10:00:27]
    

    This indicates that the memory definition for the node and what Slurm actually found are different. You can use

    free -m
    

    to see what the system thinks it has in terms of memory.

    The node definition should have a memory definition less or equal to the total showed by the "free" command. You should verify that the settings are correct for the memory the node should have. If not, investigate and determine why the discrepancy.