Difference between revisions of "XCAT"

From arccwiki
Jump to: navigation, search
(xCAT node Setup)
(xCAT General Group Names)
 
(2 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
!Group Name !! Usage  
 
!Group Name !! Usage  
 
|-
 
|-
|all || Used to select all compute nodes in the cluster.
+
|all || Used to select all compute nodes in the cluster, except the DGX nodes.
 
|-
 
|-
 
|moran || Used to select all the old moran nodes in the cluster.
 
|moran || Used to select all the old moran nodes in the cluster.
Line 125: Line 125:
 
# Add the nodes to Observium.
 
# Add the nodes to Observium.
 
# Add the nodes to Slurm.
 
# Add the nodes to Slurm.
 +
 +
== xCAT Installation ==
 +
 +
== xCAT Creating a New Node Image ==
 +
 +
== xCAT Notes and Warnings ==
 +
 +
1. If a node appears not to boot check the node definition.  Could be that the console setup is incorrect in the nodehm table.
  
 
== xCAT References ==
 
== xCAT References ==

Latest revision as of 15:27, 11 September 2019

ARCC uses the Extreme Cluster/Cloud Administration Tool for automating cluster management and deployment.

xCAT General Group Names

Following are the xCAT group names used as general group identifiers:

Group Name Usage
all Used to select all compute nodes in the cluster, except the DGX nodes.
moran Used to select all the old moran nodes in the cluster.
teton Used to select all the new teton nodes in the cluster.
tknl Knights landing nodes
mbm Moran Big memory nodes (512GB memory)
mhm Moran Huge memory nodes (1024GB memory)
tbm Teton Big memory nodes (512GB memory)
thm Teton Huge memory nodes (1024GB memory)
dbg Debug nodes
dx Used to select all the older IBM/Lenovo DX type nodes in the cluster.
nx Used to select all the newer IBM/Lenovo NX type nodes in the cluster.
rack number Used to select all nodes in a specific rack.
ipmi Used to select all IPMI enabled nodes in the cluster.
supermicro Any cluster node that is manufactured by SuperMicro.
sd530 Used to select the SD530 nodes in the cluster.

xCAT CPU Group Names

The following table details the group names used to identity a specific CPU type:

Group Name CPU Type
sandybridge Intel Sandy Bridge E5-2670
ivybridge Intel Ivy Bridge E5-2650v2, E5-2620v2
haswell Intel Haswell E5-2640v3, E5-2647v3, 2660v3
broadwell Intel Broadwell E5-2683v4
cascade Intel Cascade Lake Gold 6230 nodes

xCAT GPU Group Names

The following table details the group names used to identity GPU types within the cluster:

Group Name GPU Type
gpu Used to select all GPU enabled nodes
k20m Nvidia Tesla K20m
k20xm Nvidia Tesla K20xm
Titan Nviida Titan
Titan X Nvidia Titan X
k40c Nvidia Tesla K40c
k80 Nvidia Tesla K80
P100 Nvidia Tesla P100
v100 Nvidia Tesla V100 16GB

xCAT NX Chassis Names

NX chassis nameing currently uses the following format.

<rack-number>c{1-4}

The chassis are numbered from the lowest chassis in the rack to the highest in the rack.

xCAT node Setup

In the beginning there was a new node and that node was without configuration. Then along came xCAT and the node became useful.

  1. Edit the /etc/hosts file and add the new nodes in at the appropriate places. You'll need to add three entries for each node management IP, infiniband IP and IPMI IP.
  2. Run "makedns" to insert the new nodes into the DNS.
  3. run "addnode" to create a new node definition in xCAT.
    nodeadd t[465-479] groups=all,teton,g19,sd530,cascade
  4. Run "makegocons" to update the console configuration file with the new nodes.
  5. Edit the xCAT switch table and add the new node and switch information. You may need to edit additional tables depending on if you are adding non Lenovo or Supermicro systems. These include nodehm, noderes and ipmi.
  6. Power on one node first to make sure the node is auto discovered. You can watch the xCAT log and messages file to see where the node is currently at in the Genesis process. You can also dump the "mac" table and see if the node MAC address has been added.
    1. Once the node has been discovered, you need to tell the node what image to boot.
      nodeset <node range> osimage=XXX
      Where XXX is the current image to boot. As of this writing it is t2018.03
      The node will then reboot and PXE into the specified osimage.
    2. Once the node has rebooted you should be able to ssh to the node. Verify the node is running properly and has the right network address.
    if everything looks ok go ahead a power on all the nodes. Repeat the steps above for all the nodes.
  7. run "setup_ipmi <node range>" to configure the IPMI settings on each node. Note: We don't use the xCAT bmcsetup script as part of the genesis process due to having issues with it setting up the IPMI interface.
  8. run "UWset_bios <node option> <node range>", this script will configure the proper BIOS settings as defined for ARCC.
  9. Add the nodes to GPFS
    mmaddnode -N nodelist
    mmchlicense client --accept -N nodelist
    Make sure to specify the Infiniband address.
  10. reboot all the nodes to have the BIOS changes take effect.
  11. Verify the nodes onces they all reboot.
  12. Add the nodes to Racktables.
  13. Add the nodes to Observium.
  14. Add the nodes to Slurm.

xCAT Installation

xCAT Creating a New Node Image

xCAT Notes and Warnings

1. If a node appears not to boot check the node definition. Could be that the console setup is incorrect in the nodehm table.

xCAT References

xCAT Wikipedia Page [1]

xCAT Documentation Page [2]