HPC system: purchasing nodes

From arccwiki
Jump to: navigation, search

Unlike traditional clusters, Teton is a collaborative system wherein the majority of nodes are purchased and shared by the cluster users, known as condo investors.

The model for sustaining Teton is premised on faculty and principal investigators purchasing (Investors) compute nodes (individual servers) from their grants or other available funds which are then added to the cluster. This allows Investor-owned nodes to take advantage of the high speed Infiniband interconnect and high performance GPFS parallel filesystem storage associated with Teton. Operating costs for managing and housing Investor-owned compute nodes are waived in exchange for letting other users make use of any idle compute cycles on the Investor-owned nodes. Investors have priority access to computing resources equivalent to those purchased with their funds, but can access more nodes for their research if needed. This provides the Investor with much greater flexibility than owning a standalone cluster.

We use job pre-emption so that an investor has immediate access to his invested nodes. Any jobs running on an investors node will be stopped and re-queued for running at a later time.

Teton also has a number of community nodes available to all users. Jobs running on these nodes will not be pre-empted, unless a jobs node allocation includes investor nodes and which the user is not part of the investment.

The Details

Compute nodes are purchased and maintained based on a 5-year life-cycle. Investors owning the nodes will be notified during year 4 that there investment nodes will expire at the end of the 5th year. Nodes left in the cluster after five years may be removed and disposed of at the discretion of the ARCC director

Once an Investor has decided to participate, the Investor or his designate works with the ARCC team to procure a desired number of compute nodes. There is a 1-node minimum buy-in for any given compute node type i.e. Standard, Bigmem, Hugemem or GPU node. The standard node is the least expensive while GPU nodes are the most expensive. Generally, procurement takes about two to three months from start to finish. Once the nodes have been provisioned an investor partition will be created and the investor will be notified.

An investor may submit jobs to the general partitions on the cluster before the new nodes are provisioned. Jobs are subject to general partition limitations and guaranteed access to purchased node(s) and cores is not provided until purchased nodes are provisioned.

Please contact the ARCC at arcc-info@uwyo.edu for information and current pricing.

Node Types

Teton is currently architected with two generations of hardware the Lenovo DX and NX series of hardware. Starting in Janurary 2019 all expansions to Teton will be from the Lenovo ThinkSystem series of hardware.

This is a 2U chassis which supports either 4 standard nodes or 2 GPU nodes.

There are three node types: standard, Bigmem and Hugemem. A dual GPU tray may be added to any of these standard nodes. Node types are:

  • Standard Node: 128GB of 8 x 16GB TruDDR4 2666 MHz (2Rx8 1.2V) RDIMM memory
  • Bigmem: 512GB of 16 x 32GB TruDDR4 2666 MHz (2Rx8 1.2V) RDIMM memory
  • Hugemem: 1024GB of 16 x 64GB TruDDR4 2666 MHz (2Rx8 1.2V) RDIMM memory

Following is the base node specification.

  • Lenovo ThinkSystem SD530 dual socket compute node
  • Two Intel Xeon Gold 6130 16 core 2.1GHz Processor
  • One 2.5" Intel S4510 240GB 6Gb SATA Hot Swap SSD
  • One Mellanox ConnectX-4 1x100GbE/EDR IB QSFP28 VPI Adapter
  • 5Yr Next Business Day warranty
  • Ground shipping to Laramie

Additional items required for each node:

  • One 1m Mellanox EDR IB Passive Copper QSFP28 Cable
  • IBM Spectrum Scale Standard Edition Client License for 5 years
  • Once 1Gig network cable

Any of the above nodes Can be configured with dual Nvidia Tesla V100 16GB gpu's.

Speciality Nodes

There are two speciality nodes available:

  • KNL Nodes
  • Nvidia DGX GPU nodes.

Other special nodes can be architected, please contact the ARCC for help and information on obraining prices.

These nodes are special order nodes only.

Grant Statement

Following is a statement about the current Teton configuration which can be used as part of a grant request.

"The University of Wyoming hosts excellent computational resources for the University of Wyoming research community. The Advanced Research Computing Center (ARCC) hosts a large, ~500 Tflop cluster that serves the University of Wyoming. This cluster currently hosts 502 nodes each with Intel processors, supporting 16 or 32 cores (totalling >14400 cores) with some nodes containing dual Nvidia GPU's. There are a number of speciality nodes which include Knights Landing, Nvidia DGX GPU and others. Teton also hosts a 1.2PB GPFS high performance storage system attached to the cluster. Teton is continually growing in both computing and storage capacity.”