Tuesday, July 23, 2024

High Performance Network Fabric Islands

Previously, high-performance network connections existed at two physically separate data centers without a high-performance connection between them. It was sufficient to specify the data center (-l dc=lc or -l dc=itf) or the high-performance fabric type (-l fabric=omnipath or -l fabric=infiniband).

With the 2024 expansion, we have added a third high-performance network fabric that connects the new compute nodes at the ITF data center. To accommodate this, a new requestable resource has been added, “island,” so that jobs requiring the high-performance network can be guaranteed to get slots that can connect on the same high-performance network. If you specify an island, you can remove any data center or fabric requests. The data center and fabric resources remain available.

The three islands are:

  • 1: LC data center with Argon phase 1 & 2 systems
  • 2: ITF data center with Argon phase 3 systems
  • 3: ITF data center with Argon phase 4 (2024 expansion) systems

If you are submitting to a queue with nodes on multiple islands and need high-performance network connectivity, make sure you select an island for the job. You can request each island as a resource requirement, i.e., -l island=2 in your SGE job submission. Detailed resource request information is on our wiki at Advanced Job Submission.

Other Hardware

We have added several new GPU types to the cluster, detailed in the wiki table Advanced Job Submission.

The new nodes use Xeon 6430 Gold CPUs, with 32 physical cores per node. Each node has two sockets and CPUs installed, totaling 64 physical cores or 128 threads available. They will have 128 slots available for jobs, with per-slot memory varying. The combinations are listed on our wiki Argon Cluster .