HPC Model

All HPC compute nodes are purchased with a five-year warranty. Compute nodes will be allowed to run for up to 7 years within the following parameters.

  • The final two years of the compute node life are outside warranty and are on a best effort basis.
  • If software/OS can no longer support compute equipment prior to the end of its 7 year life the HPCteam may, in consultation with the HPC Policy Committee, determine that the life of compute equipment is shorter than 7 years. Should this occur the HPC team will strive to provide at least six months of notice to the HPC community before equipment is decommissioned.
  • In the event that compute nodes fail outside of warranty they will not be repaired. The HPC team will attempt to keep investor queues at the purchased capacity, to the extent possible, based on the following process and guidelines.
    • Investor compute nodes that fail out of warranty will be replaced with compute nodes from the UI queue within the same generation of hardware.
    • When possible compute nodes will be replaced with same or higher specification hardware. This will not be possible in all cases. In cases where this is not possible the investor will be contacted by the HPC team with available options.
    • Transfer of compute nodes from the UI queue to investor queues will occur in the order in which failures occur.
    • UI queue compute node availability is finite and is unlikely to be able to sustain all investor queues at full capacity for a 7 year life. As such investors should not assume that their queue will remain at full capacity for the duration of the two year life outside of warranty.
    • Investors may opt out of UI Queue backfill for their capacity upon hardware failure.
    • Investors who have opted out of UI Queue backfill for their capacity will be notified of the loss of capacity and may opt to purchase replacement nodes from current generation options.

What does this mean in the context of Argon hardware?

The Argon HPC system is the result of the initial transition to the “New Model” and currently consists of 3 phases of purchases.

Phase

Purchase Date

End of Warranty

Retirement Begins

UI Nodes

Total Nodes

Phase 1

Jan 2017

March 2022

March 2024

51

343

Phase 2

July 2018

July 2023

July 2025

13

37

Phase 3

Oct 2019

Oct 2024

Oct 2026

54

132

 

Hardware purchased as part of Phase 1/Lenovo will begin retirement on or about March 1, 2024.
Approximately 10% of Phase 1 hardware is owned by the UI queue. The current failure rate on Phase 1 hardware has been about 10% per year. As such we estimate that by approximately January 2023 investor queues will not remain at full capacity and the UI queues will be depleted.

What is the impact of this policy on UI queue compute capacity?

The UI queues are the most highly utilized queues on the campus HPC clusters. This policy does mean a decrease over time in the amount of generally available compute capacity as compared to an option where investor queues were not kept whole to the extent possible. The University however has budgeted for periodic addition of new hardware. As such we will work to try to mitigate this capacity constraint as budget allows. Additionally, this is not a significant departure for the UI queue from the previous model in which an HPC cluster was previously shut down after approximately five years.

What is the impact of this policy on the HPC team?

Since the policy does not repair compute nodes that fail outside of warranty it does not significantly increase the burden on the HPC team in terms of hardware repairs or reallocation of nodes between queues. The larger aggregate number of nodes and diversity of hardware architectures does have an impact on the HPC team but this was expected as part of the HPC model change.

For a More Complete Explanation of our HPC policies...