HPC Model

HPC Warranty Policy

All HPC compute nodes are purchased with a five-year warranty. Compute nodes will be allowed to run for up to 7 years within the following parameters.

The final two years of the compute node life are outside warranty and are on a best effort basis.
If software/OS can no longer support compute equipment prior to the end of its 7 year life the HPCteam may, in consultation with the HPC Policy Committee, determine that the life of compute equipment is shorter than 7 years. Should this occur the HPC team will strive to provide at least six months of notice to the HPC community before equipment is decommissioned.
In the event that compute nodes fail outside of warranty they will not be repaired. The HPC team will attempt to keep investor queues at the purchased capacity, to the extent possible, based on the following process and guidelines.
- Investor compute nodes that fail out of warranty will be replaced with compute nodes from the UI queue within the same generation of hardware.
- When possible compute nodes will be replaced with same or higher specification hardware. This will not be possible in all cases. In cases where this is not possible the investor will be contacted by the HPC team with available options.
- Transfer of compute nodes from the UI queue to investor queues will occur in the order in which failures occur.
- UI queue compute node availability is finite and is unlikely to be able to sustain all investor queues at full capacity for a 7 year life. As such investors should not assume that their queue will remain at full capacity for the duration of the two year life outside of warranty.
- Investors may opt out of UI Queue backfill for their capacity upon hardware failure.
- Investors who have opted out of UI Queue backfill for their capacity will be notified of the loss of capacity and may opt to purchase replacement nodes from current generation options.

What does this mean in the context of Argon hardware?

The Argon HPC system is the result of the initial transition to the “New Model” and currently consists of 3 phases of purchases.

Phase	Purchase Date	End of Warranty	Retirement Begins	UI Nodes	Total Nodes
Phase 1	Jan 2017	March 2022	March 2024	51	343
Phase 2	July 2018	July 2023	July 2025	13	37
Phase 3	Oct 2019	Oct 2024	Oct 2026	54	132
Phase 4	June 2024	June 2029	June 2031	76	229

Hardware purchased as part of Phase 1/Lenovo was retired in November 2024.

What is the impact of this policy on UI queue compute capacity?

The UI queues are the most highly utilized queues on the campus HPC clusters. This policy does mean a decrease over time in the amount of generally available compute capacity as compared to an option where investor queues were not kept whole to the extent possible. The University however has budgeted for periodic addition of new hardware. As such we will work to try to mitigate this capacity constraint as budget allows. Additionally, this is not a significant departure for the UI queue from the previous model in which an HPC cluster was previously shut down after approximately five years.

What is the impact of this policy on the HPC team?

Since the policy does not repair compute nodes that fail outside of warranty it does not significantly increase the burden on the HPC team in terms of hardware repairs or reallocation of nodes between queues. The larger aggregate number of nodes and diversity of hardware architectures does have an impact on the HPC team but this was expected as part of the HPC model change.

For a More Complete Explanation of our HPC policies...

View our Memorandum of Understanding (MOU)

ITS-Research Services

HPC Model

HPC Model

HPC Warranty Policy

For a More Complete Explanation of our HPC policies...

Need help?