Javascript must be enabled to continue!

Performance anomaly detection in HPC

In recent years the demand for High-performance computing (HPC) data centers has increased. HPC often consists of thousands of computing services. Given the high costs related with the setup of such systems, it is vital that the service provider maximize the utilization of the limited data center resources as efficiently as possible and reduce the service cost to fit the “pay as you go” pricing model. As HPC systems and applications continue to increase in complexity, HPC systems become more exposed for performance problems like (resource contention, software- and firmware-related problems, etc.) that can lead to premature job termination, reduced performance, and wasted compute platform resources. Permanent management of such systems health well has a huge impact financially and operationally. So it is essential for the HPC operators to monitor and analyze the performance of such complex system environment. Manually monitoring systems in this size and complexity is an impossible task; since it generates a huge amount of data as metrics of resource usage data and other key performance indicators (KPI) per day form thousands of computational nodes. There is a lot of visualizing toots available that monitors and collect HPC performance data that may contain evidence of anomalies, but the problem is the lack of analytic engine to process this data to identify performance anomalies activity. Therefore, performance problem management has become a major task in HPC cloud environment which includes on three main tasks: (i) Real-time detection of performance Anomalies within HPC cloud datacenters. (ii) Identifying the root cause of these anomalies. (iii) Identify methods to prevent these anomalies from occurring. These performance problems moved the research on computational intelligence into a new era to develop the tools and techniques to identify these anomalies. These tools use some data analytic techniques such as (Statistical, Machine Learning, Time series, Threshold, etc.) that capture information on a large number of the time-varying system performances metrics, and then analyze the relationships among system components and applications.

University of Vigo

Mohamed Soliman Halawa

2026

Title: Performance anomaly detection in HPC

Description:

In recent years the demand for High-performance computing (HPC) data centers has increased.

HPC often consists of thousands of computing services.

Given the high costs related with the setup of such systems, it is vital that the service provider maximize the utilization of the limited data center resources as efficiently as possible and reduce the service cost to fit the “pay as you go” pricing model.

As HPC systems and applications continue to increase in complexity, HPC systems become more exposed for performance problems like (resource contention, software- and firmware-related problems, etc.

) that can lead to premature job termination, reduced performance, and wasted compute platform resources.

Permanent management of such systems health well has a huge impact financially and operationally.

So it is essential for the HPC operators to monitor and analyze the performance of such complex system environment.

Manually monitoring systems in this size and complexity is an impossible task; since it generates a huge amount of data as metrics of resource usage data and other key performance indicators (KPI) per day form thousands of computational nodes.

There is a lot of visualizing toots available that monitors and collect HPC performance data that may contain evidence of anomalies, but the problem is the lack of analytic engine to process this data to identify performance anomalies activity.

Therefore, performance problem management has become a major task in HPC cloud environment which includes on three main tasks: (i) Real-time detection of performance Anomalies within HPC cloud datacenters.

(ii) Identifying the root cause of these anomalies.

(iii) Identify methods to prevent these anomalies from occurring.

These performance problems moved the research on computational intelligence into a new era to develop the tools and techniques to identify these anomalies.

These tools use some data analytic techniques such as (Statistical, Machine Learning, Time series, Threshold, etc.

) that capture information on a large number of the time-varying system performances metrics, and then analyze the relationships among system components and applications.

Back

Abstract Purpose: Disruptive technologies (AI, IoT, etc) unlock new frontiers of data-centric innovation. This increases the computational needs, pushing more and more comp...

HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments

The complexity of high-performance computing (HPC) workflows is an important issue in the provision of HPC cloud services in most national supercomputing centers. This complexity p...

A paradigm shift of HPC for geosciences: a novel HPC service model for geosciences applications

(English) The Oil and Gas (O&G) industry ranks prominently among the leading commercial users of powerful supercomputers worldwide, as indicated by global High-Performance Comp...

LLM as HPC Expert: Extending RAG Architecture for HPC Data

High-Performance Computing (HPC) is crucial for performing advanced computational tasks, yet their complexity often challenges users, particularly those unfamiliar with HPC-specifi...

Evaluation of the effects of recycled aggregates on the properties of high performance concrete

In recent decades, the use of High Performance Concrete (HPC) has grown vastly, being used in multiple applications with high requirements. However, the use of recycled aggregates ...

Polymer Distribution and Mechanism Conversion in Multiple Media of Phase-Separated Controlled-Release Film-Coating

Phase-separated films of water-insoluble ethyl cellulose (EC) and water-soluble hydroxypropyl cellulose (HPC) can be utilized to tailor drug release from coated pellets. In the pre...

Bridging the gap between object stores and HPC

(English) Efficient data management is a fundamental aspect of application workflows, particularly in the context of High-Performance Computing environments. This thesis examines t...

Cement Concrete Mixture Performance Characterization

The cementitious composite nature of concrete makes very diffi cult directly ascertaining each mixture-factors’ contribution to a given concrete mixture performance characteristics...

Email:
Password:

Email:

Performance anomaly detection in HPC

Related Results