Javascript must be enabled to continue!
Performance anomaly detection in HPC
View through CrossRef
In recent years the demand for High-performance computing (HPC) data centers has increased. HPC often consists of thousands of computing services. Given the high costs related with the setup of such systems, it is vital that the service provider maximize the utilization of the limited data center resources as efficiently as possible and reduce the service cost to fit the “pay as you go” pricing model.
As HPC systems and applications continue to increase in complexity, HPC systems become more exposed for performance problems like (resource contention, software- and firmware-related problems, etc.) that can lead to premature job termination, reduced performance, and wasted compute platform resources. Permanent management of such systems health well has a huge impact financially and operationally. So it is essential for the HPC operators to monitor and analyze the performance of such complex system environment.
Manually monitoring systems in this size and complexity is an impossible task; since it generates a huge amount of data as metrics of resource usage data and other key performance indicators (KPI) per day form thousands of computational nodes. There is a lot of visualizing toots available that monitors and collect HPC performance data that may contain evidence of anomalies, but the problem is the lack of analytic engine to process this data to identify performance anomalies activity.
Therefore, performance problem management has become a major task in HPC cloud environment which includes on three main tasks:
(i) Real-time detection of performance Anomalies within HPC cloud datacenters.
(ii) Identifying the root cause of these anomalies.
(iii) Identify methods to prevent these anomalies from occurring.
These performance problems moved the research on computational intelligence into a new era to develop the tools and techniques to identify these anomalies. These tools use some data analytic techniques such as (Statistical, Machine Learning, Time series, Threshold, etc.) that capture information on a large number of the time-varying system performances metrics, and then analyze the relationships among system components and applications.
Title: Performance anomaly detection in HPC
Description:
In recent years the demand for High-performance computing (HPC) data centers has increased.
HPC often consists of thousands of computing services.
Given the high costs related with the setup of such systems, it is vital that the service provider maximize the utilization of the limited data center resources as efficiently as possible and reduce the service cost to fit the “pay as you go” pricing model.
As HPC systems and applications continue to increase in complexity, HPC systems become more exposed for performance problems like (resource contention, software- and firmware-related problems, etc.
) that can lead to premature job termination, reduced performance, and wasted compute platform resources.
Permanent management of such systems health well has a huge impact financially and operationally.
So it is essential for the HPC operators to monitor and analyze the performance of such complex system environment.
Manually monitoring systems in this size and complexity is an impossible task; since it generates a huge amount of data as metrics of resource usage data and other key performance indicators (KPI) per day form thousands of computational nodes.
There is a lot of visualizing toots available that monitors and collect HPC performance data that may contain evidence of anomalies, but the problem is the lack of analytic engine to process this data to identify performance anomalies activity.
Therefore, performance problem management has become a major task in HPC cloud environment which includes on three main tasks:
(i) Real-time detection of performance Anomalies within HPC cloud datacenters.
(ii) Identifying the root cause of these anomalies.
(iii) Identify methods to prevent these anomalies from occurring.
These performance problems moved the research on computational intelligence into a new era to develop the tools and techniques to identify these anomalies.
These tools use some data analytic techniques such as (Statistical, Machine Learning, Time series, Threshold, etc.
) that capture information on a large number of the time-varying system performances metrics, and then analyze the relationships among system components and applications.
Related Results
Democratising HPC Training: co-creating an Industrial HPC Nano Online Course
Democratising HPC Training: co-creating an Industrial HPC Nano Online Course
Abstract
Purpose: Disruptive technologies (AI, IoT, etc) unlock new frontiers of data-centric innovation. This increases the computational needs, pushing more and more comp...
HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments
HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments
The complexity of high-performance computing (HPC) workflows is an important issue in the provision of HPC cloud services in most national supercomputing centers. This complexity p...
A paradigm shift of HPC for geosciences: a novel HPC service model for geosciences applications
A paradigm shift of HPC for geosciences: a novel HPC service model for geosciences applications
(English) The Oil and Gas (O&G) industry ranks prominently among the leading commercial users of powerful supercomputers worldwide, as indicated by global High-Performance Comp...
LLM as HPC Expert: Extending RAG Architecture for HPC Data
LLM as HPC Expert: Extending RAG Architecture for HPC Data
High-Performance Computing (HPC) is crucial for performing advanced computational tasks, yet their complexity often challenges users, particularly those unfamiliar with HPC-specifi...
Evaluation of the effects of recycled aggregates on the properties of high performance concrete
Evaluation of the effects of recycled aggregates on the properties of high performance concrete
In recent decades, the use of High Performance Concrete (HPC) has grown vastly, being used in multiple applications with high requirements. However, the use of recycled aggregates ...
Polymer Distribution and Mechanism Conversion in Multiple Media of Phase-Separated Controlled-Release Film-Coating
Polymer Distribution and Mechanism Conversion in Multiple Media of Phase-Separated Controlled-Release Film-Coating
Phase-separated films of water-insoluble ethyl cellulose (EC) and water-soluble hydroxypropyl cellulose (HPC) can be utilized to tailor drug release from coated pellets. In the pre...
Bridging the gap between object stores and HPC
Bridging the gap between object stores and HPC
(English) Efficient data management is a fundamental aspect of application workflows, particularly in the context of High-Performance Computing environments. This thesis examines t...
Cement Concrete Mixture Performance Characterization
Cement Concrete Mixture Performance Characterization
The cementitious composite nature of concrete makes very diffi cult directly ascertaining each mixture-factors’ contribution to a given concrete mixture performance characteristics...

