Javascript must be enabled to continue!

Lifeline-based global load balancing

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS). There are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). This problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs. Such graphs can be constructed from k -dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary. Our design is implemented in a few hundred lines of X10. On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.37 GNodes/sec. It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.966 GNodes/s. All numbers are relative to single core sequential performance. This implementation has been refactored into a reusable global load balancing framework. Applications can use this framework to obtain global load balance with minimal code changes. In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS. In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

Association for Computing Machinery (ACM)

Vijay A. Saraswat Prabhanjan Kambadur Sreedhar Kodali David Grove Sriram Krishnamoorthy

ACM SIGPLAN Notices

2011

Title: Lifeline-based global load balancing

Description:

On shared-memory systems, Cilk-style work-stealing has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS).

There are two main difficulties in extending this approach to distributed memory.

In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work.

In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution.

When work is sparse, this results in performance degradation.

In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing.

Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts.

Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work).

This problem is well-studied and requires careful design for good performance.

Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection.

In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory.

First, we introduce lifeline graphs: low-degree, low-diameter, fully connected directed graphs.

Such graphs can be constructed from k -dimensional hypercubes.

When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph.

Quiescent nodes do not disturb other nodes.

A quiesced node is reactivated when work arrives from a lifeline and itself shares this work with those of its incoming lifelines that are activated.

Termination occurs precisely when computation at all nodes has quiesced.

In a language such as X10, such passive distributed termination can be detected automatically using the finish construct -- no application code is necessary.

Our design is implemented in a few hundred lines of X10.

On the binomial tree described in olivier:08}, the program achieve 87% efficiency on an Infiniband cluster of 1024 Power7 cores, with a peak throughput of 2.

37 GNodes/sec.

It achieves 87% efficiency on a Blue Gene/P with 2048 processors, and a peak throughput of 0.

966 GNodes/s.

All numbers are relative to single core sequential performance.

This implementation has been refactored into a reusable global load balancing framework.

Applications can use this framework to obtain global load balance with minimal code changes.

In summary, we claim: (a) the first formulation of UTS that does not involve application level global termination detection, (b) the introduction of lifeline graphs to reduce failed steals (c) the demonstration of simple lifeline graphs based on k-hypercubes, (d) performance with superior efficiency (or the same efficiency but over a wider range) than published results on UTS.

In particular, our framework can deliver the same or better performance as an unrestricted random work-stealing implementation, while reducing the number of attempted steals.

Back

Lifeline infrastructure systems provide services to support communities, including life safety, public health, and social-economic factors. They require both built assets and the h...

Zhong-Yong as dynamic balancing between Yin-Yang opposites

Purpose The purpose of this paper is to comment on Peter Ping Li’s understanding of Zhong-Yong balancing, presented in his article titled “Global implications of the indigenous epi...

Crane Load Moment System For Offshore Crane Operations

Abstract History has shown that dependency upon the crane operator to monitor loads and boom angle or load radius do not allow the margin necessary to perform the...

Implementasi Metode Load Balancing Sebagai Upaya Meningkatkan Kinerja Server

The use of the internet turns out to cause problems that must be found a solution. One of them is the service provider's server load which continues to increase. This condition is ...

Improved Model of Load Balancing in the Infocommunication Network

The paper proposes an improved mathematical model of load balancing in the infocommunication network (ICN), corresponding to the Traffic Engineering (TE) concept principles. The mo...

Load Balancing in Peer-to-Peer Systems

Structured peer-to-peer (P2P) overlay networks like Distributed Hash Tables (DHTs) map data items to the network based on a consistent hashing function. Such mapping for data distr...

Solar-Powered Wireless Load Cell Application in Kuwait's Field

Abstract Artificial-lift systems account for a major portion of Kuwait's heavy-oil production infrastructure. Among these, the sucker rod pump remains the most ec...

Multiobjective Dynamic Resource Allocation in Cloud Computing using Harris Hawk Optimization Algorithm (MDLB-HHO)

Effective load balancing and resource distribution strategies are essential for optimizing performance and resource usage in cloud computing. Cloud computing necessitates flexible,...

Email:
Password:

Email:

Lifeline-based global load balancing

Related Results