Abstract Dr. Andrei Tchernykh

 

 

 

 

   

Large Scale Supercomputing: Resource Management in Computational Grids

 

Since introducing Grid computing in the early 90s and resource management in the last years, a broad variety of scheduling strategies have been suggested and investigated with different assumptions such as a Grid model (centralized or decentralized), hierarchy, allocation (co-allocation), job description, etc. Many different classes of these strategies have been defined in the last decade. A main drawback of these approaches is that most of them are based on experimental analysis and theoretical analysis is weakly exploited. Our talk provides a theoretical foundation and performance evaluation that will include:
(1) theoretical analysis of scheduling algorithms under different models (such as one-level, multilayer hierarchy scheduling, work stealing, load balancing, adaptive scheduling, idle regulation, etc) and constrains;
(2) experimental studies of proposed strategies based on realistic Grid parameters and real workloads;
(3) comparison of the behavior of such algorithms.

Scheduling in Grids is vital to achieve efficiently operating Grids. While scheduling in general is well understood and has been subject of research for many years, there are still only few theoretical results available. This presentation addresses non preemptive online scheduling of parallel jobs on a Grid. We consider two Grid scheduling models: peer-to-peer and two layer hierarchical models. We show that the performance of Garey and Graham’s list scheduling algorithm is significantly worse in Grids than in multiprocessors.
For the peer-to-peer Grids we present a new scheduling algorithm that guarantees a competitive factor of 5.The offline non-clairvoyant version of this algorithm has an approximation factor of 3. This "one-layer" algorithm can be implemented in centralized fashion or use a distributed "job stealing" approach. Although jobs are allocated to a machine at their submission times they can migrate before they have been started if another machine become sidle. It may be well suited to serve as a starting point for Grid scheduling algorithms in real systems. Other model uses a two layer hierarchical structure and covers the main properties of Grids, for instance, different machine sizes and parallel jobs. At the first layer jobs are allocated to a suitable machine while at the second layer, local scheduling is applied to each machine independently. Migration between machines is not allowed. We use this model to find a suitable tradeoff between a fully centralized and a fully decentralized mode. The highest layer is often a Grid-level scheduler that may have a more general view of the resources while the lowest layer is the local resource management system that manages a specific resource or set of resources. Other layers may exist in between. At every layer, additional constraints and specifications must be considered, for instance, related to the dynamics of the resource situation. Suitable scheduling algorithms are needed to support such multilayer structures of resource management.
We discuss strategies based on various combinations of allocation strategies and local scheduling algorithms. We propose a relatively simple scheme named adaptive admissible allocation. The theoretical worst case analysis yields decent bounds of the competitive ratio for certain workload configurations. We show that the algorithm is beneficial under certain conditions and allows an efficient implementation in real systems. Furthermore, a dynamic and adaptive approach is presented which can cope with different workloads and Grid properties. We evaluate the practical performance of the proposed strategies and their derivatives. To this end, we made simulations using real workload traces and corresponding Grid configurations. Further, we will compare our approach with other existing Grid scheduling strategies which are typically based on heuristics.