Present position: Temporary researcher
|Thesis title:||Optimal Resource Allocation in Grid Architectures|
|Research area:||Systems architectures|
Grid Computing has emerged as a powerful technology to solve scientific problems with high computational requirements. It allows to virtualize the access to a set of geographically-distributed resources, such as desktop computers, cluster systems and high-performance parallel machines, thus helping to share and use their computing power, memory, bandwidth, services, databases, etc., in a transparent and highly dynamic way.
Grid systems are complex and their architecture is beyond control (e.g., nodes become unavailable, new resources plug in). In this scenario, selecting resources and allocating jobs in such a way that the application requirements are met, in terms of throughput and execution time, is a critical issue. Therefore, we need suitable performance models to evaluate the capability of Grid architectures and Grid applications in order to optimally utilize the available resources. For instance, such models can be used in the development of scheduling algorithms to distribute requests among the Grid nodes and to determine the best allocation policies. Modeling techniques allow to reach the expected quality of service (QoS), especially when bounded by service level agreements (SLA).
Differently from parallel computing, overhead factors due to application partitioning and communication latency are strengthened by the characteristics of Grid systems, thus they can not be neglected. As a consequence, the existing techniques based on extensions of previous models used in parallel systems must be replaced by new approaches.
The contribute of this work is threefold: (i) we create a performance model able to accurately analyze Grid architectures, (ii) we define a technique to estimate some parameters of the model and (iii) we use the model in a Desktop Grid environment in order to define a new partitioning and allocation strategy that can be used in efficiently scheduling the tasks of an application.
As for the first result, we describe a Grid system by means of a multi-level hierarchical performance model, suitable to describe complex and heterogeneous architectures by decomposing the system into layers and meta-nodes. Each level in the hierarchy is composed by a number of meta-nodes and each meta-node is, in turn, composed of a number of lower-level meta-nodes. For instance, Cluster Grids and Desktop Grids are the lowest level and can be used to compose higher-level Campus Grids and Global Grids.
On the top of the architectural model, we consider an application model: we assume a Grid application to be composed of a number of statistically identical CPU bursts, internal communication bursts (i.e., communication within a meta-node) and external communication bursts (i.e., communication among meta-nodes) that are executed in a cyclic fashion. The resulting coupled architecture/application models are based on fork-join queueing networks and order statistics theories. The proposed approaches can significantly improve the classical performance models, estimating the performance of a Grid application much more accurately. In some cases, the enhancement in the accuracy has been proved to be greater than 50% with respect to classical techniques.
The models require the knowledge of some hardware characteristics of the target Grid environment (e.g., cpu performance, network bandwidth and latency, etc.), which can be obtained, for example, by benchmarking the architecture. Furthermore, the knowledge of few characteristics of the application being executed, like the presence of synchronous/asynchronous communications and the amount of data transferred among processes, is needed. To this end, we present a robust and constrained optimization-based technique in order to estimate one of the key parameters of the model: the service requirements. In real systems, such metrics are typically difficult to measure directly. Thus we propose an estimation based on aggregate measurements, such as the application throughput and the server utilization, usually much easier to be collected. In the analyzed cases we obtain estimation relative errors lower than 4%.
Finally, the models are used to exploit new job partitioning policies in Desktop Grid environment. The models show that, increasing the number of nodes allocated to a Grid application does not automatically ensure increasing its speed-up and throughput. In heterogeneous and large-scale environments, such as Desktop Grids, this drawback results even more stressed and leads to much less than linear speed-ups. We show that, in certain cases, the optimal performance is not obtained by partitioning the jobs among all the available resources: a trade-off has to be found between partitioning and the resulting overheads. Emerging concepts like job replication can help in achieving better performance. The basic idea of job replication is the concurrent execution of multiple copies of the same job on independent nodes. This guarantees that particular, very slow, machines will not slow the aggregate progress of a computation. By merging both job partitioning and job replication, we propose a hybrid allocation strategy, which, in several cases, can be shown to outperform the speed-up of the traditional partitioning strategy.