|Thesis title:||Extracting synchronization-free parallelism with the slicing framework|
|Advisor:||Pierluigi San Pietro|
|Research area:||Advanced Software Architectures and Methodologies|
With the recent growth of complexity of real-world applications used in various fields: networked multimedia systems, scientific and engineering computing, weather prediction, chemical dynamics, structured biology, online transaction processing, etc. feasible techniques are needed to meet real-time response constraints. As the performance of highly-integrated single-chip CMOS microprocessors is progressively increasing, the current trend is to build multiprocessor computer systems out of small and low-power processors. At the same time, processor clock cycle approaches the limit, and it becomes more and more difficult to achieve a higher speed. By distributing data and computations between processors (multiprocessor systems) and computers (distributed systems), parallel and distributed computing can permit us to meet speed and memory requirements. For many large-scale problems, using parallel and distributed applications is the only feasible way of getting answers back in a reasonable amount of time.
Parallel algorithms and programs can be developed manually or automatically. A manual development requires carrying out research, and as a result requires high skill, much time, and causes high costs. Another way is developing and applying techniques that may automatically find parallelism in existing sequential algorithms and generate corresponding parallel programs.
In real world applications, most computations are represented by loops, that is why the process of the parallelization of sequential programs consists of the detection of those loops and extraction of parallelism available in them.
There exist the two kinds of parallelism that can be extracted in a loop: fine- and coarse-grained parallelism permitting for creating fine- and coarse-grained processes in computer systems, respectively. A process is fine-grained if its completion time is small relative to the time it takes to create and send data to it. Vector and superscalar architectures deal with such processes. A coarse-grained process is that whose size is relative to the overhead of the parallel system. A grain may be considered small, or fine, in one system, and large, or coarse in another one. Coarse-grained processes are critical for multiprocessors with shared memory. Such processes are obtained by creating a thread of computations on each processor to be executed independently or with occasional synchronization. To achieve good performance in these systems, parallelism granularity should be large enough to compensate for the overhead due to exploiting parallelism and processes synchronization. Although finding coarse-grained parallelism available in loops allows us to get scalable performance on parallel computers and distributed systems, its purpose is not limited to this, since it may also increase performance on a uniprocessor system by enhancing data locality and reducing memory requirements, hence leading to decreased cost and power consumption in, e.g., embedded systems.
This work deals with extracting coarse-grained parallelism by means of parallelizing compiler techniques. Many techniques have been proposed allowing us to extract coarse-grained parallelism so that a sequential loop is transformed to a parallel loop. But all well-known methods have limitations leading to looses in extracting some coarse-grained parallelism available in loops. The Affine Transformation Framework or Polyhedral Framework is one of the most powerful techniques for extracting parallelism today. It unifies a large class of transformations and allows us to extract synchronization-free parallelism for shared memory multiprocessors and communication-free parallelism for distributed memory systems. But even this framework is not optimal, because it does not extract the entire parallelism available in the general case of arbitrarily nested loops. Non-optimality of well-known techniques motivates further research aimed at the development of advanced approaches permitting us to extract more coarse-grained parallelism than that extracted by well-known techniques. In this work, we consider coarse-grained parallelism being represented by synchronization-free slices.
The goal of this thesis was to derive algorithms extracting more coarse-grained parallelism represented by synchronization-free slices available in a loop than that extracted with well-known techniques as well as to develop tools and carry out experiments permitting us to evaluate the effectiveness and efficiency of proposed approaches on different codes. Our algorithms are within the Iteration Space Slicing Framework that takes dependence information as input to find all statement instances of a given loop nest, which must be executed to produce the correct values for the specified array elements. We contribute to the slicing framework by proposing algorithms of synchronization-free slices extraction and code generation.
The thesis contains 9 chapters and the bibliography. Chapter 2 firstly presents basic concepts from the loop transformation theory and then introduces the basic philosophy of the slicing theory. Chapter 3 describes the state-of-the-art in the field of loop parallelization techniques aimed at the extraction of coarse-grained parallelism, it illustrates limitations of well-known techniques and motivates the need for the development of novel more powerful approaches. Chapters 4, 5, and 6 present our approach to extract synchronization-free slices. Chapter 4 describes how to extract sources of synchronization-free slices, while Chapter 5 presents algorithms of code generation to scan elements of slices in the following cases
• slices are described with affine expressions (the topology of slice is arbitrary);
• slices are described with non-linear expressions and form chains of dependent operations;
• slices are described with non-linear expressions and form trees of dependent operations.
Chapter 5 describes as well how to deal with slices whose topology is a graph (that is neither a chain nor tree) and slices are described with non-linear expressions. An algorithm is presented to convert dependence relations describing such graphs to dependence relations representing trees or chains so that all dependences are preserved while the resulting code (generated on the basis of the converted dependence relations) is executed. Chapter 6 introduces an approach to extract synchronization-free slices in inner loop nests when it is not possible to extract slices in the entire loop domain. Chapter 7 presents the results of experiments that we performed in order to estimate the scope of the applicability of proposed algorithms used for the extraction of coarse-grained parallelism as well as the efficiency of generated code. Chapter 8 outlines the directions of further research that needs to be conducted in order to use the entire power of proposed algorithms to extract synchronization-free slices as well as to devise novel algorithms for the extraction of slices requiring synchronization on the basis of algorithms proposed in this thesis. Chapter 9 contains our conclusions.