Data analytics applications expose a significant amount of Task Level Parallelism (TLP) but the size of their input datasets make them not suitable to be stored in the memory of a single node, so that data to be accessed may be stored on different nodes with different access times. Implementing this type of applications on a general purpose processor in an efficient way can be a hard task. Hiding communication latencies by switching to ready tasks can mitigate the problem, but this type of applications can require a significant number of switching whose overall overhead can definitively reduce the parallelism speedup.
To overcome this limitations we propose to develop a heterogeneous architecture which exploits the FPGA connected to the general purpose processor to accelerate the parallel portions of the applications which are characterized by a large number of irregular memory accesses on distributed memory systems. The hardware accelerators are composed of a set of modules implementing the tasks composing the application. The different tasks assigned to the same module share the functional units, but have private registers and memories allowing zero delay switching among different tasks mapped on the same module. In this way tasks which request external accesses can be suspended until their data are ready without introducing further delay penalties and optimizing the resource usage of the FPGA devices. The different memory requests of the different tasks assigned to the different modules are managed by a parallel memory controller which will dispatch the requests to the opportune component (internal node memory or external bus connecting the node with other nodes). Having multiple tasks assigned to the same module allows to hide the latency of the memory accesses maximising the exploitation of the computational power of the FPGA.