2019
informatika
Improving Locality of Unstructured Mesh Algorithms on GPUs
Témavezető:
Dr. Reguly István
Dr. Reguly István
Összefoglaló
To most efficiently utilize modern parallel architectures, the memory access patterns of algorithms must make heavy use of cache architectures: successively accessed data must be close in memory (spatial locality) and one piece of data must be reused as many times as possible (temporal locality). In unstructured mesh applications the challenge is to find an effective ordering of operations and data structures that maximise locality while avoiding data races.
In this work we analyse the performance of such algorithms on GPUs. We specifically look at the optimisation method of using the shared memory to programmatically cache the accessed data inside thread blocks. To avoid data races, we use two-layered or hierarchical colouring: we colour the blocks so that no two blocks write the same data and colour the threads to avoid data race inside the blocks.
Using this method, there is a trade-off between data reuse and the amount of synchronisation between threads. We look at different block layouts in structured meshes to analyse the effect of each.
We analyse how the performance can be increased by partitioning the mesh so that the reuse is increased, how warp divergence can be reduced by ordering the operations within a block according to their colour, and how do Structure of Arrays transformations affect the memory access patterns.
We developed a standalone library that can transparently reorder the operations done and data accessed by a kernel, without modifications to the algorithm by the user.
Using this, we performed measurements on kernels from different applications, such as Airfoil, Bookleaf and Lulesh. We did not change the code of the computations, only the representation of the data, the order of the operations and the way the result is written back to memory. We observed speedups as high as 20%.
In this work we analyse the performance of such algorithms on GPUs. We specifically look at the optimisation method of using the shared memory to programmatically cache the accessed data inside thread blocks. To avoid data races, we use two-layered or hierarchical colouring: we colour the blocks so that no two blocks write the same data and colour the threads to avoid data race inside the blocks.
Using this method, there is a trade-off between data reuse and the amount of synchronisation between threads. We look at different block layouts in structured meshes to analyse the effect of each.
We analyse how the performance can be increased by partitioning the mesh so that the reuse is increased, how warp divergence can be reduced by ordering the operations within a block according to their colour, and how do Structure of Arrays transformations affect the memory access patterns.
We developed a standalone library that can transparently reorder the operations done and data accessed by a kernel, without modifications to the algorithm by the user.
Using this, we performed measurements on kernels from different applications, such as Airfoil, Bookleaf and Lulesh. We did not change the code of the computations, only the representation of the data, the order of the operations and the way the result is written back to memory. We observed speedups as high as 20%.
Dr. Reguly István