Optimized and Parallelized Floyd-Warshall Algorithm for CPUs and GPUs
- Skills & Tools: C, Nvidia CUDA, OpenMP, BU Shared Computing Cluster, Nvidia Tesla K40m
- Github URL: Project Link
- Paper URL: Link
- Slideshow URL: Google Slide
Optimized the Floyd-Warshall all-pairs shortest path algorithm for both CPU and GPU architectures. Achieved over a 6 times speedup on the CPU and 20 times speedup on the Nvidia Tesla K40m GPU.
Designed a tiled and milti-threaded version of the algorithm optimized for an 8-core CPU to improve cache locality and thread-level parallelism using OpenMP while carefully respecting loop carried dependencies to maintain correctness.
On the GPU, I designed a CUDA implementation using GPU global memory, launching the CUDA kernel within a loop over k (the outermost loop) to respect loop carried data dependencies. For parallelism and further acceleration, the CUDA kernel flattend the inner loops to leverage massive thread-level and block-level parallelism.
Gained valuable experience in memory hierarchy, multi-threaded programming, CUDA, synchronization, and parallel algorithm design.