Optimized and Parallelized Floyd-Warshall Algorithm for CPUs and GPUs

Skills & Tools: C, Nvidia CUDA, OpenMP, BU Shared Computing Cluster, Nvidia Tesla K40m
Github URL: Project Link
Paper URL: Link
Slideshow URL: Google Slide

Optimized the Floyd-Warshall all-pairs shortest path algorithm for both CPU and GPU architectures. Achieved over a 6 times speedup on the CPU and 20 times speedup on the Nvidia Tesla K40m GPU.

Designed a tiled and milti-threaded version of the algorithm optimized for an 8-core CPU to improve cache locality and thread-level parallelism using OpenMP while carefully respecting loop carried dependencies to maintain correctness.

On the GPU, I designed a CUDA implementation using GPU global memory, launching the CUDA kernel within a loop over k (the outermost loop) to respect loop carried data dependencies. For parallelism and further acceleration, the CUDA kernel flattend the inner loops to leverage massive thread-level and block-level parallelism.

Gained valuable experience in memory hierarchy, multi-threaded programming, CUDA, synchronization, and parallel algorithm design.