Implementation of parallel tridiagonal solvers for a heterogeneous computing environment


  • Hamish J Macintosh
  • David J Warne
  • Neil A Kelson
  • Jasmine E Banks
  • Troy W Farrell



Tridiagonal diagonally dominant linear systems arise in many scientific and engineering applications. The standard Thomas algorithm for solving such systems is inherently serial, forming a bottleneck in computation. Algorithms such as cyclic reduction and SPIKE reduce a single large tridiagonal system into multiple small independent systems which are solved in parallel. We develop portable cyclic reduction and the SPIKE algorithm for Open Computing Language implementations on a range of co-processors in a heterogeneous computing environment, including field programmable gate arrays, graphics processing units and other multi-core processors. We evaluate these designs in the context of solver performance, resource efficiency and numerical accuracy. References
  • Altera. Implementing FPGA design with the OpenCL standard. Technical report, Altera Inc., November 2013.
  • Altera. Altera SDK for OpenCL: Best practices guide. Technical report, Altera Inc., May 2015.
  • P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland. A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems. Parallel and Distributed Computing Practices, 2:385–400, 1999.
  • P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland. A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems II. In Euro-Par'99 Parallel Processing, Berlin, vol. 1685 of Lecture Notes in Computer Science pp. 1078–1087. Springer, 1999. doi:10.1007/3-540-48311-X_151
  • L.-W. Chang, J. A. Stratton, H.-S. Kim, and W. W. Hwu. A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society Press, 2012. doi:10.1109/SC.2012.12
  • Y. Dou, Y. Lei, G. Wu, S. Guo, J. Zhuo, and L. Shen. FPGA accelerating double/quad-double high precision floating-point applications for exascale computing. In Proceedings of the 24th ACM International Conference on Supercomputing, pp. 325–336, 2010. doi:10.1145/1810085.1810129
  • C. R. Dun, M. Hegland, and M. R. Osborne. Parallel stable solution methods for tridiagonal linear systems of equations. In Computational Techniques and Applications: CTAC95 Proceedings of the Seventh Biennial Conference, pp. 267–274. World Scientific, 1996. doi:10.1142/9789814530651
  • W. Gander and G. H. Golub. Cyclic reduction–-history and applications. In Scientific Computing, Proceedings of the Workshop, 1997, pp. 73–111. Springer, 1998.
  • Khronos OpenCL Working Group. The OpenCL specification. Technical report, Khronos Group, October 2009.
  • M. Hegland. On the parallel solution of tridiagonal systems by wrap-around partitioning and incomplete LU factorization. Numer. Math., 59(1):453–472, 1991. doi:10.1007/BF01385791
  • D. Jensen and A. F. Rodrigues. Embedded systems and exascale computing. Comput. Sci. Eng., 12(6):20–29, 2010. doi:10.1109/MCSE.2010.95
  • S. Palmer. Accelerating implicit finite difference schemes using a hardware optimised implementation of the thomas algorithm for FPGAs. Technical Report, 2014.
  • E. Polizzi and A. H. Sameh. A parallel hybrid banded system solver: the spike algorithm. Parallel Comput., 32:177–194, 2006. doi:10.1016/j.parco.2005.07.005
  • W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in FORTRAN: The art of Scientific Computation. Cambridge University Press, Cambridge, England, 2nd edition, 1992.
  • J. Shalf, S. Dosanjh and J. Morrison. Exascale computing technology challenges. In High Performance Computing for Computational Science–-VECPAR 2010, vol. 6449 of Lecture Notes in Computer Science, pp 1–25. Springer, 2011. doi:10.1007/978-3-642-19328-6_1
  • S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre and M. Ryan. Performance modeling of pipelined linear algebra architectures on FPGAs. In Reconfigurable Computing: Architectures, Tools and Applications, vol. 7806, Lecture Notes in Computer Science pp. 146-153. Springer, 2013. doi:10.1007/978-3-642-36812-7_14
  • D. J. Warne, R. F. Hayward, N. A. Kelson, J. E. Banks, and L. Mejias. Pulse-coupled neural network performance for real-time identification of vegetation during forced landing. In Engineering Mathematics and Applications Conference (EMAC2013), vol. 55 of ANZIAM J., pp. C1–C16, 2014.
  • D. J. Warne, N. A. Kelson, and R. F. Hayward. Solving tri-diagonal linear systems using field programmable gate arrays. In Proceedings of the 4th International Conference on Computational Methods (ICCM2012), 2012.
  • D. J. Warne, N. A. Kelson, and R. F. Hayward. Comparison of high level FPGA hardware design for solving tri-diagonal linear systems. Proc. Comput. Sci., 29:95–101, 2014. doi:10.1016/j.procs.2014.05.009
  • L. Wirbel. Xilinx SDAccel: A unified development environment for tomorrow's data center. Technical report, The Linley Group Inc., November 2014.
  • G. Wu, Y. Dou, J. Sun and G. D. Peterson. A high performance and memory efficient LU decomposer on FPGAs. IEEE T. Comput., 61:366–378, 2012. doi:10.1109/TC.2010.278





Proceedings Computational Techniques and Applications Conference