### Implementation of parallel tridiagonal solvers for a heterogeneous computing environment

#### Abstract

Tridiagonal diagonally dominant linear systems arise in many scientific and engineering applications. The standard Thomas algorithm for solving such systems is inherently serial, forming a bottleneck in computation. Algorithms such as cyclic reduction and SPIKE reduce a single large tridiagonal system into multiple small independent systems which are solved in parallel. We develop portable cyclic reduction and the SPIKE algorithm for Open Computing Language implementations on a range of co-processors in a heterogeneous computing environment, including field programmable gate arrays, graphics processing units and other multi-core processors. We evaluate these designs in the context of solver performance, resource efficiency and numerical accuracy.

**References**- Altera.
*Implementing FPGA design with the OpenCL standard.*Technical report, Altera Inc., November 2013. http://www.altera.com/literature/wp/wp-01173-opencl.pdf - Altera.
*Altera SDK for OpenCL: Best practices guide.*Technical report, Altera Inc., May 2015. https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-sdk/aocl_optimization_guide.pdf - P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland. A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems.
*Parallel and Distributed Computing Practices*, 2:385–400, 1999. http://www.scpe.org/index.php/scpe/article/view/152 - P. Arbenz, A. Cleary, J. Dongarra, and M. Hegland. A comparison of parallel solvers for diagonally dominant and general narrow-banded linear systems II. In
*Euro-Par'99 Parallel Processing*, Berlin, vol. 1685 of*Lecture Notes in Computer Science*pp. 1078–1087. Springer, 1999. doi:10.1007/3-540-48311-X_151 - L.-W. Chang, J. A. Stratton, H.-S. Kim, and W. W. Hwu. A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In
*Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis*, pp. 1–11. IEEE Computer Society Press, 2012. doi:10.1109/SC.2012.12 - Y. Dou, Y. Lei, G. Wu, S. Guo, J. Zhuo, and L. Shen. FPGA accelerating double/quad-double high precision floating-point applications for exascale computing. In
*Proceedings of the 24th ACM International Conference on Supercomputing*, pp. 325–336, 2010. doi:10.1145/1810085.1810129 - C. R. Dun, M. Hegland, and M. R. Osborne. Parallel stable solution methods for tridiagonal linear systems of equations. In
*Computational Techniques and Applications: CTAC95 Proceedings of the Seventh Biennial Conference*, pp. 267–274. World Scientific, 1996. doi:10.1142/9789814530651 - W. Gander and G. H. Golub. Cyclic reduction–-history and applications. In
*Scientific Computing, Proceedings of the Workshop*, 1997, pp. 73–111. Springer, 1998. http://www.springer.com/fr/book/9789813083608 - Khronos OpenCL Working Group.
*The OpenCL specification.*Technical report, Khronos Group, October 2009. http://www.khronos.org/registry/cl/specs/opencl-1.0.pdf - M. Hegland. On the parallel solution of tridiagonal systems by wrap-around partitioning and incomplete LU factorization.
*Numer. Math.*, 59(1):453–472, 1991. doi:10.1007/BF01385791 - D. Jensen and A. F. Rodrigues. Embedded systems and exascale computing.
*Comput. Sci. Eng.*, 12(6):20–29, 2010. doi:10.1109/MCSE.2010.95 - S. Palmer. Accelerating implicit finite difference schemes using a hardware optimised implementation of the thomas algorithm for FPGAs.
*Technical Report*, 2014. http://arxiv.org/abs/1402.5094 - E. Polizzi and A. H. Sameh. A parallel hybrid banded system solver: the spike algorithm.
*Parallel Comput.*, 32:177–194, 2006. doi:10.1016/j.parco.2005.07.005 - W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling.
*Numerical Recipes in FORTRAN: The art of Scientific Computation*. Cambridge University Press, Cambridge, England, 2nd edition, 1992. http://www.cambridge.org/au/academic/subjects/mathematics/numerical-recipes/numerical-recipes-fortran-77-art-scientific-computing-volume-1-2nd-edition - J. Shalf, S. Dosanjh and J. Morrison. Exascale computing technology challenges. In
*High Performance Computing for Computational Science–-VECPAR 2010*, vol. 6449 of*Lecture Notes in Computer Science*, pp 1–25. Springer, 2011. doi:10.1007/978-3-642-19328-6_1 - S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre and M. Ryan. Performance modeling of pipelined linear algebra architectures on FPGAs. In
*Reconfigurable Computing: Architectures, Tools and Applications*, vol. 7806, Lecture Notes in Computer Science pp. 146-153. Springer, 2013. doi:10.1007/978-3-642-36812-7_14 - D. J. Warne, R. F. Hayward, N. A. Kelson, J. E. Banks, and L. Mejias. Pulse-coupled neural network performance for real-time identification of vegetation during forced landing. In
*Engineering Mathematics and Applications Conference (EMAC2013)*, vol. 55 of*ANZIAM J.*, pp. C1–C16, 2014. http://journal.austms.org.au/ojs/index.php/ANZIAMJ/article/view/7851 - D. J. Warne, N. A. Kelson, and R. F. Hayward. Solving tri-diagonal linear systems using field programmable gate arrays. In
*Proceedings of the 4th International Conference on Computational Methods (ICCM2012)*, 2012. http://eprints.qut.edu.au/54894/ - D. J. Warne, N. A. Kelson, and R. F. Hayward. Comparison of high level FPGA hardware design for solving tri-diagonal linear systems.
*Proc. Comput. Sci.*, 29:95–101, 2014. doi:10.1016/j.procs.2014.05.009 - L. Wirbel.
*Xilinx SDAccel: A unified development environment for tomorrow's data center.*Technical report, The Linley Group Inc., November 2014. http://www.xilinx.com/publications/prod_mktg/sdnet/sdaccel-wp.pdf - G. Wu, Y. Dou, J. Sun and G. D. Peterson. A high performance and memory efficient LU decomposer on FPGAs.
*IEEE T. Comput.*, 61:366–378, 2012. doi:10.1109/TC.2010.278

DOI: http://dx.doi.org/10.21914/anziamj.v56i0.9371

**Remember,**for most actions you have to record/upload into this online system

**and then**inform the editor/author via clicking on an email icon or Completion button.

ANZIAM Journal, ISSN 1446-8735, copyright Australian Mathematical Society.