Concurrent number cruncher - A GPU implementation of a general sparse linear sol

Luc Buatois and Guillaume Caumon and Bruno Levy. ( 2008 )

in: Proc. 28th Gocad Meeting, Nancy

Abstract

A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. Our implementation of the Jacobipreconditioned Conjugate Gradient algorithm outperforms by up to a factor of 11.5x leading-edge CPU counterparts, making it attractive for applications which content with single precision.

Download / Links

BibTeX Reference

@inproceedings{131.buatois,
 abstract = { A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero
coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational
power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs
such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation,
but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by
their internal matrix representation.
This paper describes how to combine recent GPU programming techniques and new GPU dedicated
APIs with high performance computing strategies (namely block compressed row storage, register blocking
and vectorization), to implement a sparse general-purpose linear solver. Our implementation of the Jacobipreconditioned
Conjugate Gradient algorithm outperforms by up to a factor of 11.5x leading-edge CPU
counterparts, making it attractive for applications which content with single precision. },
 author = { Buatois, Luc AND Caumon, Guillaume AND Levy, Bruno },
 booktitle = { Proc. 28th Gocad Meeting, Nancy },
 title = { Concurrent number cruncher - A GPU implementation of a general sparse linear sol },
 year = { 2008 }
}

Concurrent number cruncher - A GPU implementation of a general sparse linear sol

Abstract

Download / Links

BibTeX Reference

QuickLinks for Sponsors

Proceedings Archives

2024 RING meeting

2023 RING meeting

2022 RING meeting

2021 RING meeting

2020 RING meeting

2019 RING meeting

2018 RING meeting

2017 RING meeting

2016 RING meeting

2015 RING meeting

34th (2014) gOcad meeting

33rd (2013) gOcad meeting

32nd (2012) gOcad meeting

31st (2011) gOcad meeting

30th (2010) gOcad meeting

29th (2009) Spring gOcad meeting

2009 Fall gOcad meeting

[1989-2008] gOcad Archive