preconditioner. Uses graph coloring to exploit
parallelism in upper and triangular solves when
computing a diagonal approximate inverse of a
sparse matrix. Supports blocksizes up to 3.
Implement calls to cuBlas, cuSparse and implement necessary
CUDA kernels to perform a single iteration of the jacobi preconditioner.
Add tests that verify new kernels and the preconditioner in its totality.
The preconditioner is verified on 2x2 and 3x3 blocks, which as of now
are the only supported sizes. 1x1 are not supported because cuSparse
does not support it.