Implement graphcoloring to expose rows in level sets that that can be
executed in parallel during the sparse triangular solves.
Add copy of A matrix that is reordered to ensure continuous memory reads
when traversing the matrix in level set order.
TODO: add number of threads available as constructor argument in DILU