[firedrake] cached kernels

Eike Mueller E.Mueller at bath.ac.uk
Thu Nov 12 11:49:53 GMT 2015


Hi David,


Is a bandwidth 3, 64 row matrix enough to make LAPACK fast? That's not a hell of a lot of non-zero entries....


that’s a good question, and could well be that LAPACK is not good on problems of this size. I use the banded LAPACK storage format, so that’s only about 3x64=192 entries, so not that much. Naively I would have thought that LAPACK adapts it’s algorithm to the matrix size, but maybe it simply doesn’t work well on problems of that size.

Eike

On Thu, 12 Nov 2015 at 11:39 Eike Mueller <E.Mueller at bath.ac.uk<mailto:E.Mueller at bath.ac.uk>> wrote:
Hi Lawrence,

>> I have effectively no idea what's going on.  Does the LU solve take
>> this long on this much data if you just call it from C?

I just carried out exactly that experiment (see the C-code in firedrake-multigridpaper/code/test_lusolve). Basically in each vertical column I do an LU solve with exactly the same matrix size and bandwidth as in the firedrake code. I use the same horizontal grid size as for the firedrake code and also use the compiler/linker flags which are printed out at the end of the run by PETSc, so I’m confident that the C-code does exactly the same as the code autogenerated by firedrake (only difference is that I initialise the matrix with random values, but make it diagonally dominant to avoid excessive pivoting).

Interestingly I can reproduce the problem in the pure C-code, so there must be an issue with the LAPACK/BLAS on ARCHER or the matrices are simply to small to get good performance (for example because the indirect addressing in the horizontal avoids efficient reuse of higher-level caches). More specifically I get the following bandwidths (calculated by working out the data volume that is streamed through for every LU solve and dividing this by the measured runtime):

*** lowest order ***
(81920 vertical columns, matrix size = 64x64, matrix bandwidth=3, 24 cores on ARCHER)
Measured memory bandwidth = 0.530GB/s (per core), 12.722GB/s (per node)

*** higher order ***
(5120 vertical columns, matrix size = 384x384, matrix bandwidth=55, 24 cores on ARCHER)
Measured memory bandwidth = 3.601GB/s (per core), 86.434GB/s (per node)

so the higher order case is running at bandwidth peak, but the lowest order case is far away from it.

That implies that the problem lies with the BLAS/LAPACK implementation, not hidden firedrake overheads.

Probably the way forward is to follow this up with ARCHER support, I can now give them a well defined test case which reproduces the problem.

Any other ideas?

Thanks,

Eike
_______________________________________________
firedrake mailing list
firedrake at imperial.ac.uk<mailto:firedrake at imperial.ac.uk>
https://mailman.ic.ac.uk/mailman/listinfo/firedrake
_______________________________________________
firedrake mailing list
firedrake at imperial.ac.uk<mailto:firedrake at imperial.ac.uk>
https://mailman.ic.ac.uk/mailman/listinfo/firedrake

-------------- next part --------------
HTML attachment scrubbed and removed


More information about the firedrake mailing list