[firedrake] cached kernels

Thu Nov 12 14:45:46 GMT 2015

Hi Lawrence,

ok, problem solved. If I use the in-place Thomas algorithm for the lowest order tridiagonal system instead of LAPACKs LU solver routines I get excellent memory throughput (average 3.4GB/s per core, so about peak for the full node).

The time per iteration drops significantly from 0.44s to 0.24s (compared to 0.35s for the PETSc solver with hypre preconditioner), so this was really a change worth implementing!

Thanks,

Eike

> On 12 Nov 2015, at 11:51, Lawrence Mitchell <lawrence.mitchell at imperial.ac.uk> wrote:
> 
> On 12/11/15 11:39, Eike Mueller wrote:
>> Hi Lawrence,
>> 
>>>> I have effectively no idea what's going on.  Does the LU solve
>>>> take this long on this much data if you just call it from C?
>> 
>> I just carried out exactly that experiment (see the C-code in
>> firedrake-multigridpaper/code/test_lusolve). Basically in each
>> vertical column I do an LU solve with exactly the same matrix size
>> and bandwidth as in the firedrake code. I use the same horizontal
>> grid size as for the firedrake code and also use the
>> compiler/linker flags which are printed out at the end of the run
>> by PETSc, so I’m confident that the C-code does exactly the same as
>> the code autogenerated by firedrake (only difference is that I
>> initialise the matrix with random values, but make it diagonally
>> dominant to avoid excessive pivoting).
>> 
>> Interestingly I can reproduce the problem in the pure C-code, so
>> there must be an issue with the LAPACK/BLAS on ARCHER or the
>> matrices are simply to small to get good performance (for example
>> because the indirect addressing in the horizontal avoids efficient
>> reuse of higher-level caches). More specifically I get the
>> following bandwidths (calculated by working out the data volume
>> that is streamed through for every LU solve and dividing this by
>> the measured runtime):
>> 
>> *** lowest order *** (81920 vertical columns, matrix size = 64x64,
>> matrix bandwidth=3, 24 cores on ARCHER) Measured memory bandwidth =
>> 0.530GB/s (per core), 12.722GB/s (per node)
> 
> OK, the matrix here is small, so plausibly there's nowhere for LAPACK
> to do good things, and the overhead of not just inlining a simple
> algorithm is hurting.
> 
>> *** higher order *** (5120 vertical columns, matrix size = 384x384,
>> matrix bandwidth=55, 24 cores on ARCHER) Measured memory bandwidth
>> = 3.601GB/s (per core), 86.434GB/s (per node)
> 
> Good, the matrix here is pretty big, so LAPACK does a good job.
> 
>> so the higher order case is running at bandwidth peak, but the
>> lowest order case is far away from it.
>> 
>> That implies that the problem lies with the BLAS/LAPACK
>> implementation, not hidden firedrake overheads.
>> 
>> Probably the way forward is to follow this up with ARCHER support,
>> I can now give them a well defined test case which reproduces the
>> problem.
> 
> Hopefully they will be able to suggest something.
> 
>> Any other ideas?
> 
> At lowest order the system really is just tridiagonal, right?  Can one
> just drop in an "inlined" tridiagonal solve for this case?
> 
> Cheers,
> 
> Lawrence
>