[firedrake] cached kernels

Lawrence Mitchell lawrence.mitchell at imperial.ac.uk
Thu Nov 12 11:51:44 GMT 2015


On 12/11/15 11:39, Eike Mueller wrote:
> Hi Lawrence,
> 
>>> I have effectively no idea what's going on.  Does the LU solve
>>> take this long on this much data if you just call it from C?
> 
> I just carried out exactly that experiment (see the C-code in
> firedrake-multigridpaper/code/test_lusolve). Basically in each
> vertical column I do an LU solve with exactly the same matrix size
> and bandwidth as in the firedrake code. I use the same horizontal
> grid size as for the firedrake code and also use the
> compiler/linker flags which are printed out at the end of the run
> by PETSc, so I’m confident that the C-code does exactly the same as
> the code autogenerated by firedrake (only difference is that I
> initialise the matrix with random values, but make it diagonally
> dominant to avoid excessive pivoting).
> 
> Interestingly I can reproduce the problem in the pure C-code, so
> there must be an issue with the LAPACK/BLAS on ARCHER or the
> matrices are simply to small to get good performance (for example
> because the indirect addressing in the horizontal avoids efficient
> reuse of higher-level caches). More specifically I get the
> following bandwidths (calculated by working out the data volume
> that is streamed through for every LU solve and dividing this by
> the measured runtime):
> 
> *** lowest order *** (81920 vertical columns, matrix size = 64x64,
> matrix bandwidth=3, 24 cores on ARCHER) Measured memory bandwidth =
> 0.530GB/s (per core), 12.722GB/s (per node)

OK, the matrix here is small, so plausibly there's nowhere for LAPACK
to do good things, and the overhead of not just inlining a simple
algorithm is hurting.

> *** higher order *** (5120 vertical columns, matrix size = 384x384,
> matrix bandwidth=55, 24 cores on ARCHER) Measured memory bandwidth
> = 3.601GB/s (per core), 86.434GB/s (per node)

Good, the matrix here is pretty big, so LAPACK does a good job.

> so the higher order case is running at bandwidth peak, but the
> lowest order case is far away from it.
> 
> That implies that the problem lies with the BLAS/LAPACK
> implementation, not hidden firedrake overheads.
> 
> Probably the way forward is to follow this up with ARCHER support,
> I can now give them a well defined test case which reproduces the
> problem.

Hopefully they will be able to suggest something.

> Any other ideas?

At lowest order the system really is just tridiagonal, right?  Can one
just drop in an "inlined" tridiagonal solve for this case?

Cheers,

Lawrence

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.ic.ac.uk/pipermail/firedrake/attachments/20151112/ab5b419a/attachment.sig>


More information about the firedrake mailing list