Conjugate-Gradient Solver in Symphony
This post is in relation to a research project investigating the applicability of the Symphony paradigm to (scientific and engineering) numerical applications. The project's name is High Performance Numerical Computing on Service-Oriented Architectures
Like many financial applications, some numerical applications can be broken down into a set of independent tasks, whose outputs can be combined to produce the overall result (so-called`embarrassingly-parallel' applications). These can be easily be efficiently parallelized in the current Symphony API.
A conjugate-gradient linear systems solver represents a class of applications which is more challenging in this respect. It, or similar algorithms, are widely used in scientific applications (e.g. structural engineering). It iteratively solves for a vector x the linear system A x = b, which is of order N. See Conjugate gradient method - Wikipedia for more details.
The basic algorithm is to repeatedly:
compute z = A*w (matrix-vector multiply)
update x, w, r, etc (vector-vector operations)
until the norm of the residual r ~= 0 (r effectively stores the value of Ax - b). The number of repetitions required depends on the structure of A, but is normally a slowly growing function of N, R(N).
The most straightforward parallelization involves parallelizing the O(N*N) matrix-vector multiply using p tasks, and perform the rest of the calculations (O(N)) via the client. Each task gets the intermediate vector w at each step and computes roughly N/P elements of z using the corresponding rows of A. Computationally, that should be efficient for large enough N.
While not `embarrassingly-parallel', the CG solver does have some properties which make it relatively easy to parallelize:- the data that changes (x, w, r, etc) is relatively small
- communication only occurs in distinct phases. In fact, these phases could be described as `scatter-gather' operations.
With that in mind, we could implement this in the Symphony paradigm as follows. Noting that A is constant throughout the computation, it can be made available to all tasks via through the Common Data mechanism. For each iteration, the client creates p tasks, sending each task the vector w and a specification of which rows to process, and then waits for the tasks to complete. At which time, it assembles the output to form z.
In terms of efficiency, the client forms a bottleneck on the `scatter-gather', but as this is O(N), this is not serious. We do create p*R(N) service tasks in total, but if Symphony spawns p' SI processes (where p is a multiple of p') and keeps them all going throughout the computation, the overhead of starting tasks (calls to onInvoke())) should be small.
Each SI is sent, and stores, the whole of A, rather than just the rows that its tasks will later need. This represents a scalability problem (communication and memory), but this is second-order. In any case, the same problem exists on the client which must store and send the whole matrix.
This could be overcome with A being put on a parallel file system (DSF - see Add the MapReduce API for SOAM). However, we would not want the tasks to be downloading the relevant rows of A from the DFS each time; we would want the SI to cache them.
It might be also nice to have an effective broadcast mechanism for w to speed up the setting up of input by the client at each stage: something like a secondary `common data' area which the current tasks can all use, and which can be reset by the client as desired.
In summary, the common data mechanism and the architecture of the SI in Symphony appears to enable a reasonably efficient CG solver. Jaison (mulerik) who works on the project with me will soon followup with some implementation details and performance results. Some extensions to Symphony might be useful; does anyone thing these (or similar) would also be useful in other contexts?
Last edited by Blackbird; September 19th, 2008 at 09:16 AM.
-- Cheers, Peter
(Peter Strazdins, Australian National University)