[pooma-dev] Sparse Engine
John Hall
jhh at zianet.com
Fri Sep 26 09:33:01 UTC 2003
Richard:
The idea is to get rid of the loop over materials found in our previous
iterations of our code. In fact we simply need to compute cell-average
quantities for the mixed cells and then perform a single data-parallel
computation over the single mixed material field which does the work
for all materials at once. So we do a little unstructured work to
compute the cell average quantities and then we do a data parallel
computation. There are some other advantages that accrue to the use of
this unstructured approach that would allow us to store some
information that would normally be too expensive to store, but, we
won't go into that here.
The complication is that we want to (in the grand tradition of POOMA)
hide the underlying complexity of our storage scheme and make things
appear beautiful and logically simple. A good example might be using a
storage scheme for a symmetric matrix that only stores an upper
triangular matrix, but, that allows you to access any index into the
array and it internally maps the indices into the correct storage
location.
In our example, the index array is positive for a pure cell and simply
is the material ID for the material contained in that cell. If the
index array contains a negative value, then it has traditionally been
an index into an unstructured linked-list of the mixed cells data. We
can then access this data and compute a cell-average value which we
store in that cell of the multi-material field and then we perform our
data-parallel operations on that multi-material field.
We occasionally need to gather all of the pure and mixed material
values of a single material so that we can do a single-material
calculation like an EOS evaluation, so that is why we want the work
array (which we compress/deallocate when we are not using it). So the
various views of the data that we would take are the multi-material
cell average view, the gathered single material view and the overall
complicated-storage scheme view. To get the kind of performance the old
code has we will also need to introduce windowing and activity flags.
Basically, we are attempting to throw away any unnecessary computations
and minimize the data we are pushing through cache.
The sparse tile layout doesn't have the concept of indirect addressing
using an index field. It is simply intended for block-AMR type meshes.
If we do AMR it would probably be a completely unstructured problem in
which any cell can be refined, rather than a block type. Unfortunately,
this again introduces the possibility of all-to-all communications
(slow) to find your neighbors, etc.
We have also been dealing with the issue of how best to do masking. I
am beginning to think that we need another sparse storage idea so that
we end up with something equivalent to a block where in which the data
is collected into lists using a test and the computation is done simply
over that collection, which gets progressively smaller as the number of
tests increases. Currently, when using a mask, you end up traversing
all of the data, maybe even doing the computation everywhere and then
simply throwing away the result where the mask is not set (either by a
conditional or by multiplying by 0.0). Building the list for extremely
sparse data can be a huge win. Like I said, the old version of this
algorithm is running 20 times faster than the data-parallel version.
This is only possible by simply doing less work.
We would also like to have exterior guards on a box with a lot of very
little logically distinct but shared memory patches without guard cells
within the box. Then we could maybe achieve some reasonable
compression and our computations should approach AMR storage schemes
without the undesired Gibb's phenomenon due to poor impedance matching
across T-joints that AMR has.
I should note that we are aware of the issue of not using certain types
of dynamically allocated data structures because the guard cell copy
scheme might only move the pointer to the data and not the actual data.
We are taking this into account.
Hope this helps,
John Hall
On Friday, September 26, 2003, at 02:07 AM, Richard Guenther wrote:
> Ok, still trying to understand - this is something like (statically)
> specifying which cells participate in computation? Like you would
> have a usual brick engine in conjunction with a bitfield specifying
> a mask and using this in the evaluator loop (of course this would be
> less memory efficient)? So this would be a cheap way to do this
> compared to using the sparse tile layout?
>
> Thanks,
>
> Richard.
>
> On Fri, 26 Sep 2003, John H.Hall wrote:
>
>> Richard:
>> OK. Here goes. The basic idea is that we have a hierarchical field
>> structure (built using hierarchical engines similar to the current
>> multi-material field abstraction) which has a collection of 1-D
>> dynamicFields (for the sparse unstructured storage), a shared Index
>> (n-D) integer Array (or Field), and a single (n-D) scalar, vector or
>> tensor field which has either the data for a pure cell, or a cell
>> average value for mixed-material cell's data. As the problem evolves
>> the material interfaces migrate and so the actual position of the
>> unstructured cells changes. However, all the indirect indexing is
>> still
>> local to the processor (except for normal guard cell communications).
>> So this is much simpler than a real unstructured problem with
>> all-to-all communications. In the general case, the sparse dynamic
>> fields are only used to compute the cell-average quantities before a
>> data-parallel computation across the single multi-material or
>> cell-average field is performed. We would also like to take some views
>> of the field in which all of the data for a particular material is
>> gathered/scattered to/from a single spare dynamic work Array that is
>> shared in this hierarchical structure.
>>
>> Field<Mesh_t, Real, MixedCellEngine> would look like this:
>> ___________________
>> |__________________| Single material Gather/Scatter 1-D Dynamic Work
>> Array (both mixed and pure cells)
>> ______
>> |_____| mat A (1-D Dynamic Array/Field)
>> _______
>> |______| mat B (1-D Dynamic Array/Field)
>> ______
>> |_____| mat C (1-D Dynamic Array/Field)
>> ______________________________________________________________________
>> _
>> |_____________________________________________________________________
>> _|
>> Cell Average Quantities (n-D)
>> ______________________________________________________________________
>> _
>> |_____________________________________________________________________
>> _|
>> Integer Index Array (n-D)
>> Single Index Array is shared by all Sparse Fields (e.g. Density,
>> Pressure, etc.). This shares duty between
>> providing the material index for a pure cell and an offset into a
>> collection tracking the unstructured
>> mixed cell data for a mixed cell.
>>
>> Multi-Patch should still work although the guard cell communications
>> might be slightly more complicated.
>>
>> The number of cells which are indirectly addressed is very small (< 5%
>> of the total) so even using compressible brick we are wasting a lot of
>> memory bandwidth and performing numerous extraneous computations. A
>> comparison code using this structure is running 20 times faster than
>> the equivalent data parallel POOMA R1 computation for the single
>> processor serial case. We believe we can match that performance by
>> building an engine that encapsulates the sparse nature reflected in
>> the
>> problem and by making more use of the new engines POOMA R2 provides
>> (stencil, etc.).
>>
>> Again, most of the computations are performed on the Cell-Average
>> Quantities, so we just take a view, operator[]?, that returns that
>> single field.
>> John and Jean
>
> --
> Richard Guenther <richard dot guenther at uni-tuebingen dot de>
> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
More information about the pooma-dev
mailing list