[pooma-dev] Sparse Engine

Fri Sep 26 09:33:01 UTC 2003

Richard:
The idea is to get rid of the loop over materials found in our previous  
iterations of our code. In fact we simply need to compute cell-average  
quantities for the mixed cells and then perform a single data-parallel  
computation over the single mixed material field which does the work  
for all materials at once. So we do a little unstructured work to  
compute the cell average quantities and then we do a data parallel  
computation. There are some other advantages that accrue to the use of  
this unstructured approach that would allow us to store some  
information that would normally be too expensive to store, but, we  
won't go into that here.

The complication is that we want to (in the grand tradition of POOMA)  
hide the underlying complexity of our storage scheme and make things  
appear beautiful and logically simple. A good example might be using a  
storage scheme for a symmetric matrix that only stores an upper  
triangular matrix, but, that allows you to access any index into the  
array and it internally maps the indices into the correct storage  
location.

In our example, the index array is positive for a pure cell and simply  
is the material ID for the material contained in that cell. If the  
index array contains a negative value, then it has traditionally been  
an index into an unstructured linked-list of the mixed cells data. We  
can then access this data and compute a cell-average value which we  
store in that cell of the multi-material field and then we perform our  
data-parallel operations on that multi-material field.

We occasionally need to gather all of the pure and mixed material  
values of a single material so that we can do a single-material  
calculation like an EOS evaluation, so that is why we want the work  
array (which we compress/deallocate when we are not using it). So the  
various views of the data that we would take are the multi-material  
cell average view, the gathered single material view and the overall  
complicated-storage scheme view. To get the kind of performance the old  
code has we will also need to introduce windowing and activity flags.  
Basically, we are attempting to throw away any unnecessary computations  
and minimize the data we are pushing through cache.

The sparse tile layout doesn't have the concept of indirect addressing  
using an index field. It is simply intended for block-AMR type meshes.  
If we do AMR it would probably be a completely unstructured problem in  
which any cell can be refined, rather than a block type. Unfortunately,  
this again introduces the possibility of all-to-all communications  
(slow) to find your neighbors, etc.

We have also been dealing with the issue of how best to do masking. I  
am beginning to think that we need another sparse storage idea so that  
we end up with something equivalent to a block where in which the data  
is collected into lists using a test and the computation is done simply  
over that collection, which gets progressively smaller as the number of  
tests increases. Currently, when using a mask, you end up traversing  
all of the data, maybe even doing the computation everywhere and then  
simply throwing away the result where the mask is not set (either by a  
conditional or by multiplying by 0.0). Building the list for extremely  
sparse data can be a huge win. Like I said, the old version of this  
algorithm is running 20 times faster than the data-parallel version.  
This is only possible by simply doing less work.

We would also like to have exterior guards on a box with a lot of very  
little logically distinct but shared memory patches without guard cells  
within the box. Then we could maybe achieve some reasonable   
compression and our computations should approach AMR storage schemes  
without the undesired Gibb's phenomenon due to poor impedance matching  
across T-joints that AMR has.

I should note that we are aware of the issue of not using certain types  
of dynamically allocated data structures because the guard cell copy  
scheme might only move the pointer to the data and not the actual data.  
We are taking this into account.

Hope this helps,
John Hall

On Friday, September 26, 2003, at 02:07  AM, Richard Guenther wrote:

> Ok, still trying to understand - this is something like (statically)
> specifying which cells participate in computation? Like you would
> have a usual brick engine in conjunction with a bitfield specifying
> a mask and using this in the evaluator loop (of course this would be
> less memory efficient)? So this would be a cheap way to do this
> compared to using the sparse tile layout?
>
> Thanks,
>
> Richard.
>
> On Fri, 26 Sep 2003, John H.Hall wrote:
>
>> Richard:
>> OK. Here goes. The basic idea is that we have a hierarchical field
>> structure (built using hierarchical engines similar to the current
>> multi-material field abstraction) which has a collection of 1-D
>> dynamicFields (for the sparse unstructured storage), a shared Index
>> (n-D) integer Array (or Field), and a single (n-D) scalar, vector or
>> tensor field which has either the data for a pure cell, or a cell
>> average value for mixed-material cell's data. As the problem evolves
>> the material interfaces migrate and so the actual position of the
>> unstructured cells changes. However, all the indirect indexing is  
>> still
>> local to the processor (except for normal guard cell communications).
>> So this is much simpler than a real unstructured problem with
>> all-to-all communications. In the general case, the sparse dynamic
>> fields are only used to compute the cell-average quantities before a
>> data-parallel computation across the single multi-material or
>> cell-average field is performed. We would also like to take some views
>> of the field in which all of the data for a particular material is
>> gathered/scattered to/from a single spare dynamic work Array that is
>> shared in this hierarchical structure.
>>
>> Field<Mesh_t, Real, MixedCellEngine> would look like this:
>> ___________________
>> |__________________| Single material Gather/Scatter 1-D Dynamic Work
>> Array (both mixed and pure cells)
>> ______
>> |_____| mat A (1-D Dynamic Array/Field)
>> _______
>> |______| mat B (1-D Dynamic Array/Field)
>> ______
>> |_____| mat C (1-D Dynamic Array/Field)
>> ______________________________________________________________________ 
>> _
>> |_____________________________________________________________________ 
>> _|
>>   Cell Average Quantities (n-D)
>> ______________________________________________________________________ 
>> _
>> |_____________________________________________________________________ 
>> _|
>>   Integer Index Array (n-D)
>> Single Index Array is shared by all Sparse Fields (e.g. Density,
>> Pressure, etc.). This shares duty between
>> providing the material index for a pure cell and an offset into a
>> collection tracking the unstructured
>> mixed cell data for a mixed cell.
>>
>> Multi-Patch should still work although the guard cell communications
>> might be slightly more complicated.
>>
>> The number of cells which are indirectly addressed is very small (< 5%
>> of the total) so even using compressible brick we are wasting a lot of
>> memory bandwidth and performing numerous extraneous computations. A
>> comparison code using this structure is running 20 times faster than
>> the equivalent data parallel POOMA R1 computation for the single
>> processor serial case. We believe we can match that performance by
>> building an engine that encapsulates the sparse nature reflected in  
>> the
>> problem and by making more use of the new engines POOMA R2 provides
>> (stencil, etc.).
>>
>> Again, most of the computations are performed on the Cell-Average
>> Quantities, so we just take a view, operator[]?, that returns that
>> single field.
>> John and Jean
>
> --
> Richard Guenther <richard dot guenther at uni-tuebingen dot de>
> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/