[vsipl++] [patch] Support for Cell FFT's up to 4K points.

Mon Feb 12 23:22:52 UTC 2007

Don McCoy wrote:
 > The attached patch provides support for 1-D complex-complex FFTs up to
 > 4K points in length.  This implementation limits it to 4K to save stack
 > space, even though the underlying SPE library routine (libfft) allows up
 > to 8K points.

Don,

This looks good.  I have a couple of comments below.  Once those are
addressed, please check it in.

			thanks,
			-- Jules

 >
 > Regards,
 >
 > ------------------------------------------------------------------------
 >
 > Index: src/vsip/opt/cbe/ppu/fft.cpp
 > ===================================================================

 > +template <typename T,
 > +          int      E>

[1] It doesn't look like any advantage is being gained by having
E as a template parameter.  It would reduce code-size to pass
the FFT direction as a run-time parameter instead.

Also, I'm guessing that 'E' stands for Exponent.  This would be
good to document, or change to another more verbose name.

(After looking over the rest of the patch, I realize you're being
consistent with the name, which is good!  However, a comment would
still be nice.)

 > +void
 > +fft_8K(std::complex<T>* out, std::complex<T> const* in,
 > +  std::complex<T> const* W, length_type length, T scale)

[2] 'in' and 'out' are good names.  'W' should probably be
'twiddle_factors' or something more descriptive.

Also, it would be good to document the expected sizes, since 'in'
and 'out' are of size 'length', while 'twiddle_factors' is of
size 'length/4'.

 > +{
 > +  Fft_params fftp;
 > +  fftp.direction = (E == -1 ? fwd_fft : inv_fft);
 > +  fftp.elements = length;

 > +  fftp.scale = (E == -1 ? 1.f : scale);

[3] From VSIPL++ API, it is permissible to scale on both forward and
inverse FFTs. I suspect this should just pass 'scale' through.

 > +  Task_manager *mgr = Task_manager::instance();
 > +  Task task = mgr->reserve<Fft_tag, void(complex<T>,complex<T>)>(
 > +    sizeof(Fft_params), sizeof(complex<T>)*(length*5/4),
 > +    sizeof(complex<T>)*length);
 > +  Workblock block = task.create_block();
 > +  block.set_parameters(fftp);
 > +  block.add_input(in, length);
 > +  block.add_input(W, length/4);
 > +  block.add_output(out, length);
 > +  task.enqueue(block);
 > +  task.wait();
 > +}
 > +
 > +template<typename T>
 > +void
 > +compute_twiddle_factors(std::complex<T>* W, length_type length)
 > +{
 > +  unsigned int i = 0;
 > +  unsigned int n = length;
 > +  T* pW = reinterpret_cast<T*>(W);
 > +  pW[0] = 1.0f;
 > +  pW[1] = 0.0f;
 > +  for (i = 1; i < n / 4; ++i)
 > +  {
 > +    pW[2*i] = cos(i * 2*M_PI / n);
 > +    pW[2*(n/4 - i)+1] = -pW[2*i];
 > +  }
 > +}

[4] There are other ways to compute twiddle factors iteratively that
may avoid the 'cos()' call, and hence may be more efficient.

This is definitely a comment for later: functionality is more
important now, and in terms of the priority of optimizations, twiddle
factor computation is low on the list, since it is done at object
creation outside of the compute loop.

 > +
 > +
 >  template <dimension_type D, typename I, typename O, int A, int E> 
class Fft_impl;
 >
 >  // 1D complex -> complex FFT
 > @@ -46,14 +88,24 @@
 >    typedef std::pair<rtype*, rtype*> ztype;
 >
 >  public:
 > -  Fft_impl(Domain<1> const &dom)
 > +  Fft_impl(Domain<1> const &dom, rtype scale) 
VSIP_THROW((std::bad_alloc))
 > +      : scale_(scale),
 > +        W_(alloc_align<ctype>(128, dom.size()/4))

[5] 128 is probably a good alignment.  However, it is kind of a magic
number that should be a macro (VSIP_IMPL_CELL_DMAALIGNMENT, pick a
good name) to call it out.

Was this the alignment problem that was causing the bad twiddle factors?

 >    {
 > -    // TBD

[6] Stefan is right on.  This would be a good time to compute the twiddle
factors.  Early binding!

 >    }

 > +
 > +private:
 > +  rtype scale_;
 > +  ctype* W_;

[7] Q for Stefan: how do we handle FFT assignment?  Is there a problem with
W_ being freed multiple times?

 >  };

 >  VSIPL_IMPL_PROVIDE(1, float, std::complex<float>, 0, -1)

[8] Do we really have a real->complex FFT on the SPE?

 > Index: src/vsip/opt/cbe/ppu/fft.hpp
 > ===================================================================
 > --- src/vsip/opt/cbe/ppu/fft.hpp	(revision 163034)
 > +++ src/vsip/opt/cbe/ppu/fft.hpp	(working copy)
 > @@ -25,6 +25,7 @@
 >  #include <vsip/domain.hpp>
 >  #include <vsip/core/fft/factory.hpp>
 >  #include <vsip/core/fft/util.hpp>
 > +#include <vsip/opt/cbe/common.h>
 >
 >  /***********************************************************************
 >    Declarations
 > @@ -37,14 +38,14 @@
 >  namespace cbe
 >  {
 >
 > -template <typename I, dimension_type D>
 > +template <typename I, dimension_type D, typename S>
 >  std::auto_ptr<I>
 > -create(Domain<D> const &dom);
 > +create(Domain<D> const &dom, S scale);
 >
 >  #define VSIP_IMPL_FFT_DECL(D,I,O,A,E)                          \
 >  template <>                                                    \
 >  std::auto_ptr<fft::backend<D,I,O,A,E> >                        \
 > -create(Domain<D> const &);
 > +create(Domain<D> const &, fft::backend<D, I, O, A, E>::scalar_type);
 >
 >  #define VSIP_IMPL_FFT_DECL_T(T)				       \
 >  VSIP_IMPL_FFT_DECL(1, T, std::complex<T>, 0, -1)               \
 > @@ -60,7 +61,7 @@
 >  #define VSIP_IMPL_FFT_DECL(I,O,A,E)                            \
 >  template <>                                                    \
 >  std::auto_ptr<fft::fftm<I,O,A,E> >                             \
 > -create(Domain<2> const &);
 > +create(Domain<2> const &, fft::backend<2, I, O, A, E>::scalar_type);
 >
 >  #define VSIP_IMPL_FFT_DECL_T(T)				       \
 >  VSIP_IMPL_FFT_DECL(T, std::complex<T>, 0, -1)                  \
 > @@ -92,16 +93,23 @@
 >  struct evaluator<D, I, O, S, R, N, Cbe_sdk_tag>
 >  {
 >    static bool const ct_valid = true;
 > -  static bool rt_valid(Domain<D> const &) { return true;}
 > +  static bool rt_valid(Domain<D> const &dom)
 > +  {
 > +    if (dom.size() < MIN_FFT_1D_SIZE)
 > +      return false;
 > +    if (dom.size() > MAX_FFT_1D_SIZE)
 > +      return false;
 > +    return true;

[9] Should rt_valid also check that the size is a power of 2 (or does
libfft handle non-power-of-two sizes?)

Also, I don't see how you could check for unit-stride here since there
is no layout info.  I need to refresh my memory on rt_valid for FFT
dispatch a bit.

 > +  }

 > Index: src/vsip/opt/cbe/ppu/alf.hpp
 > ===================================================================
 > --- src/vsip/opt/cbe/ppu/alf.hpp	(revision 163034)
 > +++ src/vsip/opt/cbe/ppu/alf.hpp	(working copy)
 > @@ -64,25 +64,28 @@
 >    template <typename P>
 >    void set_parameters(P const &p)
 >    {
 > -    alf_wb_add_param(impl_, const_cast<P *>(&p), sizeof(p), 
ALF_DATA_BYTE, 0);
 > +    assert( alf_wb_add_param(impl_, const_cast<P *>(&p),
 > +        sizeof(p), ALF_DATA_BYTE, 0) >= 0 );

Error checking good! :)

 > Index: src/vsip/opt/cbe/spu/vmul.cpp
 > ===================================================================
 > --- src/vsip/opt/cbe/spu/vmul.cpp	(revision 163034)
 > +++ src/vsip/opt/cbe/spu/vmul.cpp	(working copy)
 > @@ -1,155 +0,0 @@
 > -/* Copyright (c) 2006, 2007 by CodeSourcery.  All rights reserved.

[10] Do we have an ALF implementation of vmul, or was this our only impl?

 > Index: src/vsip/opt/cbe/spu/alf_fft_c.c
 > ===================================================================
 > --- src/vsip/opt/cbe/spu/alf_fft_c.c	(revision 0)
 > +++ src/vsip/opt/cbe/spu/alf_fft_c.c	(revision 0)
 > @@ -0,0 +1,84 @@
 > +/* Copyright (c) 2007 by CodeSourcery.  All rights reserved.
 > +
 > +   This file is available for license from CodeSourcery, Inc. under 
the terms
 > +   of a commercial license and under the GPL.  It is not part of the 
VSIPL++
 > +   reference implementation and is not available under the BSD license.
 > +*/
 > +/** @file    vsip/opt/cbe/spu/alf_fft_c.c
 > +    @author  Don McCoy
 > +    @date    2007-02-03
 > +    @brief   VSIPL++ Library: Kernel to compute complex float FFT's.
 > +*/
 > +
 > +#include <stdio.h>
 > +#include <alf_accel.h>
 > +#include <assert.h>
 > +#include <libfft.h>
 > +#include "../common.h"
 > +
 > +unsigned int log2i(unsigned int size)
 > +{
 > +  unsigned int log2_size = 0;
 > +  while (!(size & 1))
 > +  {
 > +    size >>= 1;
 > +    log2_size++;
 > +  }
 > +  return log2_size;
 > +}
 > +
 > +void fft_1d_r2_inv(vector float* out, vector float* in, vector 
float* W,
 > +		   unsigned int log2_size, float scale)
 > +{
 > +  vector unsigned int mask = (vector unsigned int){-1, -1, 0, 0};
 > +  vector float *start, *end, s0, s1, e0, e1;
 > +  unsigned int i;
 > +  unsigned int n = 1 << log2_size;
 > +
 > +  fft_1d_r2(out, in, W, log2_size);
 > +
 > +  vector float vscale = spu_splats(scale);
 > +  vector float s, e;
 > +  start = out;
 > +  end = start + 2 * n / 4;  // two complex values for each n, four 
per vector

[11] While it's not strictly necessary (because this is C, not C++; and
because the SPEs don't go fast for double), I would put the magic value
'4' into a const variable (local to the function):

	int const vec_size = 4;

That way, if this code (or a cut-and-paste copy) ever makes the jump
to C++ and is generalized to work for both float and double, there is only
one magic value to change.

 > +  s0 = e1 = *start;
 > +  for (i = 0; i < n / 4; ++i) {

It would also make the code more self-documenting

I.e.

	for (i = 0; i < n / vec_size; ++i) {

 > +    s1 = *(start + 1);
 > +    e0 = *(--end);
 > +
 > +    s = spu_sel(s0, s1, mask);
 > +    e = spu_sel(e0, e1, mask);

[12] Are 's' and 'e' being used?

 > +    *start++ = spu_mul(spu_sel(e0, e1, mask), vscale);
 > +    *end = spu_mul(spu_sel(s0, s1, mask), vscale);

[13] Can you describe what this loop is doing?  It looks like it is (a)
scaling by vmul and (b) reversing the vector (which includes using
spu_sel to swap the two complex values in a SIMD registers).  Looks
good though!

 > +    s0 = s1;
 > +    e1 = e0;
 > +  }
 > +}
 > +
 > +
 > +int alf_comp_kernel(void volatile *params,
 > +                    void volatile *input,
 > +                    void volatile *output,
 > +                    unsigned int iter,
 > +                    unsigned int n)
 > +{
 > +  int i;
 > +  Fft_params* fftp = (Fft_params *)params;
 > +  unsigned int length = fftp->elements;
 > +
 > +  vector float* in = (vector float *)input;
 > +  vector float* W = (vector float *)((float *)in + length * 2);
 > +  vector float* out = (vector float*)output;
 > +
 > +  assert(length <= MAX_FFT_1D_SIZE);
 > +  unsigned int log2_size = log2i(length);
 > +
 > +  // Perform the FFT,
 > +  //   -- 'in' may be the same as 'out'
 > +  if (fftp->direction == fwd_fft)
 > +    fft_1d_r2(out, in, W, log2_size);
 > +  else
 > +    fft_1d_r2_inv(out, in,  W, log2_size, fftp->scale);

[14] we need to allow scaling for forward FFTs.

 > +
 > +  return 0;
 > +}
 > Index: src/vsip/opt/cbe/common.h
 > ===================================================================

 > +typedef struct
 > +{
 > +  fft_dir_type direction;
 > +  unsigned int elements;
 > +  double scale;

[15] why is scale a double?

'float' should be enough for single-precision FFTs.

Also, while it seems reasonable to use a 'double' to scale
double-precision FFTs (and we may want to actually do it that
way when implementing), the VSIPL++ spec defines scale to be
a 'float' regardless of the FFT precision.  I need to check how
the C-VSIPL spec defines that because IIRC there was some confusion
between the two specs here.

 > +} Fft_params;
 > +
 > +#endif // VSIP_OPT_CBE_COMMON_H

-- 
Jules Bergmann
CodeSourcery
jules at codesourcery.com
(650) 331-3385 x705