next up previous contents
Next: Bibliography Up: Combining the Beamformer with Previous: Final Results   Contents

Some Implementation Details

The processor I'm using is a TMS320C6711 DSP. It runs at 150mhz and has 4k L1 cache for code and 4k L2 cache for data. The L2 cache is 64k, and all other memory is external to the chip. The echo canceller has a filter of length 1024 (that's about 1/8 second) and it uses a projection order of 16.

To take advantage of the blocking algorithm, matrix multiplication is implemented using Fast Fourier Transforms (FFT). The optimized FFT routine executes a length 512 FFT in approximately 7200 cycles. I am using a property of the FFT to do two real valued FFTs in one FFT operation (with some overhead). The number of FFTs per block is $1+6.5M$ where M is the number of microphones. The block size is length 256. So for a sample rate of 8khz, the total computation for the FFT code alone is approximately $8000/256*(1+6.5M)*7200$ or about 12 million cycles for 8 microphones. At first this seems pretty good. But after you consider the matrix inverse update code, the FFT overhead, and correcting the filter output, it turns out to be more than 5 or 6 times as much computation. And this doesn't even include beamforming.

Ok, so it still runs in under 150 million cycles, right? Not quite. The memory requirements exceed the 64k on the processor. And even the L2 cache runs considerably slower than the L1 cache. At a minimum we need to try to keep memory in the L2 cache as much as possible. Unfortunately, the algorithm runs through more than 64k of memory 8000 times a second. So just using cache may not be very efficient. Another solution is to use memory overlays. This gives the programmer more control over where the memory is at all times. And this is what I'm currently working on.


next up previous contents
Next: Bibliography Up: Combining the Beamformer with Previous: Final Results   Contents
Todd A Goldfinger 2004-11-22