UW-Madison • Department of Electrical and Computer Engineering
Physical Computation Laboratory
Fast cryogenic FFT for sensors
George Tzimpragos
April 24, 2025
Fully Parallel FFT
- We assume the FFT is applied to
a stream of samples from a sensor that
operates at 50GHz
- In the fully parallel implementation,
butterfly units form a network that
connects inputs a fixed
distance apart at each stage
- A fully parallel implementation is not
suitable for the cryogenic
environment’s area and power budget
2
Approach 1: Pipelined FFT
- To apply the FFT to a steaming input
sequence, we can use a pipelined
implementation like SDF (Figure)
- At each stage, inputs get buffered for a
delay according to the distance of
connected samples in the fully parallel
network, so they get aligned to be
processed by a butterfly unit
- All memory for this implementation is in
the form of delays of constant size,
which can be implemented cheaply with
delay constructs based on Passive
Transmission Lines (PTLs)
- There is no feedback from the outputs
of a stage to its inputs, so the butterfly
units can be further pipelined to
increase throughput
3
Reducing clock speed
- In order to provide a throughput of 50
GSa/s at a lower clock speed for the
butterfly units, we can divide the inputs
round-robin to create slower
sub-sequences, apply a pipelined FFT to
each in parallel, then combine them
(MDF architecture)
- While this architecture has 50% utilization,
there are similar architectures that fix this by
shuffling the buffered data (eg MDC)
- The tradeoff between lowering operating
speed requirements and increasing
hardware requirements due to increased
parallelism extends to considerations of
different logic families (eg RSFQ-Faster,
xSFQ-More area efficient)
4
Approach 2: Sliding window FFT
- In a continuous input setting, rather than calculating a new FFT from
scratch every N samples, the known previous values of the FFT can be
updated for each new sample using a complex multiplication and an
addition for each frequency bin
- Unfortunately, a feedback loop that includes that addition and
multiplication from one cycle to the next prohibits the pipelining of these
operations, so the clock frequency cannot be reduced under 50 GHz
5
Sliding FFT with Batch updates
- We can reduce the throughput
requirement by collecting a batch of k
consecutive samples and applying the
update to the value of the FFT for all
of them together, parallelizing the
complex multiplications
- The batch size can be increased until
the update can be performed at a
feasible clock frequency, but more
hardware will be required for the
parallelized calculation
- Sliding FFT is appealing when the
downstream task only uses the results
for a few frequencies, as only those will
need to be implemented