Final Project Report MRI Reconstruction
Final Project Report MRI Reconstruction
INTRODUCTION
Modern GPUs are increasingly becoming an attractive platform for speeding up applications
ranging from something as huge as black hole simulations to DNA sequencing. The
innumerable application which gets speeded up due to CUDA is expanding every other day.
In order to learn and experience the effectiveness of CUDA, we took up the challenge of
speeding up computations involved in the Advanced (Magnetic Resonance Imaging) MRI
Reconstruction.
In this paper, we find an approach to accelerate the working of MRI image reconstruction
algorithm using CUDA C with the help of Graphics Processing Unit(GPU) GTX-480. The
raw image data that is captured by the coils as the K Space data needs to be processed before
the doctor can actually interpret the results of the MRI scan. This processing time takes a few
hours for the best algorithm implemented on CPU. It was observed that the algorithm
contained portions that exhibited high parallelism. These parallel data computations could be
computed at a much faster rate on the GPU. The bottlenecks of the MRI image reconstruction
algorithm were identified and these functions are implemented in CUDA C. This paper focuses
on the approach and the techniques used to convert the Matlab version of the algorithm into a
highly parallel CUDA C version.
MOTIVATION
The motivation for this project stems from the fact that MRI scan can be a long and
uncomfortable experience for the patients requiring them to lie down in the machine to closely
an hour unmoved. This project with the guidance of Prof. Dr. John Sartori and Prof. Mehmet
Akcakaya, aims at reducing the time required for the scan and the computations involved by
utilizing the GPU demon. Dr. Mehmet Akcakaya is currently exploring an algorithm called
LOST (LOw dimensional-structure Self learning and Thresholding), which adaptively finds a
sparse representation of the given image, using the various features in the image rather than
the pre-determined fixed transform domain.
The algorithm consists of two phases, one phase is responsible for the de-noising the image
for accuracy and another phase does the fourier transform for reconstruction.
We proposed that the scope for parallelism is huge during computation of fft and ifft, involving
an input size (4D data) as huge as 256*232*88*32. We planned to use the inbuilt CUDA
library, specialised for fft Cufft to parallely optimise the execution of the transformation.
DESIGN OVERVIEW:
The Discrete Fourier transform maps a complex valued vector into its frequency domain
representation given by
Cufft vs fftw : It was found in a study from University of Waterloo that Cufft is good for
larger sized fft implementation. Cufft starts to perform better than fftw around data sizes of
8192 elements. Therefore it can be safe to assume that Cufft works better than fftw for input
size greater than (~10,000 elements).
PROFILE SUMMARY of Matlab Code:
72%
It was found that out of all the functions in the matlab code provided to us, only recons_cs,
data_consistency_3d, RecPF_denoise were consuming 72% (548 secs /752secs) of the
timing. Therefore it must be sensible only to convert those functions into CUDA C.
FLOW CHART:
We Input the image file (256*232*88*32) into the GPU's global
Step 1 memory
Find the centre dimensions to determine the Critical image area and
Step 2 filter the image using Tukeywin filter.
Perform Inverse Fast fourier transform for all coils and sum them to
Step 3 3D size.
Step 4
Step 5
manipulation of input data. The input data is the single image produced after each cycle of the
main reconstruction algorithm (mainly the 3D data consistency part).
The RecRF_Denoise first initializes constants for its finite difference method using constant
memory and compiles that to the device. It also initializes optimization parameters into
constant memory as well. The values in constant memory are retained across kernel calls.
Then RecRF_Denoise starts by calculating initial parameters for the calculation loop which
are a denominator and a numerator, both which require a FFT to be performed.
Following those initializations, we start our main loop which is divided into a few parts. We
would have wanted to create only one kernel for the entire processing of data, but because we
need to call cuFFT on our data throughout, we must divide the calculations into parts, and save
our current states using global memory so we can resume with cuFFT calculation outputs in
our calculations. The loop runs three times and solves the three big parts of the MATLAB
code: the W sub problem, the U sub problem and ends with Bergman update of values for next
run. We write the brute force approach to the code and then start optimizing.
For some of the computations, we reordered the variables in order to minimize number of
variables used in the brute force approach (and in the Matlab code). By removing the extra
variables we were able to minimize memory consumption and increase our performance.
Additionally, parts of the Matlab code were not very efficient: for example data that was
calculated but not used, etc. We also optimized the code to remove those necessities.
For the Fourier Transform code, the implementation was pretty straight forward. For the finite
difference methods, we used code that was published on the NVIDIA blog [5] and modified it
to run without input size, and with cuComplex data. The code used shared memory and we
had to play around with shared memory size because initially with our input size, the compiler
declared that we were allocating more shared memory than the device could allocate.
Combining the code proved a little challenging (we started running into problems with the
finite differences when we tried to put them all in one kernel call) so we abandoned that idea
specifically, especially since for the recRF denoise part, there were problems with the
verification.
of which parts were the problem. Starting from the beginning we notice that the results for the
numerator and denominator are not the same as those from Matlab.
At the moment, the method provides all nan + j nan for all its outputs. Sample below for first
two expected elements: -2.483373e-10+j-1.389864e-09 1.088936e-10+j6.383455e-10.
Further work should go into looking at what is causing such a problem. We ran out of time to
continue to triage this.
PERFORMANCE
Initially we implemented a CUDA program to perform the RecRF_Denoise function, which
achieved a performance of 40 seconds over 10 iterations. We noticed later that the calculations
for averages were incorrect (we werent dividing by 10) and thus that gave us a performance
of about 4 seconds per each iteration of the loop. This is smaller than 7 seconds so we are
getting some performance improvements (not much yet though). After further analysis, we
discovered that building using nvcc was building in debug mode, and after creating a new
MakeFile and running the kernel, we started achieving a huge performance increase. Now we
are getting less than half a second for run time. Thats about a 14x improvement (not a x100
yet but getting some great improvement).
We tried to change block size of finite difference part but we noticed not much increase in
performance by changing the sizes (we were already using long stencil size in our shared
memory which was already optimized).
The memory cost is huge for RecRF_Denoise, besides the input and output data, we allocate
10 other matrices to store our data, for a total cost of 256*232*88*8*(10+2) = 0.501743616
Gigabytes. Its a little hard to calculate if we are bounded by memory or computation in our
case.
IMPLEMENTATION: dataconsistency3d
The source matlab function for dataconsistency3d function is shown below
sig_check = fftshift(fftn(ifftshift(img, 1)));
sig_check(picks) = sig(picks);
z = fftshift(ifftn(ifftshift(sig_check)), 1);
The main goal of this function is to retain the critical part of the image. During the RecPF denoise process all the image data are de-noised. Picks has the index of all critical parts of the
data. Sig is the original 4D data. Perform fft, replace with critical part, perform ifft and again
8
send it to RecPF denoise process. This process continues for 25 times. Since lots of 3D fft and
ifft are performed, cuda implementation will fasten the process.
Conversion of matlab funtions such as fftshift and ifftshift into c code is the first step. Kernel
code for fftshift and ifftshift are implemented. The following table shows the comparison time
taken of fftshift & ifftshift function to run in CPU vs GPU. Input size is same as 3d matrix size
256x232x88.
Table 1. FFT and IFFT shift function CPU vs GPU
Time
taken
execute in CPU
to Time
taken
execute in GPU
to Performance gain
Fftshift function
12.5ms
0.51ms
24.5x
Ifftshift function
12.945ms
0.753ms
17.19x
The fft and ifft functions are performed using cufft functions. The first step to perform a cufft
is to create a plan usinf cufftPlan function. A plan once created can be used later for subsequent
calls. We give the x,y,z dimensions to the plan and its type of computation say complex to
complex etc. The fft and ifft calls are made by executing the plan using cufftExec function.
The table below shows the time taken for dataconsistency3d function to execute in matlab and
GPU.
CUFFT Implementation in MATLAB :
cufftHandle plan_fft;
cufftPlan2d(&plan_fft, y,x,CUFFT_C2C);
cufftExecC2C(plan_fft, img_d, out_d, CUFFT_FORWARD);
cudaDeviceSynchronize();
CUFFT Implementation in MATLAB for IFFT:
cufftHandle plan_ifft;
cufftPlan2d(&plan_ifft, y,x,CUFFT_C2C);
cufftExecC2C(plan_ifft, img_d, out_d, CUFFT_INVERSE);
cudaDeviceSynchronize();
9
No of Coils
Used
Time taken to
execute in
MATLAB(t1 in
ms)
Time taken to
execute in
GPU(t2 in
ms)
Performance
gain = (t1/t2)
39.875
975
57
17.10526316
79.75
1437
103.927
13.82701319
159.5
2083
200
10.415
319
4106
385.53
10.65027365
16
638
7761
777.987
9.975745096
32
39.875
15356
1455
10.55395189
Time in ms
14000
12000
10000
8000
6000
4000
2000
0
39.875MB
79.75MB
159.5MB
319MB
638MB
1276MB
Input Size in MB
Time taken to execute in MATLAB(t1 in ms)
10
Speed up x
12
10
8
6
4
2
0
39.875MB
79.75MB
159.5MB
319MB
638MB
1276MB
Input Size in MB
Figure 6. Performance gain for various input sizes
In the MATLAB implementation, the time taken to compute the data consistency 3D function
increases drastically with increase in input size, reaching a maximum of about 16s for all the
32 coils. Whereas in the CUDA implementation, the maximum time is only 1.5s for all the 32
coils.
The FFT kernel is launched once for every coil. As the input coil size increases, the overhead
due to memory transfers between the host and the device memory also increases. Currently,
only 1 stream is being used for performing FFT / IFFT. Using multiple streams to pipeline the
tasks of memcopy and kernel execution can increase the performance further. We
experimented with multiple streams and software pipelining to further improve the throughput.
But, that causes output mismatch and we need to do further research to solve this issue. We
have attached all the implementation of the code in the Appendix for your reference.
VERIFICATION: dataconsistency3d
11
I0 is a known, perfect answer. This is typically found by performing the function in matlab.
The following table shows MSE and PSNR for different coil inputs.
Table 3. MSE and PSNR for various coils
Coils
Coil1
Coil2
Coil3
Coil4
Coil5
Coil6
Coil7
Coil8
Coil9
Coil10
Coil11
Coil12
Coil13
Coil14
Coil15
Coil16
Mean Square
Error
2.76485E-05
0.000262461
2.52628E-05
2.45329E-05
4.57182E-07
6.1467E-06
1.42437E-05
1.46217E-06
9.14289E-05
0.00010088
6.15868E-06
0.000287483
0.000110432
4.6884E-05
6.61326E-05
0.000187387
Maximum Io
0.8731
0.8297
0.8984
0.8798
0.9082
0.9304
0.9319
0.9719
0.97
0.9363
0.9399
0.9498
0.978
0.9709
0.9484
0.8771
PSNR(db)
44.40455724
34.18777834
45.0445775
44.99019378
62.56273853
51.48697426
47.85117212
58.10246753
40.12459774
39.39023257
51.56675881
34.96651801
39.37581573
43.03324413
41.33567204
36.13359831
12
Coil17
Coil18
Coil19
Coil20
Coil21
Coil22
Coil23
Coil24
Coil25
Coil26
Coil27
Coil28
Coil29
Coil30
Coil31
Coil32
8.53287E-06
9.77867E-06
1.49731E-06
1.22615E-05
3.36993E-05
2.44395E-05
7.40974E-05
4.89955E-05
0.000186808
4.91792E-05
1.26168E-05
5.07543E-05
7.33233E-05
2.85137E-06
1.08874E-05
9.10939E-06
0.9545
0.8778
0.9177
0.8416
0.9124
0.8917
0.8703
0.8472
0.8877
0.9376
0.9754
0.9687
0.9274
0.9467
0.9251
0.9596
50.28456843
48.9651119
57.50088891
47.6166772
43.92750306
45.12345043
40.09534734
41.65815335
36.25137087
42.52253555
48.77417303
42.66905996
40.69292106
54.97370418
48.95454347
50.04691095
PSNR in decibels
60
50
40
30
20
10
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
Coils
As can be seen from the graph above, an average PSNR of 45 dB is obtained from the output
of all coils.
13
CONCLUSION
The bottleneck functions in the MATLAB code for MRI Reconstruction were identified. Of
the two bottleneck functions, the RecPF denoise function did not exhibit any kind of
parallelism and works better when implemented sequentially. The parallel implementation of
the other function, data consistency 3d, however showed considerable speedup (10.55x) over
its sequential counterpart. Further improvements can be made by performing the cuda FFT /
IFFT of multiple coils simultaneously over parallel streams. The FFT and IFFT shift were
implemented in the native CPU as well as in CUDA GPU. The parallel implementation of
FFT shift is found to be 24x faster, while the IFFT shift is found to be 17x faster. Overall, we
experimented with multiple ways to improve the existing MATLAB code and were able to
successfully accelerate the computational speed.
REFERENCE
1. https://developer.nvidia.com/cufft.
2. GPU computing gems (Emerald Edition) Wen-mei W. Hwu, ISBN: 978-0-12384988-5.
3. http://developer.nvidia.com/object/matlab_cuda.html.
4. http://mri-q.com/index.html
5. http://devblogs.nvidia.com/parallelforall/finite-difference-methods-cuda-cc-part-1/
6. University of Waterloo. (2007).
http://www.science.uwaterloo.ca/hmerz/CUDA_benchFFT/
{
// 2D Slice & 1D Line
int sLine = Ny;
int sSlice = Nx * Ny;
// Transformations Equations
int sEq1 = (sSlice + sLine) / 2;
int sEq2 = (sSlice - sLine) / 2;
__syncthreads();
// Thread Index (1D)
int xThreadIdx = threadIdx.x;
int yThreadIdx = threadIdx.y;
__syncthreads();
// Block Width & Height
int blockWidth = blockDim.x;
int blockHeight = blockDim.y;
__syncthreads();
// Thread Index (2D)
int xIndex = blockIdx.x * blockWidth + xThreadIdx;
int yIndex = blockIdx.y * blockHeight + yThreadIdx;
__syncthreads();
// Thread Index Converted into 1D Index
int index = (yIndex * Nx) + xIndex;
__syncthreads();
if (xIndex < Nx / 2)
{
if (yIndex < Ny / 2)
{
// First Quad
output[index] = input[index + sEq1];
__syncthreads();
}
else
{
// Third Quad
output[index] = input[index - sEq2];
__syncthreads();
}
}
else
{
15
if (yIndex < Ny / 2)
{
// Second Quad
output[index] = input[index + sEq2];
__syncthreads();
}
else
{
// Fourth Quad
output[index] = input[index - sEq1];
__syncthreads();
}
}
}
16
if (xIndex < Nx / 2)
{
// First Half
output[index] = input[index + Nx/2];
__syncthreads();
}
else
{
// Second Half
output[index] = input[index - Nx/2];
__syncthreads();
}
}
in_d_0,
in_d_1,
in_d_2,
in_d_3,
out_d_0,
out_d_1,
out_d_2,
out_d_3,
CUFFT_FORWARD);
CUFFT_FORWARD);
CUFFT_FORWARD);
CUFFT_FORWARD);
17
18
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_0,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+i*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_1,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+1)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_2,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+2)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_3,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+3)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
19