MP Report v2
MP Report v2
Multiprocessor
Programming (521288S)
Kaushik
Sundarajayaraman Venkat
# Clone Repository
$ git clone https://bitbucket.org/kaushiksv2/zncc.git
$ cd zncc
The application outputs the final output image to outputs/depthmap.png, overwriting any existing file. The directory
`outputs` directory must exist, and is included in the repository. The output to stdout has the following information:
Timestamp
Maximum disparity: maxdisp
Window size: winsize
Threshold for Cross-Checking: thresh
Neighborhood size for Occlusion Fill: nhood
Number of threads – CPU mode only: nthreads
Time taken to shrink and greyscale the original images: t_sg
Time taken to calculate prelim disparity maps (t_d0 and t_d1): t_d0 and t_d1
Time taken for cross check: t_cc
Time taken for occlusion filling: t_of
The same is also appended to performance_log.txt for the sake of record. it is to be noted that the nthreads value is
included in output only when run in CPU mode.
In GPU mode, all time values displayed are the time taken for kernel execution, usually in multiples of 1000ns, and
hence 0.000 ms is seen often (low-level limitation). In GPU mode, the time measured by gettimeofday happened to
be around 7.85ms for shrink_and_grey kernel and less than 5ms for cross-checking and occlusion filling kernels ,
including calls to clEnqueueNDRangeKernel, and clWaitEvent.
Computes ZNCC based depthmap. Looks for im0.png and im1.png in working
directory. Outputs to outputs/depthmap.png relative to working directory. See
zncc.cpp and zncc_gpu.cpp for details.
In CPU mode, set shell variable INTIMG=1 in order to output intermediary images
to 'outputs/' directory.
The CPU version of depth-mapping is robust in running the job under wide range of parameters. The parameters are
configurable by command line arguments, as shown in the previous page.
URL: http://www.ee.oulu.fi/~ksundara/mpo/depthmap.hd.png
Application Development:
Initial Stage Problem:
The programming exercise was originally began on 1st of February with Microsoft Visual C++, as a dialog
based MFC application. After experimentation with of LodePNG functions, quite some time was taken on
deciding the best way to perform the looping. The initial code had involved plenty of boundary checking,
and shrinking the window suitably at boundaries so as to fit within the image. The complications were then
removed as they were deemed unnecessary.
The wrong preliminary disparity map at that beginning stage had looked something like this:
zncc = 0; // <-- This was inside the `for` loop below and went unnoticed for long time.
The few non-white grey values see on pictures were a result of negative zncc.
Without having fixed this, the multithreading part was implemented on Windows with a hope that changing
parameters would result in getting an image as shown in the course instructions document. Later in Linux,
after confirming that the problem was in the algorithm itself rather than incorrect window_size, a
binge-debugging session finally helped spot the problem.
The correct output for window_size=9 is shown below:
(Moving the window at im1.png, keeping that of im0.png fixed.) (Moving the window at im0.png, keeping that of im1.png fixed.)
main.cpp parses the command line arguments and depending upon the mode of operation (default CPU; use -g
or --use-gpu flag for OpenCL/GPU), it invokes exec_project_cpu or exec_project_gpu functions which are defined in
zncc.cpp and zncc_gpu.cpp respectively.
cmdline.c is used for command line parsing, and is auto-generated from getopts.ggo by GNU gengetopt 2.22.6.
* - The links to cldemo.c were erroneously pointing to different version, and have been rectified in this document.
.h files: includes.h, zncc.h, util.h and cmdline.h
Other headers declare functions, few helper like static functions, macros etc.
zncc.cl All __kernel functions in one .cl file. This is for the purpose of simplicity.
In calculating the preliminary depth-maps using ZNCC method, the mean and standard deviation of window
on left image is done just once, and reused for calculating ZNCC values along different ‘d’ (disparity) values.
This saves many computations. To my surprise, many implementations on the internet have been oblivious to
this basic and easy optimization opportunity. For a 735x504 image, when maxdisp=64, this would save upto
23708160 calculations if the optimizer misses to move out the loop-invariant computations, which is
especially likely when the buffer is not marked as const, or in some other edgy conditions.
6. Vectorization:
The Shrink/Grey kernel uses vectorized notation for demonstration purpose.
7. Memory Coalescing:
Given that barriers used in the ‘compute_disparity’ kernel has no conditional branching between the two calls
to barrier(), memory access during zncc calculation can be expected to be coalesced.
Also, certain parts of occlusion_fill had column-major access for debugging purpose, and they were promptly
changed to do row-major access.
(This commit: https://bitbucket.org/kaushiksv2/zncc/diff/zncc.cl?diff2=3b73400a0ab1&at=master)
CPU code parallelism is done only when during preliminary depth-mapping, and not in other steps, because the
overhead of creating threads etc. would outweigh the benefits of parallelism when done for small tasks like shrinking,
greyscale, cross-check, and occlusion fill. Maybe not when there is a thread-pool ready to execute jobs, but that is not
focused here.
Performance Gain:
Timings for computation of one preliminary disparity map out of two small images:
It is interesting to note how the home PC is faster, and also has better gain than the server. This can be explained by
CPU/GPU clock frequency, and also maybe many other factors affect, like the number of users accessing the computer
at a given time.
Device Specification Comparison:
kaushik@kaushik-ubuntu:~/mp/zncc$ ./zncc -q
CL_DEVICE_LOCAL_MEM_TYPE ............... : CL_LOCAL
CL_DEVICE_LOCAL_MEM_SIZE ............... : 49152 Bytes
CL_DEVICE_MAX_COMPUTE_UNITS ............ : 9
CL_DEVICE_MAX_CLOCK_FREQUENCY .......... : 1784 MHz
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE ..... : 65536 Bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE .......... : 1024 Bytes
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS ..... : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES .......... : 1024x1024x64
It can be seen that the number of max compute units and clock frequency is
better in Ubuntu-PC and therefore it is faster than the Tesla K80, at least in
this specific problem.
Time-analysis on GPU:
The following analysis of execution time vs window size was done on Ubuntu-PC and later transferred to cse-
cn0011.oulu.fi for evaluation.
450
400
350
300
Time (ms)
250
200 w vs t_d0
w vs t_d1
150
100
50
0
0 10 20 30 40
Window size (side of square)
w vs log10(t_d0)
3
2.5
log(time_taken) 1.5
0.5
0
0 5 10 15 20 25 30 35 40
Window size (side of square)
It can be seen that time taken for complete execution of ‘compute_disparity’ kernel increases in a kind-of predictable
fashion based on the window-size in use. Further investigation is quite likely to throw some light on the relationship
between the variables, and help further fine tune real-world applications to achieve a fine balance between execution
times and any other desired variable(s).
The output depth maps with wide-range of parameter can be found at:
http://www.ee.oulu.fi/~ksundara/mpo/analysis/
Acknowledgements:
Thanks to the University for having provided me this opportunity to learn more on multiprocessor
programming.
Special thanks to Clifford Wolf (http://clifford.at) for having “unlicensed” his example program from
GNU GPL v2 to Open Domain. The CL_CHECK and CL_CHECK_ERR macros are used from his example
program available at http://svn.clifford.at/tools/trunk/examples/cldemo.c *
* - The links to cldemo.c were erroneously pointing to different version, and have been rectified in this document.