Department of Informatics Technische Universit at M U Nchen
Department of Informatics Technische Universit at M U Nchen
Jan Botsch
e
e e e
e e e e
e e
e e e
e e e e
e e e
Technische Universität München
Department of Informatics
Jan Botsch
e
e e e
e e e e
e e
e e e
e e e e
e e e
Ich versichere, dass ich diese Bachelorarbeit selbstständig verfasst und nur die angegebenen Quellen
und Hilfsmittel verwendet habe.
I hereby declare that the thesis submitted is my own unaided work. All direct or indirect sources
used are acknowledged as references.
Munich, 20.03.2015
Jan Botsch
Abstract
Road lane detection and tracking methods are the state of the art in present driver assistance systems.
However, lane detection methods that exploit the parallel processing capabilities of heterogeneous high
performance computing devices such as FPGAs (or GPUs), a technology that potentially will replace
ECUs in a coming generation of cars, are a rare subject of interest. In this thesis a road lane detection
and tracking algorithm is developed and implemented, especially designed to incorporate one or many,
and even heterogeneous, hardware accelerators. Road lane markings are detected and tracked with a
Sequential Monte Carlo (SQR) method. Lane detection is done by populating a pre-processed gradient
image with randomly sampled, straight lines. Each line is assigned a weight according to its position
and the best positioned lines are used to represent the lane markings. Subsequently, lane tracking is
performed with the help of a particle filter. The code was tested on three devices, one GPU - the
NVIDIA GeForce GTX 660 TI - and two FPGAs - the ALTERA Stratix V and the ALTERA Cyclone
V SOC. The tests revealed a processing frame rate of up to 627 Hz on the GPU, 478 Hz on the Stratix
V FPGA and 38 Hz on the Cyclone V SOC. They also showed a significant improvement in accuracy
and robustness, a 2.4-4.6 times faster execution on the GPU, a 8.4-29.7 times faster execution on the
Stratix V and a reduction of memory consumption by 71.94 % compared to a similar lane detection
method. The algorithm was tested on different recorded videos, on independent benchmark datasets
and in multiple test drives, confronting it with a wide range of scenarios, such as varying lighting
conditions, presence of disturbing shadows or light beams and varying traffic densities. In all these
scenarios the algorithm proved to be very robust to detect and track one or multiple lane markings.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Structure of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Verification of the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure of this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 High performance computing with hardware accelerators . . . . . . . . . . . . . . . . . 4
2.2.1 Graphical Computing Units (GPUs) . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Field-Programmable Gate Arrays (FPGAs) . . . . . . . . . . . . . . . . . . . . 5
2.2.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Setup for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Method 7
3.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Grayscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.4 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.5 Parallelization of the pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.6 Summary of the Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Description of lane markings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Straight-Line-Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Position description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Characteristics of lane markings . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Lane Detection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.3 Sampling of a Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.4 Parallelization of the Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Lane Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Setup of the particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.2 The lane tracking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.3 Parallelization of the lane tracking . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Redetection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vii
4.2 Lane detection on multiple hardware accelerators . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Installable client driver loader (ICD loader) . . . . . . . . . . . . . . . . . . . . 20
5 Results 21
5.1 Accuracy of lane detection and tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.2 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.3 Lane tracking/particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Performance in challenging environments . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 KITTI-ROAD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Test drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.5 Processing speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.5.1 Computation on multiple devices . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6 Comparison to previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.6.2 Memory consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6.3 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusion 35
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 37
List of Figures
3.1 The flow of the lane detection and tracking method presented in this thesis . . . . . . 7
3.2 Flow and elements of the pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Original image and output of the different pre-processing stages . . . . . . . . . . . . . 9
3.4 Description of a lane marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Detection of lane markings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Calculation of the importance weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Processing flow of the particle filter when multiple hardware accelerators are used . . 20
ix
List of Tables
5.1 Parameters that were used for comparing the performance of the current work to Nihil
Madduri’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Comparison of the memory consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 33
List of Algorithms
1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Lane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Lane Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
xi
1 Introduction
1.1 Motivation
Road accidents lead to thousands of fatalities in Germany every year (destatis, 2013, p. 7). The vast
majority of these accidents are caused by driving errors, e.g. 86 % in 2012 (destatis, 2013, p. 15), and
many can be avoided by driver assistance systems (DAS). A variety of DAS exist that aim to improve
road and driver safety. One of these system is the lane departure warning system that uses a road
lane detection algorithm to detect if a car departs from a lane. Today lane departure warning systems
only notify the driver and do not intervene, but their role will become more active in the future as the
automation of driving progresses.
The increasing responsibility poses new challenges to DAS. They have to be more reliable, more
robust and react even faster. Consequently it will be demanded of lane departure warning systems
that they detect a departure from a lane fast enough to be able to counteract it. In order to do so, road
lane detection algorithms are necessary that deliver very accurate and fast estimates of lane positions.
Road lane detection algorithms are already the state of the art and exist in many variations. Most
algorithms are vision-based and process a stream of frames depicting the road. The required com-
putations are performed on electronic control units (ECUs), application-specific embedded systems.
Tens or even up to a hundred ECUs are integrated in modern vehicles, which has resulted in an enor-
mous complexity and power consumption. However, recent developments have made other processing
units such as GPUs and, more importantly, FPGAs a feasible alternative or complement to ECUs.
Especially FPGAs promise faster and more flexible processing at less power consumption. This de-
velopment has progressed as far as the manufacturer Xilinx has already started to design automotive
ECUs that embed FPGAs (F. Fons and Fons, 2012).
The introduction of new processing units - mostly hardware accelerators - to the automotive market
transforms the demands on the software. Methods and algorithms have to be adapted to the physical
layout of the computing device in order to achieve a maximal computation speed.
Existing lane detection methods, however, are designed for conventional ECUs. Though some
methods may be adapted to the parallel computation structure of GPUs and FPGAs, they were not
specifically developed for this purpose and hence suffer from bottle necks and other disadvantages. And
they certainly lag behind a method that is specifically designed for the use of hardware accelerators.
1
1 Introduction
The proposed method is vision-based and aims at detecting the lane markings, as their appearance
significantly divers from the underlying road. It requires no knowledge of any physical parameters like
the position and orientation of the camera and is hence very flexible.
2
2 Background
2.1 Particle Filter
Particle filters belong to the family of Sequential Monte Carlo methods (SCM) and use a set of
particles to describe a posterior density distribution of a state-space model. They require little or no
assumptions on the model, only periodic (indirect) measurements of the true state. Indirect position
“measurements” in the form of images are the only available information in a vision-based lane tracking
algorithm, making a particle filter a promising tool for lane tracking.
Particle filters implement the Bayesian recursion equation
P (Y |X) P (X)
P (X|Y ) = (2.1)
P (Y )
in order to determine the posterior density distribution P (X|Y ) of a state X given the measurements Y
(Gordon, Salmond, and Smith, 1993). P (X) describes the distribution prior to taking measurements,
P (Y |X) the likelihood of the measurement conditioned on the state X and P (Y ) the evidence, or the
overall probability of the measurement.
Equation (2.1) is implemented by sampling a number of weighted particles i = 0, 1, . . . N from the
prior distribution. Each particle has a state vector Xi transforming equation (2.1) to
P (Y |Xi ) P (Xi ) P (Y |Xi ) P (Xi )
P (Xi |Y ) = = PN (2.2)
P (Y ) i=0 P (Y |Xi ) P (Xi )
In addition, each particle is assigned an importance weight wi . The importance weight expresses the
likelihood that Xi is identical with the true state of the state-space model given the measurement Y ,
hence wi = P (Y |Xi ) P (Xi ). Inserting this in formula (2.2) yields
P (Y |Xi ) P (Xi ) wi
P (Xi |Y ) = PN = PN (2.3)
i=0 P (Y |Xi ) P (Xi ) i=0 wi
In this work equation a particle filter is implemented in three consecutive steps that are performed for
each frame:
Prediction update The state Xi of each particle is updated in the same manner as the true state is
expected to change.
Importance weight update The importance weight wi is calculated for each particle and formula (2.3)
is evaluated.
Resampling A resampling step is introduced that randomly samples particles according to their im-
portance weight and copies them to a new, equally sized set. Particles may be copied more than
once. This yields a set with a higher density of “good” particles and avoids a degeneration of the
particle set (Cappe, Godsill, and Moulines, 2007, pp. 3-4).
A detailed description of the particle filter for the lane tracking based on formulas 2.1-2.3 is presented
in section 3.4.
3
2 Background
The minimum voltage is dependent on the frequency, so that if the frequency is risen the power
consumptions rises in the scale of FREQUENCY(2−3) .
This and the rising awareness that many applications are better suited to thread level parallelism
(TLP) methods, led to the development of multi-core CPU’s, consisting first of two and today already
of up to ten cores. They exert a better performance and a reduced power consumption, as each
individual core operates at a lower frequency than the single cores that were used before.
Still, there are numerous applications where even multi-core CPUs proved inefficient (Navarro,
Hitschfeld-Kahler, and Mateu, 2013). This is the case for applications which would allow extensive
parallel computing, like the rendering of 2D- and 3D-images for graphical interfaces, where the pixels
are (mostly) independent and can be processed simultaneously.
4
2.2 High performance computing with hardware accelerators
mostly serve in entertainment systems, it can be expected that in the future they will be used for DAS
like the lane departure warning system.
2.2.3 OpenCL
The change to a more general hardware layout of GPUs, gave rise to the development of new, equally
general APIs replacing the old graphic APIs. Developers like ATI (now AMD Graphics Product Group)
and Nvidia created in 2006/2007 Close to Metal (CTM) and Compute Unified Device Architecture
(CUDA), parallel computing platforms that supported their respective GPGPU architectures. The
wish for a unified programming model targeting heterogeneous platforms led to the development
of OpenCL 1.0 in 2009. OpenCL is maintained by the non-profit technology consortium Khronos
Group and it quickly evolved into versions 1.1, 1.2, 2.0 and 2.1 adding a wider support for devices and
functionalities. The OpenCL specification provides the following four models to organize the execution
of programmes embedding hardware accelerators:
Platform model This model defines one processor that coordinates the execution (host) and one or
more processors capable of executing C-like OpenCL code (devices). It further specifies that the
code for the devices is organized in C-like OpenCL functions (called kernels) that execute on the
devices.
Execution model Outlines the OpenCL environment, the frameworks different abstraction levels and
its use in a C/C++ program, including the setting up of an OpenCL context, the organisation
of host-device interaction and the different concurrency models for the kernel executions (e.g.
5
2 Background
SIMD). It specifies that a C/C++ programme is executed on the host and the host distributes
tasks between the devices, which execute the required kernels. The kernels are executed in
workgroups consisting of one ore many work items. Within a work group each work item executes
the same kernel, but different work groups may execute different kernels.
Memory model Defines the abstract memory hierarchy that kernels use, regardless of the actual
underlying memory architecture. Four different memory types are provided,
• global memory: is very big, all processing elements can access it, but access has a high
latency.
• constant memory: can be written to only by the host. It is smaller than the global memory
and has a lower latency.
• local memory: is shared by a workgroup (of processing elements), smaller than constant
memory, but access is faster.
• private memory: belongs to a particular work item, mostly implemented by registers. It is
the fastest and smallest memory.
Programming model specifies the mapping of the concurrency model to physical hardware.
If OpenCL is used with multiple devices from different vendors, an OpenCL implementation for
each device has to be available on the target platform (the host). In addition an installable client
driver loader (ICD loader) is required for each implementation, in order to use different OpenCL
implementations in the same application.
6
3 Method
The lane detection and tracking method presented in this chapter processes an incoming live video
stream frame by frame and extracts the position of lane markings. The video stream, showing a road
and the area surrounding it, will be delivered by a camera installed in a vehicle or mobile phone.
A frame is processed in two subsequent steps. First, information on the lane markings’ position
is amplified and extracted from the frame in a pre-processing stage. Then, depending on whether
previous estimates of the position exist or not, the exact position of the lane markings is detected in
a lane detection step or tracked in a lane tracking step. This process is illustrated in figure 3.1 and
will be explained in the following in detail.
Lane Detection
Position of
Video stream Pre-Processing
lane markings
Lane Tracking
Figure 3.1: The flow of the lane detection and tracking method presented in this thesis.
3.1 Pre-Processing
In the pre-processing stage information on the position of lane markings is extracted from a frame
and passed on to the lane detection or tracking step. The pre-processing stage itself consists of four
different procedures that are applied successively to the raw images. The procedures are
1. ROI selection. A region of interest (ROI) is defined within the raw image and only this region
is further processed.
7
3 Method
Pre-Processing
Pre-Processed
Thresholding Sobel Filter
Image
In this work size and position of the ROI are kept adjustable, so that the algorithm can be tested
in a wide range of a scenarios. In a real application with a fixed camera both size and position can be
hard-coded, which will yield an additional acceleration of the computation. Once the ROI is selected,
only the area within the region is processed in the subsequent stages.
3.1.2 Grayscaling
A raw image from a camera (and hence the ROI) is provided in the RGB colour format. In this format
each pixel is assigned three colour channels, one for red, one for green and one for blue. The values of
the three channels are combined to yield the actual colour of the pixel.
In the RGB colour format the distinction of lane markings from their environment is challenging or
not possible. First, all three colour channels would have to be compared. Second and more important,
the comparison would have to be generic to account for varying colours (e.g. white or orange) and
intensities of lane markings.
It is more promising to make use of the characteristic that lane markings are substantially brighter
than the road they are printed on. The ROI can be transformed to a grayscale format, where each
pixel reflects the intensity of the pixel in the original image. Dark pixel will receive low intensity values
and bright pixels will receive high values.
Different methods exist to transform an image from the RGB to the grayscale format, all based on a
weighted summation of the different channels. In this thesis a resource saving numerical approximation
is used for the transformation. It avoids the use of floating point arithmetic, which accelerates the
computation on hardware accelerators, especially on FPGAs. The transformation formulas are
where R is the red colour channel of the pixel, G the green and B of the blue, all in the range [0, 255].
Y is the intensity and will have a value in the range [16, 235]. This conversion is performed for each
pixel of the ROI individually, resulting in a grayscaled image as shown in figure 3.3d.
8
3.1 Pre-Processing
(e) Output of the Sobel filter (f) Final output after thresholding
Figure 3.3: Original image and output of the different pre-processing stages. In (a) the original image
depicting a road is shown. In (b) a ROI is marked that contains all required information
for detecting/tracking lane markings. (c) shows a zoomed image of the ROI and (d) the
result after grayscaling. In (e) the lane markings and other edges were detected by a Sobel
filter. In (f) the lane markings were highlighted and disturbances were removed by using
a threshold-value
calculated by applying two discrete differentiation operators to the image, one for the horizontal and
on for the vertical direction. The operators include smoothing and have the form of two 3x3 kernels,
which are convoluted with the original image. The formulas for the convolutions are
−1 0 +1 −1 −2 −1
Gx = −2 0 +2 ∗ Image, Gy = 0 0 ∗ Image (3.4)
0
−1 0 +1 +1 +2 +1
The formulas show that 9 pixels are required in order to determine the gradients at a position. Gx and
Gy are then combined to yield the overall gradient. Several methods exist to combine the gradients,
here a computationally inexpensive variant, shown in equation 3.5, is used.
Figure 3.3e illustrates the output of the Sobel filter. The Sobel filter delivers optimal results in an
environment consisting of a dark street, bright lane markings and no disturbances. Here the lane
markings are represented by pixels with high intensity and the rest of the image is populated with
zero intensity pixels.
9
3 Method
3.1.4 Thresholding
In a real application, however, images contain noise. This noise will be present in the form of additional,
undesired edges. Possible sources for edges are:
• Other road markings (e.g. arrows for turns)
10
3.2 Description of lane markings
Algorithm 1 Pre-processing
1: Load nine required PIXELS
2: for all PIXELS P do
3: Transform P to grayscale space (Grayscaling)
4: Calculate Gxp and Gyp (Sobel filter)
5: Gx += Gxp
6: Gy += Gyp
7: end for
8: Calculate G = |Gx | + |Gy |
9: if G < thres then (Thresholding)
10: G←0
11: else if G ≥ thres then
12: G ← MAX.
13: end if
In the following sections onlane detection and lane tracking, the pre-processed ROI will be used to
determine the exact position of the lane markings. Before that, the next section will introduce an
assumption on the shape of lane markings and establish an efficient way to describe their position.
Whereas demands one and two are obvious, the third requirement stems from the nature of the
algorithms used in this thesis. Both the lane detection and lane tracking are implemented with the
help of random sampling methods. This includes the creation of a lot of samples for lane markings, in
the following called candidate lines, from whom the most fitting candidate line will be chosen (more
in sections 3.3 and 3.4). As there might be several thousand candidate lines, each should consume as
less resources as possible.
3.2.1 Straight-Line-Assumption
If a lane markings appears in a complicated shape, it is challenging to find a feasible description.
Either a complex mathematical function, or a set that contains all the pixels of the lane marking
would be required.
The description is greatly simplified by the observation that lane markings mostly take the shape
of a straight line within the ROI (compare figure 3.3f). This is due to the fact that the ROI captures
only a small section of the street ahead.
11
3 Method
Figure 3.4: Description of a lane marking. xstart and xend and the height of the ROI (the light blue
area) fully define the line.
On straight roads and even in moderate bends the straight-line-assumption always holds. In sharp
bends or other exceptionally routed roads, the lane markings might exhibit a bend. Then the ROI
can be split horizontally into two or more regions, yielding subregions with straight lane markings.
Given the straight-line-assumption, the challenge of describing shape and position of a lane marking
reduces to describing a mathematical line.
Previous Approach. Still, there a different ways of describing a line for the purpose of lane detec-
tion. In (Madduri, 2014) two arrays are used to store position and shape of a lane marking, one for
all the x-values of the line and one for all the y-values. In addition, the angle of the line is stored and
used to calculate all x- and y-values.
This approach is intuitive and thorough. Every bit of information on the lane markings is stored.
It exerts, however, two disadvantages. First, the angle requires trigonometric functions for calcula-
tions. These functions are expensive on hardware accelerators (especially FPGAs) and consume a
considerable amount of the available computing resources.
Second, each line is excessive in terms of its memory consumption. In a ROI of N rows, a single line
will need storage of 2N + 1 values (N · x-values + N · y-values + 1 · angle). Since thousands of candidate
lines are used, the available memory on a hardware accelerator might not be sufficient.
New Method In this thesis the description of lane markings is simpler and more efficient. It bases
on the characteristic that a straight line can be defined by two points on the line. The points at the
top and the bottom of the ROI are chosen to represent a line. The position (or state) of a line can
thus be expressed as
!
xtop
X=
xbottom
if the height of the ROI is known. Since the y-values of the points are constant and given by the
respective line number in the ROI (first line and last line), they do not need storing. Figure 3.4 shows
a line in the ROI that is completely defined by the state X and the height of the ROI.
12
3.3 Lane Detection
Reduced memory consumption The memory a line consumes, was reduced from 2N + 1 to exactly
two values (compare figure 3.4).
Simple mathematical operations The need for trigonometric operations was eliminated. The slope
sX and any other point with y-value n can be determined by
(xbottom − xtop )
sX = , xn = xtop + sX · n
ROI HEIGHT
Both formulas use only basic mathematical operations that are inexpensive on FPGAs.
13
3 Method
(a) ROI with four regions (b) A lane marking is only partly in the ROI
(c) ROI with sampled lines in one region (d) ROI with sampled and best line
Figure 3.5: Detection of lane markings. (a) shows the division of the ROI in four equally sized regions.
(b) shows a lane marking that is only partly within the ROI. In (c) one subregion is filled
with randomly sampled candidate lines and in (d) a candidate line that represents the lane
marking is highlighted in red.
14
3.4 Lane Tracking
the region. If it is selected very high, the lines are spread far and the actual lane marking might be
missed.
In this paper σsam was set to region width/2. From the definition of the standard deviation of a
normal distribution follows that statistically about 68 % of the sampled lines will be placed completely
within a region and the other 32 % of the lines might overlap to other regions. This choice of parameters
proved to detect lane markings in all kind of positions, as the results in chapter 5 will show.
The selection of the best lines from the candidate lines requires knowledge of all lines and is therefore
performed on the host.
The lane tracking algorithm, which will be described in the following section, requires a set of
possible candidates/particles for each lane marking. Therefore not only the best candidate is stored,
but for each lane marking a set with a few dozen candidate lines is kept. In the following this set will
be referred to as the good lines. The line with the highest weight amongst the good lines will be called
the best line and represents the actual lane marking.
15
3 Method
Prediction update. Step one from algorithm 2, the sampling of candidate lines from a normal
distribution, can be omitted; the particles are already given in form of the good lines. Instead, an
additional prediction update step is introduced.
This step is required, because the particles (good lines) are representing the lane markings in one
frame, but are used as prior distribution in the next frame. In the new frame the lane markings might
have moved slightly (because the vehicle moves). The particles need to move the same distance in
order to be a valid prior distribution in the new frame.
The distance the lane markings shift is not known. Therefore the particles are shifted by a random
value sampled from a normal distribution with mean µshif t = 0 and standard deviation σshif t > 0.
µshif t = 0 indicates that we expect no shift in an optimal case and σshif t > 0 accounts for a deviation
from the optimal case.
Importance weight update. For each particle of the prior distribution the importance weight
is calculated. This is done by applying formula 2.3. As mentioned there the importance weight of a
particle is determined by the numerator: P (Y |Xi ) · P (Xi ). It assesses the likelihood that the predicted
particle Xi produces the observation Y . In this work this is determined by fitting the state of the
predicted particle to the Gaussian function
2
.
i 1 −(XI −µf ) 2σf2
wX t
= √ e (3.7)
σ 2π
where µf = Y . The standard deviation σf represents the measurement noise that accounts for a
possible error in the assumption that the position of a lane marking does not change between two
frames.
The term Xi − µf = Xi − Y in formula 3.7 represents the distance of two lines, the particle Xi and
the observation Y . This distance is calculated by summing up the pointwise distance of line Xi and
line Y . It yields the area between the two lines, as shown in figure 3.6.
16
3.4 Lane Tracking
(a) Best line of a previous frame and a shifted candi- (b) The area between the two lines determines the
date line importance weight
Figure 3.6: Calculation of the importance weight. In (a) the best line from the previous frame and a
candidate line after the prediction update are shown. In (b) the area that determines the
importance weight of the particle is highlighted in orange.
The evidence P (Y ), or marginal likelihood, is required in order to put the importance weight of
different particles in relation to each other. The evidence describes the overall probability of the
observation Y and is simply calculated as
N
X N
X
P (Y ) = P (Y |Xi ) ∗ P (xi ) = wi
i i
In accordance with formula 2.3 from the section on particle filter, this results in following formula for
the importance weight update:
P (Y |Xi ) · P (Xi ) wi
wiupdated = P (Xi |Y ) = = PN (3.8)
P (Y ) i wi
Resampling. The prediction and importance weight update produce a new set of good lines, where
each particle has a normalized importance weight. Finally, a resampling step is performed in order to
increase the accuracy of the lane detection algorithm and to prevent a degeneration of the set (compare
section 2.1).
The resampling algorithm selects particles from the updated set according to their normalized
importance weight and shifts them to a new set with the same number of particles. Since particles
are selected according to their importance weight, good particles are more likely to be selected than
less accurate particles. A particle can also be chosen multiple times. The resampling algorithm used
in this paper is the same as in (Madduri, 2014, p. 57).
17
3 Method
• all detected lane markings have a minimum distance, which was chosen in this work to be 20% of
the width of the ROI. However the minimum distance can be adjusted by the user and reduced
if many lane markings have to be detected or the ROI is very small
• at least 30 % of a lane marking are in the ROI. Again this is an adjustable parameter.
The last check is important for the lane tracking. The algorithm tracks lane markings based on
existing estimates and it will do so even if the lane markings moves to the boundary or even out of the
ROI. No new lane markings are detected in the lane tracking stage. In many scenarios, for example
when a car changes the lane on the highway, one lane marking moves out of the ROI and another one
moves in. The third check of the redetection criteria discovers these cases and triggers a lane detection
step to discover the new lane marking.
As the lane tracking is a computationally cheaper stage it is desirable to perform it as often as
possible and to use the lane detection only when necessary. The redetection criteria defines the
situations where a lane detection step is required. With its introduction the lane detection and
tracking method of this thesis is complete.
18
4 Details on the Implementation
The road lane detection and tracking method presented in the previous chapter was implemented in
C++. The parts of the method that can be computed in parallel were implemented as OpenCL kernels
and are performed on the hardware accelerator (FPGA/GPU).
19
4 Details on the Implementation
The communication between host and devices is organized in a server-client model. The host
distributes the workload between the devices and collects the results when they are ready. A more
dynamic, direct communication between the devices is not implemented, as OpenCL does not yet
support this feature. Figure 4.1 illustrates the processing flow when multiple hardware accelerators
are used. The distribution of the workload is kept adjustable. The user defines how many particles
are processed on each device and the distribution may be changed at runtime.
Host
n particles m particles
Collect and
Resample
Figure 4.1: Processing flow of the particle filter when multiple hardware accelerators are used. The
user defines how many particles are processed on each accelerator.
20
5 Results
The previously developed algorithm was subjected to comprehensive accuracy and efficiency testing.
The average processing speed was determined for the overall algorithm and the individual parts of the
algorithm, as was the memory consumption. Finally, the performance of this algorithm was compared
to a preceding work presented in (Madduri, 2014).
5.1.1 Pre-processing
The pre-processing has only one adjustable parameter, the threshold-value. All other pre-processing
parts are fixed operations on pixels. The tests on the videos were performed with a threshold-values in
the range of [50, 75, 100, 125, 150] and a threshold value of 50 delivered very good results, both at day
and at night. Hence the following tests are performed with a threshold-value of 50, if not indicated
otherwise.
21
5 Results
(c) City road at night (d) Lane detection, 256 sampled lines
Figure 5.1: Output of the lane detection. The lane markings are successfully detected at day and night
to detect the lane markings is dependent on the width of the ROI and the testing revealed that a total
of Nlines = WIDTHROI /2 lines per lane marking provides a reliable and accurate detection.
22
5.2 Performance in challenging environments
(e) Tracking with 128 particles (f) Tracking with 256 particles
Figure 5.2: Output of the particle filter for a varying number of particles. 64 particles are required to
deliver an accurate and robust result as shown in (d).
Figure 5.3 displays a selection of images, where the lane tracking is not trivial and the algorithm’s
results. In 5.3a misleading lane markings are present. The algorithm proved very robust against these
kind of disturbances (as it employs estimates from previous images) and tracks the lane markings
accurately. Figure 5.3b shows the results of the algorithm for an image displaying a turn. Even
though the markings in this image are far from being parallel, they are tracked successfully.
The accuracy of the algorithm does not depend on the number of tracked lane markings, as figure
5.3c illustrates. Neither do negative implications occur, if detection and tracking take place at day or
night, which is emphasized in figures 5.3d and 5.3e. A specific challenge at night is the presence of
disturbing light beams (5.3e), but the particle filter delivers even then reliable results.
Many more scenarios can be imagined and a more profound testing would be required to further
consolidate the algorithms robustness. The tests performed in this thesis, however, indicate a very
accurate detection and tracking throughout a wide range of environments.
Figures 5.3f -5.3h show the application of the redetection criteria, when the algorithm looses track of
a lane marking. The figures depict a car changing a lane on the highway. In figure 5.3f the algorithm
looses track of the right lane marking, as it moves out of the ROI. The lane detection is immediately
triggered and detects the new lane markings (figure 5.3g). Then the algorithm continues to track the
newly detected lane markings (figure 5.3h).
23
5 Results
(e) Tracking when a disturbing light beam from an- (f) Redetection 1: Loss of track of the right marking
other car is present when changing lanes
(g) Redetection 2: Successful detection of new mark- (h) Redetection 3: Continued tracking of the lane
ings markings
Figure 5.3: Output for road traffic situations that present a challenge to lane tracking algorithms.
The particular situation is described below each figure. The current work was capable
of detecting/tracking lane markings in all these situations. Figures (f)-(h) show a car
changing a lane on the highway. Consequently the algorithm loses track of the right lane
marking (f), performs a successful redetection (g) and continues to track the new lane
markings (h)
24
5.3 KITTI-ROAD Dataset
(a) Presence of misleading edges from tracks (b) Tracking in sharp bend
(c) Presence of strong light and shadows (d) Misleading edges by shadows
(e) Tracking, when there are no markings (f) No markings and shadows
25
5 Results
(g) No markings, pavement edges are tracked (h) Tracking of three lanes in city
(i) Three lanes, disturbing edges from tracks (j) Three lanes, no markings on the right
Figure 5.4: Output for the KITTI-ROAD dataset. All images are taken from (Fritsch, Kuehnl, and
Geiger, 2013). The images show a collection of especially challenging scenarios for lane
detection/tracking during daylight and the very accurate results of the current work. Each
scenario is outlined below the respective figure.
26
5.5 Processing speed
(a) Test drive on highway, day (b) Test drive on country road
(c) Test drive on highway, night (d) Tracking with faded lane marking
(e) Tracking of three lane markings (f) Tracking despite disturbing light beam
Figure 5.5: Snapshots from test drives. (a) and (b) were recorded during tests drives at day, (c) - (f)
at night. In (d) the right lane marking is barely visible, in (e) three lane markings are
tracked and in (f) a disturbing light beam is present. The results remain accurate.
This was expected as it has the highest computational power of the tested devices. It is followed
closely by the Altera Stratix V FPGA with 540 to 353 fps and after a big gap by the Cyclone V FPGA
with 38 to 33 fps.
Influence of the number of particles. Figure 5.6 demonstrates that the number of particles
has only a minor influence on the processing speed. On all three devices the average frames per sec-
ond decrease by less than 10 % if the number of particles is risen from 8 to 128. If the number of
particles is further increased to 256 or 512 the particles’ impact on the execution time grows stronger.
But as 64 particles proved sufficient to deliver accurate estimates, there is no need to use more particles.
Influence of the ROI size. The size of the ROI has a much stronger impact on the computa-
tion time. In figure 5.7 the processing speed is shown for three different ROIs (72x512 pixels, 96x512
pixels and 144x1024 pixels). Increasing the ROI from 72x512 pixels to 96x512 pixels reduces the pro-
cessed frames per second by 40-50%, depending on the device. A further increase to 144x1024 pixels
yields an additional performance reduction of 30-40%. This is a sharp decline in the computation
speed, which gives rise to the maxim of choosing the ROI as small as possible. This is also a possible
approach for future works to further improve the performance.
27
5 Results
Figure 5.6: Performance at a ROI size of 72x512 pixels. The performance is shown using the Nvidia
GeForce GTX 660 TI GPU (red), the Altera Stratix V FPGA (green) and the Altera
Cyclone V SOC (blue).
Figure 5.7: Performance using 64 particles and varying ROI sizes. Three ROI sizes are compared:
72x512 pixels, 96x512 pixels and 144x1024 pixels. A sharp performance decrease is detected
with an increasing ROI size.
28
5.5 Processing speed
(a) Distribution of the computation time between (b) Distribution of the computation time between pre-
host and device on the Altera Cyclone V SOC processing, lane detection and lane tracking
Composition of the computation time. In a setup using 64 particles and a ROI of 72x512
pixels the algorithm requires 1.59 ms per frame on the GPU, 2.09 ms per frame on the Stratix V
FPGA and 26.31 ms on the Cyclone V FPGA (compare figure 5.6). It was expected that the Cyclone
V would process the videos slower, but the difference is immense. The reason is explained by figure
5.8a. The chart shows that the calculations on the host system of the Cyclone V SOC require 97.40 %
(which are 25,63s) of the overall computation time. A lot of time is consumed by loading images and
other peripheral tasks. This indicates that the performance is reduced drastically by the slow host
system running on the ARM Cortex-A9 processor and not by a design fault or the Altera Cyclone V
FPGA itself.
Figure 5.8b shows the average distribution of the computation time on the hardware accelerators.
The distribution on all three devices was similar and therefore an average is presented. The pre-
processing requires with 48.91 % the biggest amount of computation time, further emphasising the
need to minimize the ROI size. Lane detection was performed in the tests on only 3% of the frames
but it still accounts for 31.85 % of the processing time. This justifies the use of the particle filter that
required only 19.24% of the total time, but processed 97% of the frames.
29
5 Results
Figure 5.9: Performance using Altera’s Stratix V FPGA and Nvidia’s GeForce GTX 660 TI in com-
bination
than the FPGA. Another increase of the particles to 10240 delivered similar results and consolidate
this explanation.
Summarizing the above, the use of multiple devices delivers no speed-up in computation for the
current work, where 64 particles are sufficient for tracking. It accelerates the computation if much
more particles are used, which might be interesting for other applications.
2. The resource consumption of the algorithm, including the data transfer between host and devices
3. The processing speed (average frames per second that can be processed by each algorithms)
30
5.6 Comparison to previous works
ROI size 108x448 pixels The size of the region of interest (ROI)
Particles 16, 32, 64, 128, Number of candidate lines/particles that were used to track each lane
300, 512 marking
Table 5.1: Parameters that were used for comparing the performance of the current work to Nihil
Madduri’s algorithm
5.6.1 Accuracy
The output of both particle filters using a varying number of particles is presented and compared in
figure 5.10. Madduri’s algorithm required in the tests a minimum of 128 particles to deliver accurate
estimates of the lane markings. This is shown by figures 5.10a, 5.10c and 5.10e where his particle filter
fails to track at least one marking.
Further, the previous work often loses track of lane markings, even if 128 or more particles are used.
This is for example the case when the orientation of lane markings changes, as shown in figure 5.10g.
Other changes is the scenery, such as bright street signs or disturbing light beams frequently led to a
failure as well.
This failures are caused by a flaw in the design of Madduri’s particle filter. It only uses the angle
of the particles (candidate lines) as resampling criteria. The resulting particles are parallel lines, but
their distance to the actual lane markings is not limited and might be big.
Since the particles are parallel, it is impossible to accurately track a lane marking with a orientation
that is not parallel to the estimates of the previous frame. Moreover, distant light beams or other
areas with high-intensity-pixels are frequently confused as lane markings, as particles are not limited
in their distance to the previous best line. This can be seen in figure 5.10c, where the particles are
spread throughout the ROI.
In addition, the tests showed that once Madduri’s algorithm lost track, it took it some time (at
least one or two seconds) to find back to the actual lane marking, if it was able to at all. This is
unacceptable for a DAS system, which might need to react very fast.
The current work, on the other hand, delivered very accurate results, already with 32 particles.
Figures 5.10d and 5.10f show the algorithm tracking the lane markings and the results are very
accurate, irrespective of whether 32 or 64 particles are used. This is tantamount to an enormous
increase in accuracy at a much lower computational effort. Much less particles are used to achieve the
same or an even better result.
In addition to that, the robustness of the algorithm exceeds that of the previous work by far.
As shown in figure 5.10h the current works tracks both lane markings successfully when changes
in the orientation of lane markings occur. This also held true for other disturbing influences, such
as additional lane markings, disturbing light beams and blurred images. The improved resampling
criteria, presented in section 3.4.2, ensured a robust lane tracking throughout all these scenarios.
31
5 Results
(g) Madduri’s algorithm, tracking fails (h) Current work, tracking successful
Figure 5.10: Accuracy of the current work in comparison to Madduri, 2014 Figure (a), (c), (e) show
the accuracy of Madduri’s work at 32, 64 and 128 particles, respectively. At least 128
particles are necessary to ensure reliable tracking. Figure (b), (d), (f) show the accuracy
of the current work. 32 particles are already sufficient for the lane tracking. Figure (g)
shows a case where Madduri’s algorithm, fails to track a lane marking, as only parallel
lines are resampled. Figure (h) shows the same scene and the current work successfully
tracks both lane markings.
32
5.6 Comparison to previous works
• The pre-processing is performed in the current work in parallel and in Madduri’s work in series.
33
5 Results
(a) Performance using the GPU (b) Performance using the FPGA
Figure 5.11: Processing speed of the current work in comparison to Madduri’s algorithm The ROI
size is in both figures 108x448 pixels. Figure (a) compares the processing speed of both
algorithms using a Nvidia GeForce GTX 600 TI GPU. The current work processed videos
between 2.4 and 4.6 times faster. Figure (b) compares the processing speed using an
Altera Stratix V FPGA. The current work executed between 8.4 and 29.7 times faster.
• The design of the particle filter. In this thesis the particles consume less memory and do not
require computationally expensive operations (sinus, cosinus, divisions)
• A more efficient implementation of the random generator, as in the current work the random
generator is only initialized once.
The charts in figure 5.11 reveal another difference between the two algorithms. The previous work’s
performance is strongly dependent on the number of particles. On the NVIDIA GeForce GTX 660
TI the processed frames per second decrease by 50 % (from 131 fps to 66 fps), when the number of
particles is increased from 64 to 512. On the Altera Stratix V FPGA the difference is stronger. With
64 particles the algorithm is able to process 32 fps, which is just enough to process a normal video with
30 fps, but with 512 particles the performance reduces by 78% to 7 fps, which is to slow to process a
live video.
The current work’s processing speed, however, exhibits a much lower dependence on the number
of particles. Its performance decreases only by 4 % (314 fps to 301 fps) on the GeForce GTX 660
TI GPU and by 22 %(270 fps to 208 fps) on the Stratix V FPGA. This can be explained by the
novel representation of lane markings, which was introduced in this work. It reduced the memory
consumption of the particles remarkably, making them insignificant in terms of resource consumption.
34
6 Conclusion
A vision-based road lane detection and tracking algorithm is developed, implemented using C++ and
OpenCL, and tested on two FPGAs and one GPU. A incoming life video stream is processed in two
steps, an initial pre-processing step that creates a gradient image of a selected region and removes
minor noises, followed by a lane detection or lane tracking step. The lane detection method is a greatly
improved version of a novel approach proposed in (Madduri, 2014). It populates the gradient image
with randomly sampled lines, weights these lines according to their distance to the lane markings
and selects the best fitting lines to represent the lane marking. Lane tracking uses a computationally
efficient particle filter to track lane markings. A redetection criteria was introduced that triggers an
additional redetection step, if the results are not feasible.
The pre-processing is implemented as an OpenCL kernel and performed on a hardware accelerator.
The random sampling and weighting of lines in the lane detection step are moved to a kernel as well
as the prediction and importance weight update in lane tracking step.
In various tests with recorded videos, independent datasets and life test drives the algorithm de-
livered very accurate and robust results using only 64 particles. The tests covered the detection on
highways, in cities, with low and high traffic density, at day and night and in many exceptional situa-
tions, including at turns, sharp bends, when shadows or disturbing light beams were present and even
if lane markings were barely visible. In all cases the algorithm delivered excellent results.
The performance varied with the selection of the ROI size and the hardware accelerator. The Nvidia
GeForce GTX 660 TI GPU delivered an extremely fast result with an average of 216-627 processed
fps. It was followed by the still impressive performance of the Altera Stratix V FPGA with 177-478
fps and the Altera Cyclone V FPGA with 11-38 fps. In case of the Cyclone V the slow performance
can be explained by the significantly slower host system in which the Cyclone V is embedded. The
integration of multiple hardware accelerators at the same time delivered no significant acceleration of
the computation, mainly because 64 particles are already sufficient for tracking.
The algorithm showed in a comparison to a preceding work an enormous increase in accuracy and
robustness, a by 71.94% reduced memory consumption, a 2.4- 4.6 times faster execution on the GeForce
GTX 660 TI and a 8.4-29.7 times faster execution on the Stratix V, due to numerous improvements
in design and implementation.
35
Bibliography
Cappe, O., Godsill, S., & Moulines, E. (2007, May). An overview of existing methods and recent
advances in sequential monte carlo. Proceedings of the IEEE, 95 (5), 899–924. doi:10 . 1109 /
JPROC.2007.893250
Cunningham, W. (2011). Bmw adopts nvidia gpu for in-car displays. Retrieved March 9, 2015, from
http://www.cnet.com/news/bmw-adopts-nvidia-gpu-for-in-car-displays/
destatis. (2013, July). Unfallentwicklung auf deutschen strassen 2012. Brochure. Wiesbaden: Statistis-
ches Bundesamt Deutschland.
Fons, F. & Fons, M. (2012, March). Fpga-based automotive ecu design addresses autosar and iso 26262
standards. Xcell Journal, 78, 20–31.
Fritsch, J., Kuehnl, T., & Geiger, A. (2013). A new performance measure and evaluation benchmark
for road detection algorithms. In International conference on intelligent transportation systems
(itsc).
Gordon, N., Salmond, D., & Smith, A. (1993, April). Novel approach to nonlinear/non-gaussian
bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140 (2), 107–113.
Leva, J. L. & Mitre Corporation. (1992). A fast normal random number generator. ACM Trans. Math.
Software, 449–453.
Madduri, N. (2014). Hardware accelerated particle filter for lane detection and tracking in opencl
(Master’s thesis, TU München).
Muyan-Özçelik, P. & Glavtchev, V. (2008). Gpu computing in tomorrow’s automobiles. Retrieved
March 9, 2015, from http : / / www . nvidia . com / content / nvision2008 / tech presentations /
automotive track/nvision08-gpu computing in tomorrows automobiles.pdf
Navarro, C. A., Hitschfeld-Kahler, N., & Mateu, L. (2013). A survey on parallel computing and its ap-
plications in data-parallel problems using gpu architectures. Communications in Computational
Physics, 15 (2), 285–329. doi:10.4208/cicp.110113.010813a
Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., & Phillips, J. (2008, May). Gpu computing.
Proceedings of the IEEE, 96 (5), 879–899. doi:10.1109/JPROC.2008.917757
Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., & Purcell, T. J. (2007).
A survey of general-purpose computation on graphics hardware. Computer Graphics Forum,
26 (1), 80–113. doi:10.1111/j.1467-8659.2007.01012.x
Peddie, J. (2013). Mobile devices and the gpus inside. Jon Peddie Research.
Thomas, D. B. (2011). The MWC64X Random Number Generator. Retrieved from http://cas.ee.ic.
ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Tian, X. & Benkrid, K. (2010, November). High-performance quasi-monte carlo financial simulation:
fpga vs. gpp vs. gpu. ACM Trans. Reconfigurable Technol. Syst. 3 (4), 26:1–26:22. doi:10.1145/
1862648.1862656
Yazbeck, F. & Kenny, R. (2012, March). White paper: reducing power consumption and increasing
bandwidth on 28-nm fpgas. Altera. Retrieved from https : / / www . altera . com / en US / pdfs /
literature/wp/wp-01148-stxv-power-consumption.pdf
37