Park-NET: Identifying Public Urban Green Spaces Using Multi-Source Spatial Data and Convolutional Networks
Park-NET: Identifying Public Urban Green Spaces Using Multi-Source Spatial Data and Convolutional Networks
Master’s thesis
Supervisor: S.M. Labib
July 2022
Public urban green spaces (PUGSs) are a vital part of cities. They have a positive influence on the
ecosystem and the well-being of local citizens. This is widely recognizable now with the United Nations
(UN) mentioning public urban green spaces accessibility in their Sustainable Development Goals (SDG).
To make the most of PUGSs, create them, and manage them sustainably we need to understand them
and their spatial and temporal characteristics better. There is also a need for more accessible and
cheaper PUGSs datasets so that officials and locals can use that in designing more sustainable
neighbourhoods and cities. While spatial PUGSs data for some cities are available in land use dataset,
these are often inaccessible to outside researchers, and not updated frequently. Hence there are
methodological limitations on how PUGS can be created more efficiently for a wider usage. Deep
learning methods with the usage of satellite images are a great way to achieve that because they can
capture geometric patterns of land cover types using freely available satellite images of moderate
spatial and temporal resolution. This study evaluates two convolutional network architectures, U-Net
and U-Net with a ResNet-34 encoder, for semantic segmentation of PUGSs at a metropolitan level from
multiple cities worldwide. Open-source data, mainly Sentinel and ground truth data about PUGSs from
various open data portals are used as an input data. The chosen best model had an average test
Intersection over Union (IoU) of 0,5610, and average test F1 score of 0,64515 across two external
cities, which is a moderate to good performance of the deep learning model in detecting PUGSs.
It shows that this approach is promising to create reliable, new PUGSs datasets that can help officials
in urban planning.
Keywords: Convolutional Neural Networks, Public Urban Green Spaces, Image Processing, Satellite
Imagery, Semantic Segmentation
2. Literature ............................................................................................................................................. 2
4. Results ............................................................................................................................................... 14
4.3 External evaluation and prediction of PUGSs based on the best model .................................... 18
5. Discussion .......................................................................................................................................... 22
6. Conclusion ......................................................................................................................................... 25
Appendix................................................................................................................................................ 26
References ............................................................................................................................................. 27
1. Introduction
Nowadays more than half of people globally live in cities, and this proportion is expected to increase
to 68% by 2050 according to United Nations (UN). That is why making our cities healthier to live
in, by significantly transforming the way we build and manage our public urban green spaces (PUGSs)
is even more important than before. In 2015 United Nations Development Programme created
Sustainable Development Goals (SDG) as a “universal call to action to end poverty, protect the planet,
and ensure that by 2030 all people enjoy peace and prosperity”. Goal 11 is about “Sustainable
cities and communities”. It highlights that everyone should have universal access to safe,
inclusive, accessible, green, and public spaces. Therefore, it is important to study PUGS,
so we get an understanding of how they are exactly influencing neighbourhoods, and how we should
be transforming our cities to make the most of them.
PUGS can be studied in various ways using tools and techniques related to GIS, satellite images,
and recently also machine learning. Spatial delineation of GIS elements (so e.g. PUGSs) has often been
based on a re-classification of available land cover data combined with information about the natural
values of each cover class, or by visual interpretation of aerial imagery, remotely sensed data
interpretation, and manual digitalization (Skokanová et al., 2020). These processes are often costly
and time-consuming, and they are often outdated due to irregular update frequency. Therefore,
in the last few years, there have been various studies that try to use different machine learning
algorithms for the detection of green spaces or landcover classification. Among them are supervised
maximum likelihood classification used on Sentinel-2A satellite imagery, provided in the frame of the
European Copernicus program (Kopecká et al., 2017), random forest, and support vector regression
used on Landsat (Sharifi & Hosseingholizadeh, 2019), and Convolutional Neural Networks (CNN),
precisely a U-Net architecture and very high resolution (VHR) imagery (Huerta et al., 2021).
It is especially important because CNNs have recently shown the state of the art performance in high-
level vision tasks, such as image classification and object detection (Chen et al., 2014). Some studies
implemented CNN with a transfer learning approach. Transfer learning builds upon learned knowledge
from one dataset to improve learning in another dataset (Wurm et al., 2019). It uses weights
pre-trained on datasets from other domains to improve learning accuracy and rate for a model
on a second task (Ulmas & Liiv, 2020).
When trying to study urban green spaces in a more global approach some studies used datasets that
are available worldwide, for example, Open Street Map (OSM), and Urban Atlas (Ludwig et al., 2021).
This information might be outdated quite fast, because of rapid city growth, and updating these
datasets is costly. There might also be some inconsistencies in these datasets because the definition
1
of urban greenspaces might differ across countries. That’s why there is a need of creating a uniform
dataset containing urban green spaces.
The goal of this study is to analyse to what extent can a reproducible CNN model that identifies public
urban green spaces based on open source data be created. The process involves creating two CNN
models. One was the baseline U-Net model from scratch, and the other was the U-Net with ResNet34
backbone (details discussed in Section 2.2) implemented from the Segmentation Models library (Pavel
Yakubovskiy, 2019) to make use of the transfer learning approach. Sentinel satellite imagery channels
and vegetation indices calculated from them, European Space Agency (ESA) WorldCover land cover,
and PUGSs dataset from local open-source portals were used as input data. Updated PUGSs dataset
at a metropolitan level, created by this study, would improve the understanding of public urban green
spaces for better urban area management.
Subsequently, in Section 2 literature review of related studies, and network architectures is presented.
In Section 3 input data, preprocessing methods, model setup, training, and metrics are described.
Section 4 presents the results, external test performance, and created PUGS dataset. Section 5
is a discussion, and this study is concluded in Section 6.
2. Literature
2.1 Related work
The literature on the use of remote sensing data for applications in urban planning, environmental
science, and others has a long and rich history (Albert et al., 2017). Recently with technological
advancement and the evolution of deep learning automatization of the acquisition of urban green
spaces is possible through the detection of spectral and geometric patterns available in satellite
imagery (Khryaschev & Ivanovsky, 2019).
Convolutional Networks, precisely a U-Net architecture with ResNet34 and ResNet50 backbone were
used to study urban green spaces by Huerta et al. (2021). They used very high-resolution satellite
imagery, WorldView-2 with 0,5m spatial resolution, and focused on one area. They had green, red, and
NIR bands, and they calculated Normalized Difference Vegetation Index (NDVI), Two-band Enhanced
Vegetation Index (EVI2), and Normalized Difference Water Index (NDWI) from which they made
three-band compositions to check which gives the best results. For measuring the model performance,
they used the F1 score, known also as Dice coefficient, which is often used as a metric to evaluate
semantic segmentation when there is class imbalance (F1 score is also explained in Section 3.5). They
found that NDVI–red–NIR composition gave them the best results using the ResNet34 encoder with
2
a F1 score of 0.5748 and an accuracy of 0.9503, and that different band compositions are producing
models with a different performance level.
A more global, and open-source approach had Albert et al., (2017). They used convolutional neural
networks, Google Maps satellite images of 1.2 meters of spatial resolution and the Urban Atlas dataset.
They used two different architectures – ResNet50 and VGG-16 and found that ResNet50 gave better
results for their purpose. They also studied the transferability of their models, so the model learned
on data from one city was applied to another city. They found that the model trained on a diverse set
(few cities) yielded better performance when applied at different locations than models trained
on individual cities. They also tested different tile sizes and found that 224x224 was the best tile
size for them and suggested that images at smaller scales might not capture enough variation in urban
form. They used only 3 visible bands – green, blue, and red.
While the previous studies used only spectral bands and combination of indices, Gülçi ̇N and Akpinar
(2018) were doing Object-Based Image Analysis (OBIA) to combine spectral and shape features to map
urban green spaces. They were extracting the features from orthophoto that had three visible bands
– blue, green, and red in 10 cm resolution. They also calculated Green-Red Vegetation Index (GRVI).
The study was focused on one area. They firstly did a segmentation and selected optimal values for
the main parameters. To combine spectral and shape features, multiresolution segmentation was
implemented in this research which has been proven to be one of the most successful image
segmentation algorithms in the OBIA framework, though they did not use a deep learning approach.
First worth mentioning is Pyramid Scene Parsing Network (PSPNet) created by Zhao et al. (2016), which
is a semantic segmentation model that utilises a pyramid parsing module, meaning that it has kernels
of smaller sizes like 3x3 and bigger ones like 6x6, and etc. Pyramid parsing module is applied to harvest
different sub-region representations, followed by upsampling and concatenation layers to form the
final feature representation, which carries both local and global context information. Finally, the
representation is fed into a convolution layer to get the final per-pixel prediction. The local and global
3
clues together make the final prediction more reliable (Zhao et al., 2016). Its architecture is presented
in Figure 1.
Second one is LinkNet created by Chaurasia and Culurciello (2017). This network was created to make
semantic segmentation in real time for tasks such as self-driving vehicles, augmented reality, etc.
It is achieved by efficiently sharing the information learnt by the encoder with the decoder after each
downsampling block. This proves to be better than using pooling indices in decoder or just using fully
convolutional networks in decoder. This feature forwarding technique gives good accuracy values
(Chaurasia & Culurciello, 2017). Its architecture is presented in Figure 2.
Third example is a U-Net model architecture. It is a convolutional network that was created for
biomedical image segmentation by Ronneberger et al. (2015) but various geoscience studies already
used it for the segmentation or classification of satellite images and achieved good results (Huerta et
al., 2021; Ulmas & Liiv, 2020; Rakhlin et al., 2018). Its architecture is presented in Figure 3.
4
Figure 3 U-net architecture example from Ronneberger et al. (2015) Each blue box corresponds
to a multi-channel feature map. The number of channels is denoted on top of the box.
The x-y-size is provided at the lower left edge of the box. White boxes represent
copied feature maps. The arrows denote the different operations.
The U-Net architecture is based on a Fully Convolutional Network (FCN) and similarly has symmetric
encoder–decoder composition. Encoder or sometimes called ‘contracting path’ is on the left side, and
it’s responsible for capturing context. It follows the typical architecture of a convolutional network,
and it’s used to detect features on an image. It consists of the repeated application of two 3x3
convolutions, each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with
stride 2 for downsampling. Each downsampling step doubles the number of feature channels and
makes the feature maps’ size smaller (Ulmas & Liiv, 2020).
The second, right part is called a decoder, and it’s responsible for upsampling the feature map to the
input image’s original size. It consists of a 2x2 convolution (‘up-convolution’) that halves the number
of feature channels, a concatenation with the correspondingly cropped feature map from the encoder
part, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss
of border pixels in every convolution. At the final layer, a 1x1 convolution is used to map each
64-component feature vector to the desired number of classes. In total U-Net has 23 convolutional
layers (Ronneberger et al., 2015). The second, decoder part is bringing back the spatial information
of the input image (Ulmas & Liiv, 2020).
The biggest advantage of U-Net architecture is that it effectively fuses low-level and high-level image
features by combining low-resolution and high-resolution feature maps through skip connections.
That’s why it’s so useful for geoscience and segmentation purposes, or other when we want to have
the output size that is equal to the inputs’ size and not only labels of the detected features.
5
Some studies tried using a backbone approach, so using different architectures as an encoder in their
chosen CNN architecture. A backbone network is used to extract basic features for detecting objects,
and usually designed originally for image classification and pre-trained on the ImageNet dataset.
If a backbone can extract more representational features, its host detector will perform better
accordingly. In other words, a more powerful backbone can bring better detection performance (Liang
et al., 2021). Various studies used this approach of using a pre-trained encoder (backbone) in other
CNN architecture (Huerta et al. (2021); Ulmas and Liiv (2020); Pollatos et al. (2020), so a combination
of this approach and transfer learning, and found this to be successful.
Transfer learning approach allows to use the learning gained in the first task to solve a new, more
complex task (Ulmas & Liiv, 2020). This approach was mentioned by other studies to yield both better
performance and faster training times because the network already has learned to recognize basic
shapes and patterns (Albert et al., 2017). It is executed by using a pre-trained weights as the starting
values for weights while training model for a second task, and as mentioned earlier it is often done
by adding an encoder that has pre-trained weights to chosen architecture (Huerta et al. (2021); Ulmas
and Liiv (2020); Pollatos et al. (2020)).
Various studies used different model architectures as encoders, ResNet50 was used as an encoder
in a U-Net architecture by Ulmas and Liiv (2020) and by Pollatos et al. (2020). ResNet34 encoder
in a U-Net architecture was compared with other encoders and was found the best by Rakhlin et al.
(2018) and Huerta et al. (2021) for similar to this studies’ tasks.
ResNet34 is a residual model with 34 layers, which uses a repeating pattern of layer blocks (Figure 4),
with skip connections. They are to avoid the leak gradient problem, which helps to maintain
performance and precision despite increases in the number of training layers (He et al., 2015). When
ResNet34 is used in U-Net architecture as the encoder it is used to detect features on an image, and
carries out a downsampling process, as in encoder–decoder composition explained earlier. At the point
of the greatest compression in the network, a decoder is attached that follows the principle of the
U-Net architecture to finally obtain an output equal in size to the input images’ (Huerta et al., 2021).
Figure 4 Residual learning: a building block from Resnet architecture (He et al., 2015)
6
3. Data and Methods
The workflow of the whole methodological process is presented in Figure 5.
Figure 5 Summary of the current methodological process for semantic segmentation of PUGSs using
deep learning. Input data in grey. Data preprocessing in blue, CNN model setup and implementation
in yellow, and evaluation in green. Abbreviation: Public Urban Green Spaces (PUGS).
Visualisation methods adapted from Huerta et al. (2021)
7
3.1 Study area
This study was not centred around one particular case study area, instead, it focused on multiple cities
from across the world with a focus on European, and US cities. The cities, that were used in this study
are presented in Table 1. Ten cities were used as training cities, two cities, Tel Aviv and Washington,
were used for external testing, and Kampala was used for prediction.
Satellite images that were used are Sentinel-2 multispectral images. They were downloaded from
the open-source online tool EarthExplorer (U.S. Geological Survey). Downloaded images had different
image acquisition dates, and those dates are presented in Table 2. Sentinel-2 has radiometric and
geometric corrections along with orthorectification to generate highly accurate geolocated products.
It is especially important when we are using it to analyse areas in different parts of the world.
8
Sentinel-2 has 13 spectral bands with spatial resolutions of 10 m (three visible and a near-infrared
band), 20m (6 red-edge/shortwave infrared bands), and 60m (3 atmospheric correction bands). In this
study blue, green, red, near-infrared and short-wave infrared bands were used.
PUGS data was collected from different open-source data sources. Those sources are presented
in Table 2 along with Sentinel image acquisition dates.
Gemeente https://maps.amsterdam.nl/
Amsterdam 30.05.2020
Amsterdam open_geodata/?k=99
Detail Area Plan
http://43.243.207.51/Website/
Dhaka (DAP) by RAJUK (City 17.01.2020
R2_2_1/viewer.htm
planning authority)
Dublin Dublinked https://data.smartdublin.ie/ 01.06.2020
https://data.stad.gent/explore/
Ghent Open Data Stad Gent 24.06.2020
dataset/parken-gent/table/
https://opendata.tel-aviv.gov.il/
Tel-Aviv TLV Open Data 14.07.2020
en/pages/item.aspx?ids=10
https://opendata.vancouver.ca/
City of Vancouver
Vancouver explore/dataset/parks-polygon- 29.06.2020
Open Data Portal
representation/information/
Kampala - - 21.09.2020
Table 2 Data sources for PUGS and dates for
Landcover data was also used – European Space Agency (ESA) WorldCover, which is a worldwide land
cover mapping, that provides a new baseline global land cover product at 10 m resolution for 2020
based on Sentinel-1 and 2 data (European Space Agency). Landcover data was used to evaluate
if adding already classified data would be beneficial for the segmentation model performance.
The city bounding boxes, which were used to crop the satellite images were extracted from
OpenStreetMap using the OSMnx package.
9
3.3 Data preprocessing
Preprocessing was done in QGIS 3.16.14 open-source software and included: cropping satellite images
and landcover to city bounding boxes, resampling SWIR band to 10 m resolution, rasterization of vector
urban greenspace dataset, and finally calculating 3 indices:
After the QGIS preprocessing two rasters per city were created, first one with 8 bands – blue, green,
red, NIR, NDVI, NDBI, NDWI and ESA landcover, and second categorical raster with two values –
1 representing PUGSs and 0 representing the background. Both rasters and all their bands had
a 10 m spatial resolution.
Nine different three-band compositions were chosen, similar to what Huerta et al. (2021) did.
Those compositions were: Blue-Green-Red, Green-Red-NIR, Red-NIR-NDVI, NDVI-NDBI-Landcover,
Red-NDWI-Landcover, NDVI-NDWI-Landcover, NIR-NDWI-NDBI, NDBI-NDWI-Landcover, and NIR-
NDWI-Landcover. For each band composition, one set of image chips per training city was done. This
process included creating image chips for each city from both the satellite image and the mask raster.
All the chips had a size of 256x256, but they were created with a different stride. Stride represents
a step between two neighbouring chips, so a number of pixels between two consecutive chips. When
stride is smaller than the chip size then chips are overlapping. Smaller stride, e.g., 30 makes the chips
overlap more than bigger stride e.g., 60. Strides were chosen for each city separately depending
on the city’s size, which is more explained in section 4.1. Chips were created with the Patchify library
10
(Weiyuan Wu et al., 2021) and were saved as NumPy arrays on Google drive. Examples of chips are
presented in Figure 6.
The amount of public green space differs between cities with Amsterdam and San Francisco having
only 13%, and Dublin 26%, but it rarely goes up to 30% or 40% (World Cities Culture Forum), which
makes the dataset imbalanced. Because of that, it was decided to filter out those image chips that have
a smaller percentage of public green spaces than the given threshold of 10%. It was done to get a more
balanced dataset to train the models on, and to make the data load smaller.
Next step was data augmentation. It was done using the ImageDataGenerator library, with a batch size
of 8, and included:
• zoom up to 0.1,
• random horizontal flip,
• random vertical flip,
• reflect fill mode
These techniques generated transformations for each epoch within the models and increased the size
of the training samples.
11
Input data was also split to training and validation sets with the 80% of input data used for training and
20% for validation. When using a model with a backbone setup also included preprocessing the training
and validation data to fit its architecture, and it was done using a function from the Segmentation
Models library, from where the model was also from.
The first model had each layer built separately, thanks to which there was a possibility to see how each
layer was built and alter the dropout rate or size of the convolution kernel. This model was built from
9 layers, had a dropout between 0.2 and 0.5 at each layer, and a sigmoid activation function. For the
training a learning rate of 1e-4, a batch size of 16, and 50 epochs were used.
The second model was implemented from a Segmentation Models library, so it was more like a ‘black
box’ approach, but it had a ResNet34 backbone with weight pre-trained on the 2012 ILSVRC ImageNet
dataset. Using transfer learning in this study was a great way to try to improve model performance
and tackle the limited RAM issue. Learning rate of 1e-4, batch size of 16, and 40 epochs were used.
Setup and training of the models was done in Google Colab with the Pro subscription, with provided
Tesla P100 PCIe 16 GB GPU and 25 GB of available RAM. This allowed for using data from 10 cities,
after deleting chips with less than 10% of PUGSs and using three-band compositions to train models.
The data from 10 training cities were divided into two sets with the proportion of 80% for training and
20% for validation. Validation set was used to choose the best model. Additional data from 2 external
cities were used for testing and evaluation of the chosen model. There is also a third external city
for prediction purposes.
In machine learning tasks, the loss function is used to evaluate the difference between training results
and labelled data. To decide which loss function fits the best in this semantic segmentation project
two factors were taken under consideration – binary aspect, and imbalance. This model has two output
classes – PUGSs and background, which makes it a binary case. As mentioned before PUGSs are a small
portion of a city. It can be between a few to a few dozens of percent, but usually below 40%. This
causes an imbalance that could produce errors and bias towards the background class that covers most
of the area of interest. Because of that, it was decided to use a combination of two loss functions that
12
were found to be useful by others – binary focal loss (Lin et al., 2017) and dice loss (Huerta et al., 2021).
Both losses were used from a Segmentation Models library (Pavel Yakubovskiy, 2019), and the used
total loss was calculated as follows:
Binary focal loss is a focal loss for binary classification. This loss function generalizes binary cross-
entropy by introducing a hyperparameter called the focusing parameter that allows hard-to-classify
examples to be penalized more heavily relative to easy-to-classify examples. Therefore, the focal loss
is designed to address the scenario in which there is an extreme imbalance between foreground and
background classes during training (Lin et al., 2017), and the library implementation defines it as:
Where:
Semantic segmentation studies that used deep learning have proved that the dice coefficient (F1 score)
is a loss function adequate for configurations when there is big imbalance of classes. Using it helps
preventing errors and bias towards the background class that covers most of the area (Huerta et al.,
2021). The Dice coefficient is a measure of overlap between two sets and was used both as a loss
function and as a measurement met. The dice loss in the used library is defined as follows:
(1 + 𝛽 2 ) ∗ 𝑡𝑝
𝐷𝐿(𝑡𝑝, 𝑓𝑝, 𝑓𝑛) =
(1 + 𝛽 2 ) ∗ 𝑓𝑝 + 𝛽 2 ∗ 𝑓𝑛 + 𝑓𝑝
Where:
• tp – true positive
• fp – false positive
• fn – false negative
• 𝛽 – float or integer coefficient for precision and recall balance, using the default value of 1
The F1 score (dice coefficient) was also used as a metric in this study and can also be interpreted
as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and
13
worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula
for the F1 score (dice coefficient) using precision and recall in the used library is defined as follows:
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑟𝑒𝑐𝑎𝑙𝑙) =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
When it comes to assessing the performance of the model this study used Intersection over Union
(IoU) as the first performance measurement, and already explained F1 score as the second one. This
is because there is a lot of imbalance in the dataset, so those measures fit that configuration best.
IoU, also known as the Jaccard index computes the amount of overlap between the predicted polygons
and the ground truth data (Huerta et al., 2021), with values > 0.50 considered to be correctly predicted.
The used function in the library defines it as follows:
𝐴∩𝐵
𝐼𝑂𝑈(𝐴, 𝐵) =
𝐴∪𝐵
Where:
4. Results
4.1 Model setup results
The model setup included steps like creating chips, filtering them and data augmentation. For the
process of creating data chips different strides were used for different cities based on their size, and
proportions of PUGSs in the city. The stride for each city was chosen arbitrarily and it is presented
along with the final chips number in Table 3. The presented number of chips is after filtering out chips
that had less than 10% of PUGSs on them.
From this data, it is clear, that there is an imbalance, and PUGSs pixels are a lot smaller percentage
of input images then the background pixels. The proportion presented in this table though cannot
be treated as a proportion of PUGSs in a city, but in the city bounding box. Bounding box is a bigger
area than the city itself, making the imbalance bigger. That also caused that there were parts of the
mask raster that had only background pixels, which created some chips without any PUGSs pixels.
It is visible that the data is imbalanced also when it comes to city representation. Some cities had more
chips than the others, and cities with the most chips were Manchester, London, and Philadelphia.
Manchester and London are both a lot bigger than the rest of the cities with an average percentage
14
of PUGSs pixels. Philadelphia on the other hand is not that bigger than the rest, it had a lot more chips
because it had a higher percentage of PUGSs pixels.
Model setup also included data augmentation of matching image and mask chips. Examples are
presented on Figure 7.
Figure 7 Examples of data augmentation on image and mask chips. First and third
row are original chips, second and fourth augmented ones
15
4.2 Training, validation results and choosing the best model
In this section training and validation results are presented for all the created models. Training set was
80% of the input data, and validation was 20%. There were 18 models created in total, because there
were 9 different three-band compositions, and two different model architectures – baseline U-Net
model from scratch and U-Net model with a pre-trained ResNet34 encoder. All the models were
trained on 10 cities mentioned in Table 2. The aim was to find the best combination of model
architecture and band composition based on validation metrics. Table 4 shows the results for baseline
U-Net model for scratch, and Table 5 for U-Net with a pre-trained Resnet34 backbone.
Training Validation
Band Validation
Accuracy IoU F1 score Accuracy IoU F1 score
composition average
Blue-Green-
0,9249 0,6753 0,8055 0,9212 0,6581 0,7931 0,7908
Red
Green-Red- NIR 0,9436 0,7435 0,8524 0,9406 0,7286 0,8423 0,8372
For baseline U-Net from scratch average validation IoU was 0,7259, average validation F1 score was
0,8403. For U-Net with ResNet34 encoder it was 0,8681 and 0,9274, which is a 12% rise for validation
IoU and 8% for validation F1 score. This visible rise shows that U-Net with a ResNet34 backbone
achieved better performance than the baseline model, and that transfer learning is a very good
approach. The best model should be chosen from the models with a pre-trained encoder.
16
NIR-NDWI-NDBI composition with U-Net with ResNet34 encoder architecture achieved the best
average validation score among all the models and was chosen as the best model to be used later for
evaluation and predictions. It had validation IoU of 0,8770, and validation F1 score of 0,9326.
Training Validation
Band Validation
Accuracy IoU F1 score Accuracy IoU F1 score
composition average
Blue-Green-
0,9552 0,8669 0,9266 0,9534 0,8616 0,9234 0,9128
Red
Green-Red-
0,9581 0,8738 0,9308 0,9562 0,8677 0,9271 0,917
NIR
Red-NIR-NDVI 0,9570 0,8709 0,9291 0,9554 0,8659 0,9261 0,9158
NDVI-NDBI-
0,9548 0,8667 0,9265 0,9530 0,8612 0,9232 0,9125
Landcover
Red-NDWI-
0,9607 0,8810 0,9351 0,9595 0,8769 0,9326 0,923
Landcover
NDVI-NDWI-
0,9564 0,8699 0,9284 0,9545 0,8632 0,9243 0,914
Landcover
NIR-NDWI-
0,9611 0,8813 0,9352 0,9600 0,8770 0,9326 0,9232
NDBI
NDBI-NDWI-
0,9601 0,8787 0,9338 0,9586 0,8740 0,9310 0,9212
Landcover
NIR-NDWI-
0,9573 0,8712 0,9293 0,9555 0,8656 0,9259 0,9157
Landcover
Average 0,9579 0,8734 0,9305 0,9562 0,8681 0,9274
Table 5 Training and validation accuracy, IoU, and F1Score of different band compositions
for U-Net with a Resnet34 backbone model. Average validation values for separate
band compositions models and average of metrics for all models are in green.
The chosen model is in light green
Transfer learning is not only helping in achieving better performance, but also in making the training
process converge faster. For U-Net with ResNet34 encoder, the learning process is also less bumpy
than for the model from scratch, meaning that the loss and IoU are changing more smoothly. Figure 8
compares the process of the training of the chosen best model, so NIR-NDWI-NDBI composition with
U-Net with ResNet34 encoder and NDVI-NDWI-Landcover for baseline U-Net model from scratch,
which had the highest validation average scores across the models from scratch.
17
NIR-NDWI-NDBI U-Net with a ResNet34 backbone NDVI-NDWI-Landcover baseline U-Net model
Figure 8 Comparison of the training process between the NIR-NDWI-NDBI U-Net with
a ResNet34 backbone and NDVI-NDWI-Landcover baseline U-Net from scratch
4.3 External evaluation and prediction of PUGSs based on the best model
External evaluation was based on test set that consisted of two external cities – Washington and Tel
Aviv. Both of those cities were not a part of testing or validation process. Test performance of the
chosen model, so the NIR-NDWI-NDBI U-Net with a ResNet34 encoder is presented in Table 6.
18
The best model achieved an average test IoU of 0,5610, and average test F1 score of 0,64515 across
two external cities. Washington had an average of 0,6622 test metrics, and Tel Aviv had 0,544,
so the model performed visibly better on Washington then on Tel Aviv.
Differences between validation metrics of baseline models from scratch and models with a backbone
were visible, but when we compare their prediction on external dataset this difference in performance
is even more apparent. Figure 9 illustrates differences in prediction of Washington PUGSs between
chosen NIR-NDWI-NDBI U-Net with a Resnet34 encoder, and model from scratch that achieved the
best validation metrics, so NDVI-NDWI-Landcover baseline model. It is evident that the transfer
learning model is performing a lot better on an external test set.
Figure 9 Comparison of PUGSs prediction in Washington. Left – ground truth, middle –the chosen NIR-
NDWI-NDBI U-Net with a Resnet34 encoder model, right - best model from scratch,
so NDVI-NDWI- Landcover. PUGS are white, and background is black
The best, chosen model, so NIR-NDWI-NDBI U-Net with a Resnet34 encoder was used to create new
PUGSs datasets for 3 external cities – Washington, Tel Aviv, and Kampala. Figure 10 presents ground
truth PUGSs dataset, and prediction for Washington. Predictions for Tel Aviv and Kampala are shown
in the appendix. Looking at the predictions for Washington there are some parts, like the big green
space in the north, and longitudinal green space in the middle, that were predicted good, but there
are also some misclassifications that should be analysed up close.
19
Figure 10 Left - ground truth PUGS data for Washington, right – predicted
PUGS data for Washington.
PUGS are green, background is white
Looking more closely at the predicted and the ground truth PUGSs data for Washington there are
a few groups of misclassifications present:
Figure 11 Left – true colour Washington image, right - satellite image, on top of that PUGSs predictions
as green, and ground truth PUGSs symbolised with cross filling.
20
2. Some parts of PUGSs that are build-ups, not green space.
Figure 12 shows an example when part of a PUGS is infrastructure or building, so a build-up
area, not green. The model might predict that those parts are background, not PUGS.
Figure 12 Left – true colour Washington image, right - satellite image, on top of that PUGSs
predictions as green, and ground truth PUGSs symbolised with cross filling.
3. Green spaces that were not in the ground truth data, that are probably not public.
Figure 13 shows an example of an area that was predicted as being a public urban green space,
but probably it is a golf club, which is not public.
Figure 13 Left – true colour Washington image, right - satellite image, on top of that PUGSs
predictions as green, and ground truth PUGSs symbolised with cross filling.
21
5. Discussion
5.1 Key findings
Two presented multi-city models created to identify public urban green spaces at the metropolitan
level across the world, and their external accuracy, show that this approach is promising. It allows for
efficient creation of new PUGSs datasets that yields useful information about the location, geometry,
and other spatial attributes. This can be very helpful for local authorities while managing cities and the
decision-making process about future development for cities.
The transfer learning approach shows a really big improvement comparing to a baseline model built
from scratch. ResNet34, which was used as an encoder for a U-Net model was pre-trained on the 2012
ILSVRC ImageNet dataset on blue, green, and red bands. The benefits coming from this approach are
visible – average validation performance for U-Net with ResNet34 backbone is better than the baseline
model built from scratch, and fewer epochs are needed for training.
Cities that these models were trained on were mostly from the US and Europe, and model performance
on external testing dataset is higher for Washington than for Tel Aviv. Washington is a US city and
is spatially close to some of the cities that the model was trained on. Being located in the same country
and culture makes its structure and urbanization features more similar to training cities than Tel Aviv.
Tel Aviv is on a different continent, has a different urbanization style, and is a city that has
neighbourhoods that date back to the 14th century BC. Those big differences are probably the reason
why Washington has a higher prediction performance.
This approach was based solely on open-source data, which makes it cheaper than using data from
third-party providers but is especially important when thinking about cities from developing countries
in Africa or Asia. The fastest-growing cities in the world are on those continents (Investment Monitor,
2022), and those are usually the countries that do not have the resources to buy (very) high resolution
satellite images.
This study was done using a Google Colab Pro subscription, which shows that fairly advanced and heavy
computationally work can be done using not that expensive services. This was because of a relatively
big for this kind of study spatial resolution of the images, and deletion of image chips that had less
than 10% of PUGSs in them. Both of those things contributed to making the data load smaller, and thus
enabled as many as 10 cities to be used for training.
22
5.2 Literature comparison
This study is one of the many studies that focus on urban analysis with the usage of satellite images
and CNN type models. The methodology used in this project is fairly similar to this chosen by Huerta
et al. (2021), meaning that U-Net models with a Res-Net encoder are used with a group of three-band
compositions to analyse urban green spaces. The biggest difference is that they used satellite images
with 0.5 m spatial resolution in contrast to 10 m resolution Sentinel images used by this study. Huerta
et al. (2021) also focused on one study area, whereas this study included 10 cities in the training
process. They got the best results for NDVI–red–NIR composition using ResNet34 encoder with
an F1 score of 0.5748, and evaluation IoU of 0.75. This study got the best results for the NIR-NDWI-
NDBI composition with the same architecture - U-Net with ResNet34 encoder. It achieved an average
test IoU of 0,5610, and average test F1 score of 0,64515 across two external cities. Model performance
was calculated by this study based on 2 external cities that were not part of the training process, and
Huerta et al. (2021) used a subset of 1% of the initial training dataset for evaluation purposes.
The second study, done by Albert et al. (2017) was identifying urban patterns, which is a different
purpose than this study, but there are still a lot of similarities. Albert et al. (2017) worked on open-
source data, but satellite images used by them had a spatial resolution of1,2 m. They experimented
with transfer learning and found that it gave them around of 5% increase in performance. For this
study the gain was bigger - 12% rise in validation IoU and 8% in validation F1 score. They also noticed
that performance was poor when the model was trained on samples from one city and tested on
samples from a different city. This was visibly worst when cities were more different, and a bit better
when they were more similar. This is in line with the results that this study got, especially that model
trained on mainly US and European cities had better performance on Washington than on Tel Aviv.
There were three main limitations that this study faced. The first one is connected to how complex the
definition of public urban green space is. It is hard to determine what is public and what is private with
only satellite images and landcover data, and this was probably the main reason for not that high test
23
performance. This model was trained to identify public urban green spaces but looking at the
prediction for external cities, especially examples for Washington presented in 4.3 Section, might
suggest that it predicts urban green spaces, both public and private.
The second limitation was computational load, both the RAM capacity and the training time. This study
had a global approach, and 10 cities were able to be included after filtering out some chips. It was done
using the 25 GB of RAM available for Google Colab Pro subscription. Training time of one model was
around an hour for transfer learning, and two hours for baseline model from scratch. This was
specifically challenging, because 18 models in total were created.
This project was done using open-source data Sentinel satellite images, which have a quite big spatial
resolution when comparing to different studies from this field. This had the disadvantage of being less
detailed, yielding less information about objects, and not being able to depict smaller details.
If future research is focused on having higher performance in general, than satellite images with
smaller spatial resolution should be taken under consideration. Using them would give so many more
details and should give a better performance. Focusing on the difference between private and public
urban green spaces could be also done to improve the performance. More various data sources
connected to PUGSs, such as their size, proximity to accompanying infrastructure like roads and
bridges, frequency of visits or other data about urban features and the ways they are being used could
be included in input data to achieve better performance.
Future research could also focus on including even more cities into the training set. To do so mentioned
above technical aspects should be improved, such as having more available RAM, having more disc
memory to host the data and models, and having a better machine to execute the training process
faster. Usage of services that provide ready-to-use image chips such as Google Earth Engine could also
be considered.
24
6. Conclusion
This study aimed at investigating the use of convolutional neural networks for identifying public urban
green spaces through multi-source and open-source data at the metropolitan level with a global focus.
This goal was attained with a moderate to good performance. This study focused on evaluating two
deep learning model techniques for semantic segmentation of PUGSs with the use of U-Net
architecture, and U-Net with a pre-trained ResNet34 encoder architecture. It used the transfer learning
approach to improve the performance of the created model.
Results show that this approach is promising for creating new, open source based PUGSs datasets
for various cities. Especially transfer learning approach, which gave an average test IoU of 0,5610,
and average test F1 score of 0,64515 across two external cities. This is a quite good result considering
limitations of this study, but also leaves a room for improvement for future research. Using a pre-
trained encoder also allowed for using less epochs, which makes the learning process faster.
Usage of similar, deep learning models could improve processes of identification of public urban green
spaces for management and development purposes. Nowadays most people live in urban areas and
working towards making these cities greener and more sustainable with more accessible green spaces
is one of the UN Sustainable Development Goals. If those goals are meant to be achieved new ways
of urban management should be created, and this kind of deep learning approach could be part of this
process.
Acknowledgements
This thesis project was made in collaboration with S.M. Labib, Department of Human Geography
and Spatial Planning at Utrecht University and Jiawei Zhao, MSc in Applied Data Science at Utrecht
University.
25
Appendix
GitHub repository link with all code - https://github.com/mar-koz22/Park-NET-identifying-Urban-
parks-using-multi-source-spatial-data-and-Geo-AI/tree/main/Marta's-approach
Figure 15 Left - Tel Aviv ground truth PUGSs dataset, right - predicted PUGS data for Tel Aviv
using NIR-NDWI-NDBI U-Net with ResNet34 encoder architecture. PUGS are green,
background is white
Figure 14 Right – true color satellite image of Kampala, left - predicted PUGS data for Kampala
using NIR-NDWI-NDBI U-Net with ResNet34 encoder architecture.
PUGS are green, background is white
26
References
Albert, A., Kaur, J., & Gonzalez, M. (2017). Using convolutional networks and satellite imagery to
http://arxiv.org/abs/1704.02965
Chaurasia, A., & Culurciello, E. (2017). LinkNet: Exploiting Encoder Representations for Efficient
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image
https://doi.org/10.48550/ARXIV.1412.7062
Gülçi ̇N, D., & Akpinar, A. (2018). Mapping Urban Green Spaces Based on an Object-Oriented
https://doi.org/10.30516/bilgesci.486893
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition
Huerta, R. E., Yépez, F. D., Lozano-García, D. F., Guerra Cobián, V. H., Ferriño Fierro, A. L., de León
Gómez, H., Cavazos González, R. A., & Vargas-Martínez, A. (2021). Mapping Urban Green
Spaces at the Metropolitan Level Using Very High Resolution Satellite Imagery and Deep
https://doi.org/10.3390/rs13112031
Investment Monitor. (2022, February). The ten fastest-growing cities in the world.
https://www.investmentmonitor.ai/analysis/fastest-growing-cities-in-the-world
Khryaschev, V., & Ivanovsky, L. (2019). Urban areas analysis using satellite image segmentation and
https://doi.org/10.1051/e3sconf/201913501064
27
Kopecká, M., Szatmári, D., & Rosina, K. (2017). Analysis of Urban Green Spaces Based on Sentinel-2A:
Liang, T., Chu, X., Liu, Y., Wang, Y., Tang, Z., Chu, W., Chen, J., & Ling, H. (2021). CBNetV2: A
http://arxiv.org/abs/2107.00420
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection.
https://doi.org/10.48550/ARXIV.1708.02002
Ludwig, C., Hecht, R., Lautenbach, S., Schorcht, M., & Zipf, A. (2021). Mapping Public Urban Green
Spaces Based on OpenStreetMap and Sentinel-2 Imagery Using Belief Functions. ISPRS
https://github.com/qubvel/segmentation_models
Pollatos, V., Kouvaras, L., & Charou, E. (2020). Land Cover Semantic Segmentation Using ResUNet.
https://doi.org/10.48550/ARXIV.2010.06285
Rakhlin, A., Davydow, A., & Nikolenko, S. (2018). Land Cover Classification from Satellite Imagery with
U-Net and Lovász-Softmax Loss. 2018 IEEE/CVF Conference on Computer Vision and Pattern
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image
Segmentation. https://doi.org/10.48550/ARXIV.1505.04597
Sharifi, A., & Hosseingholizadeh, M. (2019). The Effect of Rapid Population Growth on Urban
Expansion and Destruction of Green Space in Tehran from 1972 to 2017. Journal of the Indian
Skokanová, H., González, I. L., & Slach, T. (2020). Mapping Green Infrastructure Elements Based on
Available Data, A Case Study of the Czech Republic. Journal of Landscape Ecology, 13(1), 85–
103. https://doi.org/10.2478/jlecol-2020-0006
28
Ulmas, P., & Liiv, I. (2020). Segmentation of Satellite Imagery using U-Net Models for Land Cover
Classification. https://doi.org/10.48550/ARXIV.2003.02899
United Nations. (2018, May 16). 68% of the world population projected to live in urban areas by 2050,
world-urbanization-prospects.html
United Nations Development Programme. (n.d.). Sustainable Development Goals. Retrieved 15 June
communities
Weiyuan Wu, Divakar Verma, & Wangyin Yang. (n.d.). Patchify. https://pypi.org/project/patchify/
World Cities Culture Forum. (n.d.). % of public green space (parks and gardens).
http://www.worldcitiescultureforum.com/data/of-public-green-space-parks-and-gardens
Wurm, M., Stark, T., Zhu, X. X., Weigand, M., & Taubenböck, H. (2019). Semantic segmentation of
slums in satellite images using transfer learning on fully convolutional neural networks. ISPRS
https://doi.org/10.1016/j.isprsjprs.2019.02.006
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2016). Pyramid Scene Parsing Network.
https://doi.org/10.48550/ARXIV.1612.01105
29