Spatial Data r 3
Spatial Data r 3
Diego LEGROS
2022-10-10
Spatial Data
By spatial (we also say geospatial data) data we mean data that contain locational as well as attribute
information. Spatial data is data about objects, events, or phenomena that have a location on the surface of
the earth. The location may be static in the short-term (e.g., the location of a road, an earthquake event,
children living in poverty), or dynamic (e.g., a moving vehicle or pedestrian, the spread of an infectious
disease).
Geospatial data combines location information (usually coordinates on the earth), attribute information (the
characteristics of the object, event, or phenomena concerned), and often also temporal information (the time
or life span at which the location and attributes exist).
To manage spatial data a geographical Information System (GIS) is very useful. A GIS is a multi-component
environment used to create, manage, visualize and analyse spatial data i.e data with information about their
locations (adress, longitude, latitude, cartesian coordinates (X, Y ). . . In a GIS there are two main types of
data: Vector and Raster.
Vector data
Vector data can be perceived as the digitization of communicating with coordinates. Just as people have
shared spatial information by writing its coordinate on paper, now they share it by writing the coordinates
on files. That is as simple as that.
There are three subcategories of vector data: point, line, and polygon. A single piece of coordinate is a point.
Usually, houses, cars, and places where a particular incident occurs are represented by a point. A series of
points connected to each other in line. Roads, rivers, cables, and pipelines are perfect examples of line data.
An enclosed area, created by connecting a number of lines is a polygon. Neighborhoods, cities, and countries
are examples of polygon data.
Raster Data
Raster data, also known as grid data, is a spatial data type that is created by taking photos of the earth
from the sky. Raster data is stored as a grid of pixels( sometimes they are called cells), where the grid is an
array of rows and columns. Satellite images and aerial photographs are the perfect examples of raster data.
Raster data is divided into two: Single-Band and Multi-Band (or Single-Layer and Multi-Layer). If a
raster data has only one grid of pixels it is called a single-band raster. But sometimes raster data contain
information on more than one dimension. In these situations, there are grids as many as the number of
different information, of the same size, on top of each other. Then they are called multi-band rasters.
1
are universal. For this course, we will focus on a subset of spatial data file formats: shapefile for vector data.
The shapefile format is a popular geospatial vector data format for geographic information system (GIS)
software for storing the location, shape, and attributes of geographic features.
Shapefiles consist of many files sharing the same core filename and different suffixes (i.e. file extensions). It
is developed and regulated by Esri as a (mostly) open specification for data interoperability among Esri and
other GIS software products.
A Shapefile is stored in a set of related files and contains one feature class. The Shapefile is by far the most
common geospatial file type you’ll encounter. You’ll need a complete set of files that are mandatory to make
up a Shapefile.
The required files are :
.shp is a mandatory Esri file that gives features their geometry. Every Shapefile has its own .shp file that
represent spatial vector data. For example, it could be points, lines and polygons in a map.
.shx are mandatory Esri and AutoCAD shape index position. This type of file is used to search forward and
backwards.
.dbf is a standard database file used to store attribute data and object IDs. A .dbf file is mandatory for
shape files. You can open .dbf files in Microsoft Access or Excel.
.prj is an optional file that contains the metadata associated with the shapefiles coordinate and projection
system. If this file does not exist, you will get the error “unknown coordinate system”. If you want to fix this
error, you have to use the “define projection” tool which generates .prj files.
.xml file types contains the metadata associated with the shapefile. If you delete this file, you essentially
delete your metadata. You can open and edit this optional file type (.xml) in any text editor.
.sbn is an optional spatial index file that optimizes spatial queries. This file type is saved together with a
.sbx file. These two files make up a shape index to speed up spatial queries.
.sbx are similar to .sbn files in which they speed up loading times. It works with .sbn files to optimize
spatial queries.
.cpg are optional plain text files that describes the encoding applied to create the shapefile. If your shapefile
doesn’t have a cpg file, then it has the system default encoding.
2
Geomtry type Attibute table Class
Points No SpatialPoints
Points Yes SpatialPointsDataFrame
Lines No SpatialLines
Lines Yes SpatialLinesDataFrame
Polygons No SpatialPolygons
Polygons Yes SpatialPolygonsDataFrame
The readOGR has two arguments. Exactly what you pass to these arguments depends on what kinds of data
you are reading in. The first one is dsn and the second one is layer. The argument dsn should be the path
to the directory in which the file is stored and layer is the filename of the shapefile (without any extension).
The arguments are separated by a comma and the order in which they are specified is important. You do not
have to explicitly type sn=... or layer = ...as R knows which order they appear. For clarity, it is good
pratice to include argument names when learning new function so we will continue to do so.
A example using the Columbus data available in the GEODA website1 .
# Ensure that rgdal package is installed
if (!require("rgdal")) install.packages("rgdal")
library(rgdal)
columbus <- readOGR(dsn = "/cloud/project/MappingWithR/columbus",layer = "columbus")
3
## It has 20 fields
## Integer64 fields read as strings: COLUMBUS_ COLUMBUS_I POLYID
In the code above the readOGR\?? function is used to load a shapefile and assign it to a new spatial object
called columbus. Another way is to create an object that contains the location where are saved the data.
# Ensure that rgdal package is installed
if (!require("rgdal")) install.packages("rgdal")
library(rgdal)
# Set the folder where the data are saved
data.columbus <- setwd("/cloud/project/MappingWithR/columbus")
columbus <- readOGR(dsn = ".",layer = "columbus")
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Let’s now analyse the columbus object with some basic commands:
head(columbus@data, n = 2)
4
## AREA PERIMETER COLUMBUS_ COLUMBUS_I
## Min. :0.03438 Min. :0.9021 Length:49 Length:49
## 1st Qu.:0.09315 1st Qu.:1.4023 Class :character Class :character
## Median :0.17477 Median :1.8410 Mode :character Mode :character
## Mean :0.18649 Mean :1.8887
## 3rd Qu.:0.24669 3rd Qu.:2.1992
## Max. :0.69926 Max. :5.0775
## POLYID NEIG HOVAL INC
## Length:49 Min. : 1 Min. :17.90 Min. : 4.477
## Class :character 1st Qu.:13 1st Qu.:25.70 1st Qu.: 9.963
## Mode :character Median :25 Median :33.50 Median :13.380
## Mean :25 Mean :38.44 Mean :14.375
## 3rd Qu.:37 3rd Qu.:43.30 3rd Qu.:18.324
## Max. :49 Max. :96.40 Max. :31.070
## CRIME OPEN PLUMB DISCBD
## Min. : 0.1783 Min. : 0.0000 Min. : 0.1327 Min. :0.370
## 1st Qu.:20.0485 1st Qu.: 0.2598 1st Qu.: 0.3323 1st Qu.:1.700
## Median :34.0008 Median : 1.0061 Median : 1.0239 Median :2.670
## Mean :35.1288 Mean : 2.7709 Mean : 2.3639 Mean :2.852
## 3rd Qu.:48.5855 3rd Qu.: 3.9364 3rd Qu.: 2.5343 3rd Qu.:3.890
## Max. :68.8920 Max. :24.9981 Max. :18.8111 Max. :5.570
## X Y NSA NSB
## Min. :24.25 Min. :24.96 Min. :0.0000 Min. :0.0000
## 1st Qu.:36.15 1st Qu.:28.26 1st Qu.:0.0000 1st Qu.:0.0000
## Median :39.61 Median :31.91 Median :0.0000 Median :1.0000
## Mean :39.46 Mean :32.37 Mean :0.4898 Mean :0.5102
## 3rd Qu.:43.44 3rd Qu.:35.92 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :51.24 Max. :44.07 Max. :1.0000 Max. :1.0000
## EW CP THOUS NEIGNO
## Min. :0.0000 Min. :0.0000 Min. :1000 Min. :1001
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1000 1st Qu.:1013
## Median :1.0000 Median :0.0000 Median :1000 Median :1025
## Mean :0.5918 Mean :0.4898 Mean :1000 Mean :1025
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1000 3rd Qu.:1037
## Max. :1.0000 Max. :1.0000 Max. :1000 Max. :1049
columbus@data[columbus$CRIME < 20.0485,]
The above line of code asked R to select only the rows from the columbus object, where crime is lower than
the first quartile. The square brackets work as follow: anything before the comma refers to the rows that will
be selected, anything after the comma refers to the numbers of columns that should be returned.
We can compute the mean of a specific variable, for example the income.
mean(columbus$INC) # Compute the mean of the income
## [1] 14.37494
The $ symbol refers to the INC column (a variable within the table) in the data slot. The use of the mean
function works because we are dealing with numeric data. To check the classes (know the types of the
variables in the dataset) of all the variables in a spatial dataset, you can use the sapply command:
sapply(columbus@data, class)
5
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X Y NSA NSB EW CP
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## THOUS NEIGNO
## "numeric" "numeric"
To explore columbus object further, try typing nrow(columbus) (display number of rows) and record how
many zones the datasets contains. You can also try ncol(columbus).
numberOfZones <-nrow(columbus)
print(numberOfZones)
## [1] 49
Now we have seen something of the structure of the spatial object in R, let us look at plotting them using
the plot function. The plot function is one of the most useful function in R, as it changes its behavior
depending on the input data (this is called polymorphism by computer scientists). In putting another object
such as plot(columbus@data) will generate an entirely different type of plot. Note that the plot function
use the geometry data, contained primarily in the @polygons slot.
plot(columbus)