03 - Using R For Digital Soil Mapping PDF
03 - Using R For Digital Soil Mapping PDF
Brendan P. Malone
Budiman Minasny
Alex B. McBratney
Using R for
Digital Soil
Mapping
Progress in Soil Science
Series editors
Alfred E. Hartemink, Department of Soil Science, FD Hole Soils Lab,
University of Wisconsin—Madison, USA
Alex B. McBratney, Sydney Institute of Agriculture,
The University of Sydney, Eveleigh, NSW, Australia
Aims and Scope
Progress in Soil Science series aims to publish books that contain novel approaches
in soil science in its broadest sense – books should focus on true progress in a
particular area of the soil science discipline. The scope of the series is to publish
books that enhance the understanding of the functioning and diversity of soils
in all parts of the globe. The series includes multidisciplinary approaches to soil
studies and welcomes contributions of all soil science subdisciplines such as: soil
genesis, geography and classification, soil chemistry, soil physics, soil biology,
soil mineralogy, soil fertility and plant nutrition, soil and water conservation,
pedometrics, digital soil mapping, proximal soil sensing, digital soil morphometrics,
soils and land use change, global soil change, natural resources and the environment.
123
Brendan P. Malone Budiman Minasny
Sydney Institute of Agriculture Sydney Institute of Agriculture
The University of Sydney The University of Sydney
Eveleigh, NSW, Australia Eveleigh, NSW, Australia
Alex B. McBratney
Sydney Institute of Agriculture
The University of Sydney
Eveleigh, NSW, Australia
Digital soil mapping is a runaway success. It has changed the way we approach
soil resource assessment all over the world. New quantitative DSM products with
associated uncertainty are appearing weekly. Many techniques and approaches have
been developed. We can map the whole world or a farmer’s field. All of this has
happened since the turn of the millennium. DSM is now beginning to be taught
in tertiary institutions everywhere. Government agencies and private companies
are building capacity in this area. Both practitioners of conventional soil mapping
methods and undergraduate and research students will benefit from following the
easily laid out text and associated scripts in this book carefully crafted by Brendan
Malone and colleagues. Have fun and welcome to the digital soil century.
Dominique Arrouays – Scientific coordinator of GlobalSoilMap.
v
Preface
Digital soil mapping (DSM) has evolved from a science-driven research phase of
the early 1990s to presently a fully operational and functional process for spatial
soil assessment and measurement. This evolution is evidenced by the increasing
extents of DSM projects from small research areas towards regional, national and
even continental extents.
Significant contributing factors to the evolution of DSM have been the advances
in information technologies and computational efficiency in recent times. Such
advances have motivated numerous initiatives around the world to build spatial data
infrastructures aiming to facilitate the collection, maintenance, dissemination and
use of spatial information. Essentially, fine-scaled earth resource information of
improving qualities is gradually coming online. This is a boon for the advancement
of DSM. More importantly, however, the contribution of the DSM community in
general to the development of such generic spatial data infrastructure has been
through the ongoing creation and population of regional, continental and worldwide
soil databases from existing legacy soil information. Ambitious projects such as
those proposed by the GlobalSoilMap consortium, whose objective is to generate
a fine-scale 3D grid of a number of soil properties across the globe, provide
some guide to where DSM is headed operationally. We are also seeing in some
countries of the world the development of nationally consistent comprehensive
digital soil information systems—the Australian Soil Grid http://www.clw.csiro.au/
aclep/soilandlandscapegrid/ being particularly relevant in that regard. Besides the
mapping of soil properties and classes, DSM approaches have been extended to
other soil spatial analysis domains such as those of digital soil assessment (DSA)
and digital soil risk assessment (DSRA).
It is an exciting time to be involved in DSM. But with development and an
increase in the operational status of DSM, there comes a requirement to teach, share
and spread the knowledge of DSM. Put more simply, there is a need to teach more
people how to do it. It is such that this book attempts to share and disseminate some
of that knowledge.
vii
viii Preface
The focus of the materials contained in the book is to learn how to carry out DSM
in a real work situation. It is procedural and attempts to give the participant a taste
and a conceptual framework to undertake DSM in their own technical fields. The
book is very instructional—a manual of sorts—and therefore completely interactive
in that participants can access and use the available data and complete exercises
using the available computer scripts. The examples and exercises in the book
are delivered using the R computer programming environment. Subsequently, this
course is both training in DSM and R. Using R, this course will introduce some basic
R operations and functionality in order to gain some fluency in this popular scripting
language. The DSM exercises will cover procedures for handling and manipulating
soil and spatial data in R and then introduce some basic concepts and practices
relevant to DSM, which importantly includes the creation of digital soil maps. As
you will discover, DSM is a broad term that entails many applications, of which a
few are covered in this book.
The material contained in this book has been cobbled together over successive
years from 2009. This effort has largely been motivated by the need to prepare a
hands-on DSM training course with associated materials as an outreach programme
of the Pedometrics and Soil Security research group at the University of Sydney. The
various DSM workshops have been delivered to a diverse range of participants: from
undergraduates, to postgraduates, to tenured academics, as well as both private and
government scientists and consultants. These workshops have been held both at the
Soil Security laboratories at the University of Sydney, as well as various locations
around the world. The ongoing development of teaching materials for DSM needs to
continue over time as new discoveries and efficiencies are made in the field of DSM
and, more generally, pedometrics. Therefore, we would be very grateful to receive
feedback and suggestions on ways to improve the book so that the materials remain
accessible, up to date and relevant.
This book entitled Using R for Digital Soil Mapping is an excellent book that
clearly outlines the step-by-step procedures required for many aspects of digital soil
mapping. This is my first time to learn R language and spatial modelling for DSM,
but with the instructive book, it’s easy to produce different DSMs by following
text and associate R scripts. It has been especially useful in Taiwan for soil organic
carbon stock mapping in different soil depths and of different parent materials and
different land uses. The other good experience is the clear pointers on how to prepare
the covariates to build the spatial prediction functions for DSM by regression models
if we do not have enough soil data. I strongly recommend this excellent book to any
person to apply DSM techniques for studying the spatial variability of agriculture
and environmental sciences.
Distinguished Professor Zueng-Sang Chen, Department of Agricultural
Chemistry, National Taiwan University, Taipei, Taiwan.
I can recommend this book as an excellent support for those wanting to learn
digital soil mapping methods. The hands-on exercises provide invaluable examples
of code for implementing in the R computing language. The book will certainly
assist you to develop skills in R. It will also introduce you to a very wide range
of powerful numerical and categorical modelling approaches that are emerging to
enable quantitative spatial and temporal inference of soil attributes at all scales from
local to global. There is also a valuable chapter on how to assess uncertainty of the
digital soil map that has been produced. The book exemplifies the quantum leap that
is occurring in quantitative spatial and temporal modelling of soil attributes, and is
a must for students of this discipline.
Carolyn Hedley, Soil Scientist, New Zealand.
Using R for Digital Soil Mapping is a fantastic resource that has enabled us to
develop and build our skills in digital soil mapping (DSM) from scratch, so much so
that this discipline has now become part of our agency core business in Tasmanian
land evaluation. It’s thorough instructional content has enabled us to deliver a state-
wide agricultural enterprise suitability mapping programme, developing quantitative
ix
x Endorsements
xi
Contents
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Chapter 1
Digital Soil Mapping
In recent times we have bared witness to the advancement of the computer and
information technology ages. With such advances, there have come vast amounts of
data and tools in all fields of endeavor. This has motivated numerous initiatives
around the world to build spatial data infrastructures aiming to facilitate the
collection, maintenance, dissemination and use of spatial information. Soil science
potentially contributes to the development of such generic spatial data infrastructure
through the ongoing creation of regional, continental and worldwide soil databases,
and which are now operational for some uses e.g., land resource assessment and risk
evaluation (Lagacherie and McBratney 2006).
Unfortunately the existing soil databases are neither exhaustive enough nor
precise enough for promoting an extensive and credible use of the soil information
within the spatial data infrastructure that is being developed worldwide. The
main reason is that their present capacities only allow the storage of data from
conventional soil surveys which are scarce and sporadically available (Lagacherie
and McBratney 2006).
The main reason for this lack of soil spatial data is simply that conventional
soil survey methods are relatively slow and expensive. Furthermore, we have also
witnessed a global reduction in soil science funding that started in the 1980s
(Hartemink and McBratney 2008), which has meant a significant scaling back in
wide scale soil spatial data collection and/or conventional soil surveying.
To face this situation, it is necessary for the current spatial soil information
systems to extend their functionality from the storage and use of digitized (existing)
soil maps, to the production of soil maps ab initio (Lagacherie and McBratney
2006). This is precisely the aim of Digital Soil Mapping (DSM) which can be
defined as:
The creation and population of spatial soil information systems by numerical models infer-
ring the spatial and temporal variations of soil types and soil properties from soil observation
and knowledge from related environmental variables. (Lagacherie and McBratney 2006)
S D f .s; o; r; r; p; a; n/ C
or
S D f .Q/ C
Long-handed, the equation states that the soil type or attribute at an unvisited site
(S) can be predicted from a numerical function or model (f) given the factors just
described plus the locally varying, spatial dependent residuals ./. The f(Q) part
of the formulation is the deterministic component or in other words, the empirical
quantitative function linking S to the scorpan factors (Lagacherie and McBratney
2006). The scorpan factors or environmental covariates come in the form of spatially
populated digitally available data, for instance from digital elevation models and
the indices derived from them—slope, aspect, MRVBF etc. Landsat data, and other
remote sensing images, radiometric data, geological survey maps, legacy soil maps
and data, just to name a few. For the residuals ./ part of the formulation, we assume
there to be some spatial structure. This is for a number of reasons which include
that the attributes used in the deterministic component were inadequate, interactions
between attributes were not taken into account, or the form of f() was mis-specified.
Overall this general formulation is called the scorpan kriging method, where the
kriging component is the process of defining the spatial trend of the residuals (with
variograms) and using kriging to estimate the residuals at the non-visited sites.
Without getting into detail with regards to some of the statistical nuances such as
bias issues—which can be prevalent when using legacy soil point data for DSM—
that are encountered with using this type of data, the application of scorpan kriging
can only be done in extents where there is available soil point data. The challenge
therefore is: what to do in situations where this type of data is not available? In the
context of the global soil mapping key soil attributes, this is a problem, but can be
overcome with the usage of other sources of legacy soil data such as existing soil
maps. It is even more of a problem when this information is not available either.
However, in the context of global soil mapping, Minasny and McBratney (2010)
proposed a decision tree structure for actioning DSM on the basis of the nature of
available legacy soil data. This is summarized in Fig. 1.1. But bear in mind that this
1.1 The Fundamentals of Digital Soil Mapping 3
Fig. 1.1 A decision tree for digital soil mapping based on legacy soil data (Adapted from Minasny
and McBratney 2010)
decision tree is not constrained only to DSM at a global scale but at any mapping
extent where the user wishes to perform DSM given the availability of soil data for
their particular area.
As can be seen from Fig. 1.1, once you have defined an area of interest, and
assembled a suite of environmental covariates for that area, then determined the
availability of the soil data there, you follow the respective pathway. scorpan kriging
is performed exclusively when there is only point data, but can be used also when
there is both point and map data available, e.g., (Malone et al. 2014). The work flow
is quite different when there is only soil map information available. Bear in mind
that the quality of the soil map depends on the scale and subsequently variation
of soil cover; such that smaller scaled maps e.g., 1:100,000 would be considered
better and more detailed than large scaled maps e.g., 1:500,000. The elemental basis
for extracting soil properties from legacy soil maps comes from the central and
distributional concepts of soil mapping units. For example, modal soil profile data
of soil classes can be used to quickly build soil property maps. Where mapping
units consist of more than one component, we can use a spatially weighted means
type method i.e., estimation of the soil properties is based on the modal profile
of the components and the proportional area of the mapping unit each component
covers, e.g., (Odgers et al. 2012). As a pre-processing step prior to creating soil
attribute maps, it may be necessary to harmonize soil mapping units (in the case of
adjacent soil maps) and/or perform some type of disaggregation technique in order
to retrieve the map unit component information. Some approaches for doing so have
4 1 Digital Soil Mapping
been described in Bui and Moran (2003). More recently soil map disaggregation has
been a target of DSM interest with a sound contribution from Odgers et al. (2014)
for extracting individual soil series or soil class information from convolved soil
map units by way of the DSMART algorithm. The DSMART algorithm can best
be explained as a data mining with repeated re-sampling algorithm. Furthering the
DSMART algorithm, Odgers et al. (2015) then introduced the PROPR algorithm
which takes probability outputs from DSMART together with modal soil profile
data of given soil classes, to estimate soil attribute quantities (with estimates of
uncertainty).
What is the process when there is no soil data available at all? This is obviously
quite a difficult situation to confront, but a real one at that. The central concept that
was discussed by Minasny and McBratney (2010) for addressing these situations is
based on the assumed homology of soil forming factors between a reference area
and the region of interest for mapping. Malone et al. (2016) provides a further
overview of the topic together with a real world application which compared
different extrapolating functions. Overall, the soil homologue concept or Homosoil,
relative to other areas of DSM research is still in its development. But considering
from a global perspective, the sparseness of soil data and limited research funds
for new soil survey, application of the Homosoil approach or other analogues will
become increasingly important for the operational advancement of DSM.
This book covers some of the territory that is described in Fig. 1.1, particularly
the scorpan kriging type approach of DSM; as this is probably most commonly
undertaken. Also covered is spatial disagregation of polygonal maps. This is framed
in the context of updating digital soil maps and downscaling in terms of deriving
soil class or attribute information from aggregated soil mapping units. Importantly
there is a theme of implementation about this book; a sort of how to guide. So
there are some examples of how to create digital soil maps of both continuous
and categorical target variable data, given available points and a portfolio of
available covariates. The procedural detail is explained and implemented using the
R computing language. Subsequently, some effort is required to become literate in
this programming language, both for general purpose usage and for DSM and other
related soil studies. With a few exceptions, all the data that is used in this book
to demonstrate methods, together with additional functions are provided via the R
package: ithir. This package can be downloaded free of cost. Instructions for getting
this package are in the next chapter.
The motivation of the book then shifts to operational concerns and based around
real case-studies. For example, the book looks at how we might statistically validate
a digital soil map. Another operational study is that of digital soil assessment (Carre
et al. 2007). Digital soil assessment (DSA) is akin to the translation of digital
soil maps into decision making aids. These could be risk-based assessments, or
References 5
assessing threats to soil (erosion, decline of organic matter etc.), and assessing
soil functions. These type of assessments can not be easily derived from a digital
soil map alone, but require some form of post-processing inference. This could be
done with quantitative modeling and or a deep mechanistic understanding of the
assessment that needs to be made. A natural candidate in this realm of DSM is land
capability or agricultural enterprise suitability. A case study of this type of DSA is
demonstrated in this book. Specific topics of this book include:
1. Attainment of R literacy in general and for DSM.
2. Algorithmic development for soil science.
3. General GIS operations relevant to DSM.
4. Soil data preparation, examination and harmonization for DSM.
5. Quantitative functions for continuous and categorical (and combinations of
both) soil attribute modeling and mapping.
6. Quantifying digital soil map uncertainty.
7. Assessing the quality of digital soil maps.
8. Updating, harmonizing and disaggregating legacy soil mapping.
9. Digital soil assessment in terms of land suitability for agricultural enterprises.
10. Digital identification of soil homologues.
References
Bui E, Moran CJ (2003) A strategy to fill gaps in soil survey over large spatial extents: an example
from the Murray-Darling basin of Australia. Geoderma 111:21–41
Carre F, McBratney AB, Mayr T, Montanarella L (2007) Digital soil assessments: beyond DSM.
Geoderma 142(1–2):69–79
Hartemink AE, McBratney AB (2008) A soil science renaissance. Geoderma 148:123–129
Jenny H (1941) Factors of soil formation. McGraw-Hill, New York
Lagacherie P, McBratney AB (2006) Digital soil mapping: an introductory perspective, chapter 1.
In: Spatial soil information systems and spatial soil inference systems: perspectives for digital
soil mapping. Elsevier, Amsterdam, pp 3–22
Malone BP, Minasny B, Odgers NP, McBratney AB (2014) Using model averaging to combine soil
property rasters from legacy soil maps and from point data. Geoderma 232–234:34–44
Malone BP, Jha SK, Minasny AB, McBratney B (2016) Comparing regression-based digital soil
mapping and multiple-point geostatistics for the spatial extrapolation of soil data. Geoderma
262:243–253
McBratney AB, Mendonca Santos ML, Minasny B (2003) On digital soil mapping. Geoderma
117:3–52
Minasny B, McBratney AB (2010) Digital soil mapping: bridging research, environmental
application, and operation, chapter 34. In: Methodologies for global soil mapping. Springer,
Dordrecht, pp 429–425
Odgers NP, Libohova Z, Thompson JA (2012) Equal-area spline functions applied to a legacy
soil database to create weighted-means maps of soil organic carbon at a continental scale.
Georderma 189–190:153–163
Odgers NP, McBratney AB, Minasny B (2015) Digital soil property mapping and uncertainty
estimation using soil class probability rasters. Geoderma 237–238:190–198
Odgers NP, Sun W, McBratney AB, Minasny B, Clifford D (2014) Disaggregating and harmonising
soil map units through resampled classification trees. Geoderma 214–215:91–100
Chapter 2
R Literacy for Digital Soil Mapping
2.1 Objective
The immediate objective here is to skill up in data analytics and basic graphics
with R. The range of analysis that can be completed, and the types of graphics
that can be created in R is simply astounding. In addition to the wide variety of
functions available in the “base” packages that are installed with R, more than
4500 contributed packages are available for download, each with its own suite of
functions. Some individual packages are the subject of entire books.
For this chapter of the book and the later chapters that will deal with digital soil
mapping exercises, we will not be able to cover every type of analysis or plot that
R can be used for, or even every subtlety associated with each function covered in
this entire book. Given it’s inherent flexibility, R is difficult to master, as one may
be able to do with a stand-alone software. R is a software package one can only
increase their knowledge and fluency in. Meaning that, effectively, learning R is a
boundless pursuit of knowledge.
In a disclaimer of sorts, this introduction to R borrows many ideas, and structures
from the plethora of online materials that are freely available on the internet. It will
be worth your while to do a Google search from time-to-time if you get stuck—you
will be amazed to find how many other R users have had the same problems you
have or have had.
2.2 Introduction to R
R is available for Windows, Mac, and Linux operating systems. Installation files
and instructions can be downloaded from the Comprehensive R Archive Network
(CRAN) site at http://cran.r-project.org/. Although the graphical user interface
(GUI) differs slightly across systems, the R commands do not.
There are two basic ways to use R on your machine: through the GUI, where R
evaluates your code and returns results as you work, or by writing, saving, and then
running R script files. R script files (or scripts) are just text files that contain the same
types of R commands that you can submit to the GUI. Scripts can be submitted to
R using the Windows command prompt, other shells, batch files, or the R GUI. All
the code covered in this book is or is able to be saved in a script file, which then
can be submitted to R. Working directly in the R GUI is great for the early stages of
code development, where much experimentation and trial-and-error occurs. For any
code that you want to save, rerun, and modify, you should consider working with R
scripts.
So, how do you work with scripts? Any simple text editor works—you just need
to save text in the ASCII format i.e., “unformatted” text. You can save your scripts
and either call them up using the command source (“file_name.R”) in the
R GUI, or, if you are using a shell (e.g., Windows command prompt) then type
R CMD BATCH file_name.R. The Windows and Mac versions of the R GUI
comes with a basic script editor, shown below in Fig. 2.1.
Unfortunately, this editor is not very good by reason that the Windows version
does not have syntax highlighting.
2.2 Introduction to R 9
Fig. 2.1 R GUI, its basic script editor, and plot window
There are some useful (in most cases, free) text editors available that can be
set up with R syntax highlighting and other features. TINN-R is a free text editor
http://nbcgib.uesc.br/lec/software/des/editores/tinn-r/en that is designed specifically
for working with R script files. Notepad++ is a general purpose text editor, but
includes syntax highlighting and the ability to send code directly to R with the
NppToR plugin. A list of text editors that work well with R can be found at: http://
www.sciviews.org/_rgui/projects/Editors.html.
2.2.4 RStudio
corner. Also in this region is various help documentation, plus information and
documentation regarding what packages and function are currently available to use.
The frame on the left is where the action happens. This is the R console. Every
time you launch RStudio, it will have the same text at the top of the console
telling you the version that is being used. Below that information is the prompt.
As the name suggests, this is where you enter commands into R. So lets enter some
commands.
Before we start anything, it is good to get into the habit of making scripts of our
work. With RStudio launched go the File menu, then new, and R Script. A new
blank window will open on the top left panel. Here you can enter your R prompts.
For example, type the following: 1+1. Now roll your pointer over the top of the
panel to the right pointing green arrow (first one), which is a button for running the
2.2 Introduction to R 11
line of code down to the R console. Click this button and R will evaluate it. In the
console you should see something like the following:
1 + 1
## [1] 2
You could have just entered the command directly into the prompt and gotten the
same result. Try it now for yourself. You will notice a couple of things about this
code. The > character is the prompt that will always be present in the GUI. The
line following the command starts with a [1], which is simply the position of the
adjacent element in the output—this will make some sense later.
For the above command, the result is printed to the screen and lost—there is no
assignment involved. In order to do anything other than the simplest analyses, you
must be able to store and recall data. In R, you can assign the results of commands to
symbolic variables (as in other computer languages) using the assignment operator
<-. Note that other computer scripting languages often use the equals sign (=) as
the assignment operator. When a command is used for assignment, the result is no
longer printed to the GUI console.
x <- 1 + 1
x
## [1] 2
x < -1 + 1
## [1] FALSE
In this case, putting a space between the two characters that make up the
assignment operator causes R to interpret the command as an expression that ask
if x is less than zero. However spaces usually do not matter in R, as long as they do
not separate a single operator or a variable name. This, for example, is fine:
x <- 1
Note that you can recall a previous command in the R GUI by hitting the up
arrow on your keyboard. This becomes handy when you are debugging code.
When you give R an assignment, such as the one above, the object referred to as
x is stored into the R workspace. You can see what is stored in the workspace by
looking to the workspace panel in RStudio (top right panel). Alternatively, you can
use the ls function.
ls()
## [1] "x"
12 2 R Literacy for Digital Soil Mapping
rm(x)
x
As you can see, You will get an error if you try to evaluate what x is.
If you want to assign the same value to several symbolic variables, you can use
the following syntax.
x <- 1 + 1
x
X
In R, commands can be separated by moving onto a new line (i.e., hitting enter)
or by typing a semicolon (;), which can be handy in scripts for condensing code. If
a command is not completed in one line (by design or error), the typical R prompt
> is replaced with a +.
x<-
+ 1+1
There are several operators that are used in the R language. Some of the more
common are listed below.
Arithmetic
+ - * / ˆ plus, minus, multiply, divide, power
Relational
a == b a is equal to b (do not confuse with =)
a != b a is not equal to b
a < b a is less than b
a > b a is greater than b
a <= b a is less than or equal to b
a >= b a is greater than or equal to b
Logical/grouping
! not
& and
| or
2.2 Introduction to R 13
Indexing
$ part of a data frame
[] part of a data frame, array, list
[[]] part of a list
Grouping commands
{} specifying a function, for loop, if statement etc.
Making sequences
a:b returns the sequence a, a+1, a+2, . . . b
Others
# commenting (very very useful!)
; alternative for separating commands
~ model formula specification
() order of operations, function arguments
Commands in R operate on objects, which can be thought of as anything that can
be assigned to a symbolic variable. Objects include vectors, matrices, factors, lists,
data frames, and functions. Excluding functions, these objects are also referred to
as data structures or data objects.
When you want to finish up on an R session, RSudio will ask you if you want to
“save workspace image”. This refers to the workspace that you have created , i.e.,
all the objects you have created or even loaded. It is generally good practice to save
your workspace after each session. More importantly however, is the need to save
all the commands that you have created on your script file. Saving a script file in
Rstudio is just like saving a Word document. Give both a go—save the script file
and then save the workspace. You can then close RStudio.
The term “data type” refers to the type of data that is present in a data structure, and
does not describe the data structure itself. There are four common types of data in R:
numerical, character, logical, and complex numbers. These are referred to as modes
and are shown below:
Numerical data
x <- 10.2
x
## [1] 10.2
14 2 R Literacy for Digital Soil Mapping
Character data
Any time character data are entered in the R GUI, you must surround individual
elements with quotes. Otherwise, R will look for an object.
Either single or double quotes can be used in R. When character data are read
into R from a file, the quotes are not necessary.
Logical data contain only three values: TRUE, FALSE, or NA, (NA indicates a
missing value—more on this later). R will also recognize T and F, (for true and
false respectively), but these are not reserved, and can therefore be overwritten by
the user, and it is therefore good to avoid using these shortened terms.
a <- TRUE
a
## [1] TRUE
Note that there are no quotes around the logical values—this would make them
character data. R will return logical data for any relational expression submitted to
it.
4 < 2
## [1] FALSE
or
b <- 4 < 2
b
## [1] FALSE
And finally, complex numbers, which will not be covered in this book, are the
final data type in R
## [1] 10+3i
You can use the mode or class function to see what type of data is stored in
any symbolic variable.
2.2 Introduction to R 15
class(name)
## [1] "character"
class(a)
## [1] "logical"
class(x)
## [1] "numeric"
mode(x)
## [1] "numeric"
Data in R are stored in data structures (also known as data objects)—these are and
will be the that you perform calculations on, plot data from, etc. Data structures in
R include vectors, matrices, arrays, data frames, lists, and factors. In a following
section we will learn how to make use of these different data structures. The
examples below simply give you an idea of their structure.
Vectors are perhaps the most important type of data structure in R. A vector is
simply an ordered collection of elements (e.g., individual numbers).
x <- 1:12
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Arrays are similar to matrices, but can have more than two dimensions.
Y <- array(1:30, dim = c(2, 5, 3))
Y
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
16 2 R Literacy for Digital Soil Mapping
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 11 13 15 17 19
## [2,] 12 14 16 18 20
##
## , , 3
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 21 23 25 27 29
## [2,] 22 24 26 28 30
One feature that is shared for vectors, matrices, and arrays is that they can only
store one type of data at once, be it numerical, character, or logical. Technically
speaking, these data structures can only contain elements of the same mode.
Data frames are similar to matrices—they are two-dimensional. However, a data
frame can contain columns with different modes. Data frames are similar to data
sets used in other statistical programs: each column represents some variable, and
each row usually represents an “observation”, or “record”, or “experimental unit”.
dat <- (data.frame(profile_id = c("Chromosol", "Vertosol", "Sodosol"),
FID = c("a1", "a10", "a11"), easting = c(337859, 344059, 347034),
northing = c(6372415, 6376715, 6372740), visited = c(TRUE, FALSE, TRUE)))
dat
Lists are similar to vectors, in that they are an ordered collection of elements, but
with lists, the elements can be other data objects (the elements can even be other
lists). Lists are important in the output from many different functions. In the code
below, the variables defined above are used to form a list.
summary.1 <- list(1.2, x, Y, dat)
summary.1
## [[1]]
## [1] 1.2
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
##
## [[3]]
## , , 1
##
2.2 Introduction to R 17
Note that a particular data structure need not contain data to exist. This may seem
unusual, but it can be useful when it is necessary to set up an object for holding some
data later on.
x <- NULL
Real data sets often contain missing values. R uses the marker NA (for “not
available”) to indicate a missing value. Any operation carried out on an NA will
return NA.
x <- NA
x - 2
## [1] NA
Note that the NA used in R does not have the quotes around it—this would make
it character data. To determine if a value is missing, use the is.na—this function
can also be used to set elements in a data object to NA.
is.na(x)
## [1] TRUE
18 2 R Literacy for Digital Soil Mapping
!is.na(x)
## [1] FALSE
Indefinite values are indicated with the marker NaN, for “not a number”. Infinite
values are indicated with the markers Inf or -Inf. You can find these values with
the functions is.infinite, is.finite, and is.nan.
In R, you can carry out complicated and tedious procedures using functions.
Functions require arguments, which include the object(s) that the function should act
upon. For example, the function sum will calculate the sum of all of its arguments.
## [1] 109.83
The arguments in (most) R functions can be named, i.e., by typing the name of
the argument, an equal sign, and the argument value (arguments specified in this
way are also called tagged). For example, for the function plot, the help file lists
the following arguments.
plot (x, y,...)
a <- 1:10
b <- a
plot(x = a, y = b)
plot(a, b)
This code does the same as the previous code. The expected position of
arguments can be found in the help file for the function you are working with or
by asking R to list the arguments using the args function.
args(plot)
It usually makes sense to use the positional arguments for only the first few
arguments in a function. After that, named arguments are easier to keep track of.
Many functions also have default argument values that will be used if values are not
specified in the function call. These default argument values can be seen by using
the args function and can also be found in the help files. For example, for the
function rnorm, the arguments mean and sd have default values.
args(rnorm)
Any time you want to call up a function, you must include parentheses after it,
even if you are not specifying any arguments. If you do not include parentheses, R
will return the function code (which at times might actually be useful).
Note that it is not necessary to use explicit numerical values as function
arguments—symbolic variable names which represent appropriate data structure
can be used. it is also possible to use functions as arguments within functions. R will
evaluate such expressions from the inside outward. While this may seem trivial, this
quality makes R very flexible. There is no explicit limit to the degree of nesting that
can be used. You could use:
The above code includes 5 levels of nesting (the sum of 8.4,1.2 and 7 is combined
with the other values to form a vector, for which the mean is calculated, then the
square root of this value is taken and used as the standard deviation in a call to
rnorm, and the output of this call is plotted). Of course, it is often easier to assign
intermediate steps to symbolic variables. R evaluates nested expressions based on
the values that functions return or the data represented by symbolic variables. For
example, if a function expects character data for a particular argument, then you can
use a call to the function paste in place of explicit character data.
Many functions (including sum, plot, and rnorm) come with the R “base
packages”, i.e., they are loaded and ready to go as soon as you open R. These
packages contain the most common functions. While the base packages include
many useful functions, for specialized procedures, you should check out the content
that is available in the add-on packages. The CRAN website currently lists more
than 4500 contributed packages that contain functions and data that users have
contributed. You can find a list of the available packages at the CRAN website http://
cran.r-project.org/. During the course of this book and described in more detail later
on, we will be looking and using a number of specialized packages for application
of DSM. Another repository of R packages is the R-Forge website https://r-forge.r-
project.org/. R-Forge offers a central platform for the development of R packages,
R-related software and further projects. Packages in R-Forge are not necessarily
always on the CRAN website. However, many packages on the CRAN website
are developed in R-Forge as ongoing projects. Sometimes to get the latest changes
20 2 R Literacy for Digital Soil Mapping
made upon a package, it pays to visit R-Forge first, as the uploading of the revised
functions to CRAN is not instantaneous.
To utilize the functions in contributed R packages, you first need to install
and then load the package. Packages can be installed via the packages menu
in the right bottom panel of RStudio (select the “packages” menu, then “install
packages”). Installation could be retrieved from the nearest mirror site (CRAN
server location)—you will need to have first selected this by going to the tools, then
options, then packages menu where you can then select the nearest mirror site from a
suite of possibles. Alternatively, you may just install a package from a local zip file.
This is fine, but often when using a package, there are other peripheral packages (or
dependencies) that also need to be loaded (and installed). If you install the package
from CRAN or a mirror site, the dependency packages are also installed. This is
not the case when you are installing packages from zip files—you will also have to
manually install all the dependencies too.
Or just use the command:
install.packages("package name")
where “package name” should be replaced with the actual name of the package
you want to install, for example:
install.packages("Cubist")
This command will install the package of functions for running the Cubist rule-
based machine learning models for regression.
Installation is a one-time process, but packages must be loaded each time you
want to use them. This is very simple, e.g., to load the package Cubist, use the
following command.
library(Cubist)
Other popular repositories for R packages include Github and BitBucket. These
repositories as well as R-Forge are version control systems that provide a central
place for people to collaborate on everything from small to very large projects with
speed and efficiency. The companion R package to this book, ithir is hosted on
Bitbucket for example. ithir contains most of the data, and some important functions
that are covered in this book so that users can replicate all of the analyses contained
within. ithir can be downloaded and installed on your computer using the following
commands:
library(devtools)
install_bitbucket("brendo1001/ithir/pkg")
library(ithir)
2.2 Introduction to R 21
The above commands assumes your have already installed the devtools
package. Any package that you want to use that is not included as one of the “base”
packages, needs to be loaded every time you start R. Alternatively, you can add code
to the file Rprofile.site that will be executed every time you start R.
You can find information on specific packages through CRAN, by browsing
to http://cran.r-project.org/ and selecting the packages link. Each package has a
separate web page, which will include links to source code, and a pdf manual. In
RStudio, you can select the packages tab on the lower right panel. You will then see
all the package that are currently installed in your R environment. By clicking onto
any package, information on the various functions contained in the package, plus
documentation and manuals for their usage. It becomes quite clear that within this
RStudio environment, there is at your fingertips, a wealth of information for which
to consult whenever you get stuck. When working with a new package, it is a good
idea to read the manual.
To “unload” functions, use the detach function:
detach("package:Cubist")
For tasks that you repeat, but which have no associated function in R, or if you
do not like the functions that are available, you can write your own functions. This
will be covered a little a bit later on. Perhaps one day you may be able to compile
all your functions that you have created into a R package for everyone else to use.
It is usually easy to find the answer about specific functions or about R in general.
There are several good introductory books on R. For example, “R for Dummies”,
which has had many positive reviews http://www.amazon.com/R-Dummies-Joris-
Meys/dp/1119962846.You can also find free detailed manuals on the CRAN web-
site. Also, it helps to keep a copy of the “R Reference Card”, which demonstrates
the use of many common functions and operators in 4 pages http://cran.r-project.
org/doc/contrib/Short-refcard.pdf. Often a Google search https://www.google.com.
au/ of your problem can be a very helpful and fruitful exercise. To limit the results to
R related pages, adding “cran” generally works well. R even has an internet search
engine of sorts called rseek, which can be found at http://rseek.org/—it is really just
like the Google search engine, but just for R stuff!
Each function in R has a help file associated with it that explains the syntax and
usually includes an example. Help files are concisely written. You can bring up a
help file by typing ? and then the function name.
>?cubist
This will bring up the help file for the cubist function in the help panel of
RStudio. But, what if you are not sure what function you need for a particular
task? How can you know what help file to open? In addition to the sources given
22 2 R Literacy for Digital Soil Mapping
This will bring up a search results page in the help panel of RStudio of all the
various help files that have something to do with polygon. In this case, i am only
interested in a function that assesses whether a point is situated with a polygon.
So looking down the list, one can see (provided the “SDMTools” package is
installed) a function called pnt.in.poly. Clicking on this function, or submitting
?pnt.in.poly to R will bring up the necessary help file.
There is an R help mailing list http://www.r-project.org/mail.html, which can be
very helpful. Before posting a question, be sure to search the mailing list archives,
and check the posting guide http://www.r-project.org/posting-guide.html.
One of the best sources of help on R functions is the mailing list archives
(urlhttp://cran.r-project.org/, then select “search”, then “searchable mail archives”).
Here you can find suggestions for functions for particular problems, help on using
specific functions, and all kinds of other information. A quick way to search the
mailing list archives is by entering:
RSiteSearch("A Keyword")
For one more trick, to search for objects (including functions) that include a
particular string, you can use the apropos function:
apropos("mean")
2.2.11 Exercises
1. You can use for magic tricks: Pick any number. Double it, and then add 12 to the
result. Divide by 2, and then subtract your original number. Did you end up with
6.0?
2. If you want to work with a set of 10 numbers in R, something like this:
11 8.3 9.8 9.6 11.0 12.0 8.5 9.9 10.0 11.0
• What type of data structure should you use to store these in R?
• What if you want to work with a data set that contains site names, site
locations, soil categorical information, soil property information, and some
terrain variables—what type of data structure should you use to store these
in R?
3. Install and load a package—take a look at the list of available packages, and
pick one. To make sure you have loaded it correctly, try to run an example from
2.3 Vectors, Matrices, and Arrays 23
the package reference manual. Identify the arguments required for calling up the
function. Detach the package when you are done.
4. Assign your full name to a variable called my.name. Print the value of
my.name. Try to subtract 10 from my.name. Finally determine the type of
data stored in my.name and 10 using the class function. If you are unsure of
what class does, check out the help file.
5. You are interested in seeing what functions R has for fitting variograms (or some
other topic of your choosing). Can you figure out how to search for relevant
functions? Are you able to identify a function or two that may do what you want.
There are several ways to create a vector in R. Where the elements are spaced by
exactly 1, just separate the values of the first and last elements with a colon.
1:5
## [1] 1 2 3 4 5
The function seq (for sequence) is more flexible. Its typical arguments are
from, to, and by (or, in place of by, you can specify length.out).
seq(-10, 10, 2)
## [1] -10 -8 -6 -4 -2 0 2 4 6 8 10
Note that the by argument does not need to be an integer. When all the elements
in a vector are identical, use the rep function (for repeat).
rep(4, 5)
## [1] 4 4 4 4 4
c(2, 1, 5, 100, 2)
## [1] 2 1 5 100 2
c(a = 2, b = 1, c = 5, d = 100, e = 2)
## a b c d e
## 2 1 5 100 2
24 2 R Literacy for Digital Soil Mapping
## [1] 2 1 5 100 2
Variable names can be any combination of letters, numbers, and the symbols . and
_, but they can not start with a number or with _. Google has a R style guide http://
google-styleguide.googlecode.com/svn/trunk/google-r-style.html which describes
good and poor examples of variable name attribution, but generally it is a personal
preference on how you name your variables.
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
The c function is very useful for setting up arguments for other functions, as will
be shown later. As with all R functions, both variable names and function names can
be substituted into function calls in place of numeric values.
x <- rep(1:3)
y <- 4:10
z <- c(x, y)
z
## [1] 1 2 3 4 5 6 7 8 9 10
x <- 1:10
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Also, note that when logical vectors are used in arithmetic, they are changed
(coerced in R terms) into a vector of binary elements: 1 or 0. Continuing with the
above example:
2.3 Vectors, Matrices, and Arrays 25
a <- x > 5
a
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
a * 1.4
## [1] 0.0 0.0 0.0 0.0 0.0 1.4 1.4 1.4 1.4 1.4
Note that the paste function is very different from c. The paste function
concatenates its arguments into a single character value, while the c function
combines its arguments into a vector, where each argument becomes a single
element. The paste function becomes handy when you want to combine the
character data that are stored in several symbolic variables.
month <- "April"
day <- 29
year <- 1770
paste("Captain Cook, on the ", day, "th day of ", month, ", "
, year, ", sailed into Botany Bay", sep = "")
This is especially useful with loops, when a variable with a changing value is
combined with other data. Loops will be discussed in a later section.
group <- 1:10
id <- LETTERS[1:10]
for (i in 1:10) {
print(paste("group =", group[i], "id =", id[i]))
}
x <- 6:10
x
## [1] 6 7 8 9 10
x + 2
## [1] 8 9 10 11 12
For an operation carried out on two vectors, the mathematical operation is applied
on an element-by-element basis.
y <- c(4, 3, 7, 1, 1)
y
## [1] 4 3 7 1 1
z <- x + y
z
## [1] 10 10 15 10 11
x <- 1:10
m <- 0.8
b <- 2
y <- m * x + b
y
## [1] 2.8 3.6 4.4 5.2 6.0 6.8 7.6 8.4 9.2 10.0
If the number of rows in the smaller vector is not a multiple of the larger vector
(often indicative of an error) R will return a warning.
x <- 1:10
m <- 0.8
b <- c(2, 1, 1)
y <- m * x + b
2.3 Vectors, Matrices, and Arrays 27
## [1] 2.8 2.6 3.4 5.2 5.0 5.8 7.6 7.4 8.2 10.0
pi
## [1] 3.141593
7 - 2 * 4
## [1] -1
is different from:
(7 - 2) * 4
## [1] 20
and
10^1:5
## [1] 10 9 8 7 6 5
is different from:
10^(1:5)
Many functions in R are capable of accepting vectors (or even data frames, arrays,
and lists) as input for single arguments, and returning an object with the same
structure. These vectorised functions make vector manipulations very efficient.
Examples of such functions include log, sin, and sqrt, For example:
x <- 1:10
sqrt(x)
or
sqrt(1:10)
sqrt(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
But they are not the same as the following, where all the numbers are interpreted
as individual values for multiple arguments.
2.3 Vectors, Matrices, and Arrays 29
sqrt(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
There are also some functions designed for making vectorised operations on lists,
matrices, and arrays: these include apply and lapply.
Arrays are multi-dimensional collections of elements and matrices are simply two-
dimensional arrays. R has several operators and functions for carrying out operations
on arrays, and matrices in particular (e.g., matrix multiplication).
To generate a matrix, the matrix function can be used. For example:
Note that the filling order is by column by default (i.e., each column is filled
before moving onto the next one). The “unpacking” order is the same:
as.vector(X)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
If, for any reason, you want to change the filling order, you can use the by row
argument:
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 16 21 26
## [2,] 17 22 27
## [3,] 18 23 28
## [4,] 19 24 29
## [5,] 20 25 30
Arithmetic with matrices and arrays that have the same dimensions is straight-
forward, and is done on an element-by-element basis. This true for all the arithmetic
operators listed in earlier sections.
X + Z
x <- 1:9
Z + x
y <- 1:3
Z + y
R also has operators for matrix algebra. The operator %*% carries out matrix
multiplication, and the function solve can invert matrices.
X <- matrix(c(1, 2.5, 6, 7.5, 4.9, 5.6, 9.9, 7.8, 9.3), nrow = 3)
X
solve(X)
2.3.4 Exercises
5. If you are bored, try this. Given the following set of linear equations:
27:2x C 32y 10:8z D 401:2
x 1:48y D 0
409:1x C 13:5z D 2:83
Solve for x, y, and z.
As described above, a data frame is a type of data structure in R with rows and
columns, where different columns contain data with different modes. A data frame
is probably the most common data structure that you will use for storing soil
information data sets. Recall from before the data frame that we created.
We can quickly assess the different modes of a data frame (or any other object
for that matter) by using the str(structure) function.
str(dat)
The str is probably one of the most used R functions. It is great for exploring
the format and contents of any object created or imported.
The easiest way to create a data frame is to read in data from a file—this is done
using the function read.table, which works with ASCII text files. Data can be
read in from other files as well, using different functions, but read.table is the
most commonly used approach. R is very flexible in how it reads in data from text
files.
Note that the column labels in the header have to be compatible with R’s variable
naming convention, or else R will make some changes as they are read in (or will
not read in the data correctly). So lets import some real soil data. These soil data are
some chemical and physical properties from a collection of soil profiles sampled at
various locations in New South Wales, Australia
str(soil.data)
head(soil.data)
However, you may find that an error occurs, saying something like that the file
does not exist. This is true as it has not been provided to you. Rather, to use this data
you will need to load up the previously installed ithir package.
library(ithir)
data(USYD_soil1)
soil.data <- USYD_soil1
str(soil.data)
head(soil.data)
When we import a file into R, it is good habit to look at its structure (str)
to ensure the data is as it should be. As can be seen, this data set (frame) has
166 observations, and 15 columns. The head function is also a useful exploratory
function, which simply allows us to print out the data frame, but only the first 6
rows of it (good for checking data frame integrity). Note that you must specify
header=TRUE, or else R will interpret the row of labels as data. If the file you are
loading is not in the directory that R is working in (the working directory, which can
be checked with getwd() and changed with setwd(file = “filename”)).
When setting the working directory (setwd()), you can include the file path, but
note that the path should have forward, not backward slashes (or double backward
slashes, if you prefer).
The column separator function sep (an argument of read.table) lets you
tell R, where the column breaks or delimiters occur. In the soil.data object, we
specify that the data is comma separated. If you do not specify a field separator,
R assumes that any spaces or tabs separate the data in your text file. However,
any character data that contain spaces must be surrounded by quotes (otherwise,
R interprets the data on either side of the white spaces as different elements).
2.4 Data Frames, Data Import, and Data Export 35
which(is.na(soil.data$CEC))
## [1] 9 10 45 63 115
soil.data[8:11, ]
In most cases, it makes sense to put your data into a text file for reading into R.
This can be done in various ways. Data download from the internet are often in text
files to begin with. Data can be entered directly into a text file using a text editor.
For data that are in spreadsheet program such as Excel or JMP, there are facilities
available for saving these tabular frames to text files for reading into R.
This all may seem confusing, but it is really not that bad. Your best bet is to play
around with the different options, find one that you like, and stick with it. Lastly,
data frames can also be edited interactively in R using the edit function. This is
really only useful for small data sets, and the function is not supported by RStudio
(you could try with using Tinn-R instead if you want to explore using this function)
## soil.type soil.OC
## 1 Chromosol 2.1
## 2 Vertosol 2.9
## 3 Organosol 5.5
## 4 Anthroposol 0.2
While this approach is not an efficient way to enter data that could be read
in directly, it can be very handy for some applications, such as the creation of
customized summary tables. Note that column names are specified using an equal
sign. It is also possible to specify (or change, or check) column names for an existing
data frame using the function names.
## soil SOC
## 1 Chromosol 2.1
## 2 Vertosol 2.9
## 3 Organosol 5.5
## 4 Anthroposol 0.2
## soil.type soil.OC
## Ch Chromosol 2.1
## Ve Vertosol 2.9
## Or Organosol 5.5
## An Anthroposol 0.2
Specifying row names can be useful if you want to index data, which will be
covered later. Row names can also be specified for an existing data frame with the
rownames function (not to be confused with the row.names argument).
2.4 Data Frames, Data Import, and Data Export 37
So what do you do with data in R once it is in a data frame? Commonly, the data in a
data frame will be used in some type of analysis or plotting procedure. It is usually
necessary to be able to select and identify specific columns (i.e., vectors) within data
frames. There are two ways to specify a given column of data within a data frame.
The first is to use the $ notation. To see what the column names are, we can use the
names function. Using our soil.data set:
names(soil.data)
The $ just uses a $ between the data frame and column name to specify a
particular column. Say we want to look at the ESP column, which is the acronym
for exchangeable sodium percentage.
soil.data$ESP
## [1] 0.3 0.5 0.9 0.2 0.9 0.3 0.3 0.6 NA NA 0.4 0.9 0.2 0.1
## [15] NA 0.4 0.5 0.7 0.2 0.1 NA 0.2 0.3 NA 0.8 0.6 0.8 0.9
## [29] 0.9 1.1 0.5 0.6 1.1 0.2 0.6 1.1 0.4 0.3 0.5 1.0 2.2 0.1
## [43] 0.1 0.1 NA 0.1 0.1 0.4 NA 0.2 0.1 0.3 0.4 0.1 0.4 0.1
## [57] 0.1 NA 0.1 0.3 NA 0.1 NA 0.2 1.8 2.6 0.2 13.0 0.0 0.1
## [71] 0.3 0.1 0.1 0.3 0.0 0.3 0.6 0.9 0.4 NA NA 2.4 0.2 0.3
## [85] 0.2 0.0 0.1 0.4 NA 0.3 0.3 0.2 0.3 0.6 0.3 0.2 0.7 0.3
## [99] 0.4 1.0 7.9 6.1 5.7 5.2 4.7 2.9 5.8 7.2 9.6 NA 17.4 NA
## [113] 11.1 6.4 NA 4.0 12.1 21.2 2.2 1.9 NA 4.0 13.2 0.9 0.8 0.5
## [127] 0.2 0.4 NA 1.2 0.6 0.2 1.0 0.4 0.6 0.1 0.4 NA 0.7 0.5
## [141] 0.7 0.9 4.8 3.8 4.9 6.2 10.4 16.4 2.7 NA 1.2 0.5 1.9 2.0
## [155] 2.1 1.9 1.8 3.5 7.7 2.7 1.8 0.8 0.5 0.5 0.3 0.9
mean(soil.data$ESP)
## [1] NA
R can not calculate the mean because of the NA values in the vector. Lets remove
them first using the na.omit function.
mean(na.omit(soil.data$ESP))
## [1] 1.99863
38 2 R Literacy for Digital Soil Mapping
The second option for working with individual columns within a data frame is
to use the commands attach and detach. Both of these functions take a data
frame as an argument. attaching a data frame puts all the columns within the that
data frame into R’s search path, and they can be called by using their names alone
without the $ notation.
attach(soil.data)
ESP
## [1] 0.3 0.5 0.9 0.2 0.9 0.3 0.3 0.6 NA NA 0.4 0.9 0.2 0.1
## [15] NA 0.4 0.5 0.7 0.2 0.1 NA 0.2 0.3 NA 0.8 0.6 0.8 0.9
## [29] 0.9 1.1 0.5 0.6 1.1 0.2 0.6 1.1 0.4 0.3 0.5 1.0 2.2 0.1
## [43] 0.1 0.1 NA 0.1 0.1 0.4 NA 0.2 0.1 0.3 0.4 0.1 0.4 0.1
## [57] 0.1 NA 0.1 0.3 NA 0.1 NA 0.2 1.8 2.6 0.2 13.0 0.0 0.1
## [71] 0.3 0.1 0.1 0.3 0.0 0.3 0.6 0.9 0.4 NA NA 2.4 0.2 0.3
## [85] 0.2 0.0 0.1 0.4 NA 0.3 0.3 0.2 0.3 0.6 0.3 0.2 0.7 0.3
## [99] 0.4 1.0 7.9 6.1 5.7 5.2 4.7 2.9 5.8 7.2 9.6 NA 17.4 NA
## [113] 11.1 6.4 NA 4.0 12.1 21.2 2.2 1.9 NA 4.0 13.2 0.9 0.8 0.5
## [127] 0.2 0.4 NA 1.2 0.6 0.2 1.0 0.4 0.6 0.1 0.4 NA 0.7 0.5
## [141] 0.7 0.9 4.8 3.8 4.9 6.2 10.4 16.4 2.7 NA 1.2 0.5 1.9 2.0
## [155] 2.1 1.9 1.8 3.5 7.7 2.7 1.8 0.8 0.5 0.5 0.3 0.9
Note that when you are done using the individual columns, it is good practice
to detach your data frame. Once the data frame is detached, R will no longer
know what you mean when you specify the name of the column alone:
detach(soil.data)
ESP
If you modify a variable that is part of an attached data frame, the data within
the data frame remain unchanged; you are actually working with a copy of the data
frame.
Another option (for selecting particular columns) is to use the square braces []
to specify the column you want. Using the square braces to select the ESP column
from our data set you would use:
soil.data[, 10]
Here you are specifying the column in the tenth position, which as you should
check is the ESP column. To use the square braces the row position precedes to
comma, and the column position proceeds to comma. By leaving a blank space in
front of the comma, we are essentially instruction R to print out the whole column.
You may be able to surmise that it is also possible to subset a selection of columns
quite efficiently with this square brace method. We will use the square braces more
a little later on.
The $ notation can also be used to add columns to a data frame. For example, if
we want to express our Upper.Depth and Lower.Depth columns in cm rather
than m we could do the following.
2.4 Data Frames, Data Import, and Data Export 39
Many data frames that contain real data will have some missing observations. R
has several tools for working with these observations. For starters, the na.omit
function can be used for removing NAs from a vector. Working again with the ESP
column of our soil.data set:
soil.data$ESP
## [1] 0.3 0.5 0.9 0.2 0.9 0.3 0.3 0.6 NA NA 0.4 0.9 0.2 0.1
## [15] NA 0.4 0.5 0.7 0.2 0.1 NA 0.2 0.3 NA 0.8 0.6 0.8 0.9
## [29] 0.9 1.1 0.5 0.6 1.1 0.2 0.6 1.1 0.4 0.3 0.5 1.0 2.2 0.1
## [43] 0.1 0.1 NA 0.1 0.1 0.4 NA 0.2 0.1 0.3 0.4 0.1 0.4 0.1
## [57] 0.1 NA 0.1 0.3 NA 0.1 NA 0.2 1.8 2.6 0.2 13.0 0.0 0.1
## [71] 0.3 0.1 0.1 0.3 0.0 0.3 0.6 0.9 0.4 NA NA 2.4 0.2 0.3
## [85] 0.2 0.0 0.1 0.4 NA 0.3 0.3 0.2 0.3 0.6 0.3 0.2 0.7 0.3
## [99] 0.4 1.0 7.9 6.1 5.7 5.2 4.7 2.9 5.8 7.2 9.6 NA 17.4 NA
## [113] 11.1 6.4 NA 4.0 12.1 21.2 2.2 1.9 NA 4.0 13.2 0.9 0.8 0.5
## [127] 0.2 0.4 NA 1.2 0.6 0.2 1.0 0.4 0.6 0.1 0.4 NA 0.7 0.5
## [141] 0.7 0.9 4.8 3.8 4.9 6.2 10.4 16.4 2.7 NA 1.2 0.5 1.9 2.0
## [155] 2.1 1.9 1.8 3.5 7.7 2.7 1.8 0.8 0.5 0.5 0.3 0.9
na.omit(soil.data$ESP)
## [1] 0.3 0.5 0.9 0.2 0.9 0.3 0.3 0.6 0.4 0.9 0.2 0.1 0.4 0.5
## [15] 0.7 0.2 0.1 0.2 0.3 0.8 0.6 0.8 0.9 0.9 1.1 0.5 0.6 1.1
## [29] 0.2 0.6 1.1 0.4 0.3 0.5 1.0 2.2 0.1 0.1 0.1 0.1 0.1 0.4
## [43] 0.2 0.1 0.3 0.4 0.1 0.4 0.1 0.1 0.1 0.3 0.1 0.2 1.8 2.6
## [57] 0.2 13.0 0.0 0.1 0.3 0.1 0.1 0.3 0.0 0.3 0.6 0.9 0.4 2.4
## [71] 0.2 0.3 0.2 0.0 0.1 0.4 0.3 0.3 0.2 0.3 0.6 0.3 0.2 0.7
## [85] 0.3 0.4 1.0 7.9 6.1 5.7 5.2 4.7 2.9 5.8 7.2 9.6 17.4 11.1
## [99] 6.4 4.0 12.1 21.2 2.2 1.9 4.0 13.2 0.9 0.8 0.5 0.2 0.4 1.2
## [113] 0.6 0.2 1.0 0.4 0.6 0.1 0.4 0.7 0.5 0.7 0.9 4.8 3.8 4.9
## [127] 6.2 10.4 16.4 2.7 1.2 0.5 1.9 2.0 2.1 1.9 1.8 3.5 7.7 2.7
## [141] 1.8 0.8 0.5 0.5 0.3 0.9
## attr(,"na.action")
## [1] 9 10 15 21 24 45 49 58 61 63 80 81 89 110 112 115 121
## [18] 129 138 150
## attr(,"class")
## [1] "omit"
40 2 R Literacy for Digital Soil Mapping
Although the result does contain more than just the non-NA values, only the non-
NA values will be used in subsequent operations. Note that the result of na.omit
contains more information than just the non-NA values. This function can also be
applied to complete data frames. In this case, any row with an NA is removed (so be
careful with its usage).
is.na(soil.data$ESP)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [12] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [23] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [111] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [166] FALSE
With R, it is easy to write data to files. The function write.table is usually the
best function for this purpose. Given only a data frame and a file name, this function
will write the data contained in the data frame to a text file. There are a number of
arguments that can be controlled with this function as shown below (also look at the
help file).
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\", na =
"NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape",
"double"), fileEncoding = "")
The important ones (or most frequently used) are the column separator sep
argument, and whether of not you want to keep the column and row names
(col.names and row.names respectively). For example, if we want to write
soil.data to text file (“file name.txt”), retaining the column names, not retaining
row names, and having a tab delimited column separator, we would use:
2.5 Graphics: The Basics 41
Setting the append argument to TRUE lets you add data to a file that already
exists.
The write.table function can not be used with all data structures in R (like
lists for example). However, it can be used for such things as vectors and matrices.
2.4.5 Exercises
1. Using the soil.data object, determine the minimum and maximum soil pH
(PH_CaCl2) in the data frame. Next add a new column to the dataframe that
contains the log1 0 of soil carbon Total_Carbon.
2. Create a new data frame that contains the mean SOC, pH, and clay of the data
set. Write out the summary to a new file using the default options. Finally, try
changing the separator to a tab and write to a new file.
3. There are a number of NA values in the data set. We want to remove them.
Could this be done in one step i.e., delete every row that contains an NA? Is
this appropriate? How would you go about ensuring that no data is lost? Can you
do this? or perhaps—do this!
To plot a single vector, all we need to do is supply that vector as the only argument
to the function. This plot is shown in Fig. 2.3.
42 2 R Literacy for Digital Soil Mapping
2.0
1.5
1.0
z
0.5
0.0
-0.5
2 4 6 8 10
Index
z <- rnorm(10)
plot(z)
In this case, R simply plots the data in the order they occur in the vector. To plot
one variable versus another, just specify the two vectors for the first two arguments.
(see Fig. 2.4)
x <- -15:15
y <- x^2
plot(x, y)
And this is all it takes to generate plots in R, as long as you like the default set-
tings. Of course, the default settings generally will not be sufficient for publication-
or presentation-quality graphics. Fortunately, plots in R are very flexible. The table
below shows some of the more common arguments to the plot function, and some
of the common settings. For many more arguments, see the help file for par or
consult some online materials where http://www.statmethods.net/graphs/ is a useful
starting point.
Use of some of the arguments in Table 2.1 is shown in the following example
(Fig. 2.5).
2.5 Graphics: The Basics 43
200
150
y
100
50
0
-15 -10 -5 0 5 10 15
x
plot(x, y, type = "o", xlim = c(-20, 20), ylim = c(-10, 300), pch = 21,
col = "red", bg = "yellow", xlab = "The X variable", ylab = "X squared")
The plot function is effectively vectorised. It accepts vectors for the first two
arguments (which specify the x and y position of your observations), but can also
accept vectors for some of the other arguments, including pch or col. Among
other things, this provides an easy way to produce a reference plot demonstrating
R’s plotting symbols and lines. If you use R regularly, you may want to print a copy
out (or make your own)—see Fig. 2.6.
300
250
200
X squared
150
100
50
0
-20 -10 0 10 20
The X variable
Fig. 2.5 Your first plot using some of the plot arguments
2.5.2 Exercises
1. Produce a data frame with two columns: x, which ranges from 2 to 2 and has
a small interval between values (for plotting), and cosine(x). Plot the cosine(x)
vs. x as a line. Repeat, but try some different line types or colours.
2. Read in the data from the ithir package called “USYD_dIndex”, which
contains some observed soil drainage characteristics based on some defined soil
colour and drainage index (first column). In the second column is a corresponding
prediction which was made by a soil spatial prediction function. Plot the
observed drainage index (DI_observed) vs. the predicted drainage index
(DI_predicted). Ensure your plot has appropriate axis limits and labels, and
a heading. Try a few plotting symbols and colours. Add some informative text
somewhere. If you feel inspired, draw a line of concordance i.e., a 1:1 line on the
plot.
46 2 R Literacy for Digital Soil Mapping
Blue Red
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Blue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Default
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
As described before, the mode of an object describes the type of data that it contains.
In R, mode is an object attribute. All objects have at least two attributes: mode and
length, but may objects have more.
x <- 1:10
mode(x)
## [1] "numeric"
length(x)
## [1] 10
It is often necessary to change the mode of a data structure, e.g., to have your
data displayed differently, or to apply a function that only works with a particular
2.6 Manipulating Data 47
type of data structure. In R this is called coercion. There are many functions in R
that have the structure as.something that change the mode of a submitted object
to “something”. For example, say you want to treat numeric data as character data.
x <- 1:10
as.character(x)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 1 4 7 10 13 16 19 22 25 28
## 2 2 5 8 11 14 17 20 23 26 29
## 3 3 6 9 12 15 18 21 24 27 30
If you are unsure of whether or not a coercion function exists, give it a try—two
other common examples are as.numeric and as.vector.
Attributes are important internally for determining how objects should be
handled by various functions. In particular, the class attribute determines how
a particular object will be handled by a given function. For example, output from a
linear regression has the class “lm” and will be handled differently by the print
function than will a data frame, which has the class “data.frame”. The utility of
this object-orientated approach will become more apparent later on.
It is often necessary to know the length of an object. Of course, length can mean
different things. Three useful functions for this are nrow, NROW, and length.
The function nrow will return the number of rows in a two-dimensional data
structure.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 4 7 10 13 16 19 22 25 28
## [2,] 2 5 8 11 14 17 20 23 26 29
## [3,] 3 6 9 12 15 18 21 24 27 30
nrow(X)
## [1] 3
ncol(X)
## [1] 10
You can get both of these at once with the dim function.
48 2 R Literacy for Digital Soil Mapping
dim(X)
## [1] 3 10
x <- 1:10
NROW(x)
## [1] 10
The value returned from the function length depends on the type of data
structure you submit, but for most data structures, it is the total number of elements.
length(X)
## [1] 30
length(x)
## [1] 10
Sub-setting and indexing are ways to select specific parts of the data structure (such
as specific rows within a data frame) within R. Indexing (also know as sub-scripting)
is done using the square braces in R:
v1 <- c(5, 1, 3, 8)
v1
## [1] 5 1 3 8
v1[3]
## [1] 3
R is very flexible in terms of what can be selected or excluded. For example, the
following returns the 1st through 3rd observation:
v1[1:3]
## [1] 5 1 3
v1[-4]
## [1] 5 1 3
2.6 Manipulating Data 49
This bracket notation can also be used with relational constraints. For example,
if we want only those observations that are <5.0:
v1[v1 < 5]
## [1] 1 3
This may seem confusing, but if we evaluate each piece separately, it becomes
more clear:
v1 < 5
## [1] 1 3
While we are on the topic of subscripts, we should noted that, unlike some
other programming languages, the size of a vector in R is not limited by its initial
assignment. This is true for other data structures as well. To increase the size of a
vector, just assign a value to a position that does not currently exist:
length(v1)
## [1] 4
v1[8] <- 10
length(v1)
## [1] 8
v1
## [1] 5 1 3 8 NA NA NA 10
library(ithir)
data(USYD_soil1)
soil.data <- USYD_soil1
dim(soil.data)
## [1] 166 16
str(soil.data)
## $ Upper.Depth : num 0 0.02 0.05 0.1 0.2 0.7 0 0.02 0.05 0.1 ...
## $ Lower.Depth : num 0.02 0.05 0.1 0.2 0.3 0.8 0.02 0.05 0.1 0.2 ...
## $ clay : int 8 8 8 8 NA 57 9 9 9 NA ...
## $ silt : int 9 9 10 10 10 8 10 10 10 10 ...
## $ sand : int 83 83 82 83 79 36 81 80 80 81 ...
## $ pH_CaCl2 : num 6.35 6.34 4.76 4.51 4.64 6.49 5.91 ...
## $ Total_Carbon: num 1.07 0.98 0.73 0.39 0.23 0.35 1.14 ...
## $ EC : num 0.168 0.137 0.072 0.034 NA 0.059 0.123 ...
## $ ESP : num 0.3 0.5 0.9 0.2 0.9 0.3 0.3 0.6 NA NA ...
## $ ExchNa : num 0.01 0.02 0.02 0 0.02 0.04 0.01 0.02 NA NA ...
## $ ExchK : num 0.71 0.47 0.52 0.38 0.43 0.46 0.7 0.56 NA NA ...
## $ ExchCa : num 3.17 3.5 1.34 1.03 1.5 9.13 2.92 3.2 NA NA ...
## $ ExchMg : num 0.59 0.6 0.22 0.22 0.5 5.02 0.51 0.5 NA NA ...
## $ CEC : num 5.29 3.7 2.86 2.92 2.6 ...
If we want to subset out only the first 5 rows, and the first 2 columns:
soil.data[1:5, 1:2]
## PROFILE Landclass
## 1 1 native pasture
## 2 1 native pasture
## 3 1 native pasture
## 4 1 native pasture
## 5 1 native pasture
If an index is left out, R returns all values in that dimension (you need to include
the comma).
soil.data[1:2, ]
You can also specify row or column names directly within the brackets—this can
be very handy when column order may change in future versions of your code.
soil.data[1:5, "Total_Carbon"]
You can also specify multiple column names using the c function.
## Total_Carbon CEC
## 1 1.07 5.29
## 2 0.98 3.70
## 3 0.73 2.86
2.6 Manipulating Data 51
## 4 0.39 2.92
## 5 0.23 2.60
Relational constraints can also be used in indexes. Lets subset out the soil
observations that are extremely sodic i.e an ESP greater than 10 %.
While indexing can clearly be used to create a subset of data that meet certain
criteria, the subset function is often easier and shorter to use for data frames.
Sub-setting is used to select a subset of a vector, data frame, or matrix that meets a
certain criterion (or criteria). To return what was given in the last example.
Note that the $ notation does not need to be used in the subset function, As
with indexing multiple constraints can also be used:
52 2 R Literacy for Digital Soil Mapping
In some cases you may want to select observations that include any one value out
of a set of possibilities. Say we only want those observations where Landclass is
native pasture or forest. We could use:
subset(soil.data, Landclass == "Forest" | Landclass == "native pasture")
But, this is an easier way (we are using the head function just to limit the number
of outputted rows. So try it without the head function).
head(subset(soil.data, Landclass %in% c("Forest", "native pasture")))
Both of the above methods produce the same result, so it just comes down to a
matter of efficiency.
Indexing matrices and arrays follows what we have just covered. For example:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 4 7 10 13 16 19 22 25 28
## [2,] 2 5 8 11 14 17 20 23 26 29
## [3,] 3 6 9 12 15 18 21 24 27 30
X[3, 8]
## [1] 24
2.6 Manipulating Data 53
X[, 3]
## [1] 7 8 9
## [1] 3
Indexing is a little trickier for lists—you need to use double square braces,
[[i]], to specify an element within a list. Of course, if the element within the
list has multiple elements, you could use indexing to select specific elements within
it.
## [1] 1 2 3 4 5 6 7 8 9 10
It is also possible to use double, triple, etc. indexing with all types of data
structures. R evaluates the expression from left to right. As a simple example, lets
extract the element on the third row of the second column of the second element of
list.1:
list.1[[2]][3, 2]
## [1] 6
An easy way to divide data into groups is to use the split function. This
function will divide a data structure (typically a vector or a data frame) into one
subset for each level of the variable you would like to split by. The subsets are stored
together in a list. Here we split our soil.data set into the separate or individual
soil profile (splitting by the PROFILE column—note output is not shown here for
sake of brevity).
If you apply split to individual vectors, the resulting list can be used directly in
some plotting or summarizing functions to give you results for each separate group.
(There are usually other ways to arrive at this type of result). The split function
can also be handy for manipulating and analyzing data by some grouping variable,
as we will see later.
It is often necessary to sort data. For a single vector, this is done with the function
sort.
x <- rnorm(5)
x
y <- sort(x)
y
But what if you want to sort an entire data frame by one column? In this case it
is necessary to use the function order, in combination with indexing.
head(soil.data[order(soil.data$clay), ])
The function order returns a vector that contains the row positions of the ranked
data:
order(soil.data$clay)
The previous discussion in this section showed how to isolate data that meet
certain criteria from a data structure. But sometimes it is important to know where
data resides in its original data structure. But sometimes it is important to know
where data resides in its original data structure. To functions that are handy for
locating data within an R data structure are match and which. The match
function will tell you where specific values reside in a data structure, while the
which function will return the locations of values that meet certain criteria.
## [1] 41 59 18
2.6 Manipulating Data 55
Note that the match function matches the first observation only (this makes
it difficult to use when there are multiple observations of the same value). This
function is vectorised. The match function is useful for finding the location of the
unique values, such as the maximum.
match(max(soil.data$CEC, na.rm = TRUE), soil.data$CEC)
## [1] 95
Note the call to the na.rm argument in the max function as a means to overlook
the presence of NA values. So what is the maximum CEC value in our soil.data
set.
soil.data$CEC[95]
## [1] 28.21
The which function, on the other hand, will return all locations that meet the
criteria.
which(soil.data$ESP > 5)
## [1] 68 101 102 103 104 107 108 109 111 113 114 117 118 123 146 147 148
## [18] 159
The which function can also be useful for locating missing values.
which(is.na(soil.data$ESP))
soil.data$ESP[c(which(is.na(soil.data$ESP)))]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
56 2 R Literacy for Digital Soil Mapping
2.6.3 Factors
## [1] 0 0 0 0 1 1 1 1
a <- factor(a)
a
## [1] 0 0 0 0 1 1 1 1
## Levels: 0 1
The levels that R assigns to your factor are by default the unique values given
in your original vector. This is often fine, but you may want to assign more
meaningful levels. For example, say you have a vector that contains soil drainage
class categories.
soil.drainage<- c("well drained", "imperfectly drained", "poorly drained",
"poorly drained", "well drained", "poorly drained")
If you designate this as a factor, the default levels will be sorted alphabetically.
soil.drainage1 <- factor(soil.drainage)
soil.drainage1
as.numeric(soil.drainage1)
## [1] 3 1 2 2 3 2
If you specify levels as an argument of the factor function, you can control
the order of the levels.
soil.drainage2 <- factor(soil.drainage, levels = c("well drained",
"imperfectly drained", "poorly drained"))
as.numeric(soil.drainage2)
## [1] 1 2 3 3 1 3
This can be useful for obtaining a logical order in statistical output or summaries.
2.6 Manipulating Data 57
Data frames (or vectors or matrices) often need to be combined for analysis or
plotting. Three R functions that are very useful for combining data are rbind and
cbind. The function rbind simply “stacks” objects on top of each other to make
a new object (“row bind”). The function cbind (“column bind”) carries out an
analogous operation with columns of data.
soil.info1 <- data.frame(soil = c("Vertosol", "Hydrosol", "Sodosol"),
response = 1:3)
soil.info1
## soil response
## 1 Vertosol 1
## 2 Hydrosol 2
## 3 Sodosol 3
## soil response
## 1 Chromosol 4
## 2 Dermosol 5
## 3 Tenosol 6
## soil response
## 1 Vertosol 1
## 2 Hydrosol 2
## 3 Sodosol 3
## 4 Chromosol 4
## 5 Dermosol 5
## 6 Tenosol 6
2.6.5 Exercises
Here are some useful functions (and note the usage of the na.rm argument)
for calculation of means (mean), medians (median), standard deviations (sd) and
variances (var):
mean(soil.data$clay, na.rm = TRUE)
## [1] 26.95302
2.7 Exploratory Data Analysis 59
## [1] 21
## [1] 15.6996
## [1] 246.4775
summary(soil.data[, 1:6])
Box plots and histograms are simple but useful ways of summarizing data. You can
generate a histogram in R using the function hist.
hist(soil.data$clay)
The histogram (Fig. 2.7) can be made to look nicer, by applying some of the
plotting parameters or arguments that we covered for the plot function. There are
60 2 R Literacy for Digital Soil Mapping
Histogram of soil.data$clay
30
25
20
Frequency
15
10
5
0
10 20 30 40 50 60 70
soil.data$clay
also some additional “plotting” arguments that can be sourced in the hist help
file. One of these is the ability to specify the number or location of breaks in the
histogram.
Box plots are also a useful way to summarize data. We can use it simply, for
example, summarize the clay content in the soil.data (Fig. 2.8).
boxplot(soil.data$clay)
By default, the heavy line shows the median, the box shows the 25th and 75th
percentiles, the “whiskers” show the extreme values, and points show outliers
beyond these.
Another approach is to plot a single variable by some factor. Here we will plot
Total_Carbon by Landclass (Fig. 2.9).
Note the use of the tilde symbol “” in the above command. The code
Total_CarbonLandclass is analogous to a model formula in this case, and
simply indicates that Total_Carbon is described by Landclass and should be
split up based on the category of this variable. We will see more of this character
with the specification of soil spatial prediction functions later on.
2.7 Exploratory Data Analysis 61
70
60
50
40
30
20
10
12
10
Sample Quantiles
8
6
4
2
0
-2 -1 0 1 2
Theoretical Quantiles
One way to assess the normality of the distribution of a given variable is with a
quantile-quantile plot. This plot shows data values vs. quantiles based on a normal
distribution (Fig. 2.10).
There definitely seems to be some deviation from normality here. This is not
unusual for soil carbon information. It is common (in order to proceed with
statistical modelling) to perform a transformation of sorts in order to get these data
to conform to a normal distribution—lets see if a log transformation works any
better (Fig. 2.11).
Finally, another useful data exploratory tool is quantile calculations. R will return
the quantiles of a given data set with the quantile function. Note that there are
2.7 Exploratory Data Analysis 63
2
1
Sample Quantiles
0 -1
-2
-2 -1 0 1 2
Theoretical Quantiles
nine different algorithms available for doing this—you can find descriptions in the
help file for quantile.
2.7.4 Exercises
1. Using the soil.data set firstly determine the summary statistics for each
of the numerical or quantitative variables. You want to calculate things like
maximum, minimum, mean, median, standard deviation, and variance. There are
a couple of ways to do this. However, put all the results into a data frame and
export as a text file.
2. Generate histograms and QQ plots for each of the quantitative variables. Do any
need some sort of transformation so that their distribution is normal. If so, do the
transformation and perform the plots again.
## clay CEC
## Min. : 5.00 Min. : 1.900
## 1st Qu.:15.00 1st Qu.: 5.350
## Median :21.00 Median : 8.600
## Mean :26.95 Mean : 9.515
## 3rd Qu.:37.00 3rd Qu.:12.110
## Max. :68.00 Max. :28.210
## NA’s :17 NA’s :5
The (alternative) hypothesis here is that clay content is a good predictor of CEC.
As a start, let us have a look at what the data looks like (Fig. 2.12).
plot(soil.data$clay, soil.data$CEC)
25
20
soil.data$CEC
15
10
5
10 20 30 40 50 60 70
soil.data$clay
Fig. 2.12 plot of CEC against clay from the soil.data set
66 2 R Literacy for Digital Soil Mapping
##
## Call:
## lm(formula = CEC ~ clay, data = soil.data, x = TRUE, y = TRUE)
##
## Coefficients:
## (Intercept) clay
## 3.7791 0.2053
R returns only the call and coefficients by default. You can get more information
using the summary function.
summary(mod.1)
##
## Call:
## lm(formula = CEC ~ clay, data = soil.data, x = TRUE, y = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1829 -2.3369 -0.6767 1.0185 19.0924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.77913 0.63060 5.993 1.58e-08 ***
## clay 0.20533 0.02005 10.240 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.783 on 144 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.4214,Adjusted R-squared: 0.4174
## F-statistic: 104.9 on 1 and 144 DF, p-value: < 2.2e-16
class(mod.1)
## [1] "lm"
To get at the elements listed above, you can simply index the lm object, i.e., call
up part of the list.
mod.1$coefficients
## (Intercept) clay
## 3.7791256 0.2053256
coef(mod.1)
## (Intercept) clay
## 3.7791256 0.2053256
head(residuals(mod.1))
## 1 2 3 4 6 7
## -0.1317300 -1.7217300 -2.5617300 -2.5017300 -0.5226822 -2.0370556
names(summary(mod.1))
## [1] 0.4213764
This flexibility is useful, but makes for some redundancy in R. For many model
statistics, there are three ways to get your data: an extractor function (such as coef),
indexing the lm object, and indexing the summary function. The best approach is to
use an extractor function whenever you can. In some cases, the summary function
will return results that you can not get by indexing or using extractor functions.
Once we have fit a model in R, we can generate predicted values using the
predict function.
head(predict(mod.1))
## 1 2 3 4 6 7
## 5.421730 5.421730 5.421730 5.421730 15.482682 5.627056
Lets plot the observed vs. the predict from this model (Fig. 2.13).
plot(mod.1$y, mod.1$fitted.values)
As we will see later on, the predict function works for a whole range of
statistical models in R—not just lm objects. We can treat the predictions as we
would any vector. For example we can add them to the above plot or put them back
in the original data frame. The predict function can also give confidence and
prediction intervals.
18
16
14
mod.1$fitted.values
12
10
8
6
5 10 15 20 25
mod.1$y
A quick way to look for relationships between variables in a data frame is with
the cor function. Note the use of the na.omit function.
cor(na.omit(subs.soil.data))
5 10 15 20 25 5 10 15 20 25
70
50
clay
30
10
10 15 20 25
CEC
5
0 1 2 3 4 5 6 7
ExchNa
25
15 20
ExchCa
10
5
10 20 30 40 50 60 70 0 1 2 3 4 5 6 7
Fig. 2.14 A pairs plot of a select few soil attributes from the soil.data set
pairs(na.omit(subs.soil.data))
There are some interesting relationships here. Now for fitting the model:
##
## Call:
## lm(formula = CEC ~ clay + ExchNa + ExchCa, data =
subs.soil.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2008 -0.7065 -0.0470 0.6455 9.4025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.048318 0.274264 3.822 0.000197 ***
## clay 0.050503 0.009867 5.119 9.83e-07 ***
## ExchNa 2.018149 0.163436 12.348 < 2e-16 ***
## ExchCa 1.214156 0.046940 25.866 < 2e-16 ***
2.9 Advanced Work: Developing Algorithms with R 71
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.522 on 142 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.9076,Adjusted R-squared: 0.9057
## F-statistic: 465.1 on 3 and 142 DF, p-value: < 2.2e-16
For much of the remainder of this book, we will be investigating these regression
type relationships using a variety of different model types for soil spatial prediction.
These fundamental modelling concepts of the lm will become useful as we progress
ahead.
2.8.3 Exercises
1. Using the soil.data set firstly generate a correlation matrix of all the soil
variables.
2. Choose two variables that you think would be good to regress against each
other, and fit a model. Is the model any good? Can you plot the observed vs.
predicted values? Can you draw a line of concordance (maybe want to consult
an appropriate help file to do this). Something a bit trickier: Can you add the
predictions to the data frame soil.data correctly.
3. Now repeat what you did for the previous question, except this time perform
a multiple linear regression i.e., use more than 1 predictor variable to make a
prediction of the variable you are targeting.
look around its 3 3 neighbors and determine which pixel is the lowest elevation.
The algorithm for principal toposequence can be written as:
1. Determine the highest point in an area.
2. Determine its 3 3 neighbor, and determine whether there are lower points?
3. If yes, set the lowest point as the next point in the toposequence, and then repeat
step 2. If no, the toposequence has ended.
To facilitate the 3 3 neighbor search in R, we can code the neighbors using its
relative coordinates. If the current cell is designated as [0, 0], then its left neighbor
is [1, 0], and so on. We can visualize it as follows in Fig. 2.15.
If we designate the current cell [0, 0] as z1, the function below will look for the
lowest neighbor for pixel z1 in a DEM.
}
}
Now we want to create a data matrix to store the result of the toposequence i.e.
the row, column, and elevation values that are selected using the find_steepest
function.
transect <- matrix(data = NA, nrow = 20, ncol = 3)
Now we want to find within that matrix that maximum elevation value and its
corresponding row and column position.
max_elev <- which(topo_dem == max(topo_dem), arr.ind = TRUE)
row_z = max_elev[1] # row of max_elev
col_z = max_elev[2] # col of max_elev
z1 = topo_dem[row_z, col_z] # max elevation
to find the lowest value pixel from it, which in turn becomes the selected z1 and so
on until the values of the neighborhood are no longer smaller than the selected z1.
Finally we can plot the transect. First lets calculate a distance relative to the top
of the transect. After this we can generate a plot as in Fig. 2.16.
So let’s take this a step further and consider the idea of a random toposequence.
In reality, water does not only flow in the steepest direction, water can potentially
move down to any lower elevation. And, a toposequence does not necessarily start at
the highest elevation either. We can generate a random toposequence (Odgers et al.
2008), where we select a random point in the landscape, then find a random path to
the top and bottom of a hillslope. In addition to the downhill routine, we need an
uphill routine too.
The algorithm for the random toposequence could be written as:
1. Select a random point from a DEM.
2. Travel uphill:
2.1 Determine its 33 neighbor, and determine whether there are higher points?
2.2 If yes, select randomly a higher point, add to the uphill sequence, and repeat
step 2.1. If this point is the highest, the uphill sequence ended.
3. Travel downhill:
3.1 Determine its 3 3 neighbor, and determine whether there are lower points?
3.2 If yes, select randomly a lower point, add to the downhill sequence, and
repeat step 3.1. If this point is the lowest or reached a stream, the downhill
sequence ended.
From this algorithm plan, we need to specify two functions, one that allows the
transect to travel uphill and another which allows it to travel downhill. For the one
to travel downhill, we could use the function from before (find_steepest), but
we want to build on that function by allowing the user to indicate whether they want
a randomly selected smaller value, or whether they want to minimum every time.
Subsequently the two new functions would take the following form:
Now we can generate a random toposequence. We will use the same topo_dem
data as before. First we select a point at random using a random selection of a
row and column value. Keep in mind that the random point selected here may be
different to the one you get because we are using a random number generator via
the sample.int function.
We then can use the travel_up function to get our transect to go up the slope.
highest = FALSE
# iterate up the hill until highest point
while (highest == FALSE) {
result <- travel_up(dem = topo_dem, row_z, col_z,
random = TRUE)
if (result[3] <= 0)
78 2 R Literacy for Digital Soil Mapping
{
highest == TRUE
break
} # if found lowest point
t <- t + 1
row_z = result[1]
col_z = result[2]
z1 = topo_dem[row_z, col_z]
transect_up[t, 1] = row_z
transect_up[t, 2] = col_z
transect_up[t, 3] = z1
}
transect_up <- na.omit(transect_up)
Next we then use the travel_down function to get our transect to go down the
slope from the seed point.
# travel downhill create a data matrix to store results
transect_down <- matrix(data = NA, nrow = 100, ncol = 3)
# starting point
row_z <- row_z1
col_z <- col_z1
z1 = topo_dem[row_z, col_z] # a random pixel
t <- 1
transect_down[t, 1] = row_z
transect_down[t, 2] = col_z
transect_down[t, 3] = z1
lowest = FALSE
The idea then is to bind both uphill and downhill transects into a single one.
Note we are using the rbind function for this. Furthermore, we are also using the
order function here to re-arrange the uphill transect so that the resultant binding
Reference 79
Fig. 2.17 Generated random toposequence. (Red point indicates the random seed point)
will be sequential from highest to lowest elevation. Finally, we then calculate the
distance relative to the hilltop.
The last step is to make the plot (Fig. 2.17) of the transect. We can also add the
randomly selected seed point for visualization purposes.
After seeing how this algorithm works, you can modify the script to take in
stream networks, and make the toposequence end once it reaches the stream. You
can also add “error trapping” to handle missing values, and also in case where the
downhill routine ends up in a local depression. This algorithm also can be used to
calculate slope length, distance to a particular landscape feature (e.g. hedges), and
so on.
Reference
Odgers NP, McBratney AB, Minasny B (2008) Generation of kth-order random toposequences.
Comput Geosci 34(5):479–490
Chapter 3
Getting Spatial in R
R has a very rich capacity to work with, analyse, manipulate and map spatial data.
Many procedures one would carry out in a GIS software, can more-or-less be
performed relatively easy in R. The application of spatial data analysis in R is well
documented in Bivand et al. (2008). Naturally, in DSM, we constantly work with
spatial data in one form or another e.g., points, polygons, rasters. We need to do
such things as import, view, and export points to, in, and from a GIS. Similarly for
polygons and rasters. In this chapter we will cover the fundamentals for doing these
basic operations as they are very handy skills, particularly if we want to automate
procedures.
Many of the functions used for working with spatial data do not come with the
base function suite installed with the R software. Thus we need to use specific
functions from a range of different contributed R packages. Probably the most
important and most frequently used are:
sp contains many functions for handling vector (polygon) data.
raster very rich source of functions for handling raster data.
rgdal function for projections and spatial data I/O.
Consult the help files and online documentation regarding these packages, and
you will quickly realize that we are only scratching the surface of what spatial data
analysis functions these few packages are able to perform.
3.1.1 Points
We will be working with a small data set of soil information that was collected from
the Hunter Valley, NSW in 2010 called HV100. This data set is contained in the
ithir package. So first load it in:
library(ithir)
data(HV100)
str(HV100)
Now load the necessary R packages (you may have to install them onto your
computer first):
library(sp)
library(raster)
library(rgdal)
Using the coordinates function from the sp package we can define which
columns in the data frame refer to actual spatial coordinates—here the coordinates
are listed in columns x and y.
coordinates(HV100) <- ~x + y
str(HV100)
Note now that by using the str function, the class of HV100 has now changed
from a dataframe to a SpatialPointsDataFrame. We can do a spatial plot
of these points using the spplot plotting function in the sp package. There are
a number of plotting options available, so it will be helpful to consult the help file.
Here we are plotting the SOC concentration observed at each location (Fig. 3.1).
6380000
6375000
6370000
6365000
[0.6,1.578]
(1.578,2.556]
(2.556,3.534]
(3.534,4.512]
(4.512,5.49]
Fig. 3.1 A plot of the site locations with reference to SOC concentration for the 100 points in the
HV100 data set
84 3 Getting Spatial in R
## CRS arguments:
## +init=epsg:32756 +proj=utm +zone=56 +south +datum=WGS84
## +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
We need to define the CRS so that we can perform any sort of spatial
analysis. For example, we may wish to use these data in a GIS environment such
as Google Earth, ArcGIS, SAGA GIS etc. This means we need to export the
SpatialPointsDataFrame of HV100 to an appropriate spatial data format
(for vector data) such as a shapefile or KML. rgdal is again used for this via the
writeOGR function. To export the data set as a shapefile:
Note that the object we wish to export needs to be a spatial points data
frame. You should try opening up this exported shapefile in a GIS software of your
choosing.
To look at the locations of the data in Google Earth, we first need to make sure
the data is in the WGS84 geographic CRS. If the data is not in this CRS (which
is not the case for this data), then we need to perform a coordinate transformation.
This is facilitated by using the spTransform function in sp. The EPSG code for
WGS84 geographic is: 4326. We can then export out our transformed HV100 data
set to a KML file and visualize it in Google Earth.
Sometimes to conduct further analysis of spatial data, we may just want to import
it into R directly. For example, read in a shapefile (this includes both points and
polygons too). So lets read in that shapefile that was created just before and saved
to the working directory “HV_dat_shape.shp”:
imp.HV.dat@proj4string
## CRS arguments:
## +proj=utm +zone=56 +south +datum=WGS84 +units=m
## +no_defs +ellps=WGS84 +towgs84=0,0,0
3.1.2 Rasters
Most of the functions needed for handling raster data are contained in the raster
package. There are functions for reading and writing raster files from and to different
raster formats. In DSM we work quite a deal with data in table format and then
rasterise this data so that we can make a map. To do this in R, lets bring in a data
frame. This could be either from a text-file, but as for the previous occasions the
data is imported from the ithir package. This data is a digital elevation model
with 100 m grid resolution, from the Hunter Valley, NSW, Australia.
library(ithir)
data(HV_dem)
str(HV_dem)
As the data is already a raster (such that the row observation indicate locations on
a regular spaced grid), but in a table format, we can just use the rasterFromXYZ
function from raster. Also we can define the CRS just like we did with the
HV100 point data we worked with before.
6380000
300
250
6375000
200
150
6370000
100
50
6365000
Fig. 3.2 Digital elevation model for the Hunter Valley, overlayed with the HV100 sampling sites
So lets do a quick plot of this raster and overlay the HV100 point locations
(Fig. 3.2).
plot(r.DEM)
points(HV100, pch = 20)
So we may want to export this raster to a suitable format for further work in a
standard GIS environment. See the help file for writeRaster to get information
regarding the supported grid types that data can be exported. For demonstration,
we will export our data to ESRI Ascii ascii, as it is a common and universal raster
format.
What about exporting raster data to KML file? Here you could use the KML
function. Remember that we need to reproject our data because it is in the UTM
system, and need to get it to WGS84 geographic. The raster re-projection is
performed using the projectRaster function. Look at the help file for this
function. Probably the most important parameters are crs, which takes the CRS
string of the projection you want to convert the existing raster to, assuming it already
3.1 Basic GIS Operations Using R 87
has a defined CRS. The other is method which controls the interpolation method.
For continuous data, “bilinear” would be suitable, but for categorical, “ngb”, (which
is nearest neighbor interpolation) is probably better suited. KML is a handy function
from raster for exporting grids to kml format.
Now visualize this in Google Earth and overlay this map with the points that
were created before.
The other useful procedure we can perform is to import rasters directly into R so
we can perform further analyses. rgdal interfaces with the GDAL library, which
means that there are many supported grid formats that can be read into R http://
www.gdal.org/formats_list.html. Here we will load in the “HV_dem100.asc” raster
that was made just before.
## class : RasterLayer
## dimensions : 215, 169, 36335 (nrow, ncol, ncell)
## resolution : 100, 100 (x, y)
## extent : 334459.8, 351359.8, 6362591, 6384091
(xmin, xmax, ymin, ymax)
## coord. ref. : NA
## data source : in memory
## names : band1
## values : 29.61407, 315.6837 (min, max)
You will notice from the R generated output indicating the data source, it says
it is loaded into memory. This is fine for small rasters, but can become a problem
when very large rasters need to be handled. A really powerful feature of the raster
package is the ability to point to the location of a raster/s without the need to load
it into memory. It is only very rarely that one needs to use all the data contained
in a raster at one time. As will be seen later on this useful feature makes for a
very efficient way to perform digital soil mapping across very large spatial extents.
88 3 Getting Spatial in R
To point to the “HV_dem100.asc” raster that was created earlier we would use the
following or similar command (where getwd() is the function to return the address
string of the working directory):
## class :
RasterLayer
## dimensions :
215, 169, 36335 (nrow, ncol, ncell)
## resolution :
100, 100 (x, y)
## extent :
334459.8, 351359.8, 6362591, 6384091
(xmin, xmax, ymin, ymax)
## coord. ref. : NA
## data source : C:\Users\bmalone\Dropbox\2015\DSM_book\
HV_dem100.asc
## names : HV_dem100
# plot(grid.dem)
A step beyond creating kml files of your digital soil information is the creation of
customized interactive mapping products that can be visualized within your web
browser. Interactive mapping makes sharing your data with colleagues simpler, and
importantly improves the visualization experience via customization features that
are difficult to achieve via the Google Earth software platform. The interactive
mapping is made possible via the Leaflet R package. Leaflet is one of the most
popular open-source JavaScript libraries for interactive maps. The Leaflet R package
makes it easy to integrate and control Leaflet maps in R. More detailed information
about Leaflet can be found at http://leafletjs.com/, and information specifically about
the R package is at https://rstudio.github.io/leaflet/.
There is a common workflow for creating Leaflet maps in R. First is the creation
of a map widget (calling leaflet()); followed by the adding of layers or
features to the map by using layer functions (e.g. addTiles, addMarkers,
addPolygons) to modify the map widget. The map can then be printed and
visualized in the R image window or saved to HTML file for visualization within
a web browser. The following R script is a quick taste of creating an interactive
Leaflet map. It is assumed that the leaflet and magrittr are installed.
library(leaflet)
library(magrittr)
You should now see in your plot window a map of an iconic Australian landmark.
Interactive features of this map include markers with text, plus ability to zoom and
map panning. More will be discussed about the layer functions of the leaflet map
further on. What has not been encountered yet is the forward pipe operator %>%.
This operator will forward a value, or the result of an expression, into the next
function call or expression. To use this operator the magrittr package is required.
The example script below shows the same example using and not using the forward
pipe operator.
sqrt(sum(range(x)))
With the above, we are calling upon a pre-existing base map via the
addTiles() function. Leaflet supports base maps using map tiles, popularized
by Google Maps and now used by nearly all interactive web maps. By default,
OpenStreetMap https://www.openstreetmap.org/#map=13/-33.7760/150.6528&
layers=C tiles are used. Alternatively, many popular free third-party base
maps can be added using the addProviderTiles() function, which is
implemented using the leaflet-providers plugin. For example, previously we used
the Esri.WorldImagery base mapping. The full set of possible base maps
can be found at http://leaflet-extras.github.io/leaflet-providers/preview/index.html.
Note that an internet connection is required for access to the base maps and map
tiling. The last function used above the addMarkers function, we simply call
up the point data we used previously, which are those soil point observations and
measurements from the Hunter Valley, NSW. A basic map will have been created
90 3 Getting Spatial in R
with your plot window. For the next step, lets populate the markers we have created
with some of the data that was measured, then add the Esri.WorldImagery
base mapping.
# Populate pop-ups
my_pops <- paste0("<strong>Site: </strong>", HV100.ll$site,
"<br>\n <strong> Organic Carbon (%): </strong>",
HV100.ll$OC, "<br>\n <strong> soil pH: </strong>", HV100.ll$pH)
Further, we can colour the markers and add a map legend. Here we will get the
quantiles of the measured SOC% and color the markers accordingly. Note that you
will need the colour ramp package RColorBrewer installed.
library(RColorBrewer)
# Colour ramp
pal1 <- colorQuantile("YlOrBr", domain = HV100.ll$OC)
It is very worth consulting the help files associated with the leaflet R package
for further tips on creating further customized maps. The website dedicated to that
package, which was mentioned above is also a very helpful resource too.
Raster maps can also be featured in our interactive mapping too, as illustrated in
the following script.
# Colour ramp
pal2 <- colorNumeric(brewer.pal(n = 9, name = "YlOrBr"),
domain = values(p.r.DEM), na.color = "transparent")
# interactive map
leaflet() %>%
addProviderTiles("Esri.WorldImagery") %>%
addRasterImage(p.r.DEM, colors = pal2, opacity = 0.7) %>%
addLegend("topright", opacity = 0.8, pal = pal2,
values = values(p.r.DEM), title = "Elevation")
Lastly, we can create an interactive map that allows us to switch between the
different mapping outputs that we have created.
3.3 Some R Packages That Are Useful for Digital Soil Mapping 91
# layer switching
leaflet() %>%
addTiles(group = "OSM (default)") %>%
addProviderTiles("Esri.WorldImagery") %>%
With the created interactive mapping, we can then export these as a web page
in HTML format. This can be done via the export menu within the R-Studio plot
window, where you want to select the option for “Save as Web page”. This file can
then be easily shared and viewed by your colleagues.
Notwithstanding to the rich statistical and analytical resource provided through the
R base functionality, the following R packages (and their contained functions) are
what we think are an invaluable resource for DSM. As with all things in R, one
discovers new tricks all the time, which subsequently means that what functions
and analyses are useful now, are superseded or made obsolete later on. There are
four main groups of tasks that are critical for implementing DSM in general. These
are: (1) Soil science and pedometric type tasks; (2) Using GIS tools and related
GIS tasks; (4) Modelling; (4) Making maps, plotting etc. The following are short
introductions about those packages that fall into these categories.
Soil science and pedometrics
• aqp: Algorithms for quantitative pedology. http://cran.r-project.org/web/
packages/aqp/index.html. A collection of algorithms related to modeling of
soil resources, soil classification, soil profile aggregation, and visualization.
• GSIF: Global soil information facility. http://cran.r-project.org/web/packages/
GSIF/index.html. Tools, functions and sample datasets for digital soil mapping.
GIS
• sp: http://cran.r-project.org/web/packages/sp/index.html. A package that pro-
vides classes and methods for spatial data. The classes document where the
spatial location information resides, for 2D or 3D data. Utility functions are
provided, e.g. for plotting data as maps, spatial selection, as well as methods
for retrieving coordinates, for sub-setting, print, summary, etc.
92 3 Getting Spatial in R
Reference
Bivand RS, Pebesma EJ, Gomez-Rubio V (2008) Applied spatial data analysis with R. UseR!
series. Springer, New York
Chapter 4
Preparatory and Exploratory Data Analysis
for Digital Soil Mapping
At the start of this book, some of the history and theoretical underpinnings of DSM
was discussed. Now with a solid foundation in R, it is time to put this all into practice
i.e. do DSM with R.
In this chapter some common methods for soil data preparation and exploration
are covered. Soil point databases are inherently heterogeneous because soils are
measured non uniformly from site to site. However one more-or-less commonality
is that soil point observations will generally have some sort of label, together with
some spatial coordinate information that indicates where the sample was collected
from. Then things begin to vary from site to site. Probably the biggest difference is
that all soils are not measured universally at the same depths. Some soils are sampled
per horizon or at regular depths. Some soil studies examine only the topsoil, while
others sample to the bedrock depth. Then different soil attributes are measured at
some locations and depths, but not at others. Overall, it becomes quickly apparent
when one begins working with soil data that a number of preprocessing steps are
needed to fulfill the requirements of a particular analysis.
In order to prepare a collection of data for use in a DSM project as described in
Minasny and McBratney (2010) one needs to examine what data are available, what
is the soil attribute or class to be modeled? What is the support of the data? This
includes whether observations represent soil point observations or some integral
over a defined area (for now we just consider observations to be point observations).
However, we may also assume the support to be also a function of depth in that we
may be interested in only mapping soil for the top 10 cm, or to 1 m, or any depth
in between or to the depth to bedrock. The depth interval could be a single value
(such as one value for the 0–1 m depth interval as an example), or we may wish to
map simultaneously the depth variation of the target soil attribute with the lateral
or spatial variation. These questions add complexity to the soil mapping project,
but are an important consideration when planning a project and assessing what the
objectives are.
More recent digital soil mapping research has examined the combination of
soil depths functions with spatial mapping in order to create soil maps with a
near 3-D support. In the following section some approaches for doing this are
discussed with emphasis and instruction on a particular method, namely the use
of a mass-preserving soil depth function. This will be followed by a section that
will examine the important DSM step of linking observed soil information with
available environmental covariates and the subsequent preparation of covariates for
spatial modelling.
The traditional method of sampling soil involves dividing a soil profile into horizons.
The number of horizons and the position of each are generally based on attributes
easily observed in the field, such as morphological soil properties (Bishop et al.
1999). From each horizon, a bulk sample is taken and it is assumed to represent
the average value for a soil attribute over the depth interval from which it is
sampled. There are some issues with this approach, particularly from a pedological
perspective and secondly from the difficulty in using this legacy data within a
Digital Soil Mapping (DSM) framework where we wish to know the continuous
variability of a soil both in the lateral and vertical dimensions. From the pedological
perspective soil generally varies continuously with depth; however, representing
the soil attribute value as the average over the depth interval of horizons leads to
discontinuous or stepped profile representations. Difficulties can arise in situations
where one wants to know the value of an attribute at a specified depth. The second
issue is regarding DSM and is where we use a database of soil profiles to generate
a model of soil variability in the area in which they exist. Because observations at
each horizon for each profile will rarely be the same between any two profiles, it
then becomes difficult to build a model where predictions are made at a set depth or
at standardized depth intervals.
Legacy soil data is too valuable to do away with and thus needs to be molded
to suit the purposes of the map producer, such that one needs to be able to derive
a continuous function using the available horizon data as some input. This can be
done with many methods including polynomials and exponential decay type depth
functions. A more general continuous depth function is the equal-area quadratic
spline function. The usage and mathematical expression of this function have been
detailed in Ponce-Hernandez et al. (1986), Bishop et al. (1999), and Malone et al.
(2009). A useful feature of the spline function is that it is mass preserving, or in
other words the original data is preserved and can be retrieved again via integration
of the continuous spline. Compared to exponential decay functions where the goal
is in defining the actual parameters of the decay function, the spline parameters are
the values of the soil attribute at the standard depths that are specified by the user.
This is a useful feature, because firstly, one can harmonize a whole collection of
4.1 Soil Depth Functions 97
soil profile data and then explicitly model the soil for a specified depth. For example
the GlobalSoilMap.net project (Arrouays et al. 2014) has a specification that digital
soil maps be created for each target soil variable for the 0–5, 5–15, 15–30, 30–60,
60–100, and 100–200 cm depth intervals. In this case, the mass-preserving splines
can be fitted to the observed data, then values can be extracted from them at the
required depths, and are then ready for exploratory analysis and spatial modelling.
In the following, we will use legacy soil data and the spline function to prepare
data to be used in a DSM framework. This will specifically entail fitting splines to
all the available soil profiles and then through a process of harmonization, integrate
the splines to generate harmonized depths of each observation.
We will demonstrate the mass-preserving spline fitting using a single soil profile
example for which there are measurements of soil carbon density to a given
maximum depth. We can fit a spline to the maximum soil depth, or alternatively
any depth that does not exceed the maximum soil depth. The function used for
fitting splines is called ea_spline and is from the ithir package. Look at
the help file for further information on this function. For example, there is further
information about how the ea_spline function can also accept data of class
SoilProfileCollection from the aqp package in addition to data of the
more generic data.frame class. In the example below the input data is of class
data.frame. The data we need (oneProfile) is in the ithir package.
library(ithir)
data(oneProfile)
str(oneProfile)
As you can see above, the data table shows the soil depth information and carbon
density values for a single soil profile. Note the discontinuity of the observed depth
intervals which can also be observed in Fig. 4.1.
The ea_spline function will predict a continuous function from the top of
the soil profile to the maximum soil depth, such that it will interpolate values both
within the observed depths and between the depths where there is no observation.
To parametize the ea_spline function, we could accept the defaults, however
it might be likely to change the lam and d parameters to suit the objective of
the analysis being undertaken. lam is the lambda parameter which controls the
98 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
soil profile:1
0
50
100
depth
150
200
250
300
350
5 10 15 20
C.kg.m3.
Fig. 4.1 Soil profile plot of the oneProfile data. Note this figure was produced using the
plot_soilProfile function from ithir
smoothness or fidelity of the spline. Increasing this value will make the spline
more rigid. Decreasing it towards zero will make the spline more flexible such that
it will follow near directly the observed data. A sensitivity analysis is generally
recommended in order to optimize this parameter. From experience a lam value of
0.1 works well generally for most soil properties, and is the default value for the
function. The d parameter represents the depth intervals at which we want to get
soil values for. This is a harmonization process where regardless of which depths
soil was observed at, we can derive the soil values for regularized and standard
depths. In practice, the continuous spline function is first fitted to the data, then we
get the integrals of this function to determine the values of the soil at the standard
depths. d is a matrix, but on the basis of the default values, what it is indicating is
that we want the values of soil at the following depth intervals: 0–5, 5–15, 15–30,
30–60, 60–100, and 100–200 cm. These depths are specified depths determined for
the GlobalSoilMap.net project (Arrouays et al. 2014). Naturally, one can alter these
values to suit there own particular requirements. To fit a spline to the carbon density
values of the oneProfile data, the following script could be used:
## List of 4
## $ harmonised:’data.frame’: 1 obs. of 8 variables:
## ..$ id : num 1
## ..$ 0-5 cm : num 21
## ..$ 5-15 cm : num 15.8
## ..$ 15-30 cm : num 9.89
## ..$ 30-60 cm : num 7.18
## ..$ 60-100 cm : num 2.76
## ..$ 100-200 cm: num 1.73
## ..$ soil depth: num 360
## $ obs.preds :’data.frame’: 8 obs. of 6 variables:
## ..$ Soil.ID : num [1:8] 1 1 1 1 1 1 1 1
## ..$ Upper.Boundary: num [1:8] 0 10 30 50 70 120 250 350
## ..$ Lower.Boundary: num [1:8] 10 20 40 60 80 130 260 360
## ..$ C.kg.m3. : num [1:8] 20.7 11.7 8.2 6.3 2.4
2 0.7 1.2
## ..$ predicted : num [1:8] 19.84 12.45 8.24 6.2 2.56 ...
## ..$ FID : num [1:8] 1 1 1 1 1 1 1 1
## $ var.1cm : num [1:200, 1] 21.6 21.4 21.1 20.8 20.3 ...
## $ tmse : num [1, 1] 0.263
The output of the function is a list, where the first element is a dataframe
(harmonized) which are the predicted spline estimates at the specified depth
intervals. The second element (obs.preds) is another dataframe but contains
the observed soil data together with spline predictions for the actual depths of
observation for each soil profile. The third element (var.1cm) is a matrix which
stores the spline predictions of the depth function at (in this case) 1 cm resolution.
Each column represent a given soil profile and each row represents an incremental
1 cm depth increment to the maximum depth we wish to extract values for, or to the
maximum observed soil depth (which ever is smallest). The last element (tmse)
is another matrix but stores a single mean square error estimate for each given soil
profile. This value is an estimate of the magnitude of difference between observed
values and associated predicted values with each profile. It is often more amenable
to visualize the performance of the spline fitting. Subsequently, plotting the outputs
of ea_spline is made possible by the associated plot_ea_spline function
(see help file for use of this function):
plot_ea_spline. Plot 1
is type 1, plot 2 is type 2 and
50
plot 3 is type 3
100
150
200
5 10 15 20
carbon density
soil profile:1
0
50
depth
100
150
200
5 10 15 20
carbon density
soil profile:1
0
50
depth
100
150
200
5 10 15 20
carbon density
4.2 Intersecting Soil Point Observations with Environmental Covariates 101
In order to carry out digital soil mapping in terms of evaluating the significance of
environmental variables in explaining the spatial variation of the target soil variable
under investigation, we need to link both sets of data together and extract the values
of the covariates at the locations of the soil point data. The first task is to bring in to
our working environment some soil point data. We will be using a preprocessed
data set of the Edgeroi Data set (McGarry et al. 1989) with the target variable
being soil carbon density. The data was preprocessed such that the predicted values
are outputs of the mass-preserving depth function. The data is loaded in from the
ithir package with the following script:
data(edgeroi_splineCarbon)
As the summary shows above in the column headers, the soil depths correspond
to harmonized depth intervals. Before we create a spatial plot of these data, it is
a good time to introduce some environmental covariates. A small subset of them
is available for the whole Edgeroi District at a consistent pixel resolution of 90 m.
These can be accessed using the script:
data(edgeroiCovariates)
library(raster)
elevation
102 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
## class :RasterLayer
## dimensions :400, 577, 230800 (nrow, ncol, ncell)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## data source : in memory
## names : elevation
## values : 181.4204, 960.1074 (min, max)
twi
## class :RasterLayer
## dimensions :400, 577, 230800 (nrow, ncol, ncell)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## data source : in memory
## names : twi
## values : 9.801188, 23.89634 (min, max)
radK
## class :RasterLayer
## dimensions :400, 577, 230800 (nrow, ncol, ncell)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## data source : in memory
## names : radK
## values : -0.00929, 5.16667 (min, max)
landsat_b3
## class :RasterLayer
## dimensions :400, 577, 230800 (nrow, ncol, ncell)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## data source : in memory
## names : landsat_b3
## values : 18.86447, 170.517 (min, max)
4.2 Intersecting Soil Point Observations with Environmental Covariates 103
landsat_b4
## class RasterLayer:
## dimensions :
400, 577, 230800 (nrow, ncol, ncell)
## resolution :
90, 90 (x, y)
## extent :
738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## data source : in memory
## names : landsat_b4
## values : 13.18422, 154.5758 (min, max)
800
6660000
600
400
6650000
200
6640000
Fig. 4.3 Edgeroi elevation map with the soil point locations overlayed upon it
104 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
# plot raster
plot(elevation, main = "Edgeroi elvation map with overlayed
point locations")
## plot points
plot(edgeroi_splineCarbon, add = T)
When the covariate data is of common resolution and extent, rather than working
with each raster independently it is much more efficient to stack them all into a
single object. The stack function from raster is ready-made for this, and is
simple enacted with the following script:
## class :RasterStack
## dimensions :400, 577, 230800, 5 (nrow, ncol, ncell, nlayers)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## names : elevation, twi, radK,
landsat_b3, landsat_b4
## min values : 181.420395, 9.801188, -0.009290,
18.864470, 13.184220
## max values : 960.10742, 23.89634, 5.16667,
170.51700, 154.57581
As mentioned earlier, it is always preferable to have all the rasters you are
working with to have a common spatial extent, resolution and projection. Otherwise
the stack function will encounter an error.
With the soil point data and covariates prepared, it is time to perform the
intersection between the soil observations and covariate layers using the script:
The extract function is quite useful. Essentially the function ingests the
rasterStack object, together with the SpatialPointsDataFrame object
edgeroi_splineCarbon. The sp parameter set to 1 means that the extracted
covariate data gets appended to the existing SpatialPointsDataFrame
object. While the method object specifies the extraction method which in our
case is “simple” which likened to get the covariate value nearest to the points i.e it
is likened to “drilling down”.
4.2 Intersecting Soil Point Observations with Environmental Covariates 105
A good practice is to then export the soil and covariate data intersect object to
file for later use. First we convert the spatial object to a dataframe, then export
as a comma separated text file.
In the previous example the rasters we wanted to use are available data from the
ithir package. More generally we will have the raster data we need sitting on
our computer or disk somewhere. The steps for intersecting the soil observation
data with the covariates are the same as before, except we now need to specify the
location where our raster covariate data is located. We need not even have to load
in the rasters to memory, just point R to where they are, and then run the raster
extract function. This utility is obviously a very handy feature when we are
dealing with an inordinately large or large number of rasters. The work function we
need is list.files. For example:
## [1] "C:/temp/testGrids/edge_elevation.tif"
## [2] "C:/temp/testGrids/edge_landsat_b3.tif"
## [3] "C:/temp/testGrids/edge_landsat_b4.tif"
## [4] "C:/temp/testGrids/edge_radK.tif"
## [5] "C:/temp/testGrids/edge_twi.tif"
The parameter path is essentially the directory location where the raster files are
sitting. If needed, we may also want to do recursive listing into directories that are
within that path directory. We want list.files() to return all the files (in our
case) that have the .tif extension. This criteria is set via the pattern parameter,
such that $ at the end means that this is end of string, and but adding \\. ensures
that you match only files with extension .tif, otherwise it may list (if they exist),
files that end in .atif as an example. You may guess that any other type of pattern
matching criteria could be used to suit your own specific data. The full.names
logical parameter is just a question of whether we want to return the full pathway
address of the raster file, in which case, we do.
All we then need to do is perform a raster stack of these individual rasters, then
perform the intersection. This is really the handy feature where to perform the stack,
we still need not require the loading of the rasters into the R memory—they are still
on file!.
106 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
# stack rasters
r1 <- raster(files[1])
for (i in 2:length(files)) {
r1 <- stack(r1, files[i])
}
r1
## class :RasterStack
## dimensions :400, 577, 230800, 5 (nrow, ncol, ncell, nlayers)
## resolution :90, 90 (x, y)
## extent :738698.6, 790628.6, 6643808, 6679808
(xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=55 +south +datum=WGS84
## +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : edge_elevation, edge_landsat_b3,
edge_landsat_b4, edge_radK, edge_twi
## min values : 181.420395, 18.864470,
## 13.184220, -0.009290, 9.801188
## max values : 960.10742, 170.51700,
## 154.57581, 5.16667, 23.89634
Note that the stacking of rasters can only be possible if they are all equivalent
in terms of resolution and extent. If they are not you will find the other raster
package functions resample and projectRaster as invaluable methods for
harmonizing all your different raster layers. With the stacked rasters, we can perform
the soil point data intersection as done previously.
DSM_data <- extract(r1, edgeroi_splineCarbon, sp = 1,
method = "simple")
We will continue using the DSM_data object that was created in the previous
section. As the data set was saved to file you will also find it in your working
directory. Type getwd() in the console to indicate the specific file location. So
lets read the file in using the read.table function:
edge.dat <- read.table("edgeroiSoilCovariates_C.TXT",
sep = ",", header = T)
str(edge.dat)
Hereafter soil carbon density will be referred to as SOC for simplicity. Now lets
firstly look at some of the summary statistics of SOC (we will just concentrate on
the 0–5 cm depth interval in these following examples).
round(summary(edge.dat$X0.5.cm), 1)
The observation that the mean and median are not equivalent indicates the
distribution of this data deviates from normal. To assess this more formally, we
can perform other analyses such as tests of skewness, kurtosis and normality. Here
we need to use functions from the fBasics and nortest packages (If you do
not have these already you should install them.)
library(fBasics)
library(nortest)
# skewness
sampleSKEW(edge.dat$X0.5.cm)
## SKEW
## 0.1964286
# kurtosis
sampleKURT(edge.dat$X0.5.cm)
## KURT
## 1.303571
108 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
Here we see that the data is positively skewed. A formal test for normality of the
Anderson-Darling Test statistic. There are others, so its worth a look at the help files
associated with the nortest package.
ad.test(edge.dat$X0.5.cm)
##
## Anderson-Darling normality test
##
## data: edge.dat$X0.5.cm
## A = 10.594, p-value < 2.2e-16
For this data to be normally distributed the p value should be > than 0.05. This
is confirmed when we look at the histogram and qq-plot of this data in Fig. 4.4.
The histogram on Fig. 4.4 shows that there are just a few high values that are
more-or-less outliers in the data. Generally for fitting most statistical models, we
need to assume our data is normally distributed. A way to make the data to be
more normal is to transform it. Common transformations include the square root,
logarithmic, or power transformations. Below is an example of taking the natural
log transform of the data.
80
Sample Quantiles
60
Frequency
100
40
50
20
0
0 20 40 60 80 100 –3 –2 –1 0 1 2 3
edge.dat$×0.5.cm Theoretical Quantiles
Fig. 4.4 Histogram and qq-plot of SOC in the 0–5 cm depth interval
4.3 Some Exploratory Data Analysis 109
sampleSKEW(log(edge.dat$X0.5.cm))
## SKEW
## 0.03287885
sampleKURT(log(edge.dat$X0.5.cm))
## KURT
## 1.196472
ad.test(log(edge.dat$X0.5.cm))
##
## Anderson-Darling normality test
##
## data: log(edge.dat$X0.5.cm)
## A = 1.9117, p-value = 6.935e-05
While not perfect, this is an improvement from before. This is also apparent when
we do the plots shown on Fig. 4.5.
4
100
3
Sample Quantiles
80
Frequency
2
60
1
40
0
20
–1
0
–1 0 1 2 3 4 5 –3 –2 –1 0 1 2 3
log(edge.dat$×0.5.cm) Theoretical Quantiles
Fig. 4.5 Histogram and qq-plot of the natural log of SOC in the 0–5 cm depth interval
110 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
6670000
edge.dat$ 0.5.cm
25
y
50
6660000 75
6650000
Fig. 4.6 Spatial distribution of points in the Edgeroi for the untransformed SOC data at the 0–5 cm
depth interval
library(ggplot2)
ggplot(edge.dat, aes(x = x, y = y)) +
geom_point(aes(size = edge.dat$X0.5.cm))
On Fig. 4.6 (which illustrates the untransformed data), there is a subtle east to
west trend of high to low values. This trend is generally related to differences in
land use in this area where intensive cropping is practiced in the western area where
the land is an open floodplain. To the east the land is slightly undulating and land
use is generally associated with pastures and natural vegetation.
Ultimately we are interested in making maps. So, as a first exercise and to get a
clearer sense of the “spatial structure” of the data it is good to use some interpolation
method to estimate SOC values at all of the unvisited locations. A couple of ways
of doing this is the inverse distance weighted (IDW) interpolation and kriging.
For IDW predictions at unvisited locations are calculated as a weighted average
of the values available at the known points, where the weights are based only by
distance from the interpolation location. Kriging is a similar distance weighted
4.3 Some Exploratory Data Analysis 111
data(edgeroiCovariates)
The script above essentially gets the pixels which have values associated with
them (discards all NA occurrences), and then uses the cell numbers to extract the
associated spatial coordinate locations using the xyFromCell function. The result
is saved in the gXY object.
Using the idw function from gstat we fit the formula as below. We need to
specify the observed data, their spatial locations, and the spatial locations of the
points we want to interpolate onto. The idp parameter allows you to specify the
inverse distance weighting power. The default is 2, yet can be adjusted if you want
to give more weighting to points closer to the interpolation point. As we can not
evaluate the uncertainty of prediction with IDW, we can not really optimize this
parameter.
library(gstat)
names(edge.dat)[2:3] <- c("x", "y")
IDW.pred <- idw(log(edge.dat$X0.5.cm) ~ 1, locations = ~x + y,
data = edge.dat, newdata = gXY, idp = 2)
6680000
4
6670000
2
6660000
0
6650000
-1
6640000
Fig. 4.7 Map of log SOC (0–5 cm) predicted using IDW
Plotting the resulting map (Fig. 4.7) can be done using the following script.
For soil science it is more common to use kriging for the reasons that we are
able to formally define the spatial relationships in our data and get an estimate of
the prediction uncertainty. As mentioned before this is done using a variogram. Var-
iograms measure the spatial auto-correlation of phenomena such as soil properties
(Pringle and McBratney 1999). The average variance between any pair of sampling
points (calculated as the semi-variance) for a soil property S at any point of distance
h apart can be estimated by the formula:
m.h/
1 X
.h/ D fs.xi / s.xi C h/g2 (4.1)
2m.h/ iD1
empirical variogram where four of the more common ones are the linear model,
the spherical model, the exponential model, and the Gaussian model. Once an
appropriate variogram has been modeled it is then used for distance weighted
interpolation (kriging) at unvisited locations.
First, we calculate the empirical variogram i.e calculate the semivariances of all
point pairs in our data set. Then we fit a variogram model (in this case we will use a
spherical model). To do this we need to make some initial estimates of this models
parameters; namely, the nugget, sill, and range. The nugget is the very short-range
error (effectively zero distance) which is often attributed to measurement errors.
The sill is the limit of the variogram (effectively the total variance of the data).
The range is the distance at which the data are no longer auto-correlated. Once we
have made the first estimates of these parameters, we use the fit.variogram
function for their optimization. The width parameter of the variogram function is
the width of distance intervals into which data point pairs are grouped or binned for
semi variance estimates as a function of distance. An automated way of estimating
the variogram parameters is to use the autofitVariogram function from the
automap package. For now we will stick with the gstat implementation.
The plot in Fig. 4.8 shows both the empirical variogram together with the fitted
variogram model line.
We can make the maps as we did before, but now we can also look at the variances
of the predictions too (Fig. 4.9).
114 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
0.20
Cressie’s semivariance
0.15
0.10
0.05
Fig. 4.8 Empirical variogram and spherical variogram model of log SOC for the 0–5 cm depth
interval
6665000
3.0
2.5
2.0
6645000
0-18
0-17
0-16
0-15
0-14
0-13
0-12
6645000
Fig. 4.9 Kriging predictions and variances for log SOC (0–5 cm)
## [,1]
## elevation 0.440924269
## twi -0.408020193
## radK 0.094710772
## landsat_b3 -0.002556606
## landsat_b4 0.060669516
It appears the highest correlations with log SOC are the variables derived from
the digital elevation model: elevation and twi. Weak correlation is found for
the other covariates. The following chapter of the book we will explore a range of
models for mapping the soil as a function of this suite of covariates.
116 4 Preparatory and Exploratory Data Analysis for Digital Soil Mapping
References
The implementation of some of the most commonly used model functions used for
digital soil mapping will be covered in this chapter. Before this is done however,
some general concepts of model validation are covered.
where obs is the observed soil property, pred is the predicted soil property from a
given model, and n is the number of observations i. Bias, also called the mean error
of prediction and is defined as:
Pn
iD1 predi obsi
bias D (5.2)
n
The R2 measures the precision of the relationship (between observed and pre-
dicted). Concordance, or more formally—Lin’s concordance correlation coefficient
(Lin 1989), on the other hand is a single statistic that both evaluates the accuracy
and precision of the relationship. It is often referred to as the goodness of fit along
a 45 degree line. Thus it is probably a more useful statistic than the R2 alone.
Concordance c is defined as:
2pred obs
c D 2 2
(5.4)
pred C obs C .pred obs /2
where pred and obs are the means of the predicted and observed values
2 2
respectively. pred and obs are the corresponding variances. is the correlation
coefficient between the predictions and observations.
So lets fit a simple linear model. We will use the soil.data set used before in
the introductory to R chapter. First load the data in. We then want to regress CEC
content on clay (also be sure to remove as NAs).
library(ithir)
library(MASS)
data(USYD_soil1)
soil.data <- USYD_soil1
mod.data <- na.omit(soil.data[, c("clay", "CEC")])
mod.1 <- lm(CEC ~ clay, data = mod.data, y = TRUE, x = TRUE)
mod.1
##
## Call:
## lm(formula = CEC ~ clay, data = mod.data, x = TRUE, y = TRUE)
##
## Coefficients:
## (Intercept) clay
## 3.7791 0.2053
You will recall that this is the same model that we fitted during the introduction to
R chapter. What we now want to do is evaluate some of the model quality statistics
that were just described. Conveniently, these are available in the goof function
5.1 Model Validation 119
in the ithir package. We will use this function a lot during this chapter, so it
might be useful to describe it. goof takes four inputs. A vector of observed
values, a vector of predicted values, a logical choice of whether an output plot is
required, and a character input of what type of output is required. There are number
of possible goodness of fit statistics that can be requested, with only some being used
frequently in digital soil mapping projects. Therefore setting the type parameter to
“DSM” will output only the R2 , RMSE, MSE, bias and concordance statistics as
these are most relevant to DSM. Additional statistics can be returned if “spec” is
specified for the type parameter
You may wish to generate a plot in which case you would set the plot.it
logical to TRUE.
This model mod.1 does not seem to be too bad. On average the predictions
are 3.75 cmol .C/=kg off the true value. The model on average is neither over-
or under-predictive, but we can see that a few high CEC values are influencing
the concordance and R2 . This outcome may mean that there are other factors that
influence the CEC, such as mineralogy type.
validation we divide the data set into equal sized partitions or folds, with all but
one of the folds being used for the model calibration, the remaining fold is used for
validation. We could repeat this k-fold process a number of times, each time using a
different random sample from the data set for model calibration and validation. This
allows one to efficiently derive distributions of the validation statistics as a means
of assessing the stability and sensitivity of the models and parameters.
LOCV involves a little more computation such that if we had n number of data,
we would subset n-1 of these data, and fit a model. Using this model we would make
a prediction for the single data that was left out of the model (and save the residual).
This is repeated for all n. LOCV would be undertaken when there are very few data
to work with. When we can sacrifice a few data points, the random-hold back or
k-fold cross-validation procedure would be acceptable.
When we are validating trained models with some sort of data sub-setting
mechanism, always keep in mind that the validation statistics will be biased. As
Brus et al. (2011) explains, the sampling from the target mapping area to be used for
DSM is more often than not from legacy soil survey, to which would not have been
based on a probability sampling design. Therefore, that sample will be biased i.e not
a true representation of the total population. Even though we may randomly select
observations from the legacy soil survey sites, those validation points do not become
a probability sample of the target area, and consequently will only provide biased
estimates of model quality. Thus an independent probability sample is required.
Further ideas on the statistical validation of models can be found in Hastie et al.
(2001).
So lets implement some of the validation techniques in R. We will use the same
data as before i.e regressing CEC with clay content. First we will do the random-
back validation using 70 % of the data for calibration. A random sample of the data
will be performed using the sample function.
set.seed(123)
training <- sample(nrow(mod.data), 0.7 * nrow(mod.data))
training
These values correspond to row numbers which will correspond to the row which
we will use for the calibration data. We subset these rows out of mod.data and fit
a new linear model.
goof(predicted = mod.rh$fitted.values,
observed = mod.data$CEC[training])
But we are more interested in how this model performs when we use the
validation data. Here we use the predict function to predict upon this data.
5 10 15 20 25
observed
Fig. 5.1 Observed vs. predicted plot of CEC model (validation data set) with line of concordance
(red line)
122 5 Continuous Soil Attribute Modeling and Mapping
The i here is the counter, so for each loop it increases by 1 until we get to the end
of the data set. As you can see, we can index the mod.data using the i, meaning
that for each loop we will have selected a different calibration set. On each loop,
the prediction on the point left out of the calibration is made onto the corresponding
row position of the looPred object. Again we can assess the performance of the
LOCV using the goof function.
Multiple linear regression (MLR) is where we regress a target variable against more
than one covariate. In terms of soil spatial prediction functions, MLR is a least-
squares model whereby we want to predict a continuous soil variable from a suite of
covariates. There are a couple of ways to go about this. We could just put everything
(all the covariates) in the model and then fit it (estimate the model parameters). We
could perform a stepwise regression model where we only enter variables that are
statistically significant, based on some selection criteria. Alternatively we could fit
what could be termed, an “expert” model, such that based on some pre-determined
knowledge of the soil variable we are trying to model, we include covariates that
5.2 Multiple Linear Regression 123
best describe this knowledge. In some ways this is a biased model because we really
don’t know everything about (the spatial characteristics) the soil property under
investigation. Yet in many situations it is better to rely on expert knowledge that is
gained in the field as opposed to some other form.
So lets firstly get the data organized. Recall from before in the data preparatory
exercises that we were working with the soil point data and environmental covariates
for the Edgeroi area. These data are stored in the edgeroi_splineCarbon and
edgeroiCovariates objects from the ithir package. For the succession of
models to be used, we will concentrate on modelling and mapping the soil carbon
stocks for the 0–5 cm depth interval. To refresh, lets load the data in, perform a
log-transform of the soil carbon data (in order to make the distribution exhibit
normality), then intersect the data with the available covariates.
library(ithir)
library(raster)
library(rgdal)
# point data
data(edgeroi_splineCarbon)
names(edgeroi_splineCarbon)[2:3] <- c("x", "y")
# natural log transform
edgeroi_splineCarbon$log_cStock0_5 <- log(edgeroi_splineCarbon
$X0.5.cm)
# grids
data(edgeroiCovariates)
It is a general preference to progress with a data frame of just the data and
covariates required for the modelling. In this case, we will subset the columns
pertaining to the target variable log_cStock0_5, the covariates and for later on,
the spatial coordinates.
Often it is handy to check to see whether there are missing values both in the
target variable and of the covariates. It is possible that a point location does not
fit within the extent of the available covariates. In these cases the data should be
excluded. A quick way to assess whether there are missing or NA values in the data
is to use the complete.cases function.
which(!complete.cases(DSM_data))
## integer(0)
##
## Call:
## lm(formula = log_cStock0_5 ~ elevation + twi + radK +
## landsat_b3 + landsat_b4, data = DSM_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8090 -0.2312 -0.0156 0.2590 1.4837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3294707 1.1371723 1.169 0.243194
## elevation 0.0045601 0.0012970 3.516 0.000498 ***
5.2 Multiple Linear Regression 125
From the summary output above, it seems only a few of the covariates are
significant in describing the spatial variation of the target variable. To determine
the most parsimonious model we could perform a step wise regression using the
step function. With this function we can also specify what direction we want step
wise algorithm to proceed.
edge.MLR.Step <- step(edge.MLR.Full, trace = 0, direction="both")
summary(edge.MLR.Step)
##
## Call:
## lm(formula = log_cStock0_5 ~ elevation + landsat_b3,
data = DSM_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7556 -0.2325 -0.0122 0.2611 1.4594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.411781 0.181818 7.765 9.84e-14 ***
## elevation 0.004641 0.000491 9.454 < 2e-16 ***
## landsat_b3 0.004684 0.001993 2.350 0.0193 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4961 on 338 degrees of freedom
## Multiple R-squared: 0.2091,Adjusted R-squared: 0.2044
## F-statistic: 44.69 on 2 and 338 DF, p-value: < 2.2e-16
Comparing the outputs of both the full and stepwise MLR models, there is very
little difference in the model diagnostics such as the R2 . Both models explain about
20 % of variation of the target variable. Obviously the “full” model is more complex
as it has more parameters than the “step” model. If we apply Occam’s Razor, the
“step” model is preferable.
As described earlier, it is more acceptable to test the performance of a model
based upon an external validation. Lets fit a new model using the covariates selected
in the step wise regression to a random subset of the available data. We will sample
70 % of the available rows for the model calibration data set.
126 5 Continuous Soil Attribute Modeling and Mapping
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
edge.MLR.rh <- lm(log_cStock0_5 ~ elevation + landsat_b3,
data = DSM_data[training,])
# calibration predictions
MLR.pred.rhC <- predict(edge.MLR.rh, DSM_data[training, ])
# validation predictions
MLR.pred.rhV <- predict(edge.MLR.rh, DSM_data[-training, ])
Now we can evaluate the test statistics of the calibration model using the goof
function.
# calibration
goof(observed = DSM_data$log_cStock0_5[training], predicted
= MLR.pred.rhC)
# validation
goof(observed = DSM_data$log_cStock0_5[-training], predicted
= MLR.pred.rhV)
In this situation the calibration model does not appear to be over fitting because
the test statistics for the validation are equal to or better to those of the calibration
data. While this is a good result, the prediction model performs only moderately well
by the fact there is a noticeable deviation between observations and corresponding
model predictions. Examining other candidate models is a way to try to improve
upon this results.
From a soil mapping perspective the important question to ask is: What does the
map look like that results from a particular model? In practice this can be answered
by applying the model parameters to the grids of the covariates that were used in the
model. There are a few options on how to do this.
The traditional has been to collate a grid table where there would be two columns for
the coordinates followed by other columns for each of the available covariates that
5.2 Multiple Linear Regression 127
were sourced. This was seen as an efficient way to organize all the covariate data as
it ensured that a common grid was used which also meant that all the covariates are
of the same scale in terms of resolution and extent. We can simulate the covariate
table approach using the edgeroiCovariates object as below.
data(edgeroiCovariates)
covStack <- stack(elevation, twi, radK, landsat_b3, landsat_b4)
tempD <- data.frame(cellNos = seq(1:ncell(covStack)))
vals <- as.data.frame(getValues(covStack))
tempD <- cbind(tempD, vals)
tempD <- tempD[complete.cases(tempD), ]
cellNos <- c(tempD$cellNos)
gXY <- data.frame(xyFromCell(covStack, cellNos, spatial = FALSE))
tempD <- cbind(gXY, tempD)
str(tempD)
The result shown above is that the covariate table contains 201313 rows and has
8 variables. It is always necessary to have the coordinate columns, but some saving
of memory could be earned if only the required covariates are appended to the table.
It will quickly become obvious however that the covariate table approach could be
limiting when mapping extents get very large or the grid resolution of mapping
decreases, or both.
With the covariate table arranged it then becomes a matter of using the MLR
predict function.
map.MLR <- predict(edge.MLR.rh, newdata = tempD)
map.MLR <- cbind(data.frame(tempD[, c("x", "y")]), map.MLR)
Now we can rasterise the predictions for mapping (Fig. 5.2) and grid export. In
the example below we set the CRS to WGS84 Zone 55 before exporting the raster
file out as a Geotiff file.
map.MLR.r <- rasterFromXYZ(as.data.frame(map.MLR[, 1:3]))
plot(map.MLR.r, main = "MLR predicted log SOC stock (0-5cm)")
# set the projection
crs(map.MLR.r) <- "+proj=utm +zone=55 +south +ellps=WGS84
+datum=WGS84 +units=m +no_defs"
writeRaster(map.MLR.r, "cStock_0_5_MLR.tif", format = "GTiff",
datatype = "FLT4S", overwrite = TRUE)
# check woking directory for presence of raster
128 5 Continuous Soil Attribute Modeling and Mapping
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
Some of the parameters used within the writeRaster function that are worth
noting include: format, which is the raster format that we want to write to. Here
“GTiff” is being specified—use the writeFormats function to look at what other
raster formats can be used. the parameter datatype is specified as “FLT4S” which
indicates that a 4 byte, signed floating point values are to be written to file. Look
at the function dataType to look at other alternatives, for example for categorical
data where we may be interested in logical or integer values.
Probably a more efficient way of applying the fitted model is to apply it directly
to the rasters themselves. This avoids the step of arranging all covariates into table
format. If multiple rasters are being used, it is necessary to have them arranged as
a rasterStack object. This is useful as it also ensures all the rasters are of the
same extent and resolution. Here we can use the raster predict function such
as below using the covStack raster stack as input.
map.MLR.r1 <- predict(covStack, edge.MLR.rh, "cStock_0_5_MLR.tif",
format = "GTiff", datatype = "FLT4S", overwrite = TRUE)
# check woking directory for presence of raster
The prediction function is quite versatile. For example we can also map the
standard error of prediction or the confidence interval or the prediction interval even.
The script below is an example of creating maps of the 90 % prediction intervals
for the edger.MLR.rh model. We need to explicitly create a function called in
5.2 Multiple Linear Regression 129
this case predfun which will direct the raster predict function to output the
predictions plus the upper and lower prediction limits. In the predict function
we insert predfun for the fun parameter and control the output by changing the
index value to either 1, 2, or 3 to request either the prediction, lower limit, upper
limit respectively. Setting the level parameter to 0.90 indicates that we want to
return the 90 % precision interval. The resulting plots are shown in Fig. 5.3.
par(mfrow = c(3, 1))
predfun <- function(model, data) {
v <- predict(model, data, interval = "prediction", level = 0.9)
}
4.5
4.0
3.5
3.0
2.5
2.0
1.5
6645000
6.0
6665000
5.5
5.0
4.5
4.0
3.5
3.0
2.5
6645000
7
6
5
4
6645000
Fig. 5.3 MLR predicted log SOC stock 0–5 cm Edgeroi with lower and upper prediction limits
130 5 Continuous Soil Attribute Modeling and Mapping
An extension of using the raster predict function is the apply the model again to
the rasters, but to do it across multiple computer nodes. This is akin to breaking a job
up into smaller pieces then processing the jobs in parallel rather than sequentially.
The parallel component here is that the smaller pieces are passed to more than 1
compute nodes. Most desktop computers these days can have up to 8 compute nodes
which can result in some excellent gains in efficiency when applying models across
massive extents and or at fine resolutions. The raster package has some built
in dependencies with other R packages that facilitate parallel processing options.
For example the raster package ports with the snow package for setting up and
controlling the compute node processes. The script below is an example of using 4
compute nodes to apply the edge.MLR.rh model to the covStack raster stack.
beginCluster(4)
cluserMLR.pred <- clusterR(covStack, predict, args = list(edge.MLR.rh),
filename = "cStock_0_5_MLR_pred.tif", format = "GTiff",
progress = FALSE, overwrite = T)
endCluster()
To set up the compute nodes, you use the beginCluster function and inside
it, specify how many compute nodes you want to use. If empty brackets are used,
the function will use 100 % of the compute resources. The clusterR function
is the work horse function that then applies the model in parallel to the rasters.
The parameters and subsequent options are similar to the raster predict function,
although it would help to look at the help files on this function for more detailed
explanations. It is always important after the prediction is completed to shutdown
the nodes using the endCluster function.
The relative ease in setting up the parallel processing for our mapping needs has
really opened up the potential for performing DSM using very large data sets and
rasters. Moreover, using the parallel processing together with the file pointing ability
(that was discussed earlier) raster has made the possibility of big DSM a reality,
and importantly; practicable.
library(rpart)
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
edge.RT.Exp <- rpart(log_cStock0_5 ~ elevation + twi + radK +
landsat_b3 + landsat_b4, data = DSM_data[training, ],
control = rpart.control(minsplit = 50))
It is worthwhile to look at the help file for rpart particularly those aspects
regarding the rpart.control parameters which control the rpart fit. Often it
is helpful to just play around with the parameters to get a sense of what does what.
Here for the minsplit parameter within rpart.control we are specifying
that we want at least 50 observations in a node in order for a split to be attempted.
Detailed results of the model fit can be provided via the summary and printcp
functions.
summary(edge.RT.Exp)
The summary output provides detailed information of the data splitting as well
as information as to the relative importance of the covariates.
printcp(edge.RT.Exp)
132 5 Continuous Soil Attribute Modeling and Mapping
elevation<238.7
twi>=22.4 elevation<325.7
radK>=0.6443
landsat_b3<60.72 3.303
2.403 2.875 3.146
twi<21.59
2.922
2.351 2.647
The printcp function provides the useful output of indicating which covariates
were included in the final model. For the visually inclined, a plot of the tree assists
a lot to interpret the model diagnostics and assessing the important covariates too
(Fig. 5.4).
plot(edge.RT.Exp)
text(edge.RT.Exp)
As before, we can use the goof function to test the performance of the model fit
both internally and externally.
# Internal validation
RT.pred.C <- predict(edge.RT.Exp, DSM_data[training, ])
goof(observed = DSM_data$log_cStock0_5[training], predicted = RT.pred.C)
# External validation
RT.pred.V <- predict(edge.RT.Exp, DSM_data[-training, ])
goof(observed = DSM_data$log_cStock0_5[-training], predicted = RT.pred.V)
3.5
3.0
predicted
2.5
2.0
Fig. 5.5 Decision tree xy-plot plot of log SOC stock 0–5 cm (validation data set)
The decision tree model performance is not too dissimilar to the MLR model.
Looking at the xy-plot from the external validation (Fig. 5.5) and the decision tree
(Fig. 5.4), it becomes clear that a potential issue is apparent. This is: there are only
a finite number of possible outcomes in terms of the predictions.
This finite property becomes obviously apparent once we make a map by
applying the edge.RT.Exp model to the covariates (using the raster predict
function and covStack object) (Fig. 5.6).
The Cubist model is currently a very popular model structure used within the DSM
community. Its popularity is due to its ability to “mine” non-linear relationships in
data, but does not have the issues of finite predictions that occur for other decision
and regression tree models. In similar vain to regression trees however, Cubist
134 5 Continuous Soil Attribute Modeling and Mapping
3.2
3.0
2.8
2.6
2.4
Fig. 5.6 Decision tree predicted log SOC stock 0–5 cm Edgeroi
models also are a data partitioning algorithm. The Cubist model is based on the
M5 algorithm of Quinlan (1992), and is implemented in the R Cubist package.
The Cubist model first partitions the data into subsets within which their
characteristics are similar with respect to the target variable and the covariates. A
series of rules (a decision tree structure may also be defined if requested) defines the
partitions, and these rules are arranged in a hierarchy. Each rule takes the form:
if [condition is true]
then [regress]
else [apply the next rule]
The condition may be a simple one based on one covariate or, more often, it
comprises a number of covariates. If a condition results in being true then the
next step is the prediction of the soil property of interest by ordinary least-squares
regression from the covariates within that partition. If the condition is not true then
the rule defines the next node in the tree, and the sequence of if, then, else is repeated.
The result is that the regression equations, though general in form, are local to the
partitions and their errors smaller than they would otherwise be. More details of the
Cubist model can be found in the Cubist help files or Quinlan (1992).
Luckily, fitting a Cubist model in R is not too difficult—although it will be useful
to spend some time playing around with many of the controllable parameters the
function has. In the example below we can control the number of potential rules that
could potentially partition the data (note this limits the number of possible rules,
and does not necessarily mean that those number of rules will acutallu be realised
i.e. the outcome is internally optimised). We can also limit the extrapolation of the
5.4 Cubist Models 135
model predictions, which is a useful model constraint feature. These various control
parameters plus others can be adjusted within the cubistControl parameter.
The committees parameter is specified as an integer of how many committee models
(e.g.. boosting iterations) are required. Here we just set it to 1, but naturally it is
possible to perform some sort of sensitivity analysis when this committee model
option is set to greater than 1. In terms of specifying the target variable and
covariates, we do not define a formula as we did earlier for the MLR model. Rather
we specify the columns explicitly—those that are the target variable (x), and those
that are the covariates (y).
library(Cubist)
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
mDat <- DSM_data[training, ]
The output generated from fitting a Cubist model can be retrieved using the
summary function. This provides information about the conditions for each rule,
the regression models for each rule, and information about the diagnostics of the
model fit, plus the frequency of which the covariates were used as conditions and/or
within a model.
summary(edge.cub.Exp)
##
## Call:
## cubist.default(x = mDat[, c("elevation", "twi", "radK",
## "landsat_b3", "landsat_b4")], y = mDat$log_cStock0_5, committees =
## 1, control = cubistControl(rules = 5, extrapolation = 5))
##
##
## Cubist [Release 2.07 GPL Edition] Mon Feb 01 13:18:28 2016
## ---------------------------------
##
## Target attribute ‘outcome’
##
## Read 238 cases (6 attributes) from undefined.data
##
## Model:
##
## Rule 1: [238 cases, mean 2.7634952, range -1.147828 to 4.533301,
## est err 0.3358926]
##
## outcome = -0.409619 + 0.0066 elevation + 0.063 twi + 0.0042 landsat_b4
##
##
## Evaluation on training data (238 cases):
##
## Average |error| 0.3968150
136 5 Continuous Soil Attribute Modeling and Mapping
It appears the edge.cub.Exp model contains only 1 rule in this case. The
useful feature of the Cubist model is that it does not unnecessarily overfit and
partition the data. Lets see how well it validates.
# Internal validation
Cubist.pred.C <- predict(edge.cub.Exp, newdata = DSM_data[training, ])
goof(observed = DSM_data$log_cStock0_5[training], predicted = Cubist.pred.C)
# External validation
Cubist.pred.V <- predict(edge.cub.Exp, newdata = DSM_data[-training, ])
goof(observed = DSM_data$log_cStock0_5[-training], predicted = Cubist.pred.V)
The calibration model validates quite well, but its performance against the
validation does not appear to be so good. From Fig. 5.7 it appears a few observation
were very much under predicted which has had an impact on the subsequent model
performance diagnostics, otherwise the Cubist model performs reasonably well.
Creating the map resulting from the edge.cub.Exp model can be imple-
mented as before using the raster predict function (Fig. 5.8).
map.cubist.r1 <- predict(covStack, edge.cub.Exp, "cStock_0_5_cubist.tif",
format = "GTiff", datatype = "FLT4S", overwrite = TRUE)
An increasingly popular data mining algorithm in DSM and soil sciences, and even
in applied sciences in general is the Random Forests model. This algorithm is
provided in the randomForest package and can be used for both regression and
5.5 Random Forests 137
3.5
3.0
2.5
predicted
2.0
1.5
1.0
0.5
Fig. 5.7 Cubist model xy-plot plot of log SOC stock 0–5 cm (validation data set)
4.5
4.0
6660000
3.5
3.0
6650000
2.5
6640000
Fig. 5.8 Cubist model predicted log SOC stock 0–5 cm Edgeroi
138 5 Continuous Soil Attribute Modeling and Mapping
classification purposes. Random Forests are a boosted decision tree model. Further,
Random Forests are an ensemble learning method for classification (and regression)
that operate by constructing a multitude of decision trees at training time, which are
later aggregated to give one single prediction for each observation in a data set.For
regression the prediction is the average of the individual tree outputs, whereas in
classification the trees vote by majority on the correct classification (mode). For
further information regarding Random Forest and their underlying theory it is worth
consulting Breiman (2001) and Grimm et al. (2008) as an example of its application
in DSM studies.
Fitting a Random Forest model in R is relatively straightforward. It is worth
consulting the richly populated help files regarding the randomForest package
and its functions. We will be using the randomForest function and a couple of
extractor functions to tease out some of the model fitting diagnostics. Familiar will
be the formula structure of the model. As for the Cubist model, there are many
model fitting parameters to consider such as the number of trees to build (ntree)
and the number of variables (covariates) that are randomly sampled as candidates
at each decision tree split (mtry), plus many others. The print function allows
one to quickly assess the model fit. The importance parameter (logical variable)
used within the randomForest function specifies that we want to also assess the
importance of the covariates used.
library(randomForest)
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
print(edge.RF.Exp)
##
## Call:
## randomForest(formula = log_cStock0_5 ~ elevation + twi + radK +
## landsat_b3 + landsat_b4, data = DSM_data[training, ],
## importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.2747959
## % Var explained: 15.62
Using the varImpPlot function allows one to visualize which covariates are
of most importance to the prediction of our soil variable (Fig. 5.9).
varImpPlot(edge.RF.Exp)
5.5 Random Forests 139
edge.RF.Exp
elevation elevation
landsat_b3 twi
twi landsat_b4
landsat_b4 radK
radK landsat_b3
2 4 6 8 12 16 0 5 10 15
%IncMSE IncNodePurity
Fig. 5.9 Covariate importance (ranking of predictors) from Random Forest model fitting
There is a lot of talk about how the variable importance is measured in Random
Forest models. So it is probably best to quote from the source:
Here are the definitions of the variable importance measures. For each tree, the prediction
accuracy on the out-of-bag portion of the data is recorded. Then the same is done after
permuting each predictor variable. The difference between the two accuracy measurements
are then averaged over all trees, and normalized by the standard error. For regression, the
MSE is computed on the out-of-bag data for each tree, and then the same computed after
permuting a variable. The differences are averaged and normalized by the standard error. If
the standard error is equal to 0 for a variable, the division is not done (but the measure is
almost always equal to 0 in that case). For the node purity, it is the total decrease in node
impurities from splitting on the variable, averaged over all trees. For classification, the node
impurity is measured by the Gini index. For regression, it is measured by residual sum of
squares.
# External validation
RF.pred.V <- predict(edge.RF.Exp, newdata = DSM_data[-training, ])
goof(observed = DSM_data$log_cStock0_5[-training], predicted = RF.pred.V)
Essentially these results are quite similar to those of the validations from the other
models. One needs to be careful about accepting very good model fit results without
a proper external validation. This is particularly poignant for Random Forest models
which have a tendency to provide excellent calibration results which can often
give the wrong impression about its suitability for a given application. Therefore
when using random forest models it pays to look at the validation on the out-of-bag
samples when evaluating the goodness of fit of the model.
So lets look at the map resulting from applying the edge.RF.Exp models to
the covariates (Fig. 5.10).
map.RF.r1 <- predict(covStack, edge.RF.Exp, "cStock_0_5_RF.tif",
format = "GTiff", datatype = "FLT4S", overwrite = TRUE)
plot(map.RF.r1,
main = "Random Forest model predicted 0-5cm log carbon stocks (0-5cm)")
3.5
6670000
3.0
2.5
6660000
2.0
1.5
6650000
1.0
6640000
Fig. 5.10 Random Forest model predicted log SOC stock 0–5 cm Edgeroi
5.6 Advanced Work: Model Fitting with Caret Package 141
It becomes quickly apparent that there are many variants of prediction functions
that could be used for DSM. As was observed, each of the models used have
their relative advantages and disadvantages. Each also has their own specific
parameterisations and quirks for fitting. Sometimes for the various parameters that
are used for model training are chosen without any sort of optimisation, even due
consideration sometimes. Sometimes we might be confronted with many possible
model structures to use, it is often difficult to make a choice what to use, and
just default with a model we know well or have used often without considering
alternatives. This is where the caret R package http://topepo.github.io/caret/index.
html comes into its own in terms of efficiency and streamlining the workflow for
fitting models and optimising some of those parameter variables. As the dedicated
website indicates (http://topepo.github.io/caret/index.html), the caret package
(short for Classification And REgression Training) is a set of functions that attempt
to streamline the process for creating predictive models. As we have seen, there
are many different modeling functions in R. Some have different syntax for model
training and/or prediction. The caret package provides a uniform interface to the
various functions themselves, as well as a way to standardize common tasks (such
parameter tuning and variable importance). There are currently over 300 model
functions that the caret package interfaces with.
To begin, we first need to load the package into R:
library(caret)
The workhorse of the caret package is the train function. We can specify
the model to be fitted in two ways:
# 1.
fit <- train(form = log_cStock0_5 ~ elevation + twi + radK +
landsat_b3 + landsat_b4, data = DSM_data, method = "lm")
# or 2.
fit <- train(x = DSM_data[, 4:8], y = DSM_data$log_cStock0_5,
method = "lm")
fit
## Linear Regression
##
## 341 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 273, 273, 272, 273, 273, 273, ...
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 0.4918208 0.2197718 0.09759374 0.09535488
##
##
There are a lot of potential models that you could consider too for DSM. Check
out http://topepo.github.io/caret/modelList.html or print them as below:
## [1] 372
You can choose which model to use in the train function with the method
option. You will note that the fitting of the Cubist and Random Forest models below
automatically attempt to optimise some of the fitting parameters, for example the
mtry parameter for Random Forest. To look at what parameters can optimised for
each model in caret we can use the modelLookup function.
# Cubist model
modelLookup(model = "cubist")
Using the fitted model, predictions can be achieved with the predict function:
# Cubist model
pred_cubist <- predict(fit_cubist, DSM_data)
# To raster data
pred_cubistMap <- predict(covStack, fit_cubist)
In the previous sections we looked at a few soil spatial prediction functions which
at the most fundamental level, target the correlation between the target soil variable
and the available covariate information. We fitted a number of models which
included simple linear functions to non-linear functions such as regression trees to
other more complicated data mining techniques (Cubist and Random Forest). In this
section we will extend upon this DSM approach from what are called deterministic
models to also include the spatially correlated residuals that result from fitting these
models.
The approach we will now concentrate is a hybrid approach to modelling,
whereby the predictions of the target variable are made via a deterministic method
(regression model with covariate information) and a stochastic method where we
determine the spatial auto-correlation of the model residuals with a variogram. The
deterministic model essentially “detrends” the data, leaving behind the residuals
for which we need to investigate whether there is additional spatial structure which
could be added to the regression model predictions. These residuals are the random
component of the scorpan C emodel. This method is described as regression kriging
and has formally been described in Odeh et al. (1995) and is synonymous with
universal kriging (Hengl et al. 2007), which is the formal linear model procedure to
this soil spatial modeling approach. The purpose of this exercise is to introduce some
basic concepts of regression kriging. You will have already had some experience in
regression models. We have also investigated briefly the fundamental concepts of
kriging for which the variogram is central to.
144 5 Continuous Soil Attribute Modeling and Mapping
The universal kriging function in R is found in the gstat package. It is useful from
the view that both the regression model and variogram modeling of the residuals
are handled together. Using universal kriging, one can efficiently derive prediction
uncertainties by way of the kriging variance. A limitation of universal kriging in
the true sense of the model parameter fitting is that the model is linear. The general
preference is DSM studies is to used non-linear and recursive models that do not
require strict model assumptions and assume a linear relationship between target
variable and covariates.
One of the strict requirements of universal kriging in gstat is that the CRS
(coordinate reference system) of the point data and covariates must be exactly the
same. First we will take a subset of the data to use for an external validation.
Unfortunately some of our data has to be sacrificed for this.
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
cDat <- DSM_data[training, ]
coordinates(cDat) <- ~x + y
crs(cDat) <- "+proj=utm +zone=55 +south +ellps=WGS84 +datum=WGS84
+units=m +no_defs"
crs(covStack) = crs(cDat)
# check
crs(cDat)
## CRS arguments:
## +proj=utm +zone=55 +south +ellps=WGS84 +datum=WGS84 +units=m
## +no_defs +towgs84=0,0,0
crs(covStack)
## CRS arguments:
## +proj=utm +zone=55 +south +ellps=WGS84 +datum=WGS84 +units=m
## +no_defs +towgs84=0,0,0
Now lets parametise the universal kriging model, and we will use all of the
available covariates.
library(gstat)
vgm1 <- variogram(log_cStock0_5 ~ elevation + twi + radK +
landsat_b3 + landsat_b4, cDat, width = 250, cressie = TRUE,
cutoff = 10000)
mod <- vgm(psill = var(cDat$log_cStock0_5), "Exp", range = 5000, nugget = 0)
model_1 <- fit.variogram(vgm1, mod)
model_1
Using the validation data we can assess the performance of universal kriging
using the goof function.
## CRS arguments:
## +proj=utm +zone=55 +south +ellps=WGS84 +datum=WGS84 +units=m
## +no_defs +towgs84=0,0,0
goof(observed = DSM_data$log_cStock0_5[-training],
predicted = UK.preds.V[,3])
The universal kriging model performs more-or-less the same as the MLR model
that was fitted earlier.
Applying the universal kriging model spatially is facilitated through the
interpolate function from raster. One can also use the clusterR function
used earlier in order to speed things up a bit by applying the model over multiple
compute nodes. Kriging results in two main outputs: the prediction and the
prediction variance. When using the interpolate function we can control
the output by changing the index parameter. The below script results in the maps
on Fig. 5.11.
# prediction variance
UK.Pvar.map <- interpolate(covStack, gUK, xyOnly = FALSE, index = 2)
plot(UK.Pvar.map, main = "Universal krging prediction variance")
146 5 Continuous Soil Attribute Modeling and Mapping
6 0.5
5 0.4
4
0.3
3
2 0.2
1
0.1
0
Fig. 5.11 Universal kriging prediction and prediction variance of log SOC stock 0–5 cm Edgeroi
Even though we do not expect to achieve much by modeling the spatial structure
of the model residuals using the Edgeroi data, this following example will provide
the steps one would use to perform regression kriging that incorporates a complex
model structure such as a data mining algorithm. Here we will use the Cubist model
that was used earlier. Lets start from the beginning.
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
mDat <- DSM_data[training, ]
Now derive the model residual which is the model prediction subtracted from the
residual.
mDat$residual <- mDat$log_cStock0_5 - predict(edge.cub.Exp,
newdata = mDat)
mean(mDat$residual)
## [1] 0.01197274
5.7 Regression Kriging 147
If you check the histogram of these residuals you will find that the mean is around
zero and the data seems normally distributed. Now we can assess the residuals for
any model structure,
coordinates(mDat) <- ~x + y
crs(mDat) <- "+proj=utm +zone=55 +south +ellps=WGS84 +datum=WGS84
+units=m +no_defs"
With the two model components together, we can now compare the external
validation statistics of using the Cubist model only and with the Cubist model and
residual variogram together.
Fig. 5.12 Regression kriging predictions with cubist models. Log carbon stock (0–5 cm) Edgeroi
References 149
References
The other form of soil spatial prediction functions are those dealing with categorical
target variables such as soil classes. Naturally the models we will use further on are
not specific to soil classes but can be generally applied to any type of categorical
data too.
In the examples below we will demonstrate the prediction of soil-landscape
classes termed: Terrons. Terron relates to a soil and landscape concept that has
an associated model. The concept was first described by Carre and McBratney
(2005). The embodiment of Terron is a continuous soil-landscape unit or class which
combines soil knowledge, landscape information, and their interactions together.
Malone et al. (2014) detailed an approach for defining Terrons in the Lower Hunter
Valley, NSW, Australia. This area is a prominent wine-growing region of Australia,
and the development of Terrons is a first step in the realization of viticultural
terroir—an established concept of identity that incorporates much more than just soil
and landscape qualities (Vaudour et al. 2015). In the work by Malone et al. (2014)
they defined 12 Terrons for the Hunter Valley, which are distinguished by such soil
and landscape characteristics as: geomorphometric attributes (derived from a digital
elevation model) and specific attributes pertaining to soil pH, clay percentage, soil
mineralogy (clay types and presence of iron oxides), continuous soil classes, and
presence or absence of marl. The brief in this chapter is to predict the Terron classes
across the prescribed Lower Hunter Valley, given a set of observations (sampled
directly from the Terron map produced by Malone et al. (2014)).
By now you will be familiar with the process of fitting models and plot-
ting/mapping the outputs. Thus in many ways, categorical data modeling is similar
(in terms of implementation) with prediction models of continuous variables. The
example in the previous chapter demonstrating the use of the caret package can
also be similarly applied for categorial variables too, where you will find many
model functions suited to that type of data with that package.
The special characteristic of categorical data and its prediction within models, is that
a class is either predicted or it is not. For binary variables. the prediction is either
a yes or no, black or white, present or absent etc. For multinomial variables, there
are more than 2 classes, for example either black, grey, or white etc (which could
actually be an ordinal categorical classification, rather than nominal).
There is no in between; rather discrete entities. Exceptions are that some models
do estimate the probability of the existence of a particular class, which will be
touched on later. Additionally, there are methods of fuzzy classification which are
common in the soil sciences (McBratney and Odeh 1997), but will not be covered
in this section. Discrete categories and models for their prediction require other
measures of validation than those that were used for continuous variables. The most
important quality measures are described in Congalton (1991) and include:
1. Overall accuracy
2. User’s accuracy
3. Producer’s accuracy
4. Kappa coefficient of agreement
Using a contrived example, each of these quality measures will be illustrated. We
will make a 4 4 matrix, and call it con.mat, and append some column and row
names—in this case Australian Soil Classification Order codes. We then populate
the matrix with some more-or-less random integers.
## DE VE CH KU
## DE 5 0 0 0
## VE 0 15 1 10
## CH 1 0 31 2
## KU 2 5 0 11
con.mat takes the form of a confusion matrix, and ones such as this are often
the output of a classification model. If we summed each of the columns (using the
colSums function), we would obtain the total number of observations for each soil
6.1 Model Validation of Categorical Prediction Models 153
class. Having column sums reflecting the number of observations is a widely used
convention in classification studies.
colSums(con.mat)
## DE VE CH KU
## 8 20 32 23
Similarly, if we summed each of the rows we would retrieve the total number of
predictions of each soil class. The predictions could have been made through any
sort of model or classification process.
rowSums(con.mat)
## DE VE CH KU
## 5 26 34 18
Therefore, the numbers on the diagonal of the matrix will indicate fidelity
between the observed class and the subsequent prediction. Numbers on the off-
diagonals indicate a mis-classification or error. Overall accuracy is therefore
computed by dividing the total correct (i.e., the sum of the diagonal) by the total
number of observations (sum of the column sums).
ceiling(sum(diag(con.mat))/sum(colSums(con.mat)) * 100)
## [1] 75
ceiling(diag(con.mat)/colSums(con.mat) * 100)
## DE VE CH KU
## 63 75 97 48
ceiling(diag(con.mat)/rowSums(con.mat) * 100)
## DE VE CH KU
## 100 58 92 62
Xn
colSumi rowSumi
pe D . /. / (6.2)
iD1
TO TO
TO is the total number of observations and n is the number of classes. Rather than
scripting the above equations, the kappa coefficient together with the other accuracy
measures are contained in a function called goofcat in the ithir package. As we
already have a confusion matrix prepared, we can enter it directly into the function
as in the script below.
## $confusion_matrix
## DE VE CH KU
## DE 5 0 0 0
## VE 0 15 1 10
## CH 1 0 31 2
## KU 2 5 0 11
##
## $overall_accuracy
## [1] 75
##
6.2 Multinomial Logistic Regression 155
## $producers_accuracy
## DE VE CH KU
## 63 75 97 48
##
## $users_accuracy
## DE VE CH KU
## 100 58 92 62
##
## $kappa
## [1] 0.6389062
A rule of thumb as indicated in Landis and Koch (1977) for the interpretation of
the Kappa coefficient is:
< Less than chance agreement.
0.01–0.20 Slight agreement.
0.21–0.40 Fair agreement.
0.41–0.60 Moderate agreement.
0.61–0.80 Substantial agreement.
0.80–0.99 Almost perfect agreement.
We will be using these prediction quality indices for categorical variable
prediction models in the following examples.
In R we can use the multinom function from the nnet package to perform
logistic regression. The are other implementations of this model in R, so it is worth
a look to compare and contrast them. Fitting multinom is just like fitting a linear
model as seen below.
As described earlier, the data to be used for the following modelling exercises
are Terron classes as sampled from the map presented in Malone et al. (2014).The
sample data contains 1000 entries of which there are 12 different Terron classes.
Before getting into the modeling, we first load in the data and then perform the
covariate layer intersection using the suite of environmental variables contained in
the hunterCovariates data object in the ithir package. The small selection
of covariates cover an area of approximately 220 km2 at a spatial resolution of 25 m.
They include those derived from a DEM: altitude above channel network (AACN),
solar light insolation and terrain wetness index (TWI). Gamma radiometric data
(total count) is also included together with a surface that depicts soil drainage in
form of a continuous index (ranging from 0 to 5). These 5 different covariate layers
are stacked together via a rasterStack.
library(ithir)
library(sp)
library(raster)
data(terron.dat)
data(hunterCovariates)
coordinates(terron.dat) <- ~x + y
It is always good practice to check to see if any of the observational data returned
any NA values for any one of the covariates. If there is NA values, it indicates that
the observational data is outside the extent of the covariate layers. It is best to remove
these observations from the data set.
which(!complete.cases(DSM_data))
## integer(0)
Now for model fitting. The target variable is terron. So lets just use all the
available covariates in the model. We will also subset the data for an external
validation i.e. random hold back validation.
library(nnet)
set.seed(655)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
hv.MNLR <- multinom(terron ~ AACN + Drainage.Index
+ Light.Insolation + TWI + Gamma.Total.Count,
data = DSM_data[training, ])
Using the summary function allows us to see the linear models for each Terron
class, which are the result of the log-odds of each soil class modeled as a linear
combination of the covariates. We can also see the probabilities of occurrence for
each Terron class at each observation location by using the fitted function.
summary(hv.MNLR)
Subsequently, we can also determine the most probable Terron class using the
the predict function.
## 1 2 3 4 5 6 7 8 9 10 11 12
## 21 7 60 62 110 73 169 115 18 29 12 24
158 6 Categorical Soil Attribute Modeling and Mapping
Lets now perform an internal validation of the model to assess its general
performance. Here we use the goofcat function, but this time we import the
two vectors into the function which correspond to the observations and predictions
respectively.
goofcat(observed = DSM_data$terron[training],
predicted = pred.hv.MNLR)
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 13 3 1 2 0 0 1 0 1 0 0 0
## 2 1 5 0 0 0 0 0 0 1 0 0 0
## 3 0 1 35 2 0 0 8 0 2 3 6 3
## 4 4 0 9 30 1 1 3 3 5 0 6 0
## 5 0 0 0 0 56 10 1 10 13 17 1 2
## 6 0 0 0 1 16 47 0 6 1 2 0 0
## 7 0 0 14 7 4 0 92 16 7 8 9 12
## 8 0 0 0 4 10 7 14 57 4 6 13 0
## 9 0 0 0 5 3 0 2 2 5 0 1 0
## 10 0 0 0 0 4 1 2 6 3 13 0 0
## 11 0 0 4 3 0 0 0 3 0 0 2 0
## 12 1 0 2 1 0 0 1 0 0 4 0 15
##
## $overall_accuracy
## [1] 53
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 69 56 54 55 60 72 75 56 12 25 6 47
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 62 72 59 49 51 65 55 50 28 45 17 63
##
## $kappa
## [1] 0.4637285
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 1 1 1 2 0 0 0 0 1 0 0 0
## 2 5 0 2 0 0 0 0 0 1 1 0 0
## 3 0 0 13 1 0 0 5 0 1 2 2 2
## 4 2 3 9 8 0 0 0 4 8 0 3 0
6.2 Multinomial Logistic Regression 159
## 5 0 0 0 0 21 7 0 8 6 3 0 2
## 6 0 0 0 1 8 18 0 5 1 0 0 0
## 7 0 0 8 4 0 0 38 9 4 5 2 5
## 8 0 0 0 3 3 0 6 15 1 3 1 0
## 9 0 0 0 1 2 0 0 1 0 0 0 0
## 10 0 0 0 0 4 1 1 4 1 9 2 0
## 11 0 0 0 0 0 0 0 1 0 0 0 0
## 12 0 0 1 0 0 0 1 0 0 2 0 4
##
## $overall_accuracy
## [1] 43
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 13 0 39 40 56 70 75 32 0 36 0 31
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 17 0 50 22 45 55 51 47 0 41 0 50
##
## $kappa
## [1] 0.3476539
Using the raster predict function is the method for applying the hv.MNLR
model across the whole area. Note that the clusterR function can also be used
here too if there is a requirement to perform the spatial prediction across multiple
compute nodes. Note also that it is also possible for multinomial logistic regression
to create the map of the most probable class, as well as the probabilities for all
classes. The first script example below is for mapping the most probable class
which is specified by setting the type parameter to “class”. If probabilities are
required “probs” would be used for the type parameter, together with specifying
an index integer to indicate which class probabilities you wish to map. The second
script example below shows the parametisation for predicting the probabilities for
Terron class 1.
# class prediction
map.MNLR.c <- predict(covStack, hv.MNLR, type = "class",
filename = "hv_MNLR_class.tif",format = "GTiff",
overwrite = T, datatype = "INT2S")
# class probabilities
map.MNLR.p <- predict(covStack, hv.MNLR, type = "probs",
index = 1, filename = "edge_MNLR_probs1.tif", format = "GTiff",
overwrite = T, datatype = "FLT4S")
HVT_011
HVT_010
HVT_009
6375000 HVT_008
HVT_007
HVT_006
HVT_005
HVT_004
6370000
HVT_003
HVT_002
HVT_001
6365000
by Malone et al. (2014). The colors are defined in terms of HEX color codes. A
very good resource for selecting colors or deciding on color ramps for maps is
colorbrewer located at http://colorbrewer2.org/. The script below produces the map
in Fig. 6.1.
## HEX colors
area_colors <- c("#FF0000", "#38A800", "#73DFFF", "#FFEBAF",
"#A8A800", "#0070FF", "#FFA77F", "#7AF5CA", "#D7B09E",
"#CCCCCC", "#B4D79E", "#FFFF00")
# plot
levelplot(map.MNLR.c, col.regions = area_colors,
xlab = "", ylab = "")
6.3 C5 Decision Trees 161
The C5 decision tree model is available in the C50 package. The function C5.0
fits classification tree models or rule-based models using Quinlans’s C5.0 algorithm
(Quinlan 1993).
Essentially we will go through the same process as we did for the multinomial
logistic regression. The C5.0 function and its internal parameters are similar in
nature to that of the Cubist function for predicting continuous variables. The
trials parameter lets you implement a “boosted” classification tree process, with
the results aggregated at the end. There are also many other useful model tuning
parameters in the C5.0Control parameter set too that are worth a look. Just see
the help files for more information.
Using the same training and validation sets as before, we will fit the C5 model as
in the script below.
library(C50)
hv.C5 <- C5.0(x = DSM_data[training, c("AACN", "Drainage.Index",
"Light.Insolation", "TWI", "Gamma.Total.Count")],
y = DSM_data$terron[training], trials = 1, rules = FALSE))
By calling the summary function, some useful model fit diagnostics are given.
These include the tree structure, together with the covariates used and mis-
classification error of the model, as well as a rudimentary confusion matrix. A useful
feature of the C5 model is that it can omit in an automated fashion, unnecessary
covariates.
The predict function can either return the predicted class or the confidence
of the predictions (which is controlled using the type=“prob” parameter). The
probabilities are quantified such that if an observation is classified by a single leaf
of a decision tree, the confidence value is the proportion of training cases at that leaf
that belong to the predicted class. If more than one leaf is involved (i.e., one or more
of the attributes on the path has missing values), the value is a weighted sum of the
individual leaves’ confidences. For rule-classifiers, each applicable rule votes for a
class with the voting weight being the rule’s confidence value. If the sum of the votes
for class C is W(C), then the predicted class P is chosen so that W(P) is maximal, and
the confidence is the greater of (1), the voting weight of the most specific applicable
rule for predicted class P; or (2) the average vote for class P (so, W(P) divided by the
number of applicable rules for class P).Boosted classifiers are similar, but individual
classifiers vote for a class with weight equal to their confidence value. Overall, the
confidence associated with each class for every observation are made to sum to 1.
So lets look at the calibration and validation statistics. First, the calibration
statistics:
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 17 9 4 4 0 0 0 0 1 0 0 1
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 42 4 0 0 13 0 4 3 7 14
## 4 2 0 12 37 0 1 5 6 9 1 9 1
## 5 0 0 0 0 48 2 1 9 14 18 0 0
## 6 0 0 0 0 22 55 0 3 2 1 1 0
## 7 0 0 7 6 7 0 95 18 9 12 5 14
## 8 0 0 0 4 12 8 2 61 3 4 9 2
## 9 0 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 5 0 8 6 0 14 7 0
## 11 0 0 0 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 0 0
##
## $overall_accuracy
## [1] 53
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 90 0 65 68 52 84 77 60 0 27 0 0
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 48 NaN 49 45 53 66 55 59 NaN 35 NaN NaN
##
## $kappa
## [1] 0.4618099
It will be noticed that some of the Terron classes failed to be predicted by the
fitted model. For example Terron classes 2, 9, 11, and 12 were all predicted as being
a different class. All observations of Terron class 2 were predicted as Terron class
1. Doing the external validation we return the following:
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 7 3 5 1 0 0 1 0 2 1 0 1
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 14 1 0 0 4 0 2 2 2 6
## 4 1 1 5 9 0 0 4 3 7 0 0 0
6.3 C5 Decision Trees 163
## 5 0 0 0 0 20 2 0 7 5 2 0 1
## 6 0 0 0 0 10 22 0 7 0 1 0 0
## 7 0 0 10 5 0 0 35 11 2 7 5 3
## 8 0 0 0 3 6 2 4 16 2 5 2 2
## 9 0 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 1 2 0 3 3 4 7 1 0
## 11 0 0 0 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 0 0
##
## $overall_accuracy
## [1] 44
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 88 0 42 45 53 85 69 35 0 29 0 0
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 34 NaN 46 30 55 56 45 39 NaN 34 NaN NaN
##
## $kappa
## [1] 0.3565075
And finally, creating the map derived from the hv.C5 model using the raster
predict function (Fig. 6.2). Note that the C5 model returned 0 % producer’s
6380000
HVT_010
HVT_008
6375000 HVT_007
HVT_006
HVT_005
HVT_004
HVT_003
6370000
HVT_001
6365000
accuracy for Terron classes 2, 9, 11 and 12. These data account for only a small
proportion of the data set, and/or, they may be similar to other existing Terron
classes (based on the available predictive covariates). Consequently, they did not
feature in the hv.C5 model and ultimately, the final map.
# class prediction
map.C5.c <- predict(covStack, hv.C5, type = "class",
filename = "hv_C5_class.tif",format = "GTiff",
overwrite = T, datatype = "INT2S")
# plot
levelplot(map.C5.c, col.regions = area_colors,
xlab = "", ylab = "")
The final model we will look at is the Random Forest, which we should be familiar
with now as this model type was examined during the continuous variable prediction
methods section. It can also be used for categorical variables. Some useful extractor
functions like print and importance give some useful information about the
model performance.
library(randomForest)
Three types of prediction outputs can be generated from Random Forest models,
and are specified via the type parameter of the predict extractor functions. The
different “types” are the response (predicted class), prob (class probabilities) or
vote (vote count, which really just appears to return the probabilities).
# Prediction of classes
predict(hv.RF, type = "response", newdata = DSM_data[training, ])
6.4 Random Forests 165
# Class probabilities
predict(hv.RF, type = "prob", newdata = DSM_data[training, ])
From the diagnostics output of the hv.C5 model the confusion matrix is
automatically generated, except it was a different orientation to what we have been
looking for previous examples. This confusion matrix was performed on what is
called the OOB or out-of-bag data i.e. it validates the model/s dynamically with
observations withheld from the model fit. So lets just evaluate the model as we have
done for the previous models. For calibration:
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 19 0 0 0 0 0 0 0 0 0 0 0
## 2 0 9 0 0 0 0 0 0 0 0 0 0
## 3 0 0 65 0 0 0 0 0 0 0 0 0
## 4 0 0 0 55 0 0 0 0 0 0 0 0
## 5 0 0 0 0 94 0 0 0 0 0 0 0
## 6 0 0 0 0 0 66 0 0 0 0 0 0
## 7 0 0 0 0 0 0 124 0 0 0 0 0
## 8 0 0 0 0 0 0 0 103 0 0 0 0
## 9 0 0 0 0 0 0 0 0 42 0 0 0
## 10 0 0 0 0 0 0 0 0 0 53 0 0
## 11 0 0 0 0 0 0 0 0 0 0 38 0
## 12 0 0 0 0 0 0 0 0 0 0 0 32
##
## $overall_accuracy
## [1] 100
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 100 100 100 100 100 100 100 100 100 100 100 100
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 100 100 100 100 100 100 100 100 100 100 100 100
##
## $kappa
## [1] 1
It seems quite incredible that this particular model is indicating a 100 % accuracy.
Here it pays to look at the out-of-bag error of the hv.RF model for a better
indication of the model goodness of fit. FOr the random holdback validation:
166 6 Categorical Soil Attribute Modeling and Mapping
## $confusion_matrix
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 2 2 1 2 0 0 0 0 1 0 0 0
## 2 4 0 2 0 0 0 0 0 1 1 0 0
## 3 0 0 11 1 0 0 7 0 2 0 2 1
## 4 2 2 4 10 0 0 1 3 6 0 0 0
## 5 0 0 0 0 29 8 2 7 5 2 0 1
## 6 0 0 0 0 2 18 0 6 0 1 0 0
## 7 0 0 8 5 0 0 31 8 3 4 3 5
## 8 0 0 0 0 4 0 8 19 3 1 2 1
## 9 0 0 1 0 0 0 0 0 2 0 0 0
## 10 0 0 2 0 2 0 1 3 1 12 1 0
## 11 0 0 5 2 1 0 0 1 0 1 2 0
## 12 0 0 0 0 0 0 1 0 0 3 0 5
##
## $overall_accuracy
## [1] 47
##
## $producers_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 25 0 33 50 77 70 61 41 9 48 20 39
##
## $users_accuracy
## 1 2 3 4 5 6 7 8 9 10 11 12
## 25 0 46 36 54 67 47 50 67 55 17 56
##
## $kappa
## [1] 0.4015957
So based on the model validation, the Random Forest performs quite similarly to
the other models that were used before, despite a perfect performance based on the
diagnostics of the calibration model.
And finally the map that results from applying the hv.C5 model to the covariate
rasters in shown on Fig. 6.3.
# class prediction
map.RF.c <- predict(covStack, hv.RF, filename = "hv_RF_class.tif",
format = "GTiff",overwrite = T, datatype = "INT2S")
# plot
levelplot(map.RF.c, col.regions = area_colors,
xlab = "", ylab = "")
References 167
6380000
HVT_012
HVT_011
HVT_010
HVT_009
6375000 HVT_008
HVT_007
HVT_006
HVT_005
HVT_004
6370000
HVT_003
HVT_002
HVT_001
6365000
References
Soil scientists are quite aware of the current issues concerning the natural envi-
ronment because our expertise is intimately aligned with their understanding and
alleviation. We know that sustainable soil management alleviates soil degradation,
improves soil quality and will ultimately ensure food security. Critical to better soil
management is information detailing the soil resource, its processes and its variation
across landscapes. Consequently, under the broad umbrella of “environmental
monitoring”, there has been a growing need to acquire quantitative soil information
(Grimm and Behrens 2010; McBratney et al. 2003). The concerns of soil-related
issues in reference to environmental management were raised by McBratney (1992)
when stating that it is our duty as soil scientists, to ensure that the information we
provide to the users of soil information is both accurate and precise, or at least of
known accuracy and precision.
However, a difficulty we face is that soil can vary, seemingly erratically in the
context of space and time (Webster 2000). Thus the conundrum in model-based pre-
dictions of soil phenomena is that models are not “error free”. The unpredictability
of soil variation combined with simplistic representations of complex soil processes
inevitably leads to errors in model outputs.
We do not know the true character and processes of soils and our models are
merely abstractions of these real processes. We know this; or in other words, in the
absence of such confidence, we know we are uncertain about the true properties and
processes that characterize soils (Brown and Heuvelink 2005). The key is therefore
to determine to what extent our uncertainties are propagated through a model of
which effect the final predictions of a real-world process.
In modeling exercises, uncertainty of the model output is the summation of
the three main sources generally described as: model structure uncertainty, model
parameter uncertainty and model input uncertainty (Brown and Heuvelink 2005;
Minasny and McBratney 2002b). A detailed analysis of the contribution of each of
the different sources of uncertainty is generally recommended. In this book chapter
we will cover few approaches to estimate the uncertainty of model outputs. Essen-
tially what this means is that given a defined level of confidence, model predictions
from digital soil mapping will be co-associated with the requisite prediction interval
or range. The approaches for quantifying the prediction uncertainties are:
• Universal kriging prediction variance.
• Bootstrapping
• Empirical uncertainty quantification through data partitioning and cross
validation.
• Empirical uncertainty quantification through fuzzy clustering and cross
validation
The data that will be used in this chapter is a small data set of subsoil pH that has
been collected since 2001 to present from the Lower Hunter Valley in New South
Wales, Australia. The soil data covers an area of approximately 220 km2 . Validation
of the quantification of uncertainty will be performed using a subset of these data.
The mapping of the uncertainties will be conducted for a small region of the study
area. The data for this section can be retrieved from the ithir package. The soil
data is called HV_subsoilpH while the grids of environmental covariates is called
hunterCovariates_sub.
First we need to load in all the libraries that are necessary for this section and load
in the necessary data.
library(ithir)
library(sp)
library(rgdal)
library(raster)
library(gstat)
7.1 Universal Kriging Prediction Variance 171
# Point data
data(HV_subsoilpH)
str(HV_subsoilpH)
# Raster data
data(hunterCovariates_sub)
hunterCovariates_sub
## class : RasterStack
## dimensions : 249, 210, 52290, 11 (nrow, ncol, ncell, nlayers)
## resolution : 25, 25 (x, y)
## extent : 338422.3, 343672.3, 6364203, 6370428
## (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=56 +south +ellps=WGS84
## +datum=WGS84 +units=m +no_defs
## names : Terrain_Ruggedness_Index, AACN,
## Landsat_Band1, Elevation, Hillshading, Light_insolation,
## Mid_Slope_Positon, MRVBF, NDVI, TWI, Slope
## min values : 0.194371, 0.000000,
## 26.000000, 72.217499, 0.000677, 1236.662840,
## 0.000009, 0.000002, -0.573034, 8.224325, 0.001708
## max values : 15.945321, 106.665482,
## 140.000000, 212.632507, 32.440960, 1934.199950, 0.956529,
## 4.581594, 0.466667, 20.393652, 21.809752
You will notice for HV_subsoilpH that these data have already been
intersected with a number of covariates. The hunterCovariates_sub are
a rasterStack of the same covariates (although the spatial extent is smaller).
172 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
## [1] 354
nrow(vDat)
## [1] 152
The cDat and vDat objects correspond to the model calibration and validation
data sets respectively.
Now to prepare the data for the model
# coordinates
coordinates(cDat) <- ~X + Y
We will firstly use a step wise regression to determine a parsimonious model are
the most important covariates.
# Full model
lm1 <- lm(pH60_100cm ~ Terrain_Ruggedness_Index + AACN
+ Landsat_Band1 + Elevation + Hillshading + Light_insolation
+ Mid_Slope_Positon + MRVBF + NDVI + TWI + Slope, data = cDat)
# Parsimous model
lm2 <- step(lm1, direction = "both", trace = 0)
as.formula(lm2)
summary(lm2)
##
## Call:
## lm(formula = pH60_100cm ~ AACN + Landsat_Band1 + Hillshading +
## Mid_Slope_Positon + MRVBF + NDVI + TWI, data = cDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9409 -0.8467 -0.1431 0.6870 3.2195
##
## Coefficients:
7.1 Universal Kriging Prediction Variance 173
Now we can construct the universal kriging model using the step wise selected
covariates.
## data:
## hunterpH_UK : formula = pH60_100cm‘~‘AACN + Landsat_Band1 +
## Hillshading + Mid_Slope_Positon + MRVBF + NDVI + TWI ;
## data dim = 354 x 12
## variograms:
## model psill range
## hunterpH_UK[1] Nug 0.8895319 0.000
## hunterpH_UK[2] Sph 0.5204803 1100.581
Setting the index value to 2 lets us map the kriging variance which is needed for
the prediction interval. Taking the square root this estimates the standard deviation
which we can then multiple for the z value that corresponds to a 90 % probability
which is 1.644854. We then both add and subtract that result from the universal
kriging prediction to derive the 90 % prediction limits.
# prediction variance
UK.var.map <- interpolate(hunterCovariates_sub, gUK,
xyOnly = FALSE, index = 2, filename = "UK_predVarMap.tif",
format = "GTiff", overwrite = T)
# standard deviation
f2 <- function(x) (sqrt(x))
UK.stdev.map <- calc(UK.var.map, fun = f2,
filename = "UK_predSDMap.tif", format = "GTiff",
progress = "text", overwrite = T)
# Z level
zlev <- qnorm(0.95)
f2 <- function(x) (x * zlev)
UK.mult.map <- calc(UK.stdev.map, fun = f2,
filename = "UK_multMap.tif", format = "GTiff",
progress = "text", overwrite = T)
# upper PL
f3 <- function(x) (x[1] + x[2])
UK.upper.map <- calc(m1, fun = f3,
filename = "UK_upperMap.tif", format = "GTiff",
progress = "text", overwrite = T)
# lower PL
f4 <- function(x) (x[1] - x[2])
UK.lower.map <- calc(m1, fun = f4, filename = "UK_lowerMap.tif",
format = "GTiff", progress = "text", overwrite = T)
7.1 Universal Kriging Prediction Variance 175
# prediction range
m2 <- stack(UK.upper.map, UK.lower.map)
So to plot them all together we use the following script. Here we explicitly create
a color ramp that follows reasonably closely the pH color ramp. Then we scale each
map to the common range for better comparison (Fig. 7.1).
14 14
13 13
12 12
11 11
10 10
9 9
6367000
6367000
8 8
7 7
6 6
5 5
4 4
3 3
2 2
6365000
6365000
336000 338000 340000 342000 344000 346000 336000 338000 340000 342000 344000 346000
14 6
13
12 5
11
10 4
9
6367000
8 3
7
6 2
5
4 1
3
2 0
6365000
Fig. 7.1 Soil pH predictions and prediction limits derived using a universal kriging model
176 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
# color ramp
phCramp <- c("#d53e4f", "#f46d43", "#fdae61", "#fee08b",
"#ffffbf", "#e6f598", "#abdda4", "#66c2a5", "#3288bd",
"#5e4fa2", "#542788", "#2d004b")
brk <- c(2:14)
par(mfrow = c(2, 2))
plot(UK.lower.map, main = "90% Lower prediction limit",
breaks = brk, col = phCramp)
plot(UK.P.map, main = "Prediction", breaks = brk, col = phCramp)
plot(UK.upper.map, main = "90% Upper prediction limit",
breaks = brk, col = phCramp)
plot(UK.piRange.map, main = "Prediction limit range",
col = terrain.colors(length(seq(0,
6.5, by = 1)) - 1), axes = FALSE, breaks = seq(0, 6.5, by = 1))
# Prediction
UK.preds.V <- as.data.frame(krige(pH60_100cm ~ AACN +
Landsat_Band1 + Hillshading + Mid_Slope_Positon + MRVBF + NDVI +
TWI, cDat, model = model_1, newdata = vDat))
# zfactor multiplication
vMat <- matrix(NA, nrow = nrow(UK.preds.V), ncol = length(qp))
for (i in 1:length(qp)) {
vMat[, i] <- UK.preds.V$stdev * qp[i]
}
# upper
uMat <- matrix(NA, nrow = nrow(UK.preds.V), ncol = length(qp))
for (i in 1:length(qp)) {
uMat[, i] <- UK.preds.V$var1.pred + vMat[, i]
}
# lower
lMat <- matrix(NA, nrow = nrow(UK.preds.V), ncol = length(qp))
for (i in 1:length(qp)) {
lMat[, i] <- UK.preds.V$var1.pred - vMat[, i]
}
colSums(bMat)/nrow(bMat)
Plotting the confidence level against the PICP provides a visual means to assess
the fidelity about the 1:1 line. As can be seen on Fig. 7.2, the PICP follows closely
the 1:1 line.
# make plot
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
plot(cs, ((colSums(bMat)/nrow(bMat)) * 100))
And then we may assess the uncertainties on the basis of the PICP like shown
on Fig. 7.2, together with assessing the quantiles of the distribution of the prediction
limit range for a given prediction confidence level (here 90 %).
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
colnames(lMat) <- cs
colnames(uMat) <- cs
quantile(uMat[, "90"] - lMat[, "90"])
7.2 Bootstrapping
is fitted, and can then be applied to generate a digital soil map. By repeating
the process of random sampling and applying the model, we are able to generate
probability distributions of the prediction realizations from each model at each pixel.
A robust estimate may be determined by taking the average of all the simulated
predictions at each pixel. By being able to obtain probability distributions of the
outcomes, one is also able to quantify the uncertainty of the modeling by computing
a prediction interval given a specified level of confidence. While the bootstrapping
approach is relatively straightforward, there is a requirement to generate x number
of maps, where x is the number of bootstrap samples. This obviously could be
prohibitive from a computational and data storage point of view, but not altogether
impossible (given parallel processing capabilities etc.) as was demonstrated by both
Viscarra Rossel et al. (2015) and Liddicoat et al. (2015) whom both performed
bootstrapping for quantification of uncertainties across very large mapping extents.
In the case of Viscarra Rossel et al. (2015) this was for the entire Australian
continent at 100 m resolution.
In the example below, the bootstrap method is demonstrated. We will be using
Cubist modeling for the model structure and perform 50 bootstrap samples. We will
use 70 % of the available data to use for fitting models. The remaining 30 % as has
been done for all previous DSM approaches is for validation.
For the first step, we do the random partitioning of the data into calibration
and validation data sets. Again we are using the HV_subsoilpH data and the
associated hunterCovariates_sub raster data stack.
# subset data for modeling
set.seed(667)
training <- sample(nrow(HV_subsoilpH), 0.7 * nrow(HV_subsoilpH))
cDat <- HV_subsoilpH[training, ]
vDat <- HV_subsoilpH[-training, ]
The nbag variable below holds the value for the number of bootstrap models we
want to fit. Here it is 50. Essentially the bootstrap can can be contained within a for
loop, where upon each loop a sample of the available data is taken (here 100 %) then
a model is fitted. Note below the use of the replace parameter to indicate we want
random sample with replacement. After a model is fitted, we save the model to file
and will come back to it later. The modelFile variable shows the extensive use
of the paste function in order to provide the pathway and file name for the model
that we want to save on each loop iteration. The saveRDS function allows us to
save each of the model objects as rds files to the location specified. An alternative
to save the models individually to file is to save them to elements within a list.
When dealing with very large numbers of models and additionally are complex in
180 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
terms of their parameterizations, the save to list elements alternative could run
into computer memory limitation issues. The last section of the script below just
demonstrates the use of the list.files functions to confirm that we have saved
those models to file and they are ready to use.
# Number of bootstraps
nbag <- 50
## [1] "~~/bootstrap/models/bootMod_1.rds"
## [2] "~~/bootstrap/models/bootMod_10.rds"
## [3] "~~/bootstrap/models/bootMod_11.rds"
## [4] "~~/bootstrap/models/bootMod_12.rds"
## [5] "~~/bootstrap/models/bootMod_13.rds"
## [6] "~~/bootstrap/models/bootMod_14.rds"
We can then assess the goodness of fit and validation statistics of the bootstrap
models. This is done using the goof function as in previous examples. This time we
incorporate that function within a for loop. For each loop, we read in the model via
the radRDS function and then save the diagnostics to the cubiMat matrix object.
After the iterations are completed, we use the colMeans function to calculate
the means of the diagnostics over the 50 model iterations. You could also assess
the variance of those means by a command such as var(cubiDat[,1]), which
would return the variance of the R2 values.
7.2 Bootstrapping 181
# calibration data
cubiMat <- matrix(NA, nrow = nbag, ncol = 5)
for (i in 1:nbag) {
fit_cubist <- readRDS(c.models[i])
cubiMat[i, ] <- as.matrix(goof(observed = cDat$pH60_100cm,
predicted = predict(fit_cubist, newdata = cDat)))
}
# Validation data
cubPred.V <- matrix(NA, ncol = nbag, nrow = nrow(vDat))
cubiMat <- matrix(NA, nrow = nbag, ncol = 5)
for (i in 1:nbag) {
fit_cubist <- readRDS(c.models[i])
cubPred.V[, i] <- predict(fit_cubist, newdata = vDat)
cubiMat[i, ] <- as.matrix(goof(observed = vDat$pH60_100cm,
predicted = predict(fit_cubist, newdata = vDat)))
}
cubPred.V_mean <- rowMeans(cubPred.V)
For the validation data, in addition to deriving the model diagnostic statistics,
we are also saving the actual model predictions for these data for each iteration to
the cubPred.V object. These will be used further on for validating the prediction
uncertainties. The last line of the script above saves the mean of the mean square
error (MSE) estimates from the validation data. The independent MSE estimator,
accounts for both systematic and random errors in the modeling. This estimate of the
MSE is needed for quantifying the uncertainties, as this error is in addition to that
which are accounted for by the bootstrap, which are specifically those associated
with the deterministic model component i.e. the model relationship between target
variable and the covariates. Subsequently an overall prediction variance (at each
point or pixel) will be the sum of the random error component (MSE) and the
bootstrap prediction variance (as estimated from the mean of the realisations from
the bootstrap modeling).
182 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
Our initial purpose here is to derive the mean and the variance of the predictions
from each bootstrap sample. This requires loading in each bootstrap model, applying
into the covariate data, then saving the predicted map to file or R memory. In the
case below the predictions are saved to file. This is illustrated in the following script.
for (i in 1:nbag) {
fit_cubist <- readRDS(c.models[i])
mapFile <- paste(paste(paste(paste(getwd(), "~~/bootstrap/map/",
sep = ""), "bootMap_", sep = ""), i, sep = ""), ".tif", sep = "")
predict(hunterCovariates_sub, fit_cubist, filename = mapFile,
format = "GTiff", overwrite = T)
}
To evaluate the mean at each pixel from each of the created maps, the base
function mean can be applied to a given stack of rasters. First we need to get the path
location of the rasters. Notice from the list.files function and the pattern
parameter, we are restricting the search of rasters that contain the string “bootMap”.
Next we make a stack of those rasters, followed by the calculation of the mean,
which is also written directly to file.
# Pathway to rasters
files <- list.files(paste(getwd(), "~~/bootstrap/map/", sep = ""),
pattern = "bootMap", full.names = TRUE)
# Raster stack
r1 <- raster(files[1])
for (i in 2:length(files)) {
r1 <- stack(r1, files[i])
}
# Calculate mean
meanFile <- paste(paste(paste(getwd(), "~~/bootstrap/map/",
sep = ""), "meanPred_", sep = ""), ".tif", sep = "")
bootMap.mean <- writeRaster(mean(r1), filename = meanFile,
format = "GTiff", overwrite = TRUE)
There is not a simple R function to use in order to estimate the variance at each
pixel from the prediction maps. Therefore we resort to estimating it directly from
the standard equation:
n
1 X
Var.X/ D .xi /2 (7.1)
1 n iD1
The symbol in this case is the mean bootstrap prediction, and xi is the ith
bootstrap map. In the first step below, we estimate the square differences and save
7.2 Bootstrapping 183
the maps to file. Then we calculate the sum of those squared differences, before
deriving the variance prediction. The last step is to add the variance of the bootstrap
predictions to the averaged MSE estimated from the validation data.
# Square differences
for (i in 1:length(files)) {
r1 <- raster(files[i])
diffFile <- paste(paste(paste(paste(getwd(),
"~~/bootstrap/map/",
sep = ""), "bootAbsDif_", sep = ""), i, sep = ""),
".tif", sep = "")
jj <- (r1 - bootMap.mean)^2
writeRaster(jj, filename = diffFile, format = "GTiff",
overwrite = TRUE)
}
# stack
r2 <- raster(files2[1])
for (i in 2:length(files2)) {
r2 <- stack(r1, files2[i])
}
# Variance
varFile <- paste(paste(paste(getwd(), "~~/bootstrap/map/", sep=""),
"varPred_", sep = ""), ".tif", sep = "")
bootMap.var <- writeRaster(((1/(nbag - 1)) * bootMap.sqDiff),
filename = varFile, format = "GTiff", overwrite = TRUE)
probability. The z value is obtained using the qnorm function. The result is then
either added or subtracted to the mean prediction in order to generate the upper and
lower prediction limits respectively.
# Standard deviation
sdFile <- paste(paste(paste(getwd(), "~~/bootstrap/map/",
sep = ""), "sdPred_", sep = ""), ".tif", sep = "")
bootMap.sd <- writeRaster(sqrt(bootMap.varF), filename = sdFile,
format = "GTiff", overwrite = TRUE)
# standard error
seFile <- paste(paste(paste(getwd(), "~~/bootstrap/map/",
sep = ""), "sePred_", sep = ""), ".tif", sep = "")
bootMap.se <- writeRaster((bootMap.sd * qnorm(0.95)),
filename = seFile, format = "GTiff", overwrite = TRUE)
As for the Universal kriging example, we can plot the associated maps of the
predictions and quantified uncertainties (Fig. 7.3).
6369000
14 14
13 13
12 12
11 11
10 10
9 9
6367000
6367000
8 8
7 7
6 6
5 5
4 4
3 3
2 2
6365000
6365000
336000 338000 340000 342000 344000 346000 336000 338000 340000 342000 344000 346000
14 6
13
12 5
11
10 4
9
6367000
8 3
7
6 2
5
4 1
3
2 0
6365000
Fig. 7.3 Soil pH predictions and prediction limits derived using bootstrapping
You will recall the bootstrap model predictions on the validation data were saved
to the cubPred.V object. We want estimate the standard deviation of those
predictions for each point. Also recall that the prediction variance is the sum of
the MSE and the bootstrap models prediction variance. Taking the square root of
that summation results in standard deviation estimate.
val.sd <- matrix(NA, ncol = 1, nrow = nrow(cubPred.V))
for (i in 1:nrow(cubPred.V)) {
val.sd[i, 1] <- sqrt(var(cubPred.V[i, ]) + avGMSE)
}
186 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
# zfactor multiplication
vMat <- matrix(NA, nrow = nrow(cubPred.V), ncol = length(qp))
for (i in 1:length(qp)) {
vMat[, i] <- val.sd * qp[i]
}
Now we add or subtract the limits to/from the averaged model predictions to
derive to prediction limits for each level of confidence.
Now we assess the PICP for each level confidence. Recalling that we are
simply assessing whether the observed value is encapsulated by the corresponding
prediction limits, then calculating the proportion of agreement to total number of
observations.
colSums(bMat)/nrow(bMat)
As can be seen on Fig. 7.4, there is an indication that the prediction uncertainties
could be a little too liberally defined, where particularly at the higher level of
confidence the associated PICP is higher.
7.3 Empirical Uncertainty Quantification Through Data Partitioning and Cross. . . 187
# make plot
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
plot(cs, ((colSums(bMat)/nrow(bMat)) * 100))
Quantiles of the distribution of the prediction limit range are express below
for the validation data (in terms of the 90 % level of confidence). Compared to
the universal kriging approach, the uncertainties quantified from the bootstrapping
approach are higher in general.
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
colnames(lMat) <- cs
colnames(uMat) <- cs
quantile(uMat[, "90"] - lMat[, "90"])
The next two approaches for uncertainty quantification are empirical methods
whereby the prediction intervals are defined from the distribution of model errors
i.e. the deviation between observation and model predictions. In the examples to
follow however, the prediction limits are not spatially uniform. Rather they are
188 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
We will use a Cubist model regression kriging approach as used previously, and then
define the subsequent prediction intervals.
First we prepare the data.
Now we fit the Cubist model using all available environmental covariates.
library(Cubist)
# Cubist model
hv.cub.Exp <- cubist(x = cDat[, c("Terrain_Ruggedness_Index",
"AACN", "Landsat_Band1", "Elevation", "Hillshading",
"Light_insolation", "Mid_Slope_Positon", "MRVBF", "NDVI", "TWI",
"Slope")],
y = cDat$pH60_100cm, cubistControl(unbiased =TRUE, rules = 100,
extrapolation = 10, sample = 0, label = "outcome"), committees=1)
summary(hv.cub.Exp)
7.3 Empirical Uncertainty Quantification Through Data Partitioning and Cross. . . 189
##
## Read 354 cases (12 attributes) from undefined.data
##
## Model:
##
## Rule 1: [116 cases, mean 6.183380, range 3.956412 to
9.249626, est err 0.714710]
##
## if
## NDVI <= -0.191489
## TWI <= 13.17387
## then
## outcome = 9.610208 + 0.0548 AACN - 0.0335 Elevation + 0.131
## Hillshading + 3.5 NDVI + 0.076 Terrain_Ruggedness_Index
## - 0.055 Slope - 0.023 Landsat_Band1 + 0.03 MRVBF
## + 0.07 Mid_Slope_Positon + 0.005 TWI
##
## Rule 2: [164 cases, mean 6.533212, range 3.437355 to 9.741758,
est err 0.986175]
##
## if
## TWI > 13.17387
## then
## outcome = 7.471082 + 0.0215 AACN + 0.108 Hillshading + 4.2 NDVI
## + 0.24 MRVBF + 1.16 Mid_Slope_Positon - 0.0104 Elevation
## - 0.069 Slope + 0.077 Terrain_Ruggedness_Index
## - 0.028 Landsat_Band1 + 0.047 TWI
##
## Rule 3: [74 cases, mean 6.926269, range 2.997182 to 9.630296,
est err 1.115631]
##
## if
## NDVI > -0.191489
## TWI <= 13.17387
## then
## outcome = 11.743466 + 0.0416 AACN - 0.091 Landsat_Band1
## - 0.0117 Elevation + 3 NDVI + 0.048 Hillshading
## + 0.065 Terrain_Ruggedness_Index - 0.046 Slope
##
##
## Evaluation on training data (354 cases):
##
## Average |error| 1.085791
## Relative |error| 0.95
## Correlation coefficient 0.4
The hv.cub.Exp model indicates three subsets in the data were made.
Assessing the goodness of fit of this model, we use the goof function. This model
seems to perform OK.
goof(observed = cDat$pH60_100cm, predicted = predict(hv.cub.Exp,
newdata = cDat))
Now we want to assess the model residuals for spatial auto correlation. In this
case we need to look at the variogram of the residuals.
# coordinates
coordinates(cDat) <- ~X + Y
# residual variogram model
vgm1 <- variogram(residual ~ 1, cDat, width = 200)
mod <- vgm(psill = var(cDat$residual), "Sph", range = 10000,
nugget = 0)
model_1 <- fit.variogram(vgm1, mod)
## data:
## hunterpH_cubistRES : formula = residual‘~‘1 ; data dim = 354 x 13
## variograms:
## model psill range
## hunterpH_cubistRES[1] Nug 0.7431729 0.0000
## hunterpH_cubistRES[2] Sph 0.5385914 937.8323
This output indicates there is a reasonable variogram of the residuals with which
may help improve the overall predictive model. We can determine whether there is
any improvement later when we perform the validation.
With a model defined, it is now necessary to estimate the within partition
prediction limits. This is done using a leave-one-out cross validation procedure
using a Cubist model regression kriging model as we did for the spatial model. What
we need to do first is to determine which observations in the data set belong to which
partition. Looking at the summary output above we can examine the threshold
criteria that define the data partitions.
## 1 2 3
## 115 165 74
7.3 Empirical Uncertainty Quantification Through Data Partitioning and Cross. . . 191
Now we can subset cDat1 based on its defined rule or partition and then perform
the LOCV. The script below shows the procedure for the process using the first rule.
# subset the data
cDat1.r1 <- cDat1[which(cDat1$rule == 1), ]
target.C.r1 <- cDat1.r1$pH60_100cm
As can be seen, the prediction limits for each partition of the data are quite
different from each other.
To create the maps in the same fashion as was done for bootstrapping and universal
kriging, we do more-or-less as the same for the Cubist regression kriging. First
create the regression kriging map.
# kriged residuals
map.cubist.res <- interpolate(hunterCovariates_sub, gOK,
xyOnly = TRUE, index = 1, filename = "rk_residuals.tif",
format = "GTiff", datatype = "FLT4S", overwrite = TRUE)
To derive the upper and lower prediction limits we can apply the raster
calculations in a very manual way, such that each line of the rasterStack is
read in and then evaluated to determine which rule each entry on the line belongs
to. Then given the rule, the corresponding upper and lower limits are appended to a
new raster together with a raster which indicates which rule was applied where. For
small raster data sets where the mapping extent is small, this approach works fine.
It can also be applied for very large rasters, but can take some time as the reading of
the raster occurs line by line. There are however, raster options to process them this
way in chunk form.
# Create new raster datasets
upper1 <- raster(hunterCovariates_sub[[1]])
lower1 <- raster(hunterCovariates_sub[[1]])
rule1 <- raster(hunterCovariates_sub[[1]])
upper1 <- writeStart(upper1, filename = "cubRK_upper1.tif",
format = "GTiff", overwrite = TRUE)
lower1 <- writeStart(lower1, filename = "cubRK_lower1.tif",
format = "GTiff", overwrite = TRUE)
rule1 <- writeStart(rule1, filename = "cubRK_rule1.tif",
format = "GTiff", datatype = "INT2S", overwrite = TRUE)
for (i in 1:dim(upper1)[1]) {
# extract raster information line by line
cov.Frame <- as.data.frame(getValues(hunterCovariates_sub, i))
ulr.Frame <- matrix(NA, ncol = 3, nrow = dim(upper1)[2])
# append in partition information
# rule 1
ulr.Frame[which(cov.Frame$NDVI <= -0.191489 & cov.Frame$TWI
<= 13.17387), 1] <- r1.ulPI[2]
ulr.Frame[which(cov.Frame$NDVI <= -0.191489 & cov.Frame$TWI
<= 13.17387), 2] <- r1.ulPI[1]
ulr.Frame[which(cov.Frame$NDVI <= -0.191489 & cov.Frame$TWI
<= 13.17387), 3] <- 1
# rule 2
ulr.Frame[which(cov.Frame$TWI > 13.17387), 1] <- r2.ulPI[2]
ulr.Frame[which(cov.Frame$TWI > 13.17387), 2] <- r2.ulPI[1]
ulr.Frame[which(cov.Frame$TWI > 13.17387), 3] <- 2
# rule 3
ulr.Frame[which(cov.Frame$NDVI > -0.191489 & cov.Frame$TWI
194 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
Now we can derive the prediction interval by adding the upper and lower limits
to the regression kriging prediction that was made earlier. Then we can estimate the
prediction interval range.
# raster stack of predictions and prediction limits
r2 <- stack(map.cubist.final, lower1) #lower
mapRK.lower <- calc(r2, fun = sum, filename = "cubistRK_lowerPL.tif",
format = "GTiff", overwrite = T)
Now we can plot the maps as before. We can note this time that the prediction
interval range is smaller for the cubist regression kriging than it is for the universal
kriging approach (Fig. 7.5).
# color ramp
phCramp <- c("#d53e4f", "#f46d43", "#fdae61", "#fee08b", "#ffffbf",
"#e6f598", "#abdda4", "#66c2a5", "#3288bd", "#5e4fa2", "#542788",
"#2d004b")
brk <- c(2:14)
par(mfrow = c(2, 2))
plot(mapRK.lower, main = "90% Lower prediction limit",
breaks = brk, col = phCramp)
7.3 Empirical Uncertainty Quantification Through Data Partitioning and Cross. . . 195
6369000
14 14
13 13
12 12
11 11
10 10
9 9
8
6367000
6367000
7 8
6 7
5 6
4 5
3 4
2 3
2
6365000
6365000
336000 338000 340000 342000 344000 346000 336000 338000 340000 342000 344000 346000
90% Upper prediction limit prediction limit range
6369000
14 6
13
12 5
11
10 4
9
6367000
8 3
7
6 2
5
4 1
3
2 0
6365000
Fig. 7.5 Soil pH predictions and prediction limits derived using a Cubist regression kriging model
For validation we assess both the quality of the predictions and the quantification
of uncertainty as was done earlier for the universal kriging. Below we assess the
validation of the cubist model alone and the cubist regression kriging model.
196 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
# Validation regression
goof(observed = vDat$pH60_100cm, predicted = OK.preds.V$cubist)
# regression kriging
goof(observed = vDat$pH60_100cm, predicted = OK.preds.V$finalP)
The regression kriging model is a little more accurate than the Cubist model
alone.
The first step in estimating the uncertainty about the validation points is to
evaluate with rule set or partition it belongs to.
## 1 2 3
## 43 65 44
Then we can define the prediction interval that corresponds to each observation
for each level of confidence.
7.3 Empirical Uncertainty Quantification Through Data Partitioning and Cross. . . 197
# Upper PL
ulMat <- matrix(NA, nrow = nrow(vDat1), ncol = length(r1.q))
for (i in seq(2, 20, 2)) {
ulMat[which(vDat1$rule == 1), i] <- r1.q[i]
ulMat[which(vDat1$rule == 2), i] <- r2.q[i]
ulMat[which(vDat1$rule == 3), i] <- r3.q[i]
}
# Lower PL
for (i in seq(1, 20, 2)) {
ulMat[which(vDat1$rule == 1), i] <- r1.q[i]
ulMat[which(vDat1$rule == 2), i] <- r2.q[i]
ulMat[which(vDat1$rule == 3), i] <- r3.q[i]
}
# binary
bMat <- matrix(NA, nrow = nrow(ULpreds), ncol = (ncol(ULpreds)/2))
cnt <- 1
for (i in seq(1, 20, 2)) {
bMat[, cnt] <- as.numeric(vDat1$pH60_100cm <= ULpreds[, i + 1]
& vDat1$pH60_100cm >= ULpreds[, i])
cnt <- cnt + 1
}
colSums(bMat)/nrow(bMat)
The PICP estimates appear to correspond quite well with the respective confi-
dence levels. This can be observed from the plot on Fig. 7.6 too.
# make plot
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5)
# confidence level
plot(cs, ((colSums(bMat)/nrow(bMat)) * 100))
abline(a = 0, b = 1, lty = 2, col = "red")
From the validation observations the prediction intervals range between 2.45 and
4.6 with a median of about 3.36 pH units when using the Cubist regression kriging
model.
Like the previous uncertainty approach, this approach is similar in that uncertainty
is expressed in the form of quantiles of the underlying distribution of model
error (residuals). It contrasts however in terms of how the environmental data
is partitioned. For the previous approach, partitioning was based upon the hard
classification defined by a fitted Cubist model. This approach uses a fuzzy clustering
method of partition.
Essentially the approach is based on previous research by Shrestha and Solo-
matine (2006) where the idea is to partition a feature space into clusters (with a
fuzzy k-means routine) which share similar model errors. A prediction interval (PI)
is constructed for each cluster on the basis of the empirical distribution of residual
observations that belong to each cluster. A PI is then formulated for each observation
in the feature space according to the grade of their memberships to each cluster.
They applied this methodology to artificial and real hydrological data sets and it
was found to be superior to other methods which estimate a PI. The Shrestha and
Solomatine (2006) approach computes the PI independently and while free of the
prediction model structure, it requires only the model or prediction outputs. Tranter
et al. (2010) extended this approach to deal with observations that are outside of the
training domain.
The method presented in this exercise was introduced by Malone et al. (2011)
which modifies slightly the Shrestha and Solomatine (2006) and Tranter et al. (2010)
approaches to enable it for a DSM framework. The approach is summarized by the
flow diagram on Fig. 7.7.
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 199
Fig. 7.7 Flow diagram of the general procedure for achieving the outcome of mapping predictions
and their uncertainties (upper and lower prediction limits) within a digital soil mapping framework.
The 3 components for achieving this outcome are the prediction model, the empirical uncertainty
model and the mapping component (Sourced from Malone et al. 2011)
The process for deriving the uncertainties is much the same as for the previous
approach using the Cubist regression kriging approach. One benefit of using a fuzzy
kmeans approach is that the spatial distribution of uncertainty is represented as a
continuous variable. Further, the incorporation of extragrades in the fuzzy kmeans
classifying provides an explicit means to identify and highlight areas of the greatest
uncertainty and possibly where new sampling efforts should be prioritized. As
shown on Fig. 7.7 the approach entails 3 main processes:
• Calibrating the spatial model
• deriving the uncertainty model which includes both estimations of model errors
and fuzzy kmeans with extragrades classification
• Creation of maps of both spatial soil predictions and uncertainties.
Naturally, this framework is validated using a withheld or better still independent
data set.
200 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
Here we will use a random forest regression kriging model for the prediction of soil
pH across the study area. This model will also be incorporated into the uncertainty
model via leave-one-out cross validation in order to derive the model errors. As
before, we begin by preparing the data.
# Point data subset data for modeling
set.seed(667)
training <- sample(nrow(HV_subsoilpH), 0.7 * nrow(HV_subsoilpH))
cDat <- HV_subsoilpH[training, ]
vDat <- HV_subsoilpH[-training, ]
# Goodness of fit
goof(observed = cDat$pH60_100cm, predicted = predict(hv.RF.Exp,
newdata = cDat), plot.it = FALSE)
The hv.RF.Exp model appears to perform quite well when we examine the
goodness of fit statistics.
Now we can examine the model residuals for any presence of spatial structure
with variogram modeling. For the output below it does seems that there is some
useful correlation structure in the residuals that will likely help to improve upon the
performance of the hv.RF.Exp model.
# Estimate the residual
cDat$residual <- cDat$pH60_100cm - predict(hv.RF.Exp,
newdata = cDat)
# residual variogram model
coordinates(cDat) <- ~X + Y
vgm1 <- variogram(residual ~ 1, data = cDat, width = 200)
mod <- vgm(psill = var(cDat$residual), "Sph", range = 10000,
nugget = 0)
model_1 <- fit.variogram(vgm1, mod)
model_1
Like before, we need to estimate the model errors and a good way to do this is
via a LOCV approach. The script below is more-or-less a repeat from earlier with
the Cubist regression kriging modeling except now we are using the random forest
modeling.
# Uncertainty analysis
cDat1 <- as.data.frame(cDat)
names(cDat1)
cDat1.r1 <- cDat1
target.C.r1 <- cDat1.r1$pH60_100cm
where C is the c p matrix of cluster centers where c is the cluster and p is the
number of variables. M is the n c matrix of partial memberships, where n is the
number of observations; mij Œ0; 1 is the partial membership of the ith observation
to the jth cluster. 1 is the fuzziness exponent. The square distance between the
ith observation to the jth cluster is dij2 . mi? denotes the membership to the extragrade
cluster. This function also requires the parameter alpha (˛) to be defined, which is
used to evaluate membership to the extragrade cluster.
A very good stand-alone software developed by Minasny and McBratney (2002a)
called ‘Fuzme’ contains the FKM with extragrades method, plus other clustering
methods. The software may be downloaded for free from http://sydney.edu.au/
agriculture/pal/software/fuzme.shtml. The source script to this software has also
been written to an R package of the same name. Normally, the stand-alone software
would be used because it is computationally much faster. However, using the fuzme
R package allows one to easily integrate the clustering procedures into a standard R
workflow. For example, one of the issues of clustering procedures is the uncertainty
regarding the selection of an optimal cluster number for a given data set. There are a
number of ways to determine an optimal solution. Some popular approaches include
to determining the cluster combination which minimizes the Fuzzy Performance
Index (FPI) or Modified partition Entropy (MPE). The MPE establishes the degree
of fuzziness created by a specified number of clusters for a defined exponent value.
The notion is that the smaller the MPE, the more suitable is the corresponding
number of clusters at the given exponent value. A more sophisticated analysis is
to look at the derivative of Je .C; M/ with respect to and is used to simultaneously
establish the optimal and cluster size. More is discussed about each of these
indices by Odeh et al. (1992) and Bragato (2004).
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 203
Now we need to prepare the data for clustering, and parameterize the other inputs
of the function. The other inputs are: nclass, which is the number of clusters you
want to define. Note that an extra cluster will be defined as this will be the extragrade
cluster and the associated memberships. data is the data needed for clustering, U is
an initial membership matrix in order to get the fuzzy clustering algorithm operable.
phi is the fuzzy exponent, while distype refers to the distance metric to be used
for clustering. There are 3 possible distance metrics available. These are: Euclidean
(1), Diagonal (2), and Mahalanobis (3). As an example of using the fobjk lets
define 4 clusters with a fuzzy exponent of 1.2, and with the average extragrade
membership of 10 %. Currently this function is pretty slow to compute the optimal
alpha, so be prepared to wait a while.
# Parameterize fuzzy objective function
data.t <- cDat1[, 4:14] # data to be clustered
nclass.t <- 4 # number of clusters
phi.t <- 1.2
distype.t <- 3 #3 = Mahalanobis distance
Uereq.t <- 0.1 #average extragrade membership
Remember the fobjk function will only return the optimal alfa value. This
value then gets inserted into the associated fkme function in order to estimate
the memberships of the data to each cluster and the extragrade cluster. The fkme
function also returns the cluster centroids too.
alfa.t
## 1
## 0.01136809
The fkme function returns a list with a number of elements. At this stage we are
primarily interested in the elements membership and centroid which we will
use a little later on.
As described earlier, there are a number of criteria to assess the validity of a
particular clustering configuration. We can evaluate these by using the fvalidity
function. It essentially takes in a few of the outputs from the fkme function.
Another useful metric is the confusion index (after Burrough et al. 1997)
which in our case looks at the similarity between the highest and second highest
cluster memberships. The confusion index is estimated for each data point. Taking
the average over the data set provides some sense of whether cluster can be
distinguished from each other.
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 205
mean(confusion(nclass = 5, U = tester$membership))
## [1] 0.3356157
To assess the clustering performance using the criteria of the PICP and prediction
interval range, we need to first assign each data point a one of the clusters we have
derived. The assignment is based on the cluster which has the highest membership
grade. The script below provides a method for evaluating which data point belongs
to which cluster
## 1 2 3 4 5
## 58 84 78 90 44
# combine
cDat1 <- cbind(cDat1, membs$class)
names(cDat1)[ncol(cDat1)] <- "class"
levels(cDat1$class)
Then we derive the cluster model error. This entails splitting the cDat1 object
up on the basis of the cluster with the highest membership i.e. cDat1$class.
The objects quanMat1 and quanMat2 represent the lower and upper model
errors for each cluster for each quantile respectively. For the extragrade cluster,
we multiple the error by a constant, here 2, in order to explicitly indicate that the
extragrade cluster (being outliers of the data) have a higher uncertainty.
Using the validation or independent data that has been withheld, we evaluate the
PICP and prediction interval width. This requires first allocating cluster member-
ships to the points on the basis of outputs from using the fkme function, then using
these together with the cluster prediction limits to evaluate weighted averages of
the prediction limits for each point. With that done we can then derive the unique
upper and lower prediction interval limits for each point at each confidence level.
First, for the membership allocation, we use the fuzExall function. Essentially
this function takes in outputs from the fkme function and in our case, specifically
that concerning the tester object. Recall that the validation data is saved to the
vDat object.
vDat1 <- as.data.frame(vDat)
names(vDat1)
where PIiL and PIiU correspond to the weighted lower and upper limits for the ith
observation. PICjL and PICjU are the lower and upper limits for each cluster j, and
mij is the membership grade of the ith observation to cluster j (which were derived
in the previous step). In R, this can be interpreted as:
Then we want to add these values to the actual regression kriging predictions.
Now as in the previous uncertainty approaches we estimate the PICP for each
level of confidence. We can also estimate the average prediction interval length too.
We will do this for the 90 % confidence level.
# PICP
colSums(bMat)/nrow(bMat)
## [,1]
## [1,] 3.913579
Recall that our motivation at the moment to to derive and optimal cluster number
and fuzzy exponent based on criteria of the PICP and prediction interval width.
Above were the steps for evaluating those values for one clustering parameter
configuration i.e. 4 clusters (plus and extragrade) with a fuzzy exponent of 1.2.
Essentially we need to run sequentially, different combinations of cluster number
and fuzzy exponent value and then assess to criteria resulting from each of the
different configurations in order to find the optimum. For example we might initiate
the process by specifying:
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 209
## [1] 2 3 4 5 6
The data frame above lists the optimal alfa, clustering validity diagnostics,
PICP and prediction interval width diagnostics for each combination. The PICP
column is actually an absolute distance of the PICP at each level of prescribed
confidence. Subsequently we should look for a minimum in that regard. Overall
the best combination considering PICP and PIw together is 3 clusters with a fuzzy
exponent of either 1.2 or 1.3. Based on the other fuzzy validity criteria a fuzzy
exponent of 1.2 is optimal. Now we just re-run the function with these optimal
values in order to derive the cluster centroids which are need in order to create maps
of the prediction interval and range.
Now we have to calculate the cluster model error limits. This is achieved
by evaluating which cluster each data point in cDat1 belongs to based on the
maximum membership. Then we derive the quantiles of the model errors for each
cluster.
## 1 2 3 4
## 119 104 91 40
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 211
With the spatial model defined together with the fuzzy clustering with the associated
parameters and model errors, we can create the associated maps. First, the random
forest regression kriging map. The map is shown on Fig. 7.9.
map.Rf <- predict(hunterCovariates_sub, hv.RF.Exp,
filename = "RF_HV.tif", format = "GTiff", overwrite = T)
212 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
# kriged residuals
crs(hunterCovariates_sub) <- NULL
map.KR <- interpolate(hunterCovariates_sub, gOK, xyOnly = TRUE,
index = 1, filename = "krigedResid_RF.tif", format = "GTiff",
datatype = "FLT4S", overwrite = TRUE)
Now we need to map the prediction intervals. Essentially for every pixel on
the map we first need to estimate the membership value to each cluster. This
membership is based on a distance of the covariate space and the centroids of each
cluster. To do this we use the fuzzy allocation function (fuzExall) that was used
earlier. This time we use the fuzzy parameters from the fkme_final object. We
need to firstly create a dataframe from the rasterStack of covariates.
# Prediction Intervals
hunterCovs.df <- data.frame(cellNos = seq(1:ncell(hunterCovariates_sub)))
vals <- as.data.frame(getValues(hunterCovariates_sub))
hunterCovs.df <- cbind(hunterCovs.df, vals)
hunterCovs.df <- hunterCovs.df[complete.cases(hunterCovs.df), ]
cellNos <- c(hunterCovs.df$cellNos)
gXY <- data.frame(xyFromCell(hunterCovariates_sub, cellNos, spatial = FALSE))
hunterCovs.df <- cbind(gXY, hunterCovs.df)
str(hunterCovs.df)
Now we prepare all the other inputs for the fuzExall function, and then run it.
This may take a little time.
7.4 Empirical Uncertainty Quantification Through Fuzzy Clustering and Cross. . . 213
With the memberships estimated, lets visualize them by creating the associated
membership maps (Fig. 7.8).
# combine
hvCovs <- cbind(hunterCovs.df[, 1:2], fuz.me_ALL)
# Create raster
map.class1mem <- rasterFromXYZ(hvCovs[, c(1, 2, 3)])
names(map.class1mem) <- "class_1"
map.class2mem <- rasterFromXYZ(hvCovs[, c(1, 2, 4)])
names(map.class2mem) <- "class_2"
map.class3mem <- rasterFromXYZ(hvCovs[, c(1, 2, 5)])
names(map.class3mem) <- "class_3"
cluster 1 cluster 2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
cluster 3 Extragrade
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
The last spatial mapping task is to evaluate the 90 % prediction intervals. Again
we use the fuzzy committee approach given the cluster memberships and the cluster
model error limits.
# Lower limits
quanMat1["90", ]
## 1 2 3 4
## -1.532867 -1.451595 -1.792331 -3.140459
# upper limits
quanMat2["90", ]
## 1 2 3 4
## 2.023297 1.734574 2.182650 3.275685
# lower limit
f1 <- function(x) ((x[1] * quanMat1["90", 1]) + (x[2] *
quanMat1["90", 2]) + (x[3] * quanMat1["90", 3]) +
(x[4] * quanMat1["90", 4]))
# upper limit
f1 <- function(x) ((x[1] * quanMat2["90", 1]) + (x[2] *
quanMat2["90", 2]) + (x[3] * quanMat2["90", 3]) + (x[4] *
quanMat2["90", 4]))
And finally we can derive the upper and lower prediction limits.
# raster stack
s3 <- stack(mapRF.fin, mapRK.lower, mapRK.upper)
# color ramp
phCramp <- c("#d53e4f", "#f46d43", "#fdae61", "#fee08b", "#ffffbf",
"#e6f598", "#abdda4", "#66c2a5", "#3288bd", "#5e4fa2",
"#542788", "#2d004b")
brk <- c(2:14)
par(mfrow = c(2, 2))
plot(mapRF.lowerPI, main = "90% Lower prediction limit",
breaks = brk, col = phCramp)
plot(mapRF.fin, main = "Prediction", breaks = brk, col = phCramp)
plot(mapRF.upperPI, main = "90% Upper prediction limit",
breaks = brk, col = phCramp)
plot(mapRF.PIrange, main = "Prediction limit range",
col = terrain.colors(length(seq(0, 6.5, by = 1)) - 1), axes = FALSE,
breaks = seq(0, 6.5, by = 1))
216 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
6369000
14 14
13 13
12 12
11 11
10 10
9 9
6367000
6367000
8 8
7 7
6 6
5 5
4 4
3 3
2 2
6365000
6365000
336000 338000 340000 342000 344000 346000 336000 338000 340000 342000 344000 346000
14
13 6
12
11 5
10
9 4
6367000
8
7 3
6 2
5
4 1
3
2 0
6365000
Fig. 7.9 Soil pH predictions and prediction limits derived using a Random Forest regression
kriging prediction model together with LOCV and fuzzy classification
For the first step we can validate the random forest model alone and with the auto-
correlated errors. As we had already applied the model earlier, it is just a matter of
using the goof function to return the validation diagnostics.
# regression kriging
goof(observed = vDat$pH60_100cm, predicted = OK.preds.V$finalP)
# Random Forest
goof(observed = vDat$pH60_100cm, predicted = OK.preds.V$randomForest)
The regression kriging model performs better than the random forest model
alone, but only marginally so; though both models are not particularly accurate in
any case.
And now to validate the quantification of uncertainty we implement the workflow
demonstrated above for the process of determining the optimal cluster parameter
settings.
# PICP
colSums(bMat)/nrow(bMat)
Plotting the PICP against the confidence level provides a nice visual. It can be
seen on Fig. 7.10 that the PICP follows closely to the 1:1 line.
cs <- c(99, 97.5, 95, 90, 80, 60, 40, 20, 10, 5) # confidence level
plot(cs, ((colSums(bMat)/nrow(bMat)) * 100))
abline(a = 0, b = 1, lty = 2, col = "red")
From the validation observations the prediction intervals range between 3.2 and
6.4 with a median of about 3.6 pH units when using the Random Forest regression
kriging model.
218 7 Some Methods for the Quantification of Prediction Uncertainties for Digital. . .
Fig. 7.10 Plot of PICP and confidence level based on validation of Random Forest regression
kriging model
References
Bragato G (2004) Fuzzy continuous classification and spatial interpolation in conventional soil
survey for soil mapping of the lower Piave plain. Geoderma 118:1–16
Brown JD, Heuvelink GBM (2005) Assessing uncertainty propagation through physically based
models of soil water flow solute transport. In: Encyclopedia of hydrological sciences. John
Wiley and Sons, Chichester
Burrough PA, van Gaans PFM, Hootsmans R (1997) Continuous classification in soil survey:
spatial correlation, confusion and boundaries. Geoderma 77:115–135
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, London
Grimm R, Behrens T (2010) Uncertainty analysis of sample locations within digital soil mapping
approaches. Geoderma 155:154–163
Lagacherie P, Cazemier D, vanGaans P, Burrough P (1997) Fuzzy k-means clustering of fields in
an elementary catchment and extrapolation to a larger area. Geoderma 77:197–216
Liddicoat C, Maschmedt D, Clifford D, Searle R, Herrmann T, Macdonald L, Baldock J (2015)
Predictive mapping of soil organic carbon stocks in south Australia’s agricultural zone. Soil
Res 53:956–973
References 219
Malone BP, McBratney AB, Minasny B (2011) Empirical estimates of uncertainty for mapping
continuous depth functions of soil attributes. Geoderma 160:614–626
Malone BP, Minasny B, Odgers NP, McBratney AB (2014) Using model averaging to combine soil
property rasters from legacy soil maps and from point data. Geoderma 232–234:34–44
McBratney AB, de Gruijter J (1992) Continuum approach to soil classification by modified fuzzy
k-means with extragrades. J Soil Sci 43:159–175
McBratney AB (1992) On variation, uncertainty and informatics in environmental soil manage-
ment. Aust J Soil Res 30:913–935
McBratney AB, Mendonca Santos ML, Minasny B (2003) On digital soil mapping. Geoderma
117:3–52
Minasny B, McBratney AB (2002a) FuzME version 3.0. Australian Centre for Precision Agricul-
ture, The University of Sydney
Minasny B, McBratney AB (2002b) Uncertainty analysis for pedotransfer functions. Eur J Soil Sci
53:417–429
Odeh I, McBratney AB, Chittleborough D (1992) Soil pattern recognition with fuzzy-c-means:
application to classification and soil-landform interrelationships. Soil Sci Soc Am J 56:
506–516
Shrestha DL, Solomatine DP (2006) Machine learning approaches for estimation of prediction
interval for the model output. Neural Netw 19:225–235
Solomatine DP, Shrestha DL (2009) A novel method to estimate model uncertainty using machine
learning techniques. Water Resour Res 45:Article Number: W00B11
Tranter G, Minasny B, McBratney AB (2010) Estimating pedotransfer function prediction limits
using fuzzy k-means with extragrades. Soil Sci Soc Am J 74:1967–1975
Viscarra Rossel RA, Chen C, Grundy MJ, Searle R, Clifford D, Campbell PH (2015) The
Australian three-dimensional soil grid: Australia’s contribution to the globalsoilmap project.
Soil Res 53:845–864
Webster R (2000) Is soil variation random? Geoderma 97:149–163
Chapter 8
Using Digital Soil Mapping to Update,
Harmonize and Disaggregate Legacy Soil Maps
Digital soil maps are contrasted from legacy soil maps mainly in terms of the
underlying spatial data model. Digital soil maps are based on the pixel data model,
while legacy soil maps will typically consist of a tessellation of polygons. The
advantage of the pixel model is that the information is spatially explicit. The soil
map polygons are delineations of soil mapping units which consist of a defined
assemblage of soil classes assumed to exist in more-or-less fixed proportions.
There is great value in legacy soil mapping because a huge amount of expertise
and resources went into their creation. Digital soil mapping will be the richer by
using this existing knowledge-base to derive detailed and high resolution digital
soil infrastructures. However the digitization of legacy soil maps is not digital soil
mapping. Rather, the incorporation of legacy soil maps into a digital soil mapping
workflow involves some method (usually quantitative) of data mining, to appoint
spatially explicit soil information—usually a soil class or even a measurable soil
attribute—upon a grid the covers the extent of the existing (legacy) mapping. In
some ways, this process is akin to downscaling because there is a need to extract
soil class or attribute information from aggregated soil mapping units. A better term
therefore is soil map disaggregation.
There is an underlying spatial explicitness in digital soil mapping that makes
it a powerful medium to portray spatial information. Legacy soil maps also have
an underlying spatial model in terms of the delineation of geographical space.
However, there is often some subjectivity in the actual arrangement and final shapes
of the mapping unit polygons. Yet that is a matter of discussion for another time.
For disaggregation studies the biggest impediment to overcome in a quantitative
manner is to determine the spatial configuration of the soil classes within each
map unit. It is often known which soil classes are in each mapping unit, and
sometimes there is information regarding the relative proportions of each too. What
is unknown is the spatial explicitness and configuration of said soil classes within
the unit. This is the common issue faced in studies seeking the renewal and updating
of legacy soil mapping. Some examples of soil map disaggregation studies from
the literature include Thompson et al. (2010) who recovered soil-landscape rules
from a soil map report in order to map individual soil classes. This together with
a supervised classification approach described by Nauman et al. (2012) represent
manually-based approaches to soil map disaggregation. Both of these studies were
successfully applied, but because of their manual nature, could also be seen as
time-inefficient and susceptible to subjectivity. The flip side to these studies is
those using quantitative models. Usually the modeling involves some form of data
mining algorithm where knowledge is learned and subsequently optimized based
on some model error minimization criteria. Extrapolation of the fitted model is then
executed in order to map the disaggregated soil mapping units. Such model-based or
data mining procedures for soil map disaggregation include that by Bui and Moran
(2001) in Australia, Haring et al. (2012) in Germany and Nauman and Thompson
(2014) in the USA. Some fundamental ideas of soil map disaggregation framed in a
deeper discussion of scaling of soil information are presented in McBratney (1998).
This chapter seeks to describe a soil map disaggregation method that was first
described in Wei et al. (2010) for digital harmonization of adjacent soil surveys
in southern Iowa, USA. The concept of harmonization has particular relevance in
the USA because it has been long established that the underlying soil mapping
concepts across geopolitical boundaries (i.e. counties and states) don’t always
match. This issue is obviously not a phenomenon exclusive to the USA but is
a common worldwide issue. This mismatch may include the line drawings and
named map units. Of course, soils in the field do not change at these political
boundaries. These soil-to-soil mismatches are the result of the past structuring of
the soil survey program. For example, soil surveys in the US were conducted on
a soil survey area basis. Most times the soil surveys areas were based on county
boundaries. Often adjacent counties were mapped years apart. Different personnel,
different philosophies of soil survey science, new concepts of mapping and the
availability of various technologies all have played a part in why these differences
occur. These differences maybe even more exaggerated at state lines as each state
was administratively and technically responsible for the soil survey program within
a given state. The algorithm developed by Wei et al. (2010) addressed this issue,
where soil mapping units were disaggregated into soil series. Instead of mapping the
prediction of a single soil series, a probability distribution of all potential soil series
was estimated. The outcome of this was the dual disaggregation and harmonization
of existing legacy soil information into raster-based digital soil mapping product/s.
Odgers et al. (2014) using legacy soil mapping from an area in Queensland,
Australia refined the algorithm to which they called DSMART or, Disaggregation
and Harmonization of Soil Map Units Through Resampled Classification Trees.
Besides the work of Odgers et al. (2014), The DSMART algorithm has been
used is other projects throughout the world, with Chaney et al. (2014) using it to
disaggregate the entire gridded USA Soil Survey Geographic (gSSURGO) database.
The resulting POLARIS data set (Probabilistic Remapping of SSURGO) provides
the 50 most probable soil series predictions at each 30-meter grid cell over the
contiguous USA. DSMART has also been a critical component for the development
8.1 DSMART: An Overview 223
of the Soil and Landscape Grid of Australia (SLGA) data set (Grundy et al. 2015).
The SLGA is the first continental version of the GlobalSoilMap.net concept and the
first nationally consistent, fine spatial resolution set of continuous soil attributes with
Australia-wide coverage. The DSMART algorithm has been pivotal, together with
the associated PROPR algorithm (Digital Soil Property Mapping Using Soil Class
Probability Rasters; Odgers et al. (2015)) in deriving high resolution digital soil
maps where point-based DSM approaches cannot be undertaken, particularly where
soil point data is sparse. In this chapter, the fundamental features of DSMART are
described, followed its demonstration upon a small data set.
The DSMART algorithm has previously been written in the C++ and Python
computing languages. It is also available in an R package, which was developed
at the Soil Security Laboratory. Regardless of computing language preference,
DSMART requires three chief sources of data:
1. The soil map unit polygons that will be disaggregated.
2. Information about the soil class composition of the soil map unit polygons
3. Geo-referenced raster covariates representing the scorpan factors of which have
complete and continuous coverage of the mapping extent. There is no restriction
in terms of the data type i.e. continuous, categorical, ordinal etc.
The DSMART R package contains two working functions: dsmart and dsmartR.
More will be discussed about these shortly. The other items in the package are
various inputs required to run the function. In essence these data provide some
indication of the structure and nature of the information that is required to run
the DSMART algorithm so that it can be easily adapted to other project. First is
the soil map to be disaggregated. This is saved to the dsT_polygons object.
In this example the small polygon map is a clipped area of the much larger soil
map that Odgers et al. (2014) disaggregated, which was the 1:250,000 soil map
of the Dalrymple Shire in Queensland Australia by Rogers et al. (1999).In this
example data set, there are 11 soil mapping units.the polygon object is of class
SpatialPolygonsDataFrame, which is what would be created if you were to
read in a shapefile of the polygons into R (Fig. 8.1).
library(dsmart) install_bitbucket("brendo1001/dsmart/rPackage/
dsmart/pkg")
library(devtools)
library(sp)
library(raster)
# Polygons
data(dsT_polygons)
class(dsT_polygons)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
summary(dsT_polygons$MAP_CODE)
## BUGA1t CGCO3t DO3n FL3d HG2g MI6t MM5g MS4g PA1f RA3t
## 1 1 1 1 1 1 1 1 1 1
8.2 Implementation of DSMART 225
Fig. 8.1 Subset of the polygon soil map from the Dalrymple Shire, Queensland Australia which
was disaggregated by Odgers et al. (2014)
## SCFL3g
## 1
plot(dsT_polygons)
invisible(text(getSpPPolygonsLabptSlots(dsT_polygons),
labels = as.character(dsT_polygons$MAP_CODE),cex = 1))
The next inputs are the soil map unit compositions which is saved to the
dsT_composition object. This is a data frame that simply indicates in respective
columns the map unit name, and corresponding numerical identifier label. Then
there is the soil classes that are contained in the respective mapping unit, followed
by the relative proportion that each soil class contributes to the map unit. The relative
proportions will and probably should sum to 100.
# Map unit compositions
data(dsT_composition)
head(dsT_composition)
The last required inputs are the environmental covariates. This use used to inform
the model fitting for each DSMART iteration, and ultimately be used for the spatial
mapping. There are actually 20 different covariate rasters of which have been
derived from a digital elevation model and gamma radiometric data. These rasters
are organized into a RasterStack and are of 30 m grid resolution. This class of
data is the necessary format of the covariate data for input into DSMART.
# covariates
data(dsT_covariates)
class(dsT_covariates)
## [1] "RasterStack"
## attr(,"package")
## [1] "raster"
nlayers(dsT_covariates)
## [1] 20
res(dsT_covariates)
## [1] 30 30
Of particular interest is to derive the soil class probabilities, and even the most
probable soil class at each pixel, and even an estimate of the uncertainty, which
is given in terms of the confusion index that was used earlier during the fuzzy
classification of data for derivation of digital soil map uncertainties. The confusion
index essentially measure how similar to classification is between (in most cases)
most probable and second-most probable soil class predictions at a pixel. To derive
the probability rasters we need the rasters that were generated from the dsmart
function. This can be done via the use of the list.files function and the
raster package to read in the rasters and stack them into a rasterStack.
Rather than doing this we can use pre-prepared outputs namely in the form of
the dsmartOutMaps (raster outputs from dsmart) and dsT_lookup (lookup
table of soil class names and numeric counterparts) objects. As with dsmart this
function can be run in parallel mode via control of the cpus variable. A logical
entry is required for the sepP variable to determine if probability rasters are
required t be created for each soil class. These particular outputs are important for
the follow on procedure of soil attribute mapping, of which is the focus of the study
by Odgers et al. (2015) and integral to the associated PROPR algorithm. In many
cases the user may just be interested in deriving the most probable soil class, or
sometimes the n-most probable soil class maps.
RA
PI
PA
HG
-2128000 GR
GA
FR
FL
EW
DO
-2130000 DA
CP
CO
CK
CG
CE
-2132000
BW
BU
BL
Fig. 8.2 Map of the most probable soil class from DSMART
-2127000
1.0
0.8
-2129000
0.6
0.4
--2131000
0.2
0.0
--2133000
# Confusion Index
CI.map <- test.dsmartR[[4]]
plot(CI.map)
References
Grundy MJ, Viscarra Rossel R, Searle RD, Wilson PL, Chen C, Gregory LJ (2015) Soil and
landscape grid of Australia. Soil Res. http://dx.doi.org/10.1071/SR15191
Haring T, Dietz E, Osenstetter S, Koschitzki T, Schroder B (2012) Spatial disaggregation of
complex soil map units: a decision-tree based approach in Bavarian forest soils. Geoderma
185–186:37–47
McBratney A (1998) Some considerations on methods for spatially aggregating and disaggregating
soil information. In: Finke P, Bouma J, Hoosbeek M (eds) Soil and water quality at different
scales. Developments in plant and soil sciences, vol 80. Springer, Dordrecht, pp 51–62
Nauman TW, Thompson JA (2014) Semi-automated disaggregation of conventional soil maps
using knowledge driven data mining and classification trees. Geoderma 213:385–399
Nauman TW, Thompson JA, Odgers NP, Libohova Z (2012) Fuzzy disaggregation of conventional
soil maps using database knowledge extraction to produce soil property maps. In: Digital soil
assessments and beyond: Proceedings of the fifth global workshop on digital soil mapping.
CRC Press, London, pp 203–207
Odgers NP, McBratney AB, Minasny B (2015) Digital soil property mapping and uncertainty
estimation using soil class probability rasters. Geoderma 237–238:190–198
Odgers NP, Sun W, McBratney AB, Minasny B, Clifford D (2014) Disaggregating and harmonising
soil map units through resampled classification trees. Geoderma 214–215:91–100
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Rogers L, Cannon M, Barry E (1999) Land resources of the Dalrymple shire, 1. Land resources
bulletin DNRQ980090. Department of Natural Resources, Brisbane, Queensland
Thompson JA, Prescott T, Moore AC, Bell J, Kautz DR, Hempel JW, Waltman SW, Perry C
(2010) Regional approach to soil property mapping using legacy data and spatial disaggregation
techniques. In: 19th world congress of soil science. IUSS, Brisbane
Wei S, McBratney A, Hempel J, Minasny B, Malone B, D’Avello T, Burras L, Thompson J (2010)
Digital harmonisation of adjacent analogue soil survey areas – 4 Iowa counties. In: 19th world
congress of soil science, IUSS, Brisbane
Chapter 9
Combining Continuous and Categorical
Modeling: Digital Soil Mapping of Soil Horizons
and Their Depths
The motivation for this chapter is to gain some insights into a digital soil mapping
approach that uses a combination of both continuous and categorical attribute
modeling. Subsequently, we will build on the efforts of the material in the chapters
that dealt with each of these type of modeling approaches separately. There are
some situations where, a combinatorial approach might be suitable in a digital soil
mapping work flow.
An example of such a workflow is in Malone et al. (2015) in regards to the
prediction of soil depth. The context behind that approach was that often lithic
contact was not achieved during the soil survey activities, effectively indicating
soil depth was greater than the soil probing depth (which was 1.5 m). Where lithic
contact was found, the resulting depth was recorded. The difficulty in using this
data in the raw form was that there were many sites where soil depth was greater
than 1.5 m together with actual recorded soil depth measurements. The nature
of this type of data is likened to a zero-inflated distribution, where many zero
observations are recorded among actual measurements (Sileshi 2008). In Malone
et al. (2015) the zero observations were attributed to soil depth being greater
than 1.5 m. They therefore performed the modeling in two stages. First modeling
involved a categorical or binomial model of soil depth being greater than 1.5 m
or not. This was followed by continuous attribute modeling of soil depth using
the observations where lithic contact was recorded. While the approach was a
reasonable solution, it may be the case that the frequency of recorded measurements
is low, meaning that the spatial modeling of the continuous attribute is made under
considerable uncertainty, as was the case in Malone et al. (2015) with soil depth
and other environmental variables spatially modeled in that study; for example, the
frequency of winter frosts.
Another interesting example of a combinatorial DSM work flow was described
in Gastaldi et al. (2012) for the mapping of occurrence and thickness of soil profiles.
There they used a multinomial logistic model to predict the presence or absence of
the given soil horizon class, followed by continuous attribute modeling of the hori-
6380000
25
6375000
20
15
10
6370000
5
6365000
Fig. 9.1 Hunter Valley soil profiles locations overlaying digital elevation model
zon depths. For the purposes a demonstrating the work flow of this combinatorial
or two-stage DSM, we will re-visit the approach that is described by Gastaldi et al.
(2012) and work through the various steps needed to perform it within R.
The data we will use comes from 1342 soil profile and core descriptions
from the Lower Hunter Valley, NSW Australia. These data have been collected
on an annual basis since 2001 to present. These data are distributed across the
220 km2 area as shown in Fig. 9.1. The intention is to use these data first to predict
the occurrence of given soil horizon classes (following the nomenclature of the
Australian Soil Classification (Isbell 2002)). Specifically we want to prediction the
spatial distribution of the occurrence of A1, A2, AP, B1, B21, B22, B23, B24, BC,
and C horizons, and then where those horizons occur, predict their depth.
First lets perform some data discovery both in terms of the soil profile data and
spatial covariates to be used as predictor variables and to inform the spatial mapping.
You will notice the soil profile data dat is arranged in a flat file where each row is a
soil profile. There are many columns of information which include profile identifier
and spatial coordinates. Then there are 11 further columns that are binary indicators
of whether a horizon class is present or not (indicated as 1 and 0 respectively). The
following 11 columns after the binary columns indicate the horizon depth for the
given horizon class.
library(sp)
library(raster)
# data
str(dat)
9 Combining Continuous and Categorical Modeling: Digital Soil. . . 233
At our disposal are a number of spatial covariates that have either been derived
from a digital elevation model, airborne gamma radiometric survey and Landsat
satellite spectral wavelengths. These are all registered to the common spatial
resolution of 25m and have been organized together into a rasterStack.
# covariates
names(s1)
# resolution
res(s1)
234 9 Combining Continuous and Categorical Modeling: Digital Soil. . .
## [1] 25 25
# raster properties
dim(s1)
For a quick check, lets overlay the soil profile points onto the DEM. You will
notice on Fig. 9.1 the area of concentrated soil survey (which represents locations
of annual survey) within the extent of a regional scale soil survey across the whole
study area.
plot(raster(files[17]))
points(dat, pch = 20)
The last preparatory step we need to take is the covariate intersection of the soil
profile data, and remove any sites that are outside the mapping extent.
# Covariate extract
ext <- extract(s1, dat, df = T, method = "simple")
w.dat <- cbind(as.data.frame(dat), ext)
A demonstration will be given of the two-stage modeling work flow for the A1
horizon, but given some indication of the results for the other horizons and their
depths further on. First we want to subset 75 % of the data for calibrating models,
and keeping the rest aside for validation purposes.
# A1 Horizon
x.dat$A1 <- as.factor(x.dat$A1)
# random subset
set.seed(123)
training <- sample(nrow(x.dat), 0.75 * nrow(x.dat))
# calibration dataset
dat.C <- x.dat[training, ]
# validation dataset
dat.V <- x.dat[-training, ]
library(nnet)
library(MASS)
summary(mn2)
## Call:
## multinom(formula = A1 ~ SAGA_wetness_index + r57 + r37 + r32 +
## Filled_DEM + Altitude_Above_Channel_Network, data = dat.C)
##
## Coefficients:
## Values Std. Err.
## (Intercept) -6.77776938 2.25295276
## SAGA_wetness_index 0.16402675 0.06413479
## r57 5.19885165 0.62748971
## r37 -3.18245854 1.50679841
## r32 -4.22725150 1.15143384
## Filled_DEM 0.02983971 0.00503181
## Altitude_Above_Channel_Network -0.03387555 0.01051877
##
## Residual Deviance: 629.0489
## AIC: 643.0489
We use the goofcat function from ithir to assess the model quality both in
terms of the calibration and validation data.
# calibration
mod.pred <- predict(mn2, newdata = dat.C, type = "class")
goofcat(observed = dat.C$A1, predicted = mod.pred)
## $confusion_matrix
## 0 1
## 0 28 20
## 1 109 841
##
## $overall_accuracy
## [1] 88
##
## $producers_accuracy
## 0 1
## 21 98
236 9 Combining Continuous and Categorical Modeling: Digital Soil. . .
##
## $users_accuracy
## 0 1
## 59 89
##
## $kappa
## [1] 0.2492215
# validation
val.pred <- predict(mn2, newdata = dat.V, type = "class")
goofcat(observed = dat.V$A1, predicted = val.pred)
## $confusion_matrix
## 0 1
## 0 7 6
## 1 38 282
##
## $overall_accuracy
## [1] 87
##
## $producers_accuracy
## 0 1
## 16 98
##
## $users_accuracy
## 0 1
## 54 89
##
## $kappa
## [1] 0.1924603
It is clear that the mn2 model is not too effective for predicting sites where the
A1 horizon is absent.
What we want to do now is to model the A1 horizon depth. We will be
using an alternative model to those that have been examined in this book so far.
The is a quantile regression forest, which is a generalized implementation of the
random forest model from Breiman (2001). The algorithm is available via the
quantregForest package, and further details about the model can be found
at Meinshausen (2006). The caret package also interfaces with this model too.
Fundamentally, random forests are integral to the quantile regression algorithm.
However, the useful feature and advancement from normal random forests is the
ability to infer the full conditional distribution of a response variable. This facility
is useful for building non-parametric prediction intervals for a any given level of
confidence information and also the ability to detect outliers in the data easily.
Quantile regression used via the quanregForest algorithm is implemented in
the chapter simply to demonstrate the wide availability of prediction models and
machine learning methods that can be used in digital soil mapping.
Getting the model initiated we first need to perform some preparatory tasks.
Namely the removal of missing data from the available data set.
9.1 Two-Stage Model Fitting and Validation 237
# validation
val.dat <- dat.V[!is.na(dat.V$A1d), ]
It is useful to check the inputs required for the quantile regression forests (using
the help file); however its parameterization is largely similar to other models that
have been used already in this book, particularly those for the random forest models.
Before we use the goof function to assess the model quality, a very helpful
graphical output from the model is the plot of the out-of-bag samples with respect to
whether the measured values are inside or outside their prediction interval (Fig. 9.2.
Recall that the out-of-bag samples are those that are not included in the regression
forest model iterations. Further note that it is also possible to define prediction
intervals to your own wishing. The default output is for a 90 % prediction interval.
10 20 30 40
predicted median values
Fig. 9.2 90 % prediction intervals on out-of-bag data for predicting depth of A1 horizon
238 9 Combining Continuous and Categorical Modeling: Digital Soil. . .
plot(qrf)
Naturally, the best test of the model is to use an external data set. In addition to
our normal validation procedure we can also derive the PICP for the validation data
too.
## Calibration
quant.cal <- predict(qrf, newdata = mod.dat[, 27:45], all = T)
goof(observed = mod.dat$A1d, predicted = quant.cal[, 2])
# Validation
quant.val <- predict(qrf, newdata = val.dat[, 27:45], all = T)
goof(observed = val.dat$A1d, predicted = quant.val[, 2])
# PICP
sum(quant.val[, 1] <= val.dat$A1d & quant.val[, 2] >=
val.dat$A1d)/nrow(val.dat)
## [1] 0.4479167
Based on the outputs above, the calibration model seems a reasonable outcome
for the model, but is proven to be largely un-predictive for the validation data set.
We should also be expecting a PICP close to 90 %, but this is clearly not the case
above.
What has been covered above for the two-stage modeling is repeated for all the
other soil horizons, with the main results displayed in Table 9.1. These statistics are
reported based on the validation data. It clear that there is a considerable amount
of uncertainty overall in the various soil horizon models. For some horizons the
results are a little encouraging; for example the model to predict the presence of a
BC horizon is quite good. It is clear however that distinguishing between different
B2 horizons is challenging. However predicting the presence or absence of a B22
horizons seems acceptable.
Another way to assess the quality of the two-stage modeling is to assess first the
number of soil profile that have matching sequences of soil horizon types. We can
do this using:
Table 9.1 Selected model validation diagnostics returned for each horizon class and associated
depth model
Presence/Absence of horizon Depth of horizon
Overall User’s Kappa
Horizon accuracy accuracy statistic Concordance RMSE PICP
A1 87 % Pres = 89 % 0.19 0.05 10 46 %
Abs = 54 %
A2 87 % Pres = 100 % 0.04 0.10 12 42 %
Abs = 87 %
AP 86 % Pres = 50 0.15 0.00 12 53 %
Abs = 88 %
B1 91 % Pres = 0 0 0.16 12 45 %
Abs = 91 %
B21 97 % Pres = 97 0 0.05 17 41 %
Abs = 0 %
B22 73 % Pres = 73 0 0.10 14 41 %
Abs = 34 %
B23 78 % Pres = 0 0 0.04 12 45 %
Abs = 78 %
B24 97 % Pres = 0 0 0.00 22 46 %
Abs = 97 %
BC 74 % Pres = 68 0.20 0.06 18 29 %
Abs = 75 %
C 95 % Pres = 0 0 0 NA 68 %
Abs = 95 %
## [1] 0.2222222
The result above indicates that just over 20 % of validation soil profiles have
matched sequences of horizons. We can examine visually a few of these matched
profiles to examine whether there is much coherence in terms of observed and
associated predicted horizon depths. We will select out two soil profiles: One with
240 9 Combining Continuous and Categorical Modeling: Digital Soil. . .
an AP horizon, and the other with an A1 horizon. We can demonstrate this using
the aqp package, which is a dedicated R package for handling soil profile data
collections.
# Subset of matching data (observations)
match.dat <- dat.V[which(dat.V$A1 == vv.dat$a1 & dat.V$A2 == vv.dat$a2 &
dat.V$AP == vv.dat$ap & dat.V$B1 == vv.dat$b1 & dat.V$B21 == vv.dat$b21 &
dat.V$B22 == vv.dat$b22 & dat.V$B23 == vv.dat$b23 & dat.V$B24 == vv.dat$b24 &
dat.V$B3 == vv.dat$b3 & dat.V$BC == vv.dat$bc & dat.V$C == vv.dat$c), ]
Now we just select any row where we know there is and AP horizon
match.dat[49, ] #observation
## FID e n A1 A2 AP B1 B21 B22 B23 B24 B3 BC C A1d A2d APd B1d
## 195 642 338096 6372259 0 0 1 0 1 1 0 0 0 1 0 NA NA 10 NA
## B21d B22d B23d B24d B3d BCd Cd ID totalCount thppm
## 195 30 15 NA NA NA 45 NA 735 446.7597 7.192239
## Terrain_Ruggedness_Index slope SAGA_wetness_index r57 r37
## 195 0.846727 0.697118 13.34301 1.955882 0.794118
## r32 PC2 PC1 ndvi MRVBF MRRTF
## 195 1.542857 -1.89351 -2.239939 -0.076923 0.111123 3.746326
## Mid_Slope_Positon light_insolation kperc Filled_DEM drainage_2011
## 195 0.130692 1716.388 0.5863795 142.8293 3.909594
## Altitude_Above_Channel_Network
## 195 25.52147
match.dat.P[49, ] #prediction
## dat.V.FID a1 a2 ap b1 b21 b22 b23 b24 b3 bc c a1d a2d apd b1.1
## 195 642 0 0 1 0 1 1 0 0 0 1 0 18 16 21.92308 18
## b21d b22d b23d b24d b3d bcd cd
## 195 31 27 20 15.41176 NA 32 NA
We can see in these two profiles, the sequence of horizons is AP, B21, B22,
BC. Now we just need to upgrade the data to a soil profile collection. Using the
horizon classes together with the associated depths, we want to plot both soil profiles
for comparison. First we need to create a data frame of the relevant data then
upgrade to a soilProfileCollection, then finally plot. The script below
demonstrates this for the soil profile with the AP horizon. The same can be done
with the associated soil profile with the A1 horizon. The plot of the is shown on
Fig. 9.3.
# Horizon classes
H1 <- c("AP", "B21", "B22", "BC")
0cm 0cm
observed profile
predicted profile
observed profile
predicted profile
AP
AP A1
A1
20cm
B21 20cm
B21
B21
40cm
B22 40cm
B21
60cm B22
B22
60cm
BC 80cm
B22
BC BC 80cm
100cm
BC
120cm 100cm
Fig. 9.3 Examples of observed soil profiles with associated predicted profiles from the two-stage
horizon class and horizon depth model
# Upper depths
U1 <- c(p1u, p2u)
# Lower depths
L1 <- c(p1l, p2l)
# Plot
plot(TT1, name = "H1", colors = "soil_color")
title("Selected soil with AP horizon", cex.main = 0.75)
Soil is very complex, and while there is a general agreement between observed
and associated predicted soil profiles, the power of the models used in this two-stage
example has certainly been challenged. Recreating the arrangement of soil horizons
together with maintenance of their depth properties is an interesting problem for
pedometric studies and one that is likely to be pursued with vigor as better methods
become available. The next section will briefly demonstrate a work flow for creating
maps that are resultant of this type of modeling framework.
We will recall from previous chapters the process for applying prediction models
across a mapping extent. In the case of the two-stage model the mapping work flow
if first creating the map of horizon presence/occurrence. Then we apply the horizon
depth model. In order to ensure that the depth model is not applied to the areas
where a particular soil horizon is predicted as being absent, those areas are masked
out. Maps for the presence of the A1 and AP horizons and their respective depths
are displayed in Fig. 9.4. The following scripts show the process of applying the
two-stage model for the A1 horizon.
# Apply A1 horizon presence/absence model spatially Using
# the raster multi-core facility
beginCluster(4)
A1.class <- clusterR(s1, predict, args = list(mn2, type = "class"),
filename = "class_A1.tif", format = "GTiff", progress = "text",
overwrite = T)
6380000 6380000
6375000 6375000
Present Present
Absent Absent
6370000 6370000
6365000 6365000
6380000
45
40
40
6375000
6375000
35 35
30 30
25 25
20
6370000
6370000
20
15
15
10
6365000
6365000
Fig. 9.4 Predicted occurrence of AP and A1 horizons, and their respective depths in the Lower
Hunter Valley, NSW
244 9 Combining Continuous and Categorical Modeling: Digital Soil. . .
References
Digital soil assessment goes beyond the goals of digital soil mapping. Digital soil
assessment (DSA) can be defined (from McBratney et al. (2012)) as the translation
of digital soil mapping outputs into decision making aids that are framed by the
particular, contextual human-value system which addresses the question/s at hand.
The concept of DSA was first framed by Carre et al. (2007) as a mechanism for
assessing soil threats, assessing soil functions and for soil mechanistic simulations
to assess risk based scenarios to complement policy development. Very simply DSA
can be likened to the quantitative modeling of difficult-to-measure soil attributes. An
obvious candidate application for DSA is land suitability evaluation for a specified
land use type, which van Diepen et al. (1991) define as all methods to explain
or predict the use potential of land. The first part of this chapter will cover a
simple digital approach for performing this type of analysis. The second part of
the chapter will explore a different form of digital assessment by way of identifying
soil homologues (Mallavan et al. 2010).
Land evaluation in some sense has been in practice at least since the earliest known
human civilizations. The shift to sedentary agriculture from nomadic lifestyles is at
least indicative of a concerted effort of human investment to evaluate the potential
and capacity of land and its soils to support some form of agriculture like cropping
(Brevik and Hartemink 2010). In the modern times there is a well-documented
history of land evaluation practice and programs throughout the world, many of
which are described in Mueller et al. (2010). Much of the current thinking around
land evaluation for agriculture are well documented within the land evaluation
guidelines prepared by the Food and Agriculture Organization of the United Nations
(FAO) in 1976 (FAO 1976). These guidelines have strongly influenced and continue
to guide land evaluation projects throughout the world. The FAO framework is a
crop specific LSA system with a 5-class ranking of suitability (FAO Land Suitability
Classes) from 1: Highly Suitable to 5: Permanently Not Suitable. Given a suite of
biophysical information from a site, each attribute is evaluated against some expert-
defined thresholds for each suitability class. The final evaluation of suitability for
the site is the one in which is most limiting.
Digital soil mapping complementing land evaluation assessment is being more
regularly observed. Examples (in Australia) include Kidd et al. (2012) in Tasmania
and Harms et al. (2015) in Queensland. Perhaps an obvious reason is that one can
derive with digital soil and climate modeling, very attribute specific mapping which
can be targeted specifically to a particular agricultural land use or even to a specific
enterprise (Kidd et al. 2012).
In this chapter, an example is given of a DSA where enterprise suitability is
assessed. The specific example is a digital land suitability assessment (LSA) for
hazelnuts across an area of northern Tasmania, Australia (Meander Valley) which
has been previously described in Malone et al. (2015). For context, the digital soil
assessment example has been one function of the Tasmanian Wealth from Water
project for developing detailed land suitability assessments (20 specific agricultural
enterprises) to support irrigated agricultural expansion across the state (Kidd et al.
2015, 2012). The project was commissioned for a couple of targeted areas, but has
since been rolled out across the state (Kidd et al. 2015). Further general information
about the project can be found at http://dpipwe.tas.gov.au/agriculture/investing-in-
irrigation/enterprise-suitability-toolkit.
The example considered in this chapter is just to give an overview of how to
perform a DSA in what could be considered as a relatively simple example. Using
the most-limiting factor approach of land suitability assessment, the procedure
requires a pixel-by-pixel assessment of a number of input variables which have been
expertly defined as being important for the establishment and growth of hazelnuts.
Malone et al. (2015) describes the digital mapping processes that went into creating
the input variables for this example. The approach also assumes that the predicted
maps of the input variables are also error free. Figure 10.1 shows an example of
the input variable requirements and the suitability thresholds for hazelnuts. You will
notice the biophysical variables include both soil and climatic variables, and the
suitability classification has four levels of grading.
Probably the first thing to consider for enabling the DSA in this example is to
codify the information in Fig. 10.1 into an R function. It would look like something
similar to the following script.
# HAZELNUT SUITABILITY ASSESSMENT FUNCTION
hazelnutSuits <- function(samp.matrix) {
out.matrix <- matrix(NA, nrow = nrow(samp.matrix), ncol = 10)
# Chill Hours
out.matrix[which(samp.matrix[, 1] > 1200), 1] <- 1
out.matrix[which(samp.matrix[, 1] > 600 & samp.matrix[, 1] <= 1200), 1] <- 2
out.matrix[which(samp.matrix[, 1] <= 600), 1] <- 4
10.1 A Simple Enterprise Suitability Example 247
Crop Soil Depth to pH of EC Texture Drainage Stoniness Frost Mean max monthly rainfall Chill hours
Depth sodic top (top 15cm) (top 15cm-% (top 15cm) temp
layer 15cm clay))
(H20)
W >50cm 6.5 <0.15dS/m 10-30% Well, <10%<=2 No days <-6 deg C in Mean Jan or Feb <50mm (mean Chill hours 0-7˚C
Moderately (>200mm) June,July or Aug– max temp -20- march) (April-August
well occurs 4/5 years 30˚C inclusive):>1200
S 40-50cm 5.5-6.5 <0.15dS/m 30-50% Imperfect 10-20% 3 No days <-6 deg C in Mean Jan or Feb <50mm (mean Chill hours 0-7˚C
(>200mm) June,July or Aug– max temp –30- march) (April-August
occurs 3/5 to 4/5 years 33˚C & 18-20˚C inclusive):600-
1200
MS 30-40cm 6.5-7.1 <0.15dS/m 30-50% Imperfect 10-20% 4 No days <-6 deg C in Mean Jan or Feb <50mm (mean Chill hours 0-7˚C
(>200mm) June,July or Aug– max temp –33- march) (April-August
occurs 2/5 to 3/5 years 35˚C inclusive):600-
1200
U <30cm <5.5 >0.15dS/m >50% or Poor,Very >20%>= 4 No days <-6 deg C in Mean Jan or Feb >50mm (mean Chill hours 0-7˚C
>7.1 <10% poor (>200mm) June,July or Aug– max temp –>35˚C march) (April-August
occurs <2/5 years & <18˚C inclusive):<600
Well suited (W): Land having no significant limitations to sustained applications of a given use, or only minor limitations that will not significantly reduce productivity or
benefits and will not raise inputs above an acceptable level. Any risk of crop loss is inherently low or can be easily overcome with management practices that are easy and
cheap to implement.
Suitable (S): Land having no limitations which are moderately severe for sustained applications of a given use; the limitations will reduce productivity or benefits and increase
required inputs to the extent that the overall advantage to be gained from the use, although still attractive, will be appreciable inferior to that expected on Class S1 land.
Risk of crop loss is moderately high or requires management practices that are dificult or costly to implement.
Marginally Suitable(MS): Land having no limitations which are which are severe for sustained application of a given use and will so reduce productivity or benefits,or increase required
inputs, that this expenditure will be only marginally justified. Risk of crop loss may be high.
Unsuitable (U): Land which has qualities that appear to preclude sustained use of the kind under consideration
Fig. 10.1 Suitability parameters and thresholds for hazelnuts (Sourced from DPIPWE 2015)
# Clay content
out.matrix[which(samp.matrix[, 2] > 10 & samp.matrix[, 2] <= 30), 2] <- 1
out.matrix[which(samp.matrix[, 2] > 30 & samp.matrix[, 2] <= 50), 2] <- 2
out.matrix[which(samp.matrix[, 2] > 50 | samp.matrix[, 2] <= 10), 2] <- 4
# Soil Drainage
out.matrix[which(samp.matrix[, 3] > 3.5), 3] <- 1
out.matrix[which(samp.matrix[, 3] <= 3.5 & samp.matrix[, 3] > 2.5), 3] <- 2
out.matrix[which(samp.matrix[, 3] <= 2.5 & samp.matrix[, 3] > 1.5), 3] <- 3
out.matrix[which(samp.matrix[, 3] <= 1.5), 3] <- 1
# EC (transformed variable)
out.matrix[which(samp.matrix[, 4] <= 0.15), 4] <- 1
out.matrix[which(samp.matrix[, 4] > 0.15), 4] <- 4
# Frost
out.matrix[which(samp.matrix[, 10] == 0), 5] <- 1
out.matrix[which(samp.matrix[, 10] != 0 & samp.matrix[, 5] >= 80), 5] <- 1
out.matrix[which(samp.matrix[, 10] != 0 & samp.matrix[, 5] < 80 & samp.matrix
[, 5] >= 60), 5] <- 2
out.matrix[which(samp.matrix[, 10] != 0 & samp.matrix[, 5] < 60 & samp.matrix
[, 5] >= 40), 5] <- 3
out.matrix[which(samp.matrix[, 10] != 0 & samp.matrix[, 5] < 40), 5] <- 3
# pH
out.matrix[which(samp.matrix[, 6] <= 6.5 & samp.matrix[, 6] >= 5.5), 6] <- 1
out.matrix[which(samp.matrix[, 6] > 6.5 & samp.matrix[, 6] <= 7.1), 6] <- 3
out.matrix[which(samp.matrix[, 6] < 5.5 | samp.matrix[, 6] > 7.1), 6] <- 4
# rainfall
out.matrix[which(samp.matrix[, 7] <= 50), 7] <- 1
out.matrix[which(samp.matrix[, 7] > 50), 7] <- 4
248 10 Digital Soil Assessments
# soil depth
out.matrix[which(samp.matrix[, 13] == 0), 8] <- 1
out.matrix[which(samp.matrix[, 13] != 0 & samp.matrix[, 8] > 50), 8] <- 1
out.matrix[which(samp.matrix[, 13] != 0 & samp.matrix[, 8] <= 50 & samp.matrix
[, 8] > 40), 8] <- 2
out.matrix[which(samp.matrix[, 13] != 0 & samp.matrix[, 8] <= 40 & samp.matrix
[, 8] > 30), 8] <- 3
out.matrix[which(samp.matrix[, 13] != 0 & samp.matrix[, 8] <= 30), 8] <- 4
# temperature
out.matrix[which(samp.matrix[, 9] > 20 & samp.matrix[, 9] <= 30), 9] <- 1
out.matrix[which(samp.matrix[, 9] > 30 & samp.matrix[, 9] <= 33 | samp.matrix
[, 9] <= 20 & samp.matrix[, 9] > 18), 9] <- 2
out.matrix[which(samp.matrix[, 9] > 33 & samp.matrix[, 9] <= 35), 9] <- 3
out.matrix[which(samp.matrix[, 9] > 35 | samp.matrix[, 9] <= 18), 9] <- 4
# rocks
out.matrix[which(samp.matrix[, 11] == 0), 10] <- 1
out.matrix[which(samp.matrix[, 11] != 0 & samp.matrix[, 12] <= 2), 10] <- 1
out.matrix[which(samp.matrix[, 11] != 0 & samp.matrix[, 12] == 3), 10] <- 2
out.matrix[which(samp.matrix[, 11] != 0 & samp.matrix[, 12] == 4), 10] <- 3
out.matrix[which(samp.matrix[, 11] != 0 & samp.matrix[, 12] > 4), 10] <- 4
return(out.matrix)
}
Assuming there are digital soil and climate maps already created for use in the
hazelnut land suitability assessment, it is relatively straightforward to run the LSA.
Keep in mind that the creation of the biophysical variable maps were created via
a number of means which included continuous attribute modeling, binomial and
ordinal logistic regression, and a combination of both i.e. through the two-stage
mapping process. So let’s get some sense of the LSA input variables.
## [1] "X1_chill_HAZEL_FINAL_meander"
## [2] "X2_clay_FINAL_meander"
## [3] "X3_drain_FINAL_meander"
## [4] "X4_EC_cubist_meander"
## [5] "X5_Frost_HAZEL_FINAL_meander"
## [6] "X6_pH_FINAL_meander"
## [7] "X7_rain_HAZEL_FINAL_meander"
## [8] "X8_soilDepth_FINAL_meander"
## [9] "X9_temp_HAZEL_FINAL_meander"
## [10] "X5_Frost_HAZEL_binaryClass_meander"
## [11] "X10_rocks_binaryClass_meander"
## [12] "X11_rocks_ordinalClass_meander"
## [13] "X8_soilDepth_binaryClass_meander"
class(lsa.variables)
## [1] "RasterStack"
## attr(,"package")
## [1] "raster"
# Raster resolution
res(lsa.variables)
## [1] 30 30
So there are 13 rasters of data, which you will note coincide with the inputs
required for the hazelnutSuits function. Now all that is required is to go
pixel by pixel and apply the LSA function. In R the implementation may take the
following form:
250 10 Digital Soil Assessments
## [1] 1685
## [1] 27
names(sub.frame)
## NULL
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
The above script has just performed the hazelnut LSA upon the entire first row
of the input variable rasters. There were only 27 pixels where the full suite of data
was available in this case. To do the LSA for the entire mapping extent we could
effectively run the script above for each row of the input rasters. Naturally, this
would take an age to do manually, so it might be more appropriate to use the script
above inside a for loop where the row index changes for each loop. Alternatively,
the custom hazelnutSuits function could be used as an input argument for
the raster package calc function, or better still using the clusterR function
if there is a need to do the LSA in parallel mode across multiple compute nodes.
Given the available options, we will demonstrate the mapping process using the
looping approach. While it may be computationally slower, it imprints the concept
of applying the LSA spatially with greater clarity.
From the above script and resulting output, we essentially returned a vector of
integers. We effectively lost all links to the fact that the inputs and resulting outputs
are spatial data. Subsequently there is a need to link the suitability assessment back
to the mapping. Fortunately we know the column positions of the instances where
there was the full suite of input data. So once we have set up a raster object to which
data can be written to (which has the same raster properties of the input data), it is
just a matter of placing the LSA outputs into the row and column positions as those
of the input data. A full example is given below, but for the present purpose, the
LSA output data placement into a raster would look something like the following:
# A one column matrix with number of rows equal to number of
# columns in raser inputs
a.matrix <- matrix(NA, nrow = nrow(cov.Frame), ncol = 1)
10.1 A Simple Enterprise Suitability Example 251
Naturally the above few lines of script would also be embedded into the looping
process as described above. Below is an example of putting all these operations
together to ultimately produce a map of the suitability assessment. To add a slight
layer of complexity, we may also want to produce maps of the suitability assessment
for each of the input variables. This helps in determining which factors are causing
the greatest limitations and where they occur. First, we need to create a number of
rasters to which we can write the outputs of the LSA to.
# Create a suite of rasters of same raster properties
# is LSA input variables
# Overall suitability classification
LSA.raster <- raster(lsa.variables[[1]])
Now we can implement the for loop procedure and do the LSA for the entire
mapping extent.
# Run the suitability model: Open loop:for each row of each input
raster get
# raster values
for (i in 1:dim(LSA.raster)[1]) {
cov.Frame <- getValues(lsa.variables, i)
# get the complete cases
sub.frame <- cov.Frame[which(complete.cases(cov.Frame)), ]
As you may encounter, the above script can take quite a while to complete, but
ultimately you should be able to produce a number of mapping products. Figure 10.2
shows the map of the overall suitability classification and the script to produce it is
below.
library(rasterVis)
# plot
area_colors <- c("#FFFF00", "#1D0BE0", "#1CEB15", "#C91601")
levelplot(LSA.raster, col.regions = area_colors, xlab = "",
ylab = "")
5415000
5410000
Unsuited
5405000 Moderately Suited
Suited
Well Suited
5400000
5395000
5390000
Fig. 10.2 Digital suitability assessment for hazelnuts across the Meander Valley, Tasmania
(assuming all LSA input variables are error free)
254 10 Digital Soil Assessments
Similarly the above plotting procedure can be repeated to look at single input
variable limitations too.
The approaches detailed in this chapter are described in greater detail in
Malone et al. (2015) within the context of taking account of uncertainties within
LSA. Taking account of the input variable uncertainties adds an additional level
of complexity to what was achieved above, but is an important consideration
nonetheless, as the resulting outputs can be assessed for reliability in an objective
manner. However, that particular workflow for LSA is not covered in this chapter as
it is only meant to provide a general perspective and relatively simple example of a
real world digital soil assessment.
In many places in the world, soil information is difficult to obtain or even be non-
existent. When no detailed maps or soil observations are available in a region of
interest, we have to interpolate or extrapolate from other parts of the world. When
dealing with global modeling at a coarse resolution, we can interpolate or extrap-
olate soil observations available from other similar areas (that are geographically
close) or by using spatial interpolation or a spatial soil prediction function.
Homosoil, is a concept proposed by Mallavan et al. (2010) which assumes
the homology of predictive soil-forming factors between a reference area and the
region of interest. These include: climate, parent materials, and physiography of the
area. We created the homosoil function to illustrate the concept. It is relatively
simple whereby, given any location (latitude, longitude) in the world, the function
will determine other areas in the world that share similar climate, lithology and
topography. Shortly we will unpack the function into its elemental components.
First we will describe the data and how to measure similarity between sites.
The basis is a global 0:5ı 0:5ı grid data of climate, topography, and lithology.
For climate, this consists of variables representing long-term mean monthly and
seasonal temperature, rainfall, solar radiation and evapotranspiration data. We also
use the DEM representing topography, and lithology, which gives broad information
on the parent material. The climate data come from the ERA-40 reanalysis and
Climate Research Unit (CRU) dataset. More details on the datasets are available
on the website http://www.ipcc-data.org/obs/get_30yr_means.html. For each of the
4 climatic variables (rainfall, temperature, solar radiation and evapotranspiration),
we calculated 13 indicators: annual mean, mean for the driest month, mean
at the wettest month, annual range, driest quarter mean, wettest quarter mean,
10.2 Homosoil: A Procedure for Identifying Areas with Similar Soil Forming. . . 255
coldest quarter mean, hottest quarter mean, lowest ET quarter mean, highest ET
quarter mean, darkest quarter mean, lightest quarter mean, and seasonality. From
this analysis and including the acquired data, 52 global climatic variables were
composed.
The DEM is from the Hydro1k dataset supplied from the USGS (https://lta.cr.
usgs.gov/HYDRO1K), which includes the mean elevation, slope, and compound
topgraphic index (CTI).
The lithology is from a global digital map (Durr et al. 2005) with seven values
which represent the different broad groups of parent materials. The lithology
classes are: non- or semi-consolidated sediments, mixed consolidated sediments,
silic-clastic sediments, acid volcanic rocks, basic volcanic rocks, complex of
metamorphic and igneous rocks, and complex lithology.
This global data is available as a data.frame from the ithir package.
library(ithir)
data(homosoil_globeDat)
The climatic and topographic similarity between two points of the grid is calculated
by a Gower similarity measure (Gower 1971). The Gower distance measures the
similarity Sij between sites i and j, each with p number of variables, standardized by
the range of each variable:
p
1X j xik xjk j
Sij D 1 (10.1)
p kD1 range.k/
Where p denotes the number of climatic variables, j xik xjk j represents the
absolute difference of climate variables between site i and j. The similarity index has
a value between 0 and 1 and is applicable for continuous variables. For categorical
variables such as those for lithology, we simply just want to match category for
category between sites i and j.
Considering the scale and the resolution of this study and the available global
data (0:5ı 0:5ı ), the climatic factor is probably the most important and reliable
soil forming factor. This is inspired by the study of Bui et al. (2006) who showed at
continental extent, the state factors of soil formation form a hierarchy of interacting
variables, with climate being the most important factor, and different climatic
variables dominate in different regions. This is not to say that climate to be the most
important factor at all scales. Their results also show that lithology is almost equally
important in defining broad scale spatial patterns of soil properties, and shorter-
range variability in soil properties appears to be driven more by terrain variables.
256 10 Digital Soil Assessments
Essentially from below we are creating the homosoil function. This function
takes three inputs: grid.data, which is the global environmental dataset, and
recipient.lon and recipient.lat, which correspond the coordinates of
the sites to which we want to find soil homologues. For brevity we will call this the
recipient site. Inside the homosoil function, we first encounter another function
which is an encoding of the Gower’s similarity measure as defined previously.
This is followed by a number of indexation steps (to make the following steps
clearer to implement), where we explicitly make groupings of the data, for example
the object grid.climate is composed of all the global climate information
from the grid.data object. Finally make the object world.grid which is a
data.frame for putting outputs of the function into. Ultimately this object will
get returned at the end of the function execution.
homosoil <- function (grid_data,recipient.lon,recipient.lat) {
#Gower’s similarity function
gower <- function(c1, c2, r) 1-(abs(c1-c2)/r)
We then want to find which global grid point is the closest to the recipient site
based on the Euclidean distance of the coordinates. We then want to extract the
climate, lithological, and topographical data that is recorded for the nearest grid
point.
# find the closest recipient point
dist = sqrt((recipient.lat - grid.lat)^2 + (recipient.lon
- grid.lon)^2)
imin = which.min(dist)
10.2 Homosoil: A Procedure for Identifying Areas with Similar Soil Forming. . . 257
We can then determine which grid points are most similar to the recipient site.
Here we use an arbitrarily selected cutoff of 0.85 which corresponds to the top 15 %
of grid data similar to the recipient site. Lastly, we save the results of the homocline
analysis to the world.grid object.
Now we want to find within the areas we have defined as homocline, areas that
are homolith. We simply want to find the lithology match between the recipient
site and global lithology. We can do this for the entire globe, and then index those
sites that also correspond to homoclines. Again, we save the results of the homolith
analysis to the world.grid object.
Now we want to find within the areas we have defined as homolith, areas that are
homotop. This analysis can be initiated by doing estimating the Gower’s similarity
measure for the whole globe. These steps below are just the same as before for the
climate date, except now we are using the topographic data. Again we are also using
the arbitrarily selected threshold value of 15 %.
Now we want to determine those areas that are homotop within the areas that are
homolith.
That more-or-less completes the homosoil analysis. The last few tasks are to
create a raster object of the soil homologues.
Followed by directing the homosoil function to save the relevant outputs which
here are the world.grid object and the raster object of the soil homologues. Then
finally we close the function.
With the homosoil function now established, let’s put it to use. The coordinates
below correspond to a location in Jakarta in Indonesia.
Then we plot the result. Here we want to use the map object that was
created inside the homosoil function. We also specify colors to correspond
the non-homologue areas, homoclines, homoliths, and homotops. Because of the
hierarchical nature of the homosoil analysis, essentially the homotops are the soil
homologues to the recipient site (Fig. 10.3).
# plot
area_colors <- c("#EFEFEF", "#666666", "#FFDAD4", "#FF0000")
levelplot(result[[1]], col.regions = area_colors,
xlab = "", ylab = "") + layer(sp.points(dats,
col = "green", pch = 20, cex = 2))
Using the other object that is returned from the homosoil function (which was
the world.grid data frame used to put the analysis outputs into) we can also
map out the homologues individually, for example, we may just want to map the
homoclines. That you can work out to do in your own time.
80
60
40
homotop
20 homolith
homocline
0
-20
-40
-100 0 100
Fig. 10.3 Soil homologues to an area of Jakarta, Indonesia (green dot on map)
260 10 Digital Soil Assessments
References
Brevik EC, Hartemink AE (2010) Early soil knowledge and the birth and development of soil
science. Catena 83(1):23–33
Bui EN, Henderson BL, Viergever K (2006) Knowledge discovery from models of soil properties
developed through data mining. Ecol Model 191:431–446
Carre F, McBratney AB, Mayr T, Montanarella L (2007) Digital soil assessments: beyond DSM.
Geoderma 142(1–2):69–79
DPIPWE (2015) Enterprise suitability toolkit [online] dpipwe.tas.gov.au
Durr HH, Meybeck M, Durr SH (2005) Lithologic composition of the Earths continental surfaces
derived from a new digital map emphasizing riverine material transfer. Glob Biogeochem
Cycles 19:GB4S10
FAO (1976) A framework for land evaluation. Soils bulletin, vol 32. Food and Agriculture
Organisation of the United Nations, Rome
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27:857–
871
Harms B, Brough D, Philip S, Bartley R, Clifford D, Thomas M, Willis R, Gregory L (2015) Digital
soil assessment for regional agricultural land evaluation. Global Food Secur 5:25–36. Special
Section on 3rd.
Kidd DB, Webb MA, Malone AB, Minasny B, McBratney AB (2015) Digital soil assessment of
agricultural suitability, versatility and capital in tasmania, Australia. Geoderma Reg 6:7–21
Kidd DB, Webb MA, Grose CJ, Moreton RM, Malone BP, McBratney AB, Minasny B, Viscarra-
Rossel R, Sparrow LA, Smith R (2012) Digital soil assessment: guiding irrigation expansion in
Tasmania, Australia. In: Minasny B, Malone BP, McBratney AB (eds) Digital soil assessment
and beyond. CRC, Boca Raton, pp 3–9
Mallavan BP, Minansy B, McBratney AB (2010) Homosoil: a methodology for quantitative
extrapolation of soil information across the globe. In: Boettinger JL, Howell DW, Moore AC,
Hartemink AE, Kienast-Brown S (eds) Digital soil mapping: bridging research, environmental
application, and operation. Springer, New York, pp 137Ű149
Malone BP, Kidd DB, Minasny B, McBratney AB (2015) Taking account of uncertainties in digital
land suitability assessment. PeerJ 3:e1366
McBratney AB, Minasny B, Wheeler I, Malone BP, Linden DVD (2012) Frameworks for digital
soil assessment. In: Minasny B, Malone BP, McBratney AB (eds) Digital soil assessments and
beyond. CRC Press, London, pp 9–15
Mueller L, Schindler U, Mirschel W, Shepherd T, Ball B, Helming K, Rogasik J, Eulenstein F,
Wiggering H (2010) Assessing the productivity function of soils. A review. Agron Sustain Dev
30(3):601–614
van Diepen C, van Keulen H, Wolf J, Berkhout J (1991) Land evaluation: From intuition to
quantification. In: Stewart B (ed) Advances in soil science. Advances in soil science, vol 15.
Springer, New York, pp 139–204
Index
B E
Bias, 2, 117, 119, 121, 122, 126, 132, 136, 140, Edgeroi, 101, 103, 104, 110, 111, 123, 128,
145, 147, 148, 178, 181, 189, 196, 200, 129, 132, 134, 137, 140, 146, 148
216, 217, 238 Environmental covariates, 2, 3, 96, 101–106,
Bootstrapping, 122, 170, 178–187, 192 123, 170, 188, 203, 226
Extragrades, 199, 202–206, 208, 214
C
Caret package, 141–143, 151, 236 F
C5 decision trees, 161–164 Fuzme, 202, 203, 213
Coefficient of determination, 117 Fuzzy clustering, 170, 198–218
Concordance correlation coefficient, 118 Fuzzy Performance Index (FPI), 202
Coordinate reference systems, 84, 89, 144
Coordinate transformation, 84, 156
Correlation, 71, 112, 114, 115, 117, 118, 136, H
143, 189, 190, 200 Homologues, 4, 5, 245, 256, 258, 259
Cubist models, 92, 131, 133–138, 142, 143, Homosoil, 4, 245, 254–259
146–149, 179, 180, 188, 190, 191, 195,
196, 198
I
Interactive mapping, 88–91
D Intersection, 104–106, 123, 156, 234
Decision trees, 2, 3, 130–134, 138, 161, Inverse distance weighted (IDW) interpolation,
223 110, 111
Digital soil assessment (DSA), 4, 5, 245–259 ithir package, 4, 33, 45, 73, 82, 85, 97, 101,
Digital soil mapping (DSM), 1–5, 7–79, 81, 105, 111, 119, 123, 154, 156, 170, 255
85, 87, 88, 91–93, 95–115, 117, 119,
120, 130, 133, 136, 138, 141–144, 155,
169–218, 221–229, 231–243, 245, 246, K
248 Kappa coefficient, 152, 154, 155
DSMART, 4, 222–229 KML file, 84, 86–88
Q
Quantile regression forest model, 236, 237 V
Validation
k-fold, 119–120, 141
R leave-one-out cross-validation, 119, 141,
R 190, 191, 200
algorithm development, 71–79 random holdback, 119, 120, 157, 165
data import and export, 32–41 Variograms, 2, 23, 94, 111–114, 143, 144, 147,
getting help, 21–22 173, 190, 191, 200, 201