Case study with cross-correlated variables

Introduction/Problem definition

In many geographical studies variables are cross-correlated, i.e. in case of positive correlation high and low trend occur simultaneously. As a result uncertainty in one variable may be statistically dependent on uncertainty in the other. For example, the spatial properties of soil, such as organic carbon (OC) and nitrogen (N) content are cross-correlated. These variables are used to derive C/N ratio, vital information to evaluate soil management and to increase the crop productivity, but maps of OC and N are approximations encumbered with errors. These errors will propagate through the calculation into the C/N prediction.

We can use the Monte Carlo (MC) method to analyse how the error propagates through spatial operations and models (briefly described in the next section). The MC method is fairly straightforward in application, but in case of spatially distributed cross-correlated variables like OC and N one should consider taking spatial cross-correlation into account. That is because the model output (e.g. C/N) may be influenced by the spatial cross-correlation in the input.

Monte Carlo methodology for spatial uncertainty analysis with spatially cross-correlated variables

The adapted uncertainty propagation analysis approach is based on the Monte Carlo method that computes the output of the model repeatedly, with input values that are randomly sampled from their joint probability distribution function (pdf). The set of model outputs forms a random sample from the output pdf, so that the parameters of the distribution, such as the mean, variance and quantiles, can be estimated from the sample. The method thus consists of the following steps:

Characterise uncertain model inputs with a joint pdf.
Repeatedly sample from the joint pdf of uncertain inputs.
Run model with sampled inputs and store model outputs.
Compute summary statistics of model outputs.

Note that the above ignores uncertainty in model parameters and model structure, but these can easily be included if available as pdfs. A random sample from the model inputs can be obtained using an appropriate pseudo-random number generator.

For each uncertain spatially distributed continuous variable, such as soil organic carbon content, rainfall or elevation we assume the following geostatistical model:

Z(x)= μ(x)+ σ(x)∙ε(x)

where x is geographic location, μ is the (deterministic) mean of Z, σ is its standard deviation and ε is is standard normal, second-order stationary stochastic residual, whose spatial autocorrelation is modelled with a semivariogram or correlogram. Note that ε has zero mean and unit variance. Both μ and σ may vary in space so that spatial trends and spatially variable uncertainty can be taken into account. The cross-correlations are modelled using a linear model of co-regionalization (Wackernagel, 2003). The random sample is drawn from the pdf of ε to further calculate a sample from Z.

C/N uncertainty propagation analysis with ‘spup’

Preliminaries - load and view the data

The example data for C/N calculations are a 250m resolution mean OC and TN (total N) of a 33km x 33km area adjacent to lake Alaotra in Madagascar.

The ‘Madagascar’ dataset contains four spatial objects: a mean OC and TN of the area and their standard deviations. It also has a saved function that calculates C/N using OC and TN that will be used later.

# load packages
library(spup)
library(raster)
library(purrr)

# set seed
set.seed(12345)

# load and view the data
data(OC, OC_sd, TN, TN_sd)
par(mfrow = c(1,2))
class(OC)

[1] "RasterLayer"
attr(,"package")
[1] "raster"

class(TN)

[1] "RasterLayer"
attr(,"package")
[1] "raster"

plot(OC, main = "Mean of Organic Carbon") 
plot(TN, main = "Mean of Total Nitrogen")

summary(OC)

        OC_Madagaskar_xd1_250m
Min.                   7.00000
1st Qu.               14.33333
Median                23.00000
3rd Qu.               39.66667
Max.                 103.00000
NA's                1545.00000

summary(TN)

        TN_Madagaskar_xd1_250m
Min.                     0.609
1st Qu.                  1.000
Median                   1.320
3rd Qu.                  2.080
Max.                     3.740
NA's                  1545.000

Define uncertainty model (UM) for the cross-correlated OC and TN

The first step in uncertainty propagation analysis is to define an uncertainty model for the uncertain input variables, here OC and TN, that will be used in the Monte Carlo uncertainty propagation analysis.

First, the marginal uncertainty model is defined for each variable separately, and next the joint uncertainty model is defined for the variables together.

In case of OC and TN, the ε are spatially correlated and in order to include this in the analysis, we need to describe it by spatial correlograms. For each of the variables, the makeCRM() function collates all necessary information into a list.

Let us assume that the spatial autocorrelation of the OC and TN errors is an are described by spherical correlation function with a short-distance correlation of 0.6 for OC and 0.4 for TN, and a range parameter of 1000m. It is important at this step to ensure that the correlation function types as well as the ranges are the same for each variable. It is a requirement for further analysis, becasue spup employs the model of co-regionalization (Wackernagel, H. 2003).

# define spatial correlogram models
OC_crm <- makeCRM(acf0 = 0.6, range = 5000, model = "Sph")
TN_crm <- makeCRM(acf0 = 0.4, range = 5000, model = "Sph")

We can view the correlograms by plotting them.

plot(OC_crm, main = "OC correlogram")
plot(TN_crm, main = "TN correlogram")

Spatial correlograms summarise patterns of spatial autocorrelation in data and model residuals. They show the degree of correlation between values at two locations as a function of the separation distance between the locations. In the case above the correlation declines with distance, as is usually the case. The correlation is zero for distances greater than 1000m. More about correlograms is included in the DEM vignette.

In order to complete the description of each individual uncertain variable we use the defineUM() function that collates all information about the OC and TN uncertainty. The minimum information required is:

a logical value that indicates if the variablet is uncertain,
the type of the probability distribution. In case of variables with spatially correlated errors only the normal distribution is supported. For details on supported distributions and required parameters see ?defineUM,
the list of distribution parameters, for example a mean and a standard deviation for the normal distribution. In the case presented here, these are maps of the means and standard deviations of the OC or TN,
correlogram model,
the variable id.

# define uncertainty model for the OC and TN
OC_UM <- defineUM(TRUE, distribution = "norm", distr_param = c(OC, OC_sd), crm = OC_crm, id = "OC")
TN_UM <- defineUM(TRUE, distribution = "norm", distr_param = c(TN, TN_sd), crm = TN_crm, id = "TN")
class(OC_UM)

[1] "MarginalNumericSpatial"

class(TN_UM)

[1] "MarginalNumericSpatial"

Both variables are of the same class “MarginalNumericSpatial”. This is one of the requirements for defining a multivariate uncertainty model next. We use the defineMUM() function to collate information about uncertainty in each variable as above, and information about their cross-correlation. The required function arguments are:

a list of defined uncertainty models (by defineUM()) for the selected variables,
a correlation matrix.

The correlation matrix must satisfy a range of conditions: - square, - symmetrical (transposed must be the same as original), - diagonal must all be 1, - all values must belong to [-1, +1] range, - it has to be positive-definite (all eigenvalues must be > 0).

# define multivariate uncertainty model
mySpatialMUM <- defineMUM(UMlist = list(OC_UM, TN_UM), 
                          cormatrix = matrix(c(1,0.7,0.7,1), nrow=2, ncol=2))
class(mySpatialMUM)

[1] "JointNumericSpatial"

Generate possible realities of OC and TN

Generating possible realities of the selected variables can be completed by using the genSample() function. The required information to pass to the function includes:

defined uncertain object (as above).
number of realizations per variable to return.
sampling method. In case of spatially correlated variables, the method “ugs” (method based on unconditional Gaussian simulation) is recommended, otherwise spatial correlation will not be taken into account. Other sampling methods include “randomSampling” and “lhs” (Latin Hypercube Sampling), where cross-correlations can be accounted for but spatial auto- and cross- correlations are neglected. See ?genSample() for more details.

Additional parameters may be also specified. For example, sampling of spatially correlated variable is based on the ‘gstat’ package that allows for limiting the number of nearest observations that should be used for simulation.

# create possible realizations from the joint distribution of OC and TN
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = 3, samplemethod = "ugs", nmax = 20, asList = FALSE)

Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]

Note the argument ‘asList’ has been set to FALSE. This indicates that the sampling function will return an object of a class of the distribution parameters class. This is useful if you want to visualize the sample or compute summary statistics quickly.

# view the sample structure
OCTN_sample

class      : RasterStack 
dimensions : 134, 135, 18090, 6  (nrow, ncol, ncell, nlayers)
resolution : 250, 250  (x, y)
extent     : 3024625, 3058375, -2514625, -2481125  (xmin, xmax, ymin, ymax)
crs        : +proj=laea +lat_0=5 +lon_0=20 +x_0=0 +y_0=0 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
names      :    OC.sim1,    OC.sim2,    OC.sim3,    TN.sim1,    TN.sim2,    TN.sim3 
min values :  6.1980525,  6.3100354,  5.9434249,  0.5267884,  0.5682796,  0.5361586 
max values : 109.993199, 117.865444, 116.244065,   4.119291,   4.489208,   4.443525

# plot realizations of OC and TN
plot(OCTN_sample)

Usually the sample must be large to obtain stable results. Let us run the sampling to obtain 100 realizations. This may take a minute.

# create possible realizations from the joint 
# distribution of OC and TN
MC <- 500
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = MC,
                         samplemethod = "ugs", nmax = 20, asList = FALSE)

Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]

We can view the means and standard deviations of the sampled OC and TN. If the number of samples was very large then the mean of the sample of each would equal the mean OC and TN, and the sd would equal their sds.

# compute and plot OC and TN sample statistics
# e.g. mean and standard deviation
OC_sample <- OCTN_sample[[1:MC]]
TN_sample <- OCTN_sample[[(MC+1):(2*MC)]]
OC_sample_mean <- mean(OC_sample)
TN_sample_mean <- mean(TN_sample)
OC_sample_sd <- calc(OC_sample, fun = sd)  
TN_sample_sd <- calc(TN_sample, fun = sd)

par(mfrow= c(1,2))
plot(OC_sample_mean, main = "Mean of OC realizations")
plot(TN_sample_mean, main = "Mean of TN realizations")

We can also view the cross-corelations between two variables.

# R package GGally provides a nice option for plotting correlations
library(GGally)
# octn <- cbind(as.data.frame(OC_sample[[1]]), as.data.frame(TN_sample[[1]]))
octn <- cbind(as.data.frame(t(OC_sample[1,1,])), as.data.frame(t(TN_sample[1,1,])),
              as.data.frame(t(OC_sample[1,2,])), as.data.frame(t(TN_sample[1,2,])),
              as.data.frame(t(OC_sample[10,10,])), as.data.frame(t(TN_sample[10,10,])))
colnames(octn) <- c("OC_loc1", "TN_loc1", "OC_loc2", "TN_loc2", "OC_loc3", "TN_loc3")

ggscatmat(data = octn, alpha=0.15)

Uncertainty propagation through the model that calculates C/N ratio

In order to perform uncertainty propagation analysis using ‘spup’, the model through which uncertainty is propagated needs to be defined as an R function. A pre-defined model that calculates C/N using OC and TN as input is provided.

# source code for the C/N model
source("examples/C_N_model_raster.R")
C_N_model_raster

function (OC, TN) 
{
    OC/TN
}

# C/N model
C_N_model_raster <- function (OC, TN) OC/TN

The propagation of uncertainty occurs when the model is run with the uncertain inputs. Running the model with a sample of realizations of uncertain input variable(s) yields an equally large sample of model outputs that can be further analyzed. To run the C/N ratio model with the OC and TN realizations we use the propagate() function. The propagate() function takes as arguments:

a sample from the uncertain model inputs and any other remaining model inputs and parameters as a list.
the model as a function in R.
the number of Monte Carlo runs. This can be equal or smaller than the number of realizations of the uncertain input variable(s).

In order to run the propagation function the samples of uncertain input variables must be saved in lists and then collated into a list of these lists. We can either coerce the existing OCTN_sample object or get it automatically setting up the ‘asList’ argument of genSample() to TRUE.

# coerce  a raster stack to a list 
# in our example we consider two variables, so we need a list of two lists with realizations
l <- list()
l[[1]] <- map(1:100, function(x){OCTN_sample[[x]]})
l[[2]] <- map(101:200, function(x){OCTN_sample[[x]]})
OCTN_sample <- l
     
# or sample from uncertain input and return it automatically in a list by setting asList argument to TRUE (default)
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = MC, samplemethod = "ugs", nmax = 20, asList = TRUE)

Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]

# run uncertainty propagation
CN_sample <- propagate(realizations = OCTN_sample,
                       model = C_N_model_raster, n = MC)

Visualization of results

We can now view the sample of model output realizations (i.e. C/N) and visualize uncertainty by calculating and plotting the sample mean and standard deviation. In our case we need to coerce the output of the propagate()function saved as a list back to a RasterStack.

# coerce C/Ns list to a RasterStack
CN_sample <- stack(CN_sample)
names(CN_sample) <- paste("CN.", c(1:nlayers(CN_sample)), sep = "")

# view the sample of the model output
par(mfrow = c(1,1))
plot(CN_sample[[1:6]])

# compute and plot the slope sample statistics
# e.g. mean and standard deviation
CN_mean <- mean(CN_sample)
CN_sd <- calc(CN_sample, fun = sd) 
par(mfrow = c(1,2))
plot(CN_mean, main = "C/N mean")
plot(CN_sd, main = "C/N sd")

We can also view C/N realizations at specific locations, for example where its mean is highest and lowest:

l_mean <- mean(CN_sample[which.min(OC)])
l_sd <- sd(CN_sample[which.min(OC)])
h_mean <- mean(CN_sample[which.max(OC)])
h_sd <- sd(CN_sample[which.max(OC)])

par(mfrow = c(1,2))
hist(CN_sample[which.min(OC)], main = paste("C/N at lowest OC,", "\n",
     "mean = ", round(l_mean, 2), ", sd = ", round(l_sd, 2), sep = ""), xlab = "C/N")
hist(CN_sample[which.max(OC)], main = paste("C/N at highiest OC,", "\n",
     "mean = ", round(h_mean, 2), ", sd = ", round(h_sd, 2), sep = ""), xlab = "C/N")

We can also look at specific quantiles of the C/N ratio sample. The function quantile_MC_sgdf implemented in ‘spup’ allow us to do it quickly.

# calculate quantiles
CN_sample_df <- as(CN_sample, "SpatialGridDataFrame")
CN_q <- quantile_MC_sgdf(CN_sample_df, probs = c(0.1, 0.25, 0.75, 0.9), na.rm = TRUE)
spplot(CN_q[c(3,4,1,2)], main = list(label = "Quantiles of C/N realizations", cex = 1))

We may also identify all locations where the C/N ratio is smaller than 24, with 90% probability. This information might be used by farmers to identify which plots require action on improving soil quality.

CN_q$good4crops <- factor(ifelse(CN_q$prob10perc > 20, 1, 0), labels = c("<90% certain", ">90% certain"))
spplot(CN_q, "good4crops", col.regions = c("firebrick1", "darkolivegreen2"), main = "Areas with sufficient C/N for cropping")

We can also calculate OC and TN uncertainty contribution to the uncertainty in predicting C/N. We have already calculated the total uncertainty in C/N predictions assuming both OC and TN are uncertain. To identify their contributions we can run the propagation with only one of them being uncertain. For the other input we take its mean as certain information.

# calculate total variance as a measure of total uncertainty
CN_tot_var <- calc(CN_sample, fun = var)

# OC contribution
OC_sample <- OCTN_sample[[1]]
CN_sample_oc <- propagate(realizations = OC_sample, model = C_N_model_raster, TN = TN, n = MC)
CN_sample_oc <- stack(CN_sample_oc)
CN_oc_var <- calc(CN_sample_oc, fun = var)
OC_contribution <- (CN_oc_var/CN_tot_var)*100

# TN contribution
TN_sample <- OCTN_sample[[2]]
CN_sample_tn <- propagate(realizations = TN_sample, model = C_N_model_raster, OC = OC, n = MC)
CN_sample_tn <- stack(CN_sample_tn)
CN_tn_var <- calc(CN_sample_tn, fun = var)
TN_contribution <- (CN_tn_var/CN_tot_var)*100

# plot results
par(mfrow = c(2,2))
plot(CN_mean, main = "C/N mean")
plot(CN_sd, main = "C/N sd")
plot(OC_contribution, main = "OC contribution to total C/N var [%]")
plot(TN_contribution, main = "TN contribution to total C/N var [%]")

Acknowledgements

The dataset was derived from the ISRIC Soil Grid database (www.soilgrids.org) (Hengl et al., 2017).

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 607000.

References

HENGL, T., MENDES DE JESUS, J., HEUVELINK, G. B. M., RUIPEREZ GONZALEZ, M., KILIBARDA, M., BLAGOTIĆ, A., SHANGGUAN, W., WRIGHT, M. N., GENG, X., BAUER-MARSCHALLINGER, B., GUEVARA, M. A., VARGAS, R., MACMILLAN, R. A., BATJES, N. H., LEENAARS, J. G. B., RIBEIRO, E., WHEELER, I., MANTEL, S. & KEMPEN, B. 2017. SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE, 12, e0169748.

WACKERNAGEL, H. 2003. Multivariate Geostatistics: An Introduction with Applications, Springer.

Spatial Uncertainty Propagation Analysis