There are many reasons why we study deforestation including predicting at risk areas, predicting deforestation rate and informing the development of conservation policies and programs. Each study will have its own set of objectives to meet (such as setting a deforestation baseline or advising on forest protection policies) and constraints to work within (such as time and data constraints and access to experts). This thesis develops a framework for helping to decide which of several statistical and machine learning methodologies; generalised linear models (GLMs), generalised linear mixed models (GLMMs), artificial neural networks (ANNs), Bayesian networks (BNs) and Gaussian processes (GPs) might be suitable for a given deforestation study.
One common constraint on deforestation studies is data availability, as it is often not possible to acquire all the datasets that would ideally be included. High resolution demographic or socio-economic information can be costly, and obtaining the value of dynamic variables such as road location for the correct point in time may be difficult. By using either freely available or low cost datasets to generate the variables for this thesis, it was possible to evaluate the usefulness of these data in predicting deforestation and identifying its predisposing factors. Their proven utility demonstrates that they could provide effective substitutes in those cases where the ideal datasets are not available.
The main datasets used were the Conservation International land use change data for southern Mexico and north-eastern Madagascar, which are raster datasets at 30m resolution showing forest loss for either two (Mexico) or three (Madagascar) time steps. Predictor variables were also generated from the World Database on Protected Areas, the NASA Landsat digital elevation model and several Natural Earth datasets on city and river location. Random samples were generated across the forested areas of the study zones and models were then trained to predict whether there would be any deforestation within a 500m x 500m zone around each sample.
Models were implemented using either R or Matlab and their performance was evaluated using sensitivity, specificity, true skill statistic and the area under the receiver operating curve. The results of the best performing model designs for each methodology were mapped to examine whether the predicted high risk areas were close to where actual deforestation occurred. Separate maps were produced for the predicted results at a 50% probability cut off, as well as across all probabilities to produce a map of predicted high risk areas. Additional maps were created showing the predictions after correcting for the rate of expected deforestation.
When applied to complex problems such as deforestation analysis, machine learning (ML) techniques have several theoretical and practical advantages over classical statistics, primarily the ability to take into account non-linear relationships. The ML models were therefore expected to outperform the simpler statistical versions, however the results showed that this was not always the case. While the GLMMs outperformed ANNs in two of the three study zones, ML techniques did offer improvements with BNs scoring higher on the true skill statistic than GLMMs when trained on standard, rather than stratified data and GPs improving performance when fewer variables are available. Most models showed promising results for predicting the location of high risk areas, although this was dependent in some cases on using stratified data to boost the number of positive deforestation samples that the models could learn from.
Outside of predictive performance, other methodology attributes, such as interpretability and ease of implementation can also dictate a model’s suitability in meeting the objectives and constraints of a study. Taking this into account, several recommendations are presented based on the findings in this thesis. Firstly, freely available datasets should be considered as a valuable source of data for deforestation studies. In terms of the application of ML methodologies to deforestation studies, when generalised linear models are used, performance may be improved by modelling the spatial dimension as a random effect (in this case the X and Y coordinates were used). When examining the drivers of deforestation, if p-values are not required, Bayesian networks offer better interpretability than the statistical models with no decrease in predictive performance. When predicting location, the Gaussian processes implemented in this thesis outperformed the artificial neural networks and were easier to design, although each model took longer to run.
The recommendations derived from this research go some way towards providing guidance to environmental management practitioners on which ML methodologies can improve on the classical statistical techniques that are frequently employed. This contributes to closing the gap between the disciplines and opening up new tools and datasets that can help with the ongoing challenges facing those attempting to curb deforestation and reverse the trends of environmental degradation associated with it.