Making the most of machine learning and freely available datasets: a deforestation case study

Mayfield, Helen (2015). Making the most of machine learning and freely available datasets: a deforestation case study PhD Thesis, School of Geography, Planning and Environmental Management, The University of Queensland. doi:10.14264/uql.2015.1018

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
s371524_phd_submission.pdf Thesis (open access) application/pdf 22.76MB 0

Author Mayfield, Helen
Thesis Title Making the most of machine learning and freely available datasets: a deforestation case study
School, Centre or Institute School of Geography, Planning and Environmental Management
Institution The University of Queensland
DOI 10.14264/uql.2015.1018
Publication date 2015-11-06
Thesis type PhD Thesis
Supervisor Marc Hockings
Carl Smith
Marcus Gallagher
Total pages 367
Total colour pages 141
Total black and white pages 226
Language eng
Subjects 050205 Environmental Management
080110 Simulation and Modelling
080109 Pattern Recognition and Data Mining
Formatted abstract
There are many reasons why we study deforestation including predicting at risk areas, predicting deforestation rate and informing the development of conservation policies and programs. Each study will have its own set of objectives to meet (such as setting a deforestation baseline or advising on forest protection policies) and constraints to work within (such as time and data constraints and access to experts). This thesis develops a framework for helping to decide which of several statistical and machine learning methodologies; generalised linear models (GLMs), generalised linear mixed models (GLMMs), artificial neural networks (ANNs), Bayesian networks (BNs) and Gaussian processes (GPs) might be suitable for a given deforestation study.

One common constraint on deforestation studies is data availability, as it is often not possible to acquire all the datasets that would ideally be included. High resolution demographic or socio-economic information can be costly, and obtaining the value of dynamic variables such as road location for the correct point in time may be difficult. By using either freely available or low cost datasets to generate the variables for this thesis, it was possible to evaluate the usefulness of these data in predicting deforestation and identifying its predisposing factors. Their proven utility demonstrates that they could provide effective substitutes in those cases where the ideal datasets are not available.

The main datasets used were the Conservation International land use change data for southern Mexico and north-eastern Madagascar, which are raster datasets at 30m resolution showing forest loss for either two (Mexico) or three (Madagascar) time steps. Predictor variables were also generated from the World Database on Protected Areas, the NASA Landsat digital elevation model and several Natural Earth datasets on city and river location. Random samples were generated across the forested areas of the study zones and models were then trained to predict whether there would be any deforestation within a 500m x 500m zone around each sample.

Models were implemented using either R or Matlab and their performance was evaluated using sensitivity, specificity, true skill statistic and the area under the receiver operating curve. The results of the best performing model designs for each methodology were mapped to examine whether the predicted high risk areas were close to where actual deforestation occurred. Separate maps were produced for the predicted results at a 50% probability cut off, as well as across all probabilities to produce a map of predicted high risk areas. Additional maps were created showing the predictions after correcting for the rate of expected deforestation.

When applied to complex problems such as deforestation analysis, machine learning (ML) techniques have several theoretical and practical advantages over classical statistics, primarily the ability to take into account non-linear relationships. The ML models were therefore expected to outperform the simpler statistical versions, however the results showed that this was not always the case. While the GLMMs outperformed ANNs in two of the three study zones, ML techniques did offer improvements with BNs scoring higher on the true skill statistic than GLMMs when trained on standard, rather than stratified data and GPs improving performance when fewer variables are available. Most models showed promising results for predicting the location of high risk areas, although this was dependent in some cases on using stratified data to boost the number of positive deforestation samples that the models could learn from.

Outside of predictive performance, other methodology attributes, such as interpretability and ease of implementation can also dictate a model’s suitability in meeting the objectives and constraints of a study. Taking this into account, several recommendations are presented based on the findings in this thesis. Firstly, freely available datasets should be considered as a valuable source of data for deforestation studies. In terms of the application of ML methodologies to deforestation studies, when generalised linear models are used, performance may be improved by modelling the spatial dimension as a random effect (in this case the X and Y coordinates were used). When examining the drivers of deforestation, if p-values are not required, Bayesian networks offer better interpretability than the statistical models with no decrease in predictive performance. When predicting location, the Gaussian processes implemented in this thesis outperformed the artificial neural networks and were easier to design, although each model took longer to run.

The recommendations derived from this research go some way towards providing guidance to environmental management practitioners on which ML methodologies can improve on the classical statistical techniques that are frequently employed. This contributes to closing the gap between the disciplines and opening up new tools and datasets that can help with the ongoing challenges facing those attempting to curb deforestation and reverse the trends of environmental degradation associated with it.
Keyword Deforestation
Machine Learning Methods
Bayesian network
Artificial neural network (ANN)
Gaussian process
Mixed effects models
Environmental Management
Geo-referenced datasets

Document type: Thesis
Collections: UQ Theses (RHD) - Official
UQ Theses (RHD) - Open Access
Version Filter Type
Citation counts: Google Scholar Search Google Scholar
Created: Mon, 02 Nov 2015, 16:51:00 EST by Helen Mayfield on behalf of Scholarly Communication and Digitisation Service