When is variable importance estimation in species distribution modelling affected by spatial correlation?

Associate Professor from the University of Twente, Thomas Groen, and former Master’s student, Nivedita Varma Harisena, received the FAIR Data Fund in Spring 2021 to prepare their biodiversity research data and software code for publication in 4TU.ResearchData. Here, we learn about their research on species distribution modelling and the importance of sharing their educational dataset to benefit the wider ecological research community. 

Thomas Groen and Nivedita Varma Harisena are spatial ecologists working on species distribution models. They use spatial interpolation techniques and remote sensing information to map the locations that are suitable for various animal and plant species.

Species distribution modelling

Each species occupies a ‘niche’ which is the match of a species to a set of particular environmental conditions. Species distribution models use computer algorithms to predict the distribution of a species across geographic locations using environmental data. 

Thomas explains that species distribution modelling typically requires two sets of data. “The first dataset contains observations about species location, i.e. where species are present and where they are absent. The second contains information about environmental conditions [variables] within those locations, such as temperature, rainfall, vegetation cover and incoming radiation, for example.”

He continues, “Using statistical models, it’s possible to describe the relationship between the two datasets, allowing researchers to predict under which conditions they are likely or less likely to find certain species. The idea is to use these models to map species distribution across large geographic regions and estimate the importance of certain environmental variables for species survival.”

Species distribution models typically reuse existing data. “Species location data is typically collected by field work, a process by which scientists observe and collect data within ecosystems,” says Thomas. “Another method involves citizen science, whereby nature enthusiasts record observations about species sighted in visited locations and share the data in public databases.”

Researchers who model species distribution can also access public repositories, such as Bioclim, the Normalised Difference Vegetation Index (NDVI) Climate Data Record (CDR) or the Global Biodiversity Information Facility (GBIF), to download valuable data on environmental variables or presence of species. 

Real-world impact

“Species distribution models are particularly valuable for conservation since they can be used to identify areas of land suitable for endangered species, and these areas can be subsequently protected. Such models can also inform about the effects of climate change or land use change” says Thomas. 

“Agronomists also use these models to identify suitable areas for crops, or where control of pests is most needed. In health sciences, the techniques are used to model disease spread by vectors, such as mosquitos,” he adds. 

Thomas provides more information about ‘Species distribution modelling’ and its value.

The problem of spatial autocorrelation 

Nivedita states that ‘spatial autocorrelation’ is a common problem affecting the accuracy of species distribution models. It arises when geographic locations that are close together have similar values for environmental variables.

“Researchers often go to the same geographic locations to collect data. And, sampling at the same locations can lead to a bias in the model output,” says Nivedita.

She provides an example. “If the temperature is consistently high in the locations sampled, one might assume that high temperature is important for species residing within those locations. However, it’s likely that the close similarity in temperatures recorded in those locations inflates the importance of temperature as an explanatory variable.”

Nivedita’s educational dataset 

To help researchers understand the effects of spatial autocorrelation, Nivedita created an educational dataset containing simulated landscapes of environmental variables and virtual species that respond to these variables in predictable ways. 

By controlling the response of the virtual species to the different simulated environmental variables, the importance of each variable in explaining species presence is known. By subsequently estimating the importance of these variables with established methods, the (mis) match between “true importance” and “estimated importance” at different levels of autocorrelation can be demonstrated 

Nivedita used the FAIR Data Fund from 4TU.ResearchData to refine and publish the data and code underlying her research paper, ‘When is variable importance estimation in species distribution modelling affected by spatial correlation?

Aside from simulation data, her published dataset comprises a README file and an R script to allow future users of species distribution models to be able to change model parameters, visualise the results and reuse the models within their own context. 

“My dataset allows researchers to explore the simulations and learn how to correct for spatial autocorrelation in their own datasets,” says Nivedita. “Researchers can change parameters within my model, such as landscape size and resolution; sampling density; species response to environmental variables; and spatial autocorrelation levels, to replicate the conditions of their own dataset.”

“What’s more, they can learn how spatial autocorrelation could affect their dataset and can choose to adapt their methods in order to avoid bias, such as increasing sampling effort, sample size or location,” she adds. 

The FAIR Data Fund

Working now as PhD researcher at ETH Zurich, Nivedita explained that the FAIR Data Fund helped her set aside a number of hours to retrospectively prepare her Master’s dataset for open access publication online. 

“I had to format, clean and annotate the dataset myself. I knew that if I handed that task to someone else they would not have been able to understand my data. I created understandable documentation, functions and variable names; removed any errors from the dataset; and, structured the data to make it easier to find and access by others,” says Nivedita. 

With support from the University of Twente’s Open Science Officer, Markus Konkol, Nivedita also organised for her computational research to be independently executed and verified by CODECHECK. A certificate is available confirming that Nivedita’s computations could be independently executed by the reviewers. 

Nivedita concludes with some positive comments about data publication. 

“Publishing my data was easy and straightforward, and I received a lot of support from my supervisor, Thomas. Our dataset has been downloaded more than 100 times since it was published! It’s exciting to observe that data sharing and reuse is becoming more common in ecology. Collecting samples during fieldwork costs a lot of time, money and effort, so it’s important that researchers share their work and get properly credited for their contribution.”

Authored by Connie Clare (4TU.ResearchData)

Related Articles


Leave a comment!