Forecasting biodiversity: A matter of data availability

In a previous post, we briefly discussed our internship experience with GEO BON, in which we developed a forecasting model of local contributions to beta diversity (LCBD) at the regional scale, using communities of warblers species in Quebec and Colombia as a case study. The first part of our endeavor was getting access to data. As typical grad students in quantitative ecology, we used data mostly openly available on the internet. As mentioned in the previous post, for species occurrence, we used data from the eBird database, while environmental and land-use data were obtained from the CHELSA database and the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), respectively. While these datasets are openly available, the steps required to actually use them and the digital space they occupy could represent a challenge for someone unfamiliar with such a task.

Another limiting factor regarding data availability was that the land-use and environmental data had to be obtained from the same climatic model and RCP scenario, which from the get-go narrowed the possible options even more. For eBird, we also had to ask for authorization to download the raw data and their size made it hard to work with in the beginning (58 GB for the compressed file and 120 GB file once decompressed). Once acquired, data transformation was still needed, but for the most part that went well and all data manipulation steps are stored in a replicable script of code. In comparison, forecasted land-use data were hard to find in good enough resolution and in the wanted timeframe. Ultimately, the environmental data were easier to obtain; working with Julia, we were able to use the SimpleSDMLayers.jl package to retrieve and work with CHELSA data. If someone wanted to replicate our work without Julia, it would be possible to get the data directly from CHELSA database, but they would need to coarsen them to fit the land-use data resolution. In our case, we used the coarsen function in SimpleSDMLayers.jl; however, without Julia, additional preliminary steps, in other software such as R and QGIS, would be required before running our analysis. Those were the main problems we encountered with the data we worked with. While they were satisfactory to be used in a proof of concept, those data were also chosen because of time constraints and a simple lack of data availability. If in the future, we want to use this method in a more polished fashion, we would really benefit from better quality data.

Here we discuss the specific problems we encountered for the different types of data and why it restricted the datasets we could use. The biggest challenge was and will probably continue to be the resolution of the available data, more specifically the land-use data. When working on large spatial scales like we wanted to, most of the available data have coarse resolution and/or are sometimes missing areas at the global scale. The environmental data we used had a really fine resolution of 30 arcseconds, which makes the size of the pixels more or less 1 km on average. In comparison, the land cover data we used had a resolution of 15 arcminutes, i.e., 30 times bigger. We did manage to find one land-use dataset with a finer resolution of 3 arcminutes, but because of time constraints and our research objectives, we kept the former. Still, even with 3 arcminute resolution data, there could be some problems with having pixels of around 30 km² as an input in this kind of model. Indeed, there is information that is not used to its fullest when we coarsen the environment data, and it can be hard to find really unique areas in terms of LCBD if the areas covered by our communities are 30 km².

Relative LCBD values in Quebec in 2020. We can see here that the entire Anticosti Island is composed of around 30 pixels.

With that being said, these problems are still only speculative, and we would need to actually try and repeat the process with different spatial resolutions to see how it would impact the result. On the other hand, if we work at a finer spatial resolution, we would most likely encounter a lot of areas with no observation and we are not fully sure how the model would handle that. Again, those are concerns for future work. Another possible improvement that could be made to the land-use dataset would be to have more information on specific ecosystem types such as wetlands, deserts, deciduous forests, etc. The dataset used here classifies landscapes mostly in an anthropocentric way, with categories such as forested land, pasture, crops, and urban area. The main problem is finding such detailed datasets at a global scale (or at least at a national scale) with a good resolution and the least missing data possible. Then again, maybe our model doesn’t need data this precise, but additional tests and data are needed to investigate this issue further. Looking at the environmental data, it would also be convenient to be able to use forecasted data for a larger set of years. In our case, we only worked with average predictions for the years 2041-2060 and 2061-2080. Having said that, the spatial resolution, the fact that the data are global, and their ease of use make the environmental dataset we used hard to compete with. Finally, if we wanted to replicate this study on another taxonomic group, acquiring the data could be a hard endeavor. As said earlier, warbler data from eBird were abundant due to the popularity of the website in recent years and the popularity of warblers. For other species, having a comparable amount of data could be a challenge.

In the end, what we did with our data was sufficient to accomplish what we wanted to do, but it is still interesting and exciting to think of how much better this approach could be. All we need is more data, more testing, enough processing power, and enough love of the craft to go through all those error messages.

This piece of writing is the second and last blog post on our internship experience with GEO BON. With this series of articles, we hope that we can spark your enthusiasm in pursuing such a science-driven internship in the not too distant future!

Samuel Provencher-Tardif , Francis Banville, and Gabriel Dansereau

Written on November 1st, 2021