Questions & Answers

The BioDT project aims to build Digital Twin prototypes for biodiversity, also exploiting the potential of the LUMI Supercomputer. This work requires efforts from a variety of experts that aim to push the current boundaries of predictive understanding of biodiversity dynamics by providing advanced modelling, simulation and prediction capabilities. By exploiting existing technologies and data from relevant research infrastructures in new ways, BioDT will be able to accurately and quantitatively model interactions between species and their environment.

Below you can find some of the most frequently asked questions concerning the BioDT project, infrastructures, use cases.

Feel free to reach out to us through our contact form, in case of any question, that has not been collected in this Q&A section yet.

Have a closer look at the questions and learn more about BioDT.


The project

Why is the existing modelling approach insufficient? Is it due to lack of data or insufficient knowledge of the species’ dynamics?

Issues have to do with data availability and quality, and insufficient process-level knowledge are both important sources of uncertainty when it comes to biodiversity modelling. Particularly, at large spatial scales, we often lack a detailed mechanistic understanding of species dynamics and face multiple challenges presented by heterogeneous data. While some biodiversity data sets are based on systematic surveys of presence-absence data, others are based on presence-only data and are opportunistically collected. Inherent biases of biodiversity data also include taxonomic and geographic bias. More data are available on, for example, birds and mammals than on fungi and insects, and from Europe (compared with other areas including Africa).

In the first BioDT general presentation you said that you want to ‘somehow combine mechanical and statistical modelling approaches’. What do you mean with ‘somehow’?

Developing hybrid modelling approaches, i.e. approaches that combine mechanistic and statistical approaches, is one of the research areas pursued in BioDT. Before going into the "how", we can also briefly discuss the "why". Mechanistic modelling methods can perform well with access to well-defined, controlled data sets and are efficient for clarifying causal links. However, they are also limited in their ability to model the full scale of complexity in natural ecosystems. On the other hand, statistical models can be highly efficient in terms of identifying patterns of interest in the data, but they lack direct causal descriptions. In BioDT, we propose that hybrid models, which combine the generality and relevance of phenomenological models with the causality of mechanistic models, can outperform the predictive capacity of both phenomenological and mechanistic models. Mechanistic elements could be brought into statistical models by, for example, informing model structures and constraining model parameters with experimental data and ecological knowledge. Conversely, statistical elements could be integrated into mechanistic models through, for example, including dependencies of transition rates on covariates.

A key difference between DTs and ‘conventional models’ is the two-way communication between users and DT/model. How is this integrated/addressed in BioDT?

One way in which this will be addressed involves scanning model outputs using AI / machine learning methods to identify ways in which the original model parameters could be further improved. Generally, the exact ways in which this type of two-way interaction will be achieved will depend on the specific BioDT Use Cases being addressed, which in turn define the types of digital twins to be developed.

Much of the modelling relies on external information, climate, climate change, land use (and change), soil physical properties etc.. How will this be integrated?

The exact requirements will depend on the individual BioDT Use Cases. Establishing trackable and transparent ways in which to combine biological and physical data sets from different sources will require careful planning and collaboration between the four biodiversity research infrastructures involved in the project. Technical platform development activities in BioDT will also be pursued in a way that seeks to maximise the compatibility of BioDT with other European DT initiatives, including Destination Earth.

What will be the products or services coming out of BioDT to be used in policy consulting/ecosystem management?

One specific possibility with reference to the BioDT Use Cases involves interactive maps that could be used for biomonitoring.

What's the difference of the DT in comparison to species distribution modelling?

The digital twin will operate on “real-time” data streams to build a digital counterpart to a physical thing or system. The digital twin will learn to “behave” in a manner similar to the physical counterpart so that we can make experiments with the twin that would give similar results as experiments with the real thing or system.
DT is a digital version of a real-time process. It is not static like SDM.

Climate change digital twins have been created 'in advance of policy' hence providing a truly objective scientific view on climate change. Now that Biodiversity is becoming a hot-topic, how do you ensure that the creation of such DTs are not pushed into policy too much?

Due to the complexity of the work ahead, BioDT focuses on generating digital twin prototypes. Indeed, when developing digital twins for predictive biodiversity modelling, it will be critical to build an understanding of the uncertainties, limitations and biases of the different methodologies being developed. Because of this, a dedicated task in BioDT will be to systematically test the predictive performance of the digital twins and to determine whether corresponding patterns exist in their real-life counterparts. Retaining full transparency with regard to model uncertainty will be of importance with respect to all BioDT research outputs and dissemination activities, including interactions with policymakers.

Infrastructures

The code of domain scientists (working on their own laptops) is often not suitable for deployment on HPC or cloud environments. How do you aim to convert such legacy code? Or are you only focussing on code that is already compatible with HPC or cloud platforms?

We do aim to convert such legacy code and have quite a lot of working time allocated for it. Indeed, there is much legacy code which is crucial for our work.

You may want to also connect to EuropaBON that deals with gathering all available BioDiv data also across habitats. eLTER has linkages to other Research Infrastructures and will also tackle the harmonisation.

This is a valuable suggestion, as connecting with the wider European biodiversity research landscape is an objective to be approached throughout the lifetime of the project. Further to EuropaBON, initiatives of interest could include ENVRI, EOSC-Life and BiCIKL.

It was mentioned during the DITTO interoperability side event that some Digital Twins (i.e. DestinE) might produce up to 1PB/day of data. Following you aim to integrate with other Digital Twins, do you plan to run some parts of BioDT in other data centres than LUMI or in the cloud?

The BioDT will build on services operated by the biodiversity Research Infrastructures in Europe (GBIF, eLTER, DiSSCo, and LifeWatch, etc). Some of the twins we build we imagine can continue to run in LifeWatch.

LUMI HPC is built on AMD architecture. Do you anticipate any challenges with porting MKL or CUDA accelerated libraries to OpenCL, ROCm, or SYCL?

Several biodiversity modelling tools come with limited support for high-performance computing (HPC). To help address this challenge, one of the BioDT tasks is dedicated to model upscaling and ensuring the compatibility of selected modelling tools with LUMI. This work will be carried out by project partners with experience in code porting and HPC, in collaboration with modellers involved in the project.

During your previous event you mentioned Agent-Based modelling. Do you plan to work/integrate with current open source software like Repast HPC, or is it still to be determined?

We expect these discussions to take place during the modelling workshop that will take place in November 2022 in Finland, and hope to update all about the key outcomes.

I would like to know the DT framework or services that scale better as regards data that are not very much available.

Biodiversity twin prototype will be built starting from the small scales and readily available data, and move towards more complex and less data rich areas.

To what extent data generated from modelling can be inserted in subsequent analysis?

This is hard to predict before the first versions of BioDT are operational, but we expect these to be useful for the downstream analyses on the open and FAIR basis.

If data covering a large enough spatial area and with sufficient measurement density is not available, what plans, if any, are there to generate synthetic data to all work to proceed in parallel on speeding up the models and underlying libraries to allow them to run at scale on large machines like LUMI.

The will be likely use of the synthetic or play data on the BioDT building phase to help design and prototyping, but we expect to finish with a Twin based on real-life data.

Use Cases

Seems the occurrence points of the grasspea species are plenty in Europe and areas where hunger and malnutrition is relatively low. Are global south regions also suitable for the species?

The crop Lathyrus sativus (grasspea) which is also known as Indian pea is grown very well in the global south. It is commonly cropped in east Africa (Ethiopia), middle east and south eastern Asia (India). True that it is also well distributed in southern Europe, especially in the Mediterranean area. Members of the genus have quite wider distribution. The occurrence points we have displayed both at genus and species level are obtained from GBIF and hence it may be misleading since the global south is not well sampled.

Can you provide some additional information on the scenarios (climatology/policy/etc.) that the digital twin be used for?

In this first pilot phase of validating that we can build the biodiversity digital twin we will explore a set of 8 use cases. See: https://biodt.eu/#use-cases

What's the difference of the DT in comparison to species distribution modelling?

The digital twin will operate on “real-time” data streams to build a digital counterpart to a physical thing or system. The digital twin will learn to “behave” in a manner similar to the physical counterpart so that we can make experiments with the twin that would give similar results as experiments with the real thing or system.
DT is a digital version of a real-time process. It is not static like SDM.

Are "levels of invasion" planned to incorporate local species-interaction network knowledge?

This is currently not planned, because the knowledge of these interaction network is not available for alien species over the full environmental gradent across Europe. To the best of my knowledge, we have too few detailed information available on these groups.

Which data will be used to describe the real-time habitat conditions?

For alien species we will explore different high-resolution Remote Sensing products.

For the modelling of the invasive species spread, Which are you next to environmental variables also using EBVs? If so which ones?

EBVs as such would be responses (e.g. levels of invasions is in one of the EBV classes) rather than predictors. Because we need to tap on existing data, we cannot use ready-made EBV as response, as they are not (yet) centrally being made available (though for IAS this is in making). But we will make use of eLTER, DISSCO and seeking contact with our colleagues from the European Vegetation Archive or sPlot.

Concerning invasive species, are there efforts being made to quantify their benefits to local societies too, for example in case of medicinal benefits as hypothesised in Ethnobotany, see https://link.springer.com/article/10.1007/s12231-017-9389-8

Currently, we do not plan to link levels of invasions to impacts or benefits, because this again is different across different regions in Europe and we only have very general coarse scale data on these.

Is there any global spatial database about pesticide use that can be used to model its influence on pollinators?

https://ourworldindata.org/pesticides

What kind of data do you use for the Pollinator interactions use case?

Beehive weight is a major data source, including the TrachtNet data, as well as remotely sensed land use maps. Further relevant data are weather data and data about the phenology of all relevant crop and non-crop plant species and the amount of nectar and pollen they provide.

In both for the 2nd and 3rd presentation, objectives showed a large overlap (naturally) with the goals/targets specified by the CBD in their post-2020 framework. However, the biodiversity indicators associated to this framework focused on pollinators and on 'rate of invasive alien species spread/impact, are not 'mature'. I think it would be useful if the DT would be capable of simulating how such indicators would change over time. How do you see this?

Currently, we do not plan to link levels of invasions to impacts or benefits, because this again is different across different regions in Europe and we only have very general coarse scale data on these. As for pollinators, the honey bee model will be used as a "probe" to assess the availiability of nectar and pollen, and the diversity of pollen, in a given agricultural landscape. This is a proxy that can help developing crop rotation schemes and land cover structures. Moreover, risk imposed by pesticides can be made landscape-specific, and hence support biodiversity assessment and conservation.

To what extent data generated from modelling can be inserted in subsequent analysis?

This is hard to predict before the first versions of BioDT are operational, but we expect these to be useful for the downstream analyses on the open and FAIR basis.

If data covering a large enough spatial area and with sufficient measurement density is not available, what plans, if any, are there to generate synthetic data to all work to proceed in parallel on speeding up the models and underlying libraries to allow them to run at scale on large machines like LUMI.

The will be likely use of the synthetic or play data on the BioDT building phase to help design and prototyping, but we exepect to finish with a Twin based on real-life data.

Crop wild relatives use case - would you include DNA-based observation that have no taxonomic classification (non-described OTUs/ASVs) in the calculation of biodiversity metrics? Do you find non-described OTUs/ASVs informative for policy makers?

There is no plan to use non-described OTUs / ASVs in the crop wild relatives use case directly. Yes we believe that such OTUs / ASVs are potentially useful, as with evolving reference libraries and identification algorithms these name-naked OTUs can eventually germinate into the records with formal names.

Is there a possibility for non-academic institutions to collaborate and help in the development of these use cases?

Yes, you are welcome to reach out to the use case leaders or project management and we will look at the potential opportunities together.

DestinE

How is the Commission bringing social science perspectives into DestinE? Do you have any examples?

The Strategic Advisory Board of DestinE is currently setting up a phased and continuous living Destine Science Plan, which will take up also this dimension as a specific thematic line. The Science Plan will set out concrete recommendations to relevant Commission services on further enhancing the contribution of this dimension in DestinE.

What are the Commission’s thoughts on how to manage expectations from policy stakeholders about what DTs can and cannot model?

Successful take-up and actual usage of such tools will typically depend on actual co-design of services between end-users and developers. To this end, the three DestinE implementing entities are putting in place a comprehensive user engagement process.

How will DestinE harness the future EU Green Deal Data Space? What is the vision for connection and/or interaction between these initiatives?

As set out in the Data Strategy, the Green Deal Data Space aims to exploit the major potential of data in support of the European Green Deal priorities through setup of a federated ecosystem of relevant data sources, enabling policy makers, businesses, researchers and citizens to work jointly for the realisation of the Green Deal objectives. A dedicated Coordination & Support Action has been launched under the Digital Europe Programme (“the GREAT project”). This project aims, among other, at preparing a reference architecture and an implementation roadmap for the Green Deal Data Space and will explore and further develop links with DestinE to ensure maximum synergies. There have already been discussions between Commission Services working on the Green Deal Data Space and DestinE on potential use cases and synergies. Further, ECMWF is part of the GREAT project.

Are there any formats at the Commission about this issue that is open to national institutions?

The specific forum to discuss issues of intended participation of Member State institutions in DestinE is the dedicated DestinE Member States forum which meets periodically and is composed of representatives of the Commission, the Member States and the three DestinE implementing institutions.