FAIR DATA Fund use case: evolution of specialised metabolite decorations in brassicaceae

Authors: Felicia Wolters, Tina Woldu, Sibbe Bakker, Justin J. J. van der Hooft, Klaas Bouwmeester, Marnix H. Medema; Wageningen University & Research.

General background

Plants have evolved a plethora of specialised metabolites providing valuable resources for human medicine, nutrition and agrochemistry. Bioactivity of specialized metabolites is ultimately dependent on highly variable modifications of core structures. In particular, compound glycosylation determines toxicity conferred by the addition of sugar moieties on variable sites of the core structure. Genes encoding for UDP-dependent glycosyltransferases (UGTs) evolve rapidly via tandem duplication and translocation, resulting in the formation of UGT gene tandem-arrays. Although glycosylation profiles convey key features of plant specialized compound bioactivity, functional annotation of UGT genes and trajectories of tandem-array evolution are still largely obscure. To this end, integration of genomics, transcriptomics, and metabolomics data is needed. However, paired omics datasets are unavailable for most plant species.

Approach

In our project, we generated paired transcriptomics and metabolomics datasets for close and distant relatives of the Brassicaceae family with publicly available genome assemblies.

Untargeted metabolomics data and paired RNAseq data has been generated for a bio panel of 17 species, consisting of close and distant relatives in the Brassicaceae family. Based on the evaluation of metabolic distance, a subset of ten species was selected for the generation of paired transcriptomics and metabolomics data in a time-series, including plant hormone treatments for elicitation of specialized biosynthetic pathways.

Key results

A prominent integration method for different data types has been introduced recently centered around metabolomics data analysis [1]. The quality and comprehensiveness of sample metadata determines the extent to which the relatedness of extracted features can be computationally inferred. Harmonization of metadata standards has been proposed recently for the large public metabolomics repository MassIVE to enable integration with the Sequence Read Archive (SRA) [2].

Refinement of metadata structure
Based on the proposed structuring of metadata [1, 2] and the use of common ontologies for metadata fields [1 ‑3], we have constructed a metadata framework for both generated paired omics datasets within this project.Tthe design of this framework aims to facilitate further integration of datasets and paired data types, such as genome annotation data, proteomics data, and phenomics data by common ontology terms according to the Planteome knowledgebase [3].
Construction of a knowledge-graph framework for data integration
Based on the refinement of metadata structure, we have constructed a prototype for a knowledge-graph according to the Experimental Plant Natural Products Knowledge-graph framework introduced by Gaudry et al [4]. The metadata structure was converted into a shareable tripple format enabling the use of SPARQL queries for extraction of data and related metadata.

Fig. 1: Simplified scheme of metadata integration for cross-linking public knowledge databases and construction of an Experimental Natural Products Knowledge-graph following the concept introduces by Gaudry et al [1].

Fig. 2: Example scheme of a Knowledge-graph outline queryable using SPARQL query language.

Support from the 4TU.ResearchData FAIR DATA Fund

The FAIR DATA Fund enabled us to curate a knowledge-graph-based framework for integration of transcriptomics and metabolomics data generated in this PhD project. Based on this framework, we are able to provide interoperable data for the plant natural products research community suitable for large-scale meta-analysis. We are able to comply with metadata standards compatible with related omics data beyond the scope of this PhD project. With the support provided by the FAIR DATA Fund, we are able to share the data generated in this PhD project suitable for re-use by a broader scientific community. In particular, standardized metadata curated in this project aims to facilitate biocuration for AI development in the life sciences, as outlined recently [4].

References

[1] Gaudry, A., Pagni, M., Mehl, F., Moretti, S., Quiros-Guerrero, L. M., Cappelletti, L., … & Allard, P. M. (2024). A sample-centric and knowledge-driven computational framework for natural products drug discovery. DOI: 10.1021/acscentsci.3c00800.

[2] El Abiead, Y., Strobel, M., Payne, T., Fahy, E., O’Donovan, C., Subramamiam, S., … & Wang, M. (2024). Enabling pan-repository reanalysis for big data science of public metabolomics data. DOI: 10.26434/chemrxiv-2024-jt46s.

[3] Cooper, L., Elser, J., Laporte, M. A., Arnaud, E., & Jaiswal, P. (2024). Planteome 2024 Update: Reference ontologies and knowledgebase for plant biology. Nucleic Acids Research, 52(D1), D1548-D1555. DOI: 10.1093/nar/gkad1028.

[4] Dessimoz, C., & Thomas, P. D. (2024). AI and the democratization of knowledge. Scientific Data, 11(1), 268. DOI: 10.1038/s41597-024–03099‑1

Felicia Wolters and the team at Wageningen University & Research are among the winners of the FAIR Data Fund 2023 edition.

Related Articles

Discover more from 4TU.ResearchData