Orchids are one of the largest and most diverse families of flowering plants in the world that comprises more than 25,000 different species. How can you tell the species apart? Last year’s RDNL Dutch Data Prize winner, Diah Apriyanti, is developing a tool to help!
Diah Apriyanti is a PhD student in the Department of Data management & Biometrics at the University of Twente. During her doctoral research, Diah is developing an algorithm that can be used to identify different species of plants for their taxonomic classification. Curious to learn more, we asked Diah to tell us more about her research and the story behind her winning dataset.
What inspired your research?
My interest in plant identification began during my time working in the Purwodadi botanical garden in my home country of Indonesia. I recognised that botanists and taxonomists devoted a large amount of time to the manual identification of plants for their taxonomic classification using reference collections and databases.
The manual process is tedious and laborious, and the existing image-based identification systems are a ‘black box’. They provide the name of the plant species but no explanation of the identification process or the plant characteristics.
Determined to improve the process, I have been developing a computational tool to automate plant identification as part of my Master’s degree and PhD research using orchids as a model species. My tool promises to makes plant identification quicker, easier and more reliable by providing the species name and a detailed description of the flower characteristics. What’s more, it can recognise the species name from images taken from different angles and under various lighting conditions.
How have you developed your tool so far, and how does it work?
The first step was to create the Orchid Flowers Dataset which contains more than 7,000 images of orchid flowers that belong to almost 160 species. I obtained these images from a variety of sources, including Go Botany (Native Plant Trust), the Encyclopedia of Life (EOL) and Flickr. I developed software code using Python programming language to query the Flickr application programming interface (API) using appropriate keywords about orchid species and flower characteristics. All of the images were available under the Creative Commons CC-BY license and, therefore, could be redistributed and reused.
The harvested images were all very different! Some were close-up images of orchid flowers whilst others were of the whole plant, including the stem and leaves. As my plant identification tool required an image of the orchid flower, I had to clean the data as much as possible to remove irrelevant images and obtain images of flowers.
Next, I created a Bayesian deep learning algorithm to identify, describe, classify and name the orchid in each image. The algorithm uses ‘feature classifiers’ which are variables used to describe the flower characteristics. The classifiers I used were comparable to those used by botanists and taxonomists, such as ‘species name’, ‘colour of flower’, ‘texture’, ‘number of flowers’, ‘inflorescence’ (arrangement of flowers) and ‘colour of labellum’ (lip that attracts insects). These classifiers were entered into a Bayesian network to identify the species name.
Image No. 7034 from the Orchid Flower Dataset
Source: Arthur Haines (Nature Plant Trust)
Species name: Amerorchis rotundifolia
Colour of flower: Purple
Colour of labellum: Purple
Labellum characteristic: lobed, spotted
Inflorescence (arrangement of flowers): raceme, a few flowers per cluster.
What were your motivations for publishing your dataset?
Plant identification datasets are scarce. Available datasets are limited to common plant species rather than wild tropical species, such as orchids. Many orchid species are endangered, growing only in conservation areas and botanic gardens, so I wanted to make my Orchid Flowers Dataset available to those who strive to protect such vulnerable species.
I published my image dataset in the Harvard Dataverse repository because other scientists in my research field have used this repository and I wanted to follow their good practice. I also published a README file which is a simple text file containing the ‘feature classifiers’ that were used to identify orchid species in the images. Providing the images alone would not be sufficient. The README file allows others to understand and reuse the images. In addition, my software code used to query the Flickr API is openly available on GitHub.
Has your data been reused?
I’m proud to say that my dataset has been downloaded more than 140 times since its publication online! It has been used by students in the Bachelor’s programme in Electrical Engineering and the Master’s programme in Computer Vision and Biometrics to learn about the use of deep learning models.
Currently, I’m writing a publication about this research and I hope that by citing my dataset others will be able to reuse it for their own benefit. After all, my dataset is novel and provides a more detailed description of orchid flower characteristics than currently exists. It is not limited to orchids but serves as a ‘proof-of-concept’ tool that can be adapted to identify other plant species!
How will you spend your prize winnings?
I’d like to use my RDNL Dutch Data Prize money of €1.000 to continue developing my tool for automated plant identification, and to develop a mobile application so that end-users can upload an image of a plant to identify it’s species and characteristics. This would be of huge benefit to the work of botanists and taxonomists, and could help save critically endangered plant species around the world.
Thank you for sharing your story with us, Diah, and good luck completing your PhD. We wish you all the best with these beautiful flowers!.. And, when we stop to admire an orchid in the future perhaps we can use your app to check what species it is!