|Mohtady Barakat||Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia|
|Gwo Chin Chung||Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia|
|It Ee Lee||Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia|
|Wai Leong Pang||School of Engineering, Taylor’s University, 47500 Subang Jaya, Selangor, Malaysia|
|Kah Yoong Chan||Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia|
Since 2017, up to 41% of Malaysia's land has been cultivated for durian, making it the most widely planted crop. The rapid increase in demand urges the authorities to search for a more systematic way to control durian cultivation and manage the productivity and quality of the fruit. This research paper proposes a deep-learning approach for detecting and sizing durian fruit in any given image. The aim is to develop zero-shot learning models that can accurately identify and measure the size of durian fruits in images, regardless of the image’s background. The proposed methodology leverages two cutting-edge models: Grounding DINO and Segment Anything (SAM), which are trained using a limited number of samples to learn the essential features of the fruit. The dataset used for training and testing the model includes various images of durian fruits captured from different sources. The effectiveness of the proposed model is evaluated by comparing it with the Segmentation Generative Pre-trained Transformers (SegGPT) model. The results show that the Grounding DINO model, which has a 92.5% detection accuracy, outperforms the SegGPT in terms of accuracy and efficiency. This research has significant implications for computer vision and agriculture, as it can facilitate automated detection and sizing of durian fruits, leading to improved yield estimation, quality control, and overall productivity.
Durian; Grounding DINO; Segment Anything; SegGPT; Zero-shot learning
Agriculture is the backbone of many economies around the world. Among various plantations, durian, often referred to as the "king of fruits", is of great economic importance in many countries, especially in Southeast Asia (Allied Market Research, 2021). In the countries where it is extensively grown, such as Thailand, Malaysia, and the Philippines, durian farming is a major income-generating activity. For example, in 2020, Thailand, the world's leading producer of durians, harvested about 1.05 million metric tonnes of fruit (FAOSTAT, 2021). On the export front, China's enormous appetite for durians is notable. According to a report from the Thai Office of Agricultural Economics (Saowanit, 2017), in 2016, durian exports to China amounted to over 158,000 metric tonnes, valued at approximately $0.24 billion. Durian's unique traits and high economic value have also led to considerable investments in research and development to improve cultivation techniques, disease resistance, and post-harvest handling. For instance, the Malaysian government has invested about $2.8 million in durian research and development since 2016.
In recent years, technology has significantly influenced traditional farming practices, a shift often called 'Precision Agriculture' (Haziq et al., 2022; Pedersen and Lind, 2017). One of the critical tasks in precision agriculture is the accurate measurement of fruit size, which assists in yield prediction, harvest planning, and post-harvest processing. Durian's size detection holds substantial importance due to the fruit's variable and large size. While traditional methods of durian cultivation have relied on skilled farmers' expertise to determine the fruit's ripeness and quality, the global agricultural landscape is rapidly evolving. The appeal of integrating advanced deep-learning techniques into durian cultivation stems from several factors: scalability and consistency, data-driven insights, market differentiation, and future-proofing (Kamilaris and Prenafeta-Boldú 2018). With the advent of machine learning and deep learning techniques, automated fruit detection and size estimation have become a possibility. Various models have been explored, including Convolutional Neural Networks (CNNs) (Tan et al., 2022), You Only Look Once (YOLO), etc.
For instance, (Pedersen and Lind, 2017) demonstrated the use of machine learning algorithms for predicting fruit yield, quality, and disease detection. Similarly, image processing and computer vision have been widely studied for their potential in the ripeness classification of oil palm fresh fruit bunches (Mansour, Dambul, and Choo, 2022). Besides that, the authors used deep-simulated learning for fruit counting, thereby highlighting the potential of deep learning in precision agriculture (Rahnemoonfar and Sheppard 2017). The proposed Inception-ResNet architecture showed 91% average test accuracy on real images. Similarly, (Mohanty, Hughes, and Salathé, 2016) used deep CNNs for image-based plant disease detection and achieved nearly 100% accuracy in detection under certain conditions. Last but not least, a game-theoretic model of the species and varietal composition of fruit plantations has also been developed (Belousova and Danilina, 2021).
While traditional machine learning techniques have shown promising results, deep learning, with its ability to learn complex features, has taken a step ahead. The Grounding DINO and Segment Anything models have been proven effective in object detection and image segmentation tasks. (Maninis, Radosavovic, and Kokkinos, 2019) demonstrated the potential of Segment Anything in attentive single-tasking of multiple tasks. The model has shown exceptional results in various image segmentation tasks in terms of reducing the number of parameters while maintaining or even improving the overall performance. On the other hand, Grounding DINO, known for its disentangled non-autoregressive object detection, has also shown promising results (Carion et al., 2020). The zero-shot learning approach used by the Grounding DINO has raised the interest of many researchers recently for its low computational complexity and promising performance (Wang et al., 2019). Another extensive review of zero-shot learning has also been done (Pourpanah et al., 2020).
In this paper, we propose a zero-shot deep learning model using the Grounding DINO and Segment Anything models for durian fruit size detection. Our methodology centers around a two-step process; we first employ Grounding DINO for initial object detection and then use Segment Anything (SAM) to segment and mask each durian fruit for size determination. Notably, we implement a zero-shot training approach, eliminating the need for data collection and manual annotation. We compare the performance of the proposed model with that of the Segmentation Generative Pre-trained Transformers (SegGPT) model. Performance accuracy is analyzed and improved with several different box threshold and text threshold adjustments. Last but not least, the results of the optimized detection models will be demonstrated on practical images. Accurate fruit size detection, which will be examined in this paper, holds considerable implications for precision agriculture. The ability to reliably quantify fruit sizes in real-time can transform numerous farming practices, from yield prediction and harvest planning to post-harvest processing.
This paper is organized as follows: Section 2 presents the research methodology by defining the concept of one-shot learning models such as Grounding DINO, SegGPT, and SAM. It also discusses the design process for the detection and sizing of durian. Section 3 illustrates the results of durian size detection using the above-mentioned models. It includes performance comparison and analysis of the Grounding DINO, SegGPT, and SAM models. Lastly, Section 4 comprises a summary of the durian size detection using our proposed deep learning models as well as future recommendations.
2.1. Zero-Shot Learning
2.1.1. Grounding DINO
Grounding DINO, an Artificial Intelligence (AI) model developed by OpenAI is a significant advancement in the fields of deep learning and computer vision. The model derives its name from a combination of "grounding", a process that connects the understanding of vision and language in the AI system, and "DINO", which stands for Distillation of Knowledge Without Labels, a framework developed by META AI (Caron et al., 2021).
The architecture of Grounding DINO is built upon the principles of Vision Transformers (ViT) and the DINO frameworks. ViT treats an image as a sequence of patches, similar to how transformers in natural language processing treat text as a sequence of tokens. The DINO framework, on the other hand, is a novel approach to self-supervised learning that does not require labeled data for training. It learns visual representations by encouraging agreement between differently augmented views of the same image. This is achieved through a process known as knowledge distillation, in which a teacher network guides a student network to learn the appropriate feature representations. The combined force of the ViT and DINO frameworks makes it a powerful tool for tasks where labeled training data is scarce or nonexistent (Radford et al., 2018).
SegGPT, short for Segmentation with Generative Pretraining, is a deep learning model designed for object detection and segmentation tasks in images. The model builds on the foundations of Open AI’s transformer-based models, such as Generative Pre-training Transformer (GPT), and utilizes a similar architecture with adjustments to suit vision-related tasks (Ramesh et al., 2021).
The architecture of SegGPT employs the standard transformer model, known for its exceptional performance in Natural Language Processing (NLP), and extends its capabilities to handle image data. This is achieved through an autoregressive model that allows SegGPT to generate descriptions of images by sequentially predicting bounding boxes and classes for objects present in the image. One of the key features of SegGPT is its ability to operate under a zero-shot learning paradigm, meaning it can identify and segment objects within images without having previously seen examples of those objects during training (Brown et al., 2020).
SAM, developed by Meta AI, is a state-of-the-art image segmentation model that can return a valid segmentation mask for any given prompt (Kirillov et al., 2023). In this context, a prompt refers to an indication of what to segment in an image and can take multiple forms, including points indicating the foreground or background, a rough box or mask, text, clicks, and more.
The architecture of SAM is composed of three components: an image encoder, a prompt encoder, and a mask encoder. SAM relies on a massive dataset of over a billion annotated images to support its extensive capabilities, known as the “Segment Anything 1 billion Mask" (SA-1B) dataset. This dataset is currently the largest labeled segmentation dataset available and is specifically designed for the developing and evaluating advanced segmentation models.
2.2. Design Process
This research aims to devise an advanced deep-learning solution for detecting and sizing durian fruits from a variety of images. The objective is approached through a two-phase procedure: object detection, segmentation, and sizing, as shown in Figure 1. This involved the application of two different object detection models— SegGPT and Grounding DINO—and a comparison of their performances. Appropriate thresholds will then be tested and chosen to optimize performance. After selecting the most accurate detection model, the SAM model is used for image segmentation and size calculation. SegGPT can also be used for segmentation, but it is not recommended, as elaborated in the result section later. It's worth noting that the entire methodology operated under a zero-shot learning paradigm, bypassing traditional data collection and manual annotation requirements.
Figure 1 Block diagram of the design process
2.2.1. Phase 1: Object detection
The first phase of the methodology involves applying and comparing two deep learning models designed for object detection tasks: SegGPT and Grounding DINO. Both models have shown effectiveness in object detection and can operate within a zero-shot learning environment, making them ideal candidates for our analysis. They are applied to a set of images, instructing it to identify and mark durian fruits within each image. This is accomplished by having the learning models draw bounding boxes around every detected durian fruit (Carion et al., 2020).
2.2.2. Phase 2: Segmentation and size determination
The focus of this study is to assess the effectiveness of three cutting-edge deep learning models - Grounding DINO, SegGPT, and SAM, for the task of durian fruit detection and sizing. The experiment process involved zero-shot learning, eliminating the need for extensive data collection or training of these models on durian images specifically. This approach offered an opportunity to test these models' capabilities to generalize their learned knowledge from diverse domains to a new task, i.e. identifying and segmenting durian fruits, as shown in Figure 2.
3.1. Object Detection
The first model tested is Grounding DINO, which is able to ground or localize text in images. In the context of this research, Grounding DINO is used to identify durian fruits based on the textual description, "durian fruit.". Given an input image, the model is prompted with the text "durian," and it returns a list of bounding boxes that it believes contain durian fruits, as shown in Figure 3. The bounding boxes are associated with scores, indicating the model's confidence level for each bounding box.
Figure 2 Original durian image