Published at : 31 Oct 2023
Volume : IJtech
Vol 14, No 6 (2023)
DOI : https://doi.org/10.14716/ijtech.v14i6.6640
Mohtady Barakat | Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia |
Gwo Chin Chung | Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia |
It Ee Lee | Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia |
Wai Leong Pang | School of Engineering, Taylor’s University, 47500 Subang Jaya, Selangor, Malaysia |
Kah Yoong Chan | Faculty of Engineering, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia |
Since 2017, up to 41% of Malaysia's land has been
cultivated for durian, making it the most widely planted crop. The rapid increase
in demand urges the authorities to search for a more systematic way to control
durian cultivation and manage the productivity and quality of the fruit. This
research paper proposes a deep-learning approach for detecting and sizing
durian fruit in any given image. The aim is to develop zero-shot learning
models that can accurately identify and measure the size of durian fruits in
images, regardless of the image’s background. The proposed methodology
leverages two cutting-edge models: Grounding DINO and Segment Anything (SAM),
which are trained using a limited number of samples to learn the essential
features of the fruit. The dataset used for training and testing the model
includes various images of durian fruits captured from different sources. The
effectiveness of the proposed model is evaluated by comparing it with the
Segmentation Generative Pre-trained Transformers (SegGPT) model. The results
show that the Grounding DINO model, which has a 92.5% detection accuracy,
outperforms the SegGPT in terms of accuracy and efficiency. This research has
significant implications for computer vision and agriculture, as it can
facilitate automated detection and sizing of durian fruits, leading to improved
yield estimation, quality control, and overall productivity.
Durian; Grounding DINO; Segment Anything; SegGPT; Zero-shot learning
Agriculture is the backbone of many economies around the world. Among various plantations, durian, often referred to as the "king of fruits", is of great economic importance in many countries, especially in Southeast Asia (Allied Market Research, 2021). In the countries where it is extensively grown, such as Thailand, Malaysia, and the Philippines, durian farming is a major income-generating activity. For example, in 2020, Thailand, the world's leading producer of durians, harvested about 1.05 million metric tonnes of fruit (FAOSTAT, 2021). On the export front, China's enormous appetite for durians is notable. According to a report from the Thai Office of Agricultural Economics (Saowanit, 2017), in 2016, durian exports to China amounted to over 158,000 metric tonnes, valued at approximately $0.24 billion. Durian's unique traits and high economic value have also led to considerable investments in research and development to improve cultivation techniques, disease resistance, and post-harvest handling. For instance, the Malaysian government has invested about $2.8 million in durian research and development since 2016.
In recent
years, technology has significantly influenced traditional farming practices, a
shift often called 'Precision Agriculture' (Haziq et
al., 2022; Pedersen and
Lind, 2017). One of
the critical tasks in precision agriculture is the accurate measurement of
fruit size, which assists in yield prediction, harvest planning, and
post-harvest processing. Durian's size detection holds substantial importance
due to the fruit's variable and large size. While traditional methods of durian
cultivation have relied on skilled farmers' expertise to determine the fruit's
ripeness and quality, the global agricultural landscape is rapidly evolving.
The appeal of integrating advanced deep-learning techniques into durian
cultivation stems from several factors: scalability and consistency,
data-driven insights, market differentiation, and future-proofing (Kamilaris and
Prenafeta-Boldú 2018). With the advent of machine learning and deep
learning techniques, automated fruit detection and size estimation have become
a possibility. Various models have been explored, including Convolutional Neural Networks (CNNs) (Tan et al., 2022), You Only Look Once
(YOLO), etc.
For instance, (Pedersen and Lind, 2017) demonstrated the use of machine learning algorithms for predicting
fruit yield, quality, and disease detection. Similarly, image processing and
computer vision have been widely studied for their potential in the ripeness
classification of oil palm fresh fruit bunches (Mansour, Dambul, and Choo, 2022). Besides that, the authors used deep-simulated
learning for fruit counting, thereby highlighting the potential of deep
learning in precision agriculture (Rahnemoonfar and Sheppard
2017). The proposed Inception-ResNet architecture showed
91% average test accuracy on real images. Similarly, (Mohanty, Hughes, and Salathé, 2016)
used deep CNNs for image-based plant disease detection and achieved nearly 100%
accuracy in detection under certain conditions. Last but not least, a
game-theoretic model of the species and varietal composition of fruit
plantations has also been developed (Belousova and Danilina,
2021).
While traditional machine learning techniques have
shown promising results, deep learning, with its ability to learn complex
features, has taken a step ahead. The Grounding DINO and Segment Anything
models have been proven effective in object detection and image segmentation
tasks. (Maninis, Radosavovic, and Kokkinos, 2019) demonstrated the potential
of Segment Anything in attentive single-tasking of multiple tasks. The model
has shown exceptional results in various image segmentation tasks in terms of
reducing the number of parameters while maintaining or even improving the
overall performance. On the other hand, Grounding DINO, known for its
disentangled non-autoregressive object detection, has also shown promising
results (Carion et al., 2020). The
zero-shot learning approach used by the Grounding DINO has raised the interest
of many researchers recently for its low computational complexity and promising
performance (Wang et al., 2019).
Another extensive review of zero-shot learning has also been done (Pourpanah et al., 2020).
In this paper, we propose a zero-shot deep learning
model using the Grounding DINO and Segment Anything models for durian fruit
size detection. Our methodology centers around a two-step process; we first
employ Grounding DINO for initial object detection and then use Segment
Anything (SAM) to segment and mask each durian fruit for size determination.
Notably, we implement a zero-shot training approach, eliminating the need for
data collection and manual annotation. We compare the performance of the
proposed model with that of the Segmentation Generative Pre-trained
Transformers (SegGPT) model. Performance accuracy is analyzed and improved with
several different box threshold and text threshold adjustments. Last but not
least, the results of the optimized detection models will be demonstrated on
practical images. Accurate fruit size detection, which will be examined in this
paper, holds considerable implications for precision agriculture. The ability
to reliably quantify fruit sizes in real-time can transform numerous farming
practices, from yield prediction and harvest planning to post-harvest
processing.
This paper is organized as follows: Section 2 presents
the research methodology by defining the concept of one-shot learning models
such as Grounding DINO, SegGPT, and SAM. It also discusses the design process
for the detection and sizing of durian. Section 3 illustrates the results of
durian size detection using the above-mentioned models. It includes performance
comparison and analysis of the Grounding DINO, SegGPT, and SAM models. Lastly,
Section 4 comprises a summary of the durian size detection using our proposed
deep learning models as well as future recommendations.
Research Methodology
2.1. Zero-Shot Learning
2.1.1.
Grounding DINO
Grounding DINO, an Artificial Intelligence
(AI) model developed by OpenAI is a significant advancement in the fields of
deep learning and computer vision. The model derives its name from a
combination of "grounding", a process that connects the understanding
of vision and language in the AI system, and "DINO", which stands for
Distillation of Knowledge Without Labels, a framework developed by META AI (Caron et al., 2021).
The architecture of Grounding DINO is
built upon the principles of Vision Transformers (ViT) and the DINO frameworks.
ViT treats an image as a sequence of patches, similar to how transformers in
natural language processing treat text as a sequence of tokens. The DINO
framework, on the other hand, is a novel approach to self-supervised learning
that does not require labeled data for training. It learns visual
representations by encouraging agreement between differently augmented views of
the same image. This is achieved through a process known as knowledge
distillation, in which a teacher network guides a student network to learn the
appropriate feature representations. The combined force of the ViT and DINO
frameworks makes it a powerful tool for tasks where labeled training data is
scarce or nonexistent (Radford et al., 2018).
2.1.2.
SegGPT
SegGPT,
short for Segmentation with Generative Pretraining, is a deep learning model
designed for object detection and segmentation tasks in images. The model
builds on the foundations of Open AI’s transformer-based models, such as Generative
Pre-training Transformer (GPT), and utilizes a similar architecture with
adjustments to suit vision-related tasks (Ramesh et al., 2021).
The architecture of SegGPT
employs the standard transformer model, known for its exceptional performance
in Natural Language Processing (NLP), and extends its capabilities to handle
image data. This is achieved through an autoregressive model that allows SegGPT
to generate descriptions of images by sequentially predicting bounding boxes
and classes for objects present in the image. One of the key features of SegGPT
is its ability to operate under a zero-shot learning paradigm, meaning it can
identify and segment objects within images without having previously seen
examples of those objects during training (Brown et al., 2020).
2.1.3.
SAM
SAM,
developed by Meta AI, is a state-of-the-art image segmentation model that can
return a valid segmentation mask for any given prompt (Kirillov
et al., 2023). In this context, a prompt refers to an
indication of what to segment in an image and can take multiple forms,
including points indicating the foreground or background, a rough box or mask,
text, clicks, and more.
The architecture of SAM is
composed of three components: an image encoder, a prompt encoder, and a mask
encoder. SAM relies on a massive dataset of over a billion annotated images to
support its extensive capabilities, known as the “Segment Anything 1 billion
Mask" (SA-1B) dataset. This dataset is currently the largest labeled
segmentation dataset available and is specifically designed for the developing
and evaluating advanced segmentation models.
2.2. Design Process
This research aims to devise an advanced deep-learning solution for detecting and sizing durian fruits from a variety of images. The objective is approached through a two-phase procedure: object detection, segmentation, and sizing, as shown in Figure 1. This involved the application of two different object detection models— SegGPT and Grounding DINO—and a comparison of their performances. Appropriate thresholds will then be tested and chosen to optimize performance. After selecting the most accurate detection model, the SAM model is used for image segmentation and size calculation. SegGPT can also be used for segmentation, but it is not recommended, as elaborated in the result section later. It's worth noting that the entire methodology operated under a zero-shot learning paradigm, bypassing traditional data collection and manual annotation requirements.
Figure 1 Block diagram of the
design process
2.2.1.
Phase
1: Object detection
The first phase of the methodology involves applying and
comparing two deep learning models designed for object detection tasks: SegGPT
and Grounding DINO. Both models have shown effectiveness in object detection
and can operate within a zero-shot learning environment, making them ideal
candidates for our analysis. They are applied to a set of images, instructing
it to identify and mark durian fruits within each image. This is accomplished
by having the learning models draw bounding boxes around every detected durian
fruit (Carion et al., 2020).
2.2.2.
Phase
2: Segmentation and size determination
The focus of this study is to assess the
effectiveness of three cutting-edge deep learning models - Grounding DINO,
SegGPT, and SAM, for the task of durian fruit detection and sizing. The
experiment process involved zero-shot learning, eliminating the need for
extensive data collection or training of these models on durian images
specifically. This approach offered an opportunity to test these models'
capabilities to generalize their learned knowledge from diverse domains to a
new task, i.e. identifying and segmenting durian fruits, as shown in Figure 2.
3.1. Object
Detection
The first model tested is Grounding DINO, which is able to ground or localize text in images. In the context of this research, Grounding DINO is used to identify durian fruits based on the textual description, "durian fruit.". Given an input image, the model is prompted with the text "durian," and it returns a list of bounding boxes that it believes contain durian fruits, as shown in Figure 3. The bounding boxes are associated with scores, indicating the model's confidence level for each bounding box.
Figure 2 Original durian image