Published at : 03 Nov 2022
Volume : IJtech
Vol 13, No 6 (2022)
DOI : https://doi.org/10.14716/ijtech.v13i6.5876
Cheng Hao Ng | Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia |
Tee Connie | Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia |
Kan Yeep Choo | Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100, Cyberjaya, Malaysia |
Michael Kah Ong Goh | Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia |
Wildlife-vehicle
collision (WVC) has been a significant threat to endangered species in
Malaysia. Due to excessive development, tropical rainforests and their
inhabitants have been edged towards extinction. Road buildings and other linear
infrastructures, for instance, have caused forest destruction and forced wild
animals to come out from their natural habitats to compete for resources with
the human-beings. In Malaysia, much precious wildlife have been lost due to
road accidents. Road signs and warning lights have been set up near wildlife
crossing, but these do not help much. In this paper, we aim to propose a
wildlife surveillance mechanism to detect the existence of wildlife near
roadways using visual and audio input. Machine learning classifiers, including
Convolution Neural Network (CNN), Support Vector Machine (SVM), K-nearest
neighbors (KNN), and Naive Bayes, are adopted in the study. We focus on five
types of the most frequently occurring wildlife on the roads: elephants,
tapirs, Malayan bears, tigers, and wild boars. Experimental results demonstrate
that a good accuracy as high as 99% can be achieved using the proposed
approach. On the other hand, the Naïve Bayes classifier ranks the lowest in
performance with an accuracy value only up to 86%.
Deep learning; Fusion; Machine learning; Wildlife surveillance
Studies show that Malaysia has lost many precious widlife species due to
road accidents (Jantan et al., 2020). For example, around 102 tapir had been killed by
road accidents in the last decade. Because of the total tapir population of
only about 1,000 left in Peninsular Malaysia, such fatal accidents impose
significant losses. The problem of wildlife-human conflict is worsening because
the forests have been cleared to make way for development like road
infrastructure building. Rapid growth has adversely altered wildlife profiles
and destroyed their natural habitats. The extensive invasion of the wildlife
ecosystem has pushed them to extinction.
In
response to the rise of wildlife-vehicle collision (WVC) in the country, the
Wildlife and National Parks Department has set up road signs at wildlife
crossing areas. Besides, solar lights and transverse bars have also been
installed at the crash-prone zones. Nevertheless, these measures are sometimes
ineffective due to the natural responses of the wild animals when encountering
human-being or vehicles. Instead of fleeing, wild animals would sometimes
become immobilized and experience inescapable shocks when caught in a
traumatic situation (Zanette et al., 2019). It might be too late to escape when they
run into vehicles. Drivers’ honking or flashing lights would not chase them
away but instead causes them to freeze and be hit by the cars.
To address WVC, this paper
presents a fusion of visual- and audio-based wildlife surveillance approach
using machine learning methods. Automatic monitoring is performed at wildlife
frequently occurring areas to detect potential wild animals that will go near
the roadway. Instead of passively letting the wilderness enter the highway
which might cause possible WVC, the proposed approach takes proactive measure
to warn the related department to chase the animals away before they come near
the roadway. In this study, we focus on five animal species: elephant, tapir,
tiger, Malaysian bear, and wild boar. These animals are commonly reported to be
observed near man areas.
Surveillance systems that rely on visual
input, e.g., CCTV cameras, are easily affected by illumination and pose
changes. For example, the appearance of the animals’ changes when viewed from
different angles (e.g., front, back, left, and right). Besides, the scene also
changes when acquired during different times of the day, , e.g., the scene appears
clear under bright sunlight, but the scene becomes invisible during the night.
Therefore, there is a need to complement visual-based recognition with
audio-based input to improve the reliability of the proposed system. In this
study, a pipeline approach is presented to process both the visual and audio
input for the animals using AI and machine learning techniques (Berawi, 2020; Siswanto, 2022; Fagbola, 2019). Discriminative
features are extracted from both types of signals and the extracted features are
fused using deep learning approach. Experimental results show that the fusion
of visual and audio signal can greatly improve the overall accuracy of the
proposed method.
2.1. Size of Dataset
In this study, we focus on
five types of animals: tapir, elephant, Malayan bear, tiger, and wild boar. The
dataset are mainly collected from Internet sources. It is difficult to find
sources that contain both image and audio sound. Therefore, the pictures and
audio files are downloaded separately from the Internet. The image data sources
mainly come from Kaggle and Google Images. As for sound sources, it is tough to
collect, and it is almost not available for certain animals such as tapir and
Malayan bear. As a result, the videos of the animals are downloaded from
YouTube. After that conversion, the video files were converted manually to
audio files. After that, the audio signals were segmented to derive multiple
samples from a single audio file. In the end, there are 215 samples each for
both image and audio signals, for each type of animals. So, there are a total
of 1075 (215 x 5) samples each for image and audio data. Some sample images are
illustrated in Figure 1.
Figure 1
Sample images in the dataset. Left column to right: elephants, Malayan bear,
tapir, tiger, wild boar
2.2. Pre-processing
The collected audio files are
trimmed so that all the files' duration is standardized to one second. The
audio files are converted to WAV file format, with a sampling rate of 16000 HZ,
mono channels, and 16-bit depth. To expand the dataset for audio signals, data
augmentation is further performed. Various techniques like addition of
distribution noise, shift time, stretching, change of speed, and change of
pitch are applied.
To pre-process the image
files, the region of interest containing the animals are manually cropped from
the image. The purpose of performing this step is to remove unnecessary
background from the pictures. In addition, augmentation is also applied to
increase the image samples. After that, all the images are converted to size
150 x 150 pixels.
2.3.
Audio Signal Processing
The audio signals are represented using Mel
spectrogram features (Ulutas et al.,
2022). Mel
spectrogram is one of the most potent features used to learn the time and
frequency representation from the audio sequences. Several processes are involved in calculating
Mel spectrogram and Fast Fourier Transform (FFT) is one the important
processes. Given N number of audio samples, the general expression to calculate
the output of FFT is given as (Equation 1),
where f(n) refers to the n-th
sample value, k is the discrete
frequency variable, and T denotes the
signal period.
The input to Mel spectrogram is WAV files. The output spectrogram illustrates the signal's intensity (or "loudness") at various frequencies in a waveform across time. Figure 2 and Figure 3 show some sample audio signals and the spectrograms for the different species of animals, respectively.
Figure 2
Samples of original audio signals
Figure 3
Samples of spectrograms corresponding to the audio signals
Data
augmentation is applied on the audio signals to broaden the dataset size
further to synthesize additional samples. The techniques used for data
augmentation include addition of distribution noise, shift time, stretching,
change of speed, and change of pitch. Sample audio signals, and their
corresponding spectrogram after data augmentation are displayed in Figure 4.
Figure 4
Samples of audio signals (left) and the corresponding spectrogram (right) after
data augmentation. From top to bottom: Original image, after noise addition,
stretching, change of pitch
2.4. Visual Image Process
To
process the visual images, this project adopts Principal Component Analysis
(PCA) (Fromentèze et al.,
2022) to extract discriminative features from the
wildlife images. PCA is a well-known dimensionality reduction technique that
transforms high-dimensional data into a reduced space that maintains the
multitude of values of the more extensive set.
Given
a set of N training images of the
wild animals the covariance matrix, Cx, is first determined. After that, a linear
transformation V is found to
transform the original dataset Xi
into a new subspace Y. The generalized eigenvalue problem is given as (Equation
2),
where is the diagonal matrix and
are the eigenvectors. Then, the PCA
transformation can be computed as (Equation 3),
where denotes the total number of pixels in the image.
2.5. Fusion and Classification of Audio- and
Visual-based Features
After
the audio and visual samples have been processed, they are fused together to
perform classification. In this paper, feature-level fusion is adopted. In
feature-level fusion, feature sets from the two input sources are combined into
a single feature set. The real benefit of feature-level fusion is that it is
able to encode prominent characteristics that might increase classification
performance by detecting associated feature values provided by the distinct
input modalities.
Firstly, all input images are resized to
the same width and high of 150 × 150 pixels. Then PCA is applied to extract
unique features from the input images. On the other hand, the audio samples are
converted from waveform to spectrogram.
A spectrogram can be viewed as a pictorial representation of the audio
signals representing a spectrum of frequencies of the movement. Therefore, the
spectrogram is also processed by PCA to obtain a compact audio signal
representation. The last step is to concatenate the feature of image and audio
together becomes a new feature.
The extracted visual and audio features
are concatenated to form an extended vector, are the dimension of the transformed output
features using PCA for the visual and audio signals, respectively. After that,
a classifier is used to classify
into one of the animal classes. In this study,
many classifiers, including Convolutional Neural Networks (CNN) (Gautam &
Singhai, 2022),
Support Vector Machines (SVM) (Ansari et al.,
2021),
Naïve Bayes (Khamdamovich &
Elshod, 2021), and
K-nearest neighbours (KNN) (Saleem &
Kovari, 2022)
are evaluated. Figure 5 shows the overall process in the proposed method.