• International Journal of Technology (IJTech)
  • Vol 13, No 6 (2022)

Fusion of Visual and Audio Signals for Wildlife Surveillance

Fusion of Visual and Audio Signals for Wildlife Surveillance

Title: Fusion of Visual and Audio Signals for Wildlife Surveillance
Cheng Hao Ng, Tee Connie, Kan Yeep Choo, Michael Kah Ong Goh

Corresponding email:

Cite this article as:
Ng.C.H., Connie, T., Choo, K.Y., Goh, M.K.O., 2022. Fusion of Visual and Audio Signals for Wildlife Surveillance. International Journal of Technology. Volume 13(6), pp. 1213-1221

Cheng Hao Ng Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia
Tee Connie Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia
Kan Yeep Choo Faculty of Engineering, Multimedia University, Persiaran Multimedia, 63100, Cyberjaya, Malaysia
Michael Kah Ong Goh Faculty of Information Science & Technology, Multimedia University, 75450, Melaka, Malaysia
Email to Corresponding Author

Fusion of Visual and Audio Signals for Wildlife Surveillance

Wildlife-vehicle collision (WVC) has been a significant threat to endangered species in Malaysia. Due to excessive development, tropical rainforests and their inhabitants have been edged towards extinction. Road buildings and other linear infrastructures, for instance, have caused forest destruction and forced wild animals to come out from their natural habitats to compete for resources with the human-beings. In Malaysia, much precious wildlife have been lost due to road accidents. Road signs and warning lights have been set up near wildlife crossing, but these do not help much. In this paper, we aim to propose a wildlife surveillance mechanism to detect the existence of wildlife near roadways using visual and audio input. Machine learning classifiers, including Convolution Neural Network (CNN), Support Vector Machine (SVM), K-nearest neighbors (KNN), and Naive Bayes, are adopted in the study. We focus on five types of the most frequently occurring wildlife on the roads: elephants, tapirs, Malayan bears, tigers, and wild boars. Experimental results demonstrate that a good accuracy as high as 99% can be achieved using the proposed approach. On the other hand, the Naïve Bayes classifier ranks the lowest in performance with an accuracy value only up to 86%.

Deep learning; Fusion; Machine learning; Wildlife surveillance


    Studies show that Malaysia has lost many precious widlife species due to road accidents (Jantan et al., 2020). For example, around 102 tapir had been killed by road accidents in the last decade. Because of the total tapir population of only about 1,000 left in Peninsular Malaysia, such fatal accidents impose significant losses. The problem of wildlife-human conflict is worsening because the forests have been cleared to make way for development like road infrastructure building. Rapid growth has adversely altered wildlife profiles and destroyed their natural habitats. The extensive invasion of the wildlife ecosystem has pushed them to extinction.

       In response to the rise of wildlife-vehicle collision (WVC) in the country, the Wildlife and National Parks Department has set up road signs at wildlife crossing areas. Besides, solar lights and transverse bars have also been installed at the crash-prone zones. Nevertheless, these measures are sometimes ineffective due to the natural responses of the wild animals when encountering human-being or vehicles. Instead of fleeing, wild animals would sometimes become immobilized and experience inescapable shocks when caught in a traumatic situation (Zanette et al., 2019).   It might  be too late to  escape  when  they  run into vehicles. Drivers’ honking or flashing lights would not chase them away but instead causes them to freeze and be hit by the cars.                
    To address WVC, this paper presents a fusion of visual- and audio-based wildlife surveillance approach using machine learning methods. Automatic monitoring is performed at wildlife frequently occurring areas to detect potential wild animals that will go near the roadway. Instead of passively letting the wilderness enter the highway which might cause possible WVC, the proposed approach takes proactive measure to warn the related department to chase the animals away before they come near the roadway. In this study, we focus on five animal species: elephant, tapir, tiger, Malaysian bear, and wild boar. These animals are commonly reported to be observed near man areas.
        Surveillance systems that rely on visual input, e.g., CCTV cameras, are easily affected by illumination and pose changes. For example, the appearance of the animals’ changes when viewed from different angles (e.g., front, back, left, and right). Besides, the scene also changes when acquired during different times of the day, , e.g., the scene appears clear under bright sunlight, but the scene becomes invisible during the night. Therefore, there is a need to complement visual-based recognition with audio-based input to improve the reliability of the proposed system. In this study, a pipeline approach is presented to process both the visual and audio input for the animals using AI and machine learning techniques (Berawi, 2020; Siswanto, 2022; Fagbola, 2019). Discriminative features are extracted from both types of signals and the extracted features are fused using deep learning approach. Experimental results show that the fusion of visual and audio signal can greatly improve the overall accuracy of the proposed method.

Experimental Methods

2.1. Size of Dataset

        In this study, we focus on five types of animals: tapir, elephant, Malayan bear, tiger, and wild boar. The dataset are mainly collected from Internet sources. It is difficult to find sources that contain both image and audio sound. Therefore, the pictures and audio files are downloaded separately from the Internet. The image data sources mainly come from Kaggle and Google Images. As for sound sources, it is tough to collect, and it is almost not available for certain animals such as tapir and Malayan bear. As a result, the videos of the animals are downloaded from YouTube. After that conversion, the video files were converted manually to audio files. After that, the audio signals were segmented to derive multiple samples from a single audio file. In the end, there are 215 samples each for both image and audio signals, for each type of animals. So, there are a total of 1075 (215 x 5) samples each for image and audio data. Some sample images are illustrated in Figure 1.


Figure 1 Sample images in the dataset. Left column to right: elephants, Malayan bear, tapir, tiger, wild boar

2.2. Pre-processing

    The collected audio files are trimmed so that all the files' duration is standardized to one second. The audio files are converted to WAV file format, with a sampling rate of 16000 HZ, mono channels, and 16-bit depth. To expand the dataset for audio signals, data augmentation is further performed. Various techniques like addition of distribution noise, shift time, stretching, change of speed, and change of pitch are applied. 

    To pre-process the image files, the region of interest containing the animals are manually cropped from the image. The purpose of performing this step is to remove unnecessary background from the pictures. In addition, augmentation is also applied to increase the image samples. After that, all the images are converted to size 150 x 150 pixels.

2.3.  Audio Signal Processing

        The audio signals are represented using Mel spectrogram features (Ulutas et al., 2022). Mel spectrogram is one of the most potent features used to learn the time and frequency representation from the audio sequences.  Several processes are involved in calculating Mel spectrogram and Fast Fourier Transform (FFT) is one the important processes. Given N number of audio samples, the general expression to calculate the output of FFT is given as (Equation 1),

where f(n) refers to the n-th sample value, k is the discrete frequency variable, and T denotes the signal period.

        The input to Mel spectrogram is WAV files. The output spectrogram illustrates the signal's intensity (or "loudness") at various frequencies in a waveform across time. Figure 2 and Figure 3 show some sample audio signals and the spectrograms for the different species of animals, respectively.

Figure 2 Samples of original audio signals


Figure 3 Samples of spectrograms corresponding to the audio signals

        Data augmentation is applied on the audio signals to broaden the dataset size further to synthesize additional samples. The techniques used for data augmentation include addition of distribution noise, shift time, stretching, change of speed, and change of pitch. Sample audio signals, and their corresponding spectrogram after data augmentation are displayed in Figure 4.


Figure 4 Samples of audio signals (left) and the corresponding spectrogram (right) after data augmentation. From top to bottom: Original image, after noise addition, stretching, change of pitch

2.4.  Visual Image Process

        To process the visual images, this project adopts Principal Component Analysis (PCA) (Fromentèze et al., 2022) to extract discriminative features from the wildlife images. PCA is a well-known dimensionality reduction technique that transforms high-dimensional data into a reduced space that maintains the multitude of values of the more extensive set.

        Given a set of N training images of the wild animals the covariance matrix, Cx, is first determined. After that, a linear transformation V is found to transform the original dataset Xi into a new subspace Y. The generalized eigenvalue problem is given as (Equation 2),

where is the diagonal matrix and are the eigenvectors. Then, the PCA transformation can be computed as (Equation 3),

where denotes the total number of pixels in the image.

2.5.  Fusion and Classification of Audio- and Visual-based Features

        After the audio and visual samples have been processed, they are fused together to perform classification. In this paper, feature-level fusion is adopted. In feature-level fusion, feature sets from the two input sources are combined into a single feature set. The real benefit of feature-level fusion is that it is able to encode prominent characteristics that might increase classification performance by detecting associated feature values provided by the distinct input modalities.

        Firstly, all input images are resized to the same width and high of 150 × 150 pixels. Then PCA is applied to extract unique features from the input images. On the other hand, the audio samples are converted from waveform to spectrogram.  A spectrogram can be viewed as a pictorial representation of the audio signals representing a spectrum of frequencies of the movement. Therefore, the spectrogram is also processed by PCA to obtain a compact audio signal representation. The last step is to concatenate the feature of image and audio together becomes a new feature.

        The extracted visual and audio features are concatenated to form an extended vector,  are the dimension of the transformed output features using PCA for the visual and audio signals, respectively. After that, a classifier is used to classify  into one of the animal classes. In this study, many classifiers, including Convolutional Neural Networks (CNN) (Gautam & Singhai, 2022), Support Vector Machines (SVM) (Ansari et al., 2021), Naïve Bayes (Khamdamovich & Elshod, 2021), and K-nearest neighbours (KNN) (Saleem & Kovari, 2022) are evaluated. Figure 5 shows the overall process in the proposed method.