• International Journal of Technology (IJTech)
  • Vol 14, No 6 (2023)

A Combination of Light Pre-trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos

A Combination of Light Pre-trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos

Title: A Combination of Light Pre-trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos
Muhammad Shahril Nizam Bin Abdullah, Hezerul Abdul Karim, Nouar AlDahoul

Corresponding email:

Cite this article as:
Abdullah, M.S.N.B., Abdul Karim, H., AlDahoul, N. 2023. A Combination of Light Pre-trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos. International Journal of Technology. Volume 14(6), pp. 1228-1236

Muhammad Shahril Nizam Bin Abdullah Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia
Hezerul Abdul Karim Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia
Nouar AlDahoul Computer Science, New York University, Abu Dhabi, United Arab Emirates
Email to Corresponding Author

A Combination of Light Pre-trained Convolutional Neural Networks and Long Short-Term Memory for Real-Time Violence Detection in Videos

Machine learning techniques have been used widely to analyze videos that have scenes of violence for censorship or surveillance purposes. Violence detection plays a crucial role in preventing underage and teenage exposure to violent acts and ensuring a safer viewing environment.  The automatic identification of violent scenes is significant to classify videos into two classes, including violence and non-violence. The existing violence detection models suffer from several problems, including memory inefficiency and low-speed inference, and thus make them unsuitable to be implemented on embedding systems with limited resources.  This article aims to propose a novel combination of light Convolutional Neural Networks (CNN), namely EfficientNet-B0 and Long Short-Term Memory (LSTM). The public dataset which consists of two different datasets, was utilized to train, evaluate, and compare the deep learning models used in this study. The experimental results show the superiority of EfficientNet B0-LSTM, which outperform other models in terms of accuracy (86.38%), F1 score (86.39%), and False Positive Rate (13.53%). Additionally, the proposed model has been deployed to a low-cost embedding device such as Raspberry Pi for real-time violence detection.

Convolutional Neural Network (CNN); Long Short-Term Memory (LSTM); Pre-trained model; Violence detection


Violence detection uses machine learning and computer vision techniques to analyze the videos that are retrieved from surveillance and security cameras or readily available videos, as in the case of censorship of films (Ramzan et al., 2019). Violence detection and censorship are essential acts to prevent children and youths from getting involved in any violent acts, as violent scenes have a significant implication on audience character development and mental state (Hassan, Osman, and Azarian, 2009).  The World Health Organization (WHO) has also reported that suicide cases as the second highest determining factor of casualties among people with the age range between 15 to 19 years old across the country (Abbas, 2019).

Artificial intelligence (AI) has been used for various activity recognition, including pornography activities (Hor, et al., 2022) and animal activities (Ong, Connie, and Goh, 2022). In addition, anomaly prediction using machine learning techniques, a subset of AI, was investigated to solve gas transmission (Noorsaman et al., 2023) and electricity challenges (EL-Hadad, Tan, and Tan, 2022). Consequently, various approaches have been implemented for human activity recognition in videos to identify the normal and abnormal activities (i.e., violent acts) (Sultani, Chen, and Shah, 2018).

     This paper presents various classification models including several pre-trained CNNs for feature extraction and LSTM for time series classification, which managed to obtain a higher accuracy compared to the previous experiment made in other research. The models were used to extract frames from videos and classify the sequence of these extracted features into two classes: violence and non-violence. The feature extractors were chosen due to the small number of parameters compared to other available pre-trained models, which helps in the implementation of a system to be deployable into edge devices.

Related Work

    Most previous works have utilized small-size public datasets, including Movies with 200 videos (Nievas et al., 2011a), Hockey Fights with 1000 videos (Nievas et al., 2011b), and VFD with 246 videos (Hassner, Itcher, and Kliper-Gross, 2012). For instance, The spatiotemporal Encoder was used to detect violent acts based on Bidirectional Convolutional LSTM (BiConvLSTM) architecture (Yu et al., 2019). A combination of CNN and LSTM was trained and evaluated with a public dataset that consists of two different datasets such as RWF-2000 (Cheng, Cai, and RWF, 2021) and RLVS-2000 (Soliman et al., 2019).
      The combination of CNN and Convolutional Long Short Term Memory (convLSTM) was investigated using Hockey Fight Dataset, Movies Dataset, and Violent-Flows Crowd Violence Dataset (Sudhakaranand Lanz, 2017). Although good performance was achieved, implementing it on a low-cost Internet of Things (IoT) node presents a challenge. The objective was to detect violence using a memory and computation-efficient, low-cost IoT node  (AlDahoul et al., 2021). However, this proposed method utilized a small-sized, arbitrarily defined sequential model during the extraction of features from the dataset in which it only managed to cater less than 80% in terms of accuracy during the prediction over the test batch when it is combined with the LSTM classifier.

Experimental Methods

    There are two datasets being used in this research, and the description of each is explained in this section. Moreover, the section describes the methodology used in terms of different feature extractors, including pre-trained CNNs as well as classification of extracted features utilizing LSTM. Finally, this section discusses the experimental results and explains further the findings obtained from the said methodology in this work.

3.1. Dataset Overview

Both datasets were obtained from open sources and stored differently by their respective authors. The RWF-2000 dataset (Cheng, Cai, and RWF, 2021) contains videos captured from Closed-Circuit Television (CCTV) cameras, representing diverse real-life situations. This dataset contains 1000 non-violence videos and 1000 violence videos, which were already classified into training and validation datasets. Meanwhile, the RLVS-2000 dataset (Soliman, et al., 2019) comprises videos collected by the author, sourced primarily from online social media platforms like YouTube. In contrast to the RWF-2000 dataset, RLVS-2000 offers a wider variety of video content, including scenes from movies with fighting sequences and pre-acted fighting scenes recorded by groups of individuals to demonstrate fighting scenarios. Additionally, this dataset also contains scenes in different environments and recorded using various devices such as dash cam, mobile phones and so on, and so forth. RLVS-2000 dataset was classified into two (2) different classes, which are 1000 violent videos and 1000 non-violence videos. Ultimately, with the aim to increase the variation as well as to improve the accuracy of detection, this research combined two (2) of the said datasets to compose 4000 videos in total:  3200 videos used for training and validation and 800 used for testing purposes. Figure 1 shows the samples for non-violence videos used in this research, while Figure 2 shows samples of violence videos. The summary of video division into training, validation, and testing is shown in Table 1.

Figure 1 Non-Violence videos thumbnails

Figure 2 Violence videos thumbnails

Table 1 Summary of videos


No. of Videos in Dataset

Use for




















3.2. Methods

       This section demonstrates several pre-trained CNNs that were chosen as the feature extractors due to their small sizes and high accuracy on ImageNet weights. These CNNs include DenseNet-121, EfficientNet B0, MobileNetV2 and ShuffleNet. Additionally, LSTM was added after CNNs to capture a time series of features extracted for violence/non-violence classification purposes.

3.2.1.  The solution pipeline

       The experiments carried out in this work were divided into several stages, as shown in Figure 3 as follows:

1. Data pre-processing to read video and convert it to sequence of frames for further processing.

2. Feature extraction using various pre-trained CNNs such as DenseNet-121, EfficientNet B0, MobileNetV2 and ShuffleNet-v1 to extract feature vector from each frame.

3. Training in LSTM for classification purposes. The sequence of feature vectors extracted from video frames was applied to LSTM to be classified into violence/non-violence.

4. Model evaluation in terms of accuracy, recall, precision, and F1 score listed in the classification report, with the utilization of the values from the True Positives, True Negatives, False Positives, and False Negatives.

Figure 3 Block diagram for violence detection in videos

       4000 videos were converted into Python-accessible format using Python’s Open-Source Computer Vision Library or OpenCV. Python can manipulate the OpenCV array structure for characterization if alternative libraries, such as NumPy, are used in a potent mix with it. The vector space and mathematical operations on these features were employed to determine visual patterns and their myriad properties. The videos were pre-processed into the array of NumPy, which was used during the feature extraction and model training.

        The experiment was repeated four (4) times with different pre-trained CNNs to determine which combination of CNN and LSTM produced the best result for violence detection in videos. Feature extractors took the converted videos (in terms of NumPy arrays) as the input and passed the input through the available layers according to the layers that were entailed in different pre-trained CNNs and provided different output sizes. These output sizes during the feature extraction determined the number of parameters for our LSTM classifier, as shown in Table 2.

Table 2 Comparison between various feature extractors utilized for violence detection

Feature Extractor


No. of Parameter

Input Shape

CNN Output Shape






EfficientNet B0












3.2.2.  LSTM as the Classifier

     The NumPy arrays extracted with the 3-dimensional output shape were reshaped into 2-dimensional NumPy arrays according to their output shape after the feature extraction stage. This reshaped array was used as the input for the LSTM classifier that was created using the Keras Sequential Model with one input layer, stacks of hidden layers, and one output layer. Some of the layers that were used are the LSTM layer, Dense layer, and Activation layer. The number of neurons in each layer was chosen to minimize the model's parameter count while still maintaining good accuracy. Table 3 shows the final LSTM architecture that was used in this work during training, validating, and testing.

Table 3 LSTM Classifier Hyperparameters and Architecture

LSTM Architecture


Output Shape

No. of Parameters


(None, 10)



(None, 512)


ReLU Activation

(None, 512)



(None, 1024)


ReLU Activation

(None, 1024)



(None, 2)


SoftMax Activation

(None, 2)


The LSTM Classifier architecture was compiled using the Categorical Cross entropy loss with the “accuracy” metrics monitored. Aside from that, the Adam optimizer was used, with the default learning rate value of 0.01, at 50 epochs with 30 batch sizes.

3.2.3.  Hardware implementation for inferencing purpose

       Lightweight Pre-Trained CNN and LSTM were combined to perform prediction of violence detection in videos. This would be beneficial to be implemented on devices with limited computability and physical memory. A High-Definition (HD) USB Web Camera was used for capturing the sequence of images in real-time, which acted as an input. Besides that, Raspberry Pi 4 was utilized as the processing unit to perform the prediction over the series of images that were captured in real-time from the webcam using the trained model. An RGB LED was also connected to the Raspberry Pi 4 to act as an actuator to indicate certain colors whenever a violent situation is detected, and vice-versa.

Results and Discussion

    This section discusses the experimental part of the research, which includes the software and hardware used to set up and implement the methodology. Besides that, this section also focuses on the results of the comparison between numerous feature extractors connected to LSTM for violence and non-violence classification.

4.1. Experimental Setup

    The experiments were carried out to train the proposed methods using a laptop that runs on Intel Core i5-7200U at 2.50GHz base frequency and 20GB DDR4 RAM. In other words, these experiments did not involve a Graphical Processing Unit (GPU). The purpose was to assess the suitability of the chosen lightweight pre-trained CNNs for running on edge devices with constrained resources, including limited computation and memory.

4.2. Experiments and Results

   The results of the conducted experiments are presented using the confusion matrix and the classification report. The confusion matrix provides a visual representation of the performance of the classification model, showing the number of true positives, true negatives, false positives, and false negatives for each class that enables the model to be assessed in terms of its ability to correctly classify instances and identify any potential misclassifications. Additionally, classification reports were generated to provide a more detailed analysis of the model's performance. These reports include important metrics such as accuracy, recall, precision, and F1 score, which were calculated for each individual class. By examining these metrics, the insights into the model's accuracy in correctly identifying instances of each class, its ability to recall relevant instances, the precision in correctly classifying instances, and the overall balance between precision and recall as represented by the F1 score were obtained.

4.2.1.  Confusion Matrix

    Figure 4 shows the results of experiments conducted in this work and presented in terms of confusion matrixes. The experiments have been repeated several times for each feature extractor utilized, and the results presented show the model performance using the testing dataset that contains 800 videos. A confusion matrix is a table of values that displays the number of predictions that the classifier made, in both correct and wrong ways. This helps to evaluate the classifier and calculate several performance metrics for classification, such as accuracy, recall, precision, and F1 score.

     Based on Figure 4, the combination of EfficientNetB0 and LSTM classifier outperformed other combinations of CNNs and LSTM. This shows that EfficientNetB0 managed to work well with the LSTM classifier proposed to perform the feature extraction, and classification of videos in terms of violence and non-violence. Furthermore, MobileNetV2 came in the second-ranking regarding the number of correctly predicted videos, with 82.88% of the videos in the test set that were predicted correctly.