|Rasoul Banaeeyan||Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia|
|Hezerul Abdul Karim||Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia|
|Haris Lye||Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia|
|Mohammad Faizal Ahmad Fauzi||Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia|
|Sarina Mansor||Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia|
|John See||Faculty of Computing & Informatics, Multimedia University, Cyberjaya, 63100, Malaysia|
The main objective of this paper is pornography recognition using audio features. Unlike most of the previous attempts, which have concentrated on the visual content of pornography images or videos, we propose to take advantage of sounds. Using sounds is particularly important in cases in which the visual features are not adequately informative of the contents (e.g., cluttered scenes, dark scenes, scenes with a covered body). To this end, our hypothesis is grounded in the assumption that scenes with pornographic content encompass audios with features specific to those scenes; these sounds can be in the form of speech or voice. More specifically, we propose to extract two types of features, (I) pitch and (II) mel-frequency cepstrum coefficients (MFCC), in order to train five different variations of the k-nearest neighbor (KNN) supervised classification models based on the fusion of these features. Later, the correctness of our hypothesis was investigated by conducting a set of evaluations based on a porno-sound dataset created based on an existing pornography video dataset. The experimental results confirm the feasibility of the proposed acoustic-driven approach by demonstrating an accuracy of 88.40%, an F-score of 85.20%, and an area under the curve (AUC) of 95% in the task of pornography recognition.
Acoustic recognition; KNN classifier; MFCC features; Pornography detection
Filtering inappropriate visual content from different sources (internet television (TV), web pages, etc.) is a primary concern in environments, such as schools, homes, and workplaces. In some countries, such as Malaysia, Indonesia, and Brunei, all TV channel providers are expected to obtain suitability approval before granting access to their subscribers or public users.
One part of the suitability assessment involves pornography recognition, which, most of the time, imposes a huge censorship cost to the service providers due to the need to recruit a large amount of manpower to work constantly over several months.
The main purpose of this research is to facilitate the task of pornography detection by proposing to exploit the distinctive power of acoustic features (as explained in Section III). More specifically, this study proposes employing pitch and mel-frequency cepstrum coefficient (MFCC) acoustic-related features, which represent both voiced and unvoiced sounds.
Although there have been several attempts to address the problem of pornography recognition (Caetano et al., 2016; Geng et al., 2016; Moreira et al., 2016; Nian, et al., 2016; Zhou et al., 2016; Jin et al., 2018; More et al., 2018; Nurhadiyatna et al., 2018; Shen et al., 2018;), almost all of them have utilized visual content to automate the target task of sensitive content detection.
The paper is organized as follows. The next section (2) briefly overviews recent similar works in the domain, followed by Section 3, which presents the design framework of the proposed acoustic-driven pornography recognition, as well as the details of the system design employed in this study. Section 4 details the experimental setup and procedures followed in our research to facilitate the reproducibility of the results. In Section 5, the results of the different experiments are presented and discussed; this is followed by Section 6, which concludes the paper and states some possible future directions.
In this research, we used acoustic information extracted from video clips in order to train different supervised classification models and test the feasibility of acoustic-driven features in the task of pornography recognition. More specifically, two types of features, pitch and MFCC, were employed to construct acoustic representations of the audio tracks.
We constructed a new audio dataset of pornography soundtracks comprising two sets of training and testing partitions. After conducting multiple experiments, the best performance enhancement in terms of recall, F-score, and AUC was achieved by the Medium KNN, and the highest recognition rates for precision and accuracy were obtained by Cosine KNN and Weighted KNN, respectively.
In future works, we intend to extend our research by investigating the effects of other pitch-based feature descriptor algorithms, such as those reported in studies by Drugman and Alwan (2011), Gonzalez and Brookes (2011), Hermes (1988), and Noll (1967). We will also explore the performance of different supervised and unsupervised learning models on a larger pornography audio dataset.
This research was fully funded by TELEKOM Malaysia Research and Development (TM R&D).
Atal, B.S., 1972. Automatic Speaker Recognition based on Pitch Contours. The Journal of the Acoustical Society of America, Volume 52(6B), pp. 1687–1697
Caetano, C., Avila, S., Schwartz, W.R., Guimarães, S.J.F., Araújo, A. de A., 2016. A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection. Neurocomputing, Volume 213, pp. 102–114
Dalal, N., Triggs, B., Europe, D., 2005. Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 20-25 June 2005
Drugman, T., Alwan, A., 2011. Joint Robust Voicing Detection and Pitch Estimation based on Residual Harmonics. In: Twelfth Annual Conference of the International Speech Communication Association, 27-31 August 2011
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M., 2015. Reliable Detection of Audio Events in Highly Noisy Environments. Pattern Recognition Letters, Volume 65, pp. 22–28
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M., 2016. Audio Surveillance of Roads: A System for Detecting Anomalous Sounds. IEEE Transactions on Intelligent Transportation Systems, Volume 17(1), pp. 279–288
Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M., 2017. Audio Set: An Ontology and Human-labeled Dataset for Audio Events. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 776–780
Geng, Z., Zhuo, L., Zhang, J., Li, X., 2016. A Comparative Study of Local Feature Extraction Algorithms for Web Pornographic Image Recognition. In: Proceedings of 2015 IEEE International Conference on Progress in Informatics and Computing, PIC 2015, pp. 87–92
Gonzalez, S., Brookes, M., 2011. A Pitch Estimation Filter Robust to High Levels of Noise (PEFAC). In: European Signal Processing Conference, (Eusipco), 29 August - 2 September 2011
Gupta, M., Bhaskar, D., Bera, R., 2016. Automatic Target Classification in GMTI Airborne Scenario. International Journal of Technology, Volume 7(5), pp. 840–848
Hasan, R., Jamil, M., Rabbani, G., Rahman, S., 2004. Speaker Identification using Mel Frequency Cepstral Coefficients. In: Proceedings of the 3rd International Conference on Electrical & Computer Engineering (ICECE 2004), December 2004, pp. 28–30
Hermes, D.J., 1988. Measurement of Pitch by Subharmonic Summation. The Journal of the Acoustical Society of America, Volume 83(1), pp. 257–264
Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M., 2013. High-level Event Recognition in Unconstrained Videos. International Journal of Multimedia Information Retrieval, Volume 2(2), pp. 73–101
Jin, X., Wang, Y., Tan, X., 2018. Pornographic Image Recognition via Weighted Multiple Instance Learning. IEEE Transactions on Cybernetics, Volume 49(12), pp. 4412–4420
Lopes, A.P.B., De Avila, S.E.F., Peixoto, A.N.A., Oliveira, R.S., Coelho, M.D.M., Araújo, A.D.A., 2009. Nude Detection in Video using Bag-of-visual-features. In: Proceedings of SIBGRAPI 2009, 22nd Brazilian Symposium on Computer Graphics and Image Processing, pp. 224–231
Lowe, D.G., 1999. Object Recognition from Local Scale-invariant Features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Volume 2, pp. 1150–1157
Mesaros, A., Heittola, T., Virtanen, T., 2016. TUT Database for Acoustic Scene Classification and Sound Event Detection. In: European Signal Processing Conference (EUSIPCO), November 2016, pp. 1128–1132
More, M.D., Souza, D.M., Barros, R.C., 2018. Seamless Nudity Censorship: An Image-to-Image Translation Approach based on Adversarial Training. In: IEEE International Joint Conference on Neural Networks (IJCNN)
Moreira, D., Avila, S., Perez, M., Moraes, D., Testoni, V., Valle, E., Goldenstein, S., Rocha, A., 2016. Pornography Classification: The Hidden Clues in Video Space–time. Forensic Science International, Volume 268, pp. 46–61
Naik, S., Metkewar, P., 2015. Recognizing Offline Handwritten Mathematical Expressions (ME) based on a Predictive Approach of Segmentation using K-NN Classification. International Journal of Technology, Volume 6(3), pp. 345–354
Nian, F., Li, T., Wang, Y., Xu, M., Wu, J., 2016. Pornographic Image Detection Utilizing Deep Convolutional Neural Networks. Neurocomputing, Volume 210, pp. 283–293
Noll, A.M., 1967. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, Volume 41, pp. 293–309
Nurhadiyatna, A., Cahyadi, S., Damatraseta, F., Rianto, Y., 2018. Adult Content Classification through Deep Convolution Neural Network. In: Proceedings of the 2017 International Conference on Computer, Control, Informatics and Its Applications: Emerging Trends In Computational Science and Engineering, IC3INA 2017, January 2018, pp. 106–110
Piczak, K.J., 2015. ESC: Dataset for Environmental Sound Classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018
Pieropan, A., Salvi, G., Pauwels, K., Kjellstrom, H., 2014. Audio-visual Classification and Detection of Human Manipulation Actions. In: IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 3045–3052
Rabiner, L.R., Schafer, R.W., 2011. Theory and Applications of Digital Speech Processing. Pearson, Upper Saddle River, NJ
Salamon, J., Bello, J.P., 2015. Unsupervised Feature Learning for Urban Sound Classification. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), August 2015, pp. 171–175
Shen, R., Zou, F., Song, J., Yan, K., Zhou, K., 2018. EFUI: An Ensemble Framework using Uncertain Inference for Pornographic Image Recognition. Neurocomputing, Volume 322, pp. 166–176
Shi, Z., Han, J., Zheng, T., Li, J., 2013. Identification of Objectionable Audio Segments based on Pseudo and Heterogeneous Mixture Models. IEEE Transactions on Audio, Speech and Language Processing, Volume 21(3), pp. 611–623
Snoek, C.G.M., Worring, M., 2007. Concept-based Video Retrieval. Foundations and Trends® in Information Retrieval, Volume 2(4), pp. 215–322
Snoek, C.G.M., Worring, M., Smeulders, A.W.M., 2005. Early Versus Late Fusion in Semantic Video Analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA ’05, pp. 399–402
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D., 2015. Detection and Classification of Acoustic Scenes and Events. IEEE Transactions on Multimedia, Volume 17(10), pp. 1733–1746
Zhou, K., Zhuo, L., Geng, Z., Zhang, J., Li, X.G., 2016. Convolutional Neural Networks Based Pornographic Image Classification. In: Proceedings of the 2016 IEEE 2nd International Conference on Multimedia Big Data, BigMM 2016, pp. 206–209
Zhou, W., Ahrary, A., Kamata, S.I., 2012. Image Description with Local Patterns: An Application to Face Recognition. IEICE Transactions on Information and Systems, Volume E95-D(5), pp. 1494–1505
Zuo, H., Wu, O., Hu, W., Xu, B., 2008. Recognition of Blue Movies by Fusion of Audio and Video. In: Proceedings of the 2008 IEEE International Conference on Multimedia and Expo, ICME 2008, pp. 37–40