• International Journal of Technology (IJTech)
  • Vol 10, No 7 (2019)

Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients

Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients

Title: Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients
Rasoul Banaeeyan, Hezerul Abdul Karim, Haris Lye, Mohammad Faizal Ahmad Fauzi, Sarina Mansor, John See

Corresponding email:

Cite this article as:
Banaeeyan, R., Karim, H.A., Lye, H., Fauzi, M.F.A., Mansor, S., See, J., 2019. Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients. International Journal of Technology. Volume 10(7), pp. 1335-1343

Rasoul Banaeeyan Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia
Hezerul Abdul Karim Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia
Haris Lye Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia
Mohammad Faizal Ahmad Fauzi Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia
Sarina Mansor Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia
John See Faculty of Computing & Informatics, Multimedia University, Cyberjaya, 63100, Malaysia
Email to Corresponding Author

Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients

The main objective of this paper is pornography recognition using audio features. Unlike most of the previous attempts, which have concentrated on the visual content of pornography images or videos, we propose to take advantage of sounds. Using sounds is particularly important in cases in which the visual features are not adequately informative of the contents (e.g., cluttered scenes, dark scenes, scenes with a covered body). To this end, our hypothesis is grounded in the assumption that scenes with pornographic content encompass audios with features specific to those scenes; these sounds can be in the form of speech or voice. More specifically, we propose to extract two types of features, (I) pitch and (II) mel-frequency cepstrum coefficients (MFCC), in order to train five different variations of the k-nearest neighbor (KNN) supervised classification models based on the fusion of these features. Later, the correctness of our hypothesis was investigated by conducting a set of evaluations based on a porno-sound dataset created based on an existing pornography video dataset. The experimental results confirm the feasibility of the proposed acoustic-driven approach by demonstrating an accuracy of 88.40%, an F-score of 85.20%, and an area under the curve (AUC) of 95% in the task of pornography recognition.

Acoustic recognition; KNN classifier; MFCC features; Pornography detection


Filtering inappropriate visual content from different sources (internet television (TV), web pages, etc.) is a primary concern in environments, such as schools, homes, and workplaces. In some countries, such as Malaysia, Indonesia, and Brunei, all TV channel providers are expected to obtain suitability approval before granting access to their subscribers or public users.

One part of the suitability assessment involves pornography recognition, which, most of the time, imposes a huge censorship cost to the service providers due to the need to recruit a large amount of manpower to work constantly over several months.

The main purpose of this research is to facilitate the task of pornography detection by proposing to exploit the distinctive power of acoustic features (as explained in Section III). More specifically, this study proposes employing pitch and mel-frequency cepstrum coefficient (MFCC) acoustic-related features, which represent both voiced and unvoiced sounds.

Although  there have  been several  attempts  to address the  problem  of  pornography  recognition (Caetano et al., 2016; Geng et al., 2016; Moreira et al., 2016; Nian, et al., 2016; Zhou et al., 2016; Jin et al., 2018; More et al., 2018; Nurhadiyatna et al., 2018; Shen et al., 2018;), almost all of them have utilized visual content to automate the target task of sensitive content detection.

The paper is organized as follows. The next section (2) briefly overviews recent similar works in the domain, followed by Section 3, which presents the design framework of the proposed acoustic-driven pornography recognition, as well as the details of the system design employed in this study. Section 4 details the experimental setup and procedures followed in our research to facilitate the reproducibility of the results. In Section 5, the results of the different experiments are presented and discussed; this is followed by Section 6, which concludes the paper and states some possible future directions.


In this research, we used acoustic information extracted from video clips in order to train different supervised classification models and test the feasibility of acoustic-driven features in the task of pornography recognition. More specifically, two types of features, pitch and MFCC, were employed to construct acoustic representations of the audio tracks.

We constructed a new audio dataset of pornography soundtracks comprising two sets of training and testing partitions. After conducting multiple experiments, the best performance enhancement in terms of recall, F-score, and AUC was achieved by the Medium KNN, and the highest recognition rates for precision and accuracy were obtained by Cosine KNN and Weighted KNN, respectively.

In future works, we intend to extend our research by investigating the effects of other pitch-based feature descriptor algorithms, such as those reported in studies by Drugman and Alwan (2011), Gonzalez and Brookes (2011), Hermes (1988), and Noll (1967). We will also explore the performance of different supervised and unsupervised learning models on a larger pornography audio dataset.


This research was fully funded by TELEKOM Malaysia Research and Development (TM R&D).


Atal, B.S., 1972. Automatic Speaker Recognition based on Pitch Contours. The Journal of the Acoustical Society of America, Volume 52(6B), pp. 1687–1697

Caetano, C., Avila, S., Schwartz, W.R., Guimarães, S.J.F., Araújo, A. de A., 2016. A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection. Neurocomputing, Volume 213, pp. 102–114

Dalal, N., Triggs, B., Europe, D., 2005. Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 20-25 June 2005

Drugman, T., Alwan, A., 2011. Joint Robust Voicing Detection and Pitch Estimation based on Residual Harmonics. In: Twelfth Annual Conference of the International Speech Communication Association, 27-31 August 2011

Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M., 2015. Reliable Detection of Audio Events in Highly Noisy Environments. Pattern Recognition Letters, Volume 65, pp. 22–28

Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M., 2016. Audio Surveillance of Roads: A System for Detecting Anomalous Sounds. IEEE Transactions on Intelligent Transportation Systems, Volume 17(1), pp. 279–288

Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M., 2017. Audio Set: An Ontology and Human-labeled Dataset for Audio Events. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 776–780

Geng, Z., Zhuo, L., Zhang, J., Li, X., 2016. A Comparative Study of Local Feature Extraction Algorithms for Web Pornographic Image Recognition. In: Proceedings of 2015 IEEE International Conference on Progress in Informatics and Computing, PIC 2015, pp. 87–92

Gonzalez, S., Brookes, M., 2011. A Pitch Estimation Filter Robust to High Levels of Noise (PEFAC). In: European Signal Processing Conference, (Eusipco), 29 August - 2 September 2011

Gupta, M., Bhaskar, D., Bera, R., 2016. Automatic Target Classification in GMTI Airborne Scenario. International Journal of Technology, Volume 7(5), pp. 840–848

Hasan, R., Jamil, M., Rabbani, G., Rahman, S., 2004. Speaker Identification using Mel Frequency Cepstral Coefficients. In: Proceedings of the 3rd International Conference on Electrical & Computer Engineering (ICECE 2004), December 2004, pp. 28–30

Hermes, D.J., 1988. Measurement of Pitch by Subharmonic Summation. The Journal of the Acoustical Society of America, Volume 83(1), pp. 257–264

Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M., 2013. High-level Event Recognition in Unconstrained Videos. International Journal of Multimedia Information Retrieval, Volume 2(2), pp. 73–101

Jin, X., Wang, Y., Tan, X., 2018. Pornographic Image Recognition via Weighted Multiple Instance Learning. IEEE Transactions on Cybernetics, Volume 49(12), pp. 4412–4420

Lopes, A.P.B., De Avila, S.E.F., Peixoto, A.N.A., Oliveira, R.S., Coelho, M.D.M., Araújo, A.D.A., 2009. Nude Detection in Video using Bag-of-visual-features. In: Proceedings of SIBGRAPI 2009, 22nd Brazilian Symposium on Computer Graphics and Image Processing, pp. 224–231

Lowe, D.G., 1999. Object Recognition from Local Scale-invariant Features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Volume 2, pp. 1150–1157

Mesaros, A., Heittola, T., Virtanen, T., 2016. TUT Database for Acoustic Scene Classification and Sound Event Detection. In: European Signal Processing Conference (EUSIPCO), November 2016, pp. 1128–1132

More, M.D., Souza, D.M., Barros, R.C., 2018. Seamless Nudity Censorship: An Image-to-Image Translation Approach based on Adversarial Training. In: IEEE International Joint Conference on Neural Networks (IJCNN)

Moreira, D., Avila, S., Perez, M., Moraes, D., Testoni, V., Valle, E., Goldenstein, S., Rocha, A., 2016. Pornography Classification: The Hidden Clues in Video Space–time. Forensic Science International, Volume 268, pp. 46–61

Naik, S., Metkewar, P., 2015. Recognizing Offline Handwritten Mathematical Expressions (ME) based on a Predictive Approach of Segmentation using K-NN Classification. International Journal of Technology, Volume 6(3), pp. 345–354

Nian, F., Li, T., Wang, Y., Xu, M., Wu, J., 2016. Pornographic Image Detection Utilizing Deep Convolutional Neural Networks. Neurocomputing, Volume 210, pp. 283–293

Noll, A.M., 1967. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, Volume 41, pp. 293–309

Nurhadiyatna, A., Cahyadi, S., Damatraseta, F., Rianto, Y., 2018. Adult Content Classification through Deep Convolution Neural Network. In: Proceedings of the 2017 International Conference on Computer, Control, Informatics and Its Applications: Emerging Trends In Computational Science and Engineering, IC3INA 2017, January 2018, pp. 106–110

Piczak, K.J., 2015. ESC: Dataset for Environmental Sound Classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018

Pieropan, A., Salvi, G., Pauwels, K., Kjellstrom, H., 2014. Audio-visual Classification and Detection of Human Manipulation Actions. In: IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 3045–3052

Rabiner, L.R., Schafer, R.W., 2011. Theory and Applications of Digital Speech Processing. Pearson, Upper Saddle River, NJ

Salamon, J., Bello, J.P., 2015. Unsupervised Feature Learning for Urban Sound Classification. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), August 2015, pp. 171–175

Shen, R., Zou, F., Song, J., Yan, K., Zhou, K., 2018. EFUI: An Ensemble Framework using Uncertain Inference for Pornographic Image Recognition. Neurocomputing, Volume 322, pp. 166–176

Shi, Z., Han, J., Zheng, T., Li, J., 2013. Identification of Objectionable Audio Segments based on Pseudo and Heterogeneous Mixture Models. IEEE Transactions on Audio, Speech and Language Processing, Volume 21(3), pp. 611–623

Snoek, C.G.M., Worring, M., 2007. Concept-based Video Retrieval. Foundations and Trends® in Information Retrieval, Volume 2(4), pp. 215–322

Snoek, C.G.M., Worring, M., Smeulders, A.W.M., 2005. Early Versus Late Fusion in Semantic Video Analysis. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA ’05, pp. 399–402

Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D., 2015. Detection and Classification of Acoustic Scenes and Events. IEEE Transactions on Multimedia, Volume 17(10), pp. 1733–1746

Zhou, K., Zhuo, L., Geng, Z., Zhang, J., Li, X.G., 2016. Convolutional Neural Networks Based Pornographic Image Classification. In: Proceedings of the 2016 IEEE 2nd International Conference on Multimedia Big Data, BigMM 2016, pp. 206–209

Zhou, W., Ahrary, A., Kamata, S.I., 2012. Image Description with Local Patterns: An Application to Face Recognition. IEICE Transactions on Information and Systems, Volume E95-D(5), pp. 1494–1505

Zuo, H., Wu, O., Hu, W., Xu, B., 2008. Recognition of Blue Movies by Fusion of Audio and Video. In: Proceedings of the 2008 IEEE International Conference on Multimedia and Expo, ICME 2008, pp. 37–40