Published at : 29 Nov 2019
Volume : IJtech
Vol 10, No 7 (2019)
DOI : https://doi.org/10.14716/ijtech.v10i7.3270
Rasoul Banaeeyan | Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia |
Hezerul Abdul Karim | Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia |
Haris Lye | Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia |
Mohammad Faizal Ahmad Fauzi | Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia |
Sarina Mansor | Faculty of Engineering, Multimedia University, Cyberjaya, 63100, Malaysia |
John See | Faculty of Computing & Informatics, Multimedia University, Cyberjaya, 63100, Malaysia |
The main objective
of this paper is pornography recognition using audio features. Unlike most of
the previous attempts, which have concentrated on the visual content of
pornography images or videos, we propose to take advantage of sounds. Using
sounds is particularly important in cases in which the visual features are not
adequately informative of the contents (e.g., cluttered scenes, dark scenes,
scenes with a covered body). To this end, our hypothesis is grounded in the
assumption that scenes with pornographic content encompass audios with features
specific to those scenes; these sounds can be in the form of speech or voice.
More specifically, we propose to extract two types of features, (I) pitch and
(II) mel-frequency cepstrum coefficients (MFCC), in order to train five
different variations of the k-nearest neighbor (KNN) supervised classification
models based on the fusion of these features. Later, the correctness of our
hypothesis was investigated by conducting a set of evaluations based on a
porno-sound dataset created based on an existing pornography video dataset. The
experimental results confirm the feasibility of the proposed acoustic-driven
approach by demonstrating an accuracy of 88.40%, an F-score of 85.20%, and an
area under the curve (AUC) of 95% in the task of pornography recognition.
Acoustic recognition; KNN classifier; MFCC features; Pornography detection
Filtering
inappropriate visual content from different sources (internet television (TV),
web pages, etc.) is a primary concern in environments, such as schools, homes,
and workplaces. In some countries, such as Malaysia, Indonesia, and Brunei, all
TV channel providers are expected to obtain suitability approval before
granting access to their subscribers or public users.
One
part of the suitability assessment involves pornography recognition, which,
most of the time, imposes a huge censorship cost to the service providers due
to the need to recruit a large amount of manpower to work constantly over
several months.
The
main purpose of this research is to facilitate the task of pornography
detection by proposing to exploit the distinctive power of acoustic features
(as explained in Section III). More specifically, this study proposes employing
pitch and mel-frequency cepstrum coefficient (MFCC) acoustic-related features,
which represent both voiced and unvoiced sounds.
Although there have been
several attempts to address the problem of pornography recognition
The paper is organized as
follows. The next section (2) briefly overviews recent similar works in the
domain, followed by Section 3, which presents the design framework of the
proposed acoustic-driven pornography recognition, as well as the details of the
system design employed in this study. Section 4 details the experimental setup
and procedures followed in our research to facilitate the reproducibility of
the results. In Section 5, the results of the different experiments are
presented and discussed; this is followed by Section 6, which concludes the
paper and states some possible future directions.
In this research, we used
acoustic information extracted from video clips in order to train different
supervised classification models and test the feasibility of acoustic-driven
features in the task of pornography recognition. More specifically, two types
of features, pitch and MFCC, were employed to construct acoustic
representations of the audio tracks.
We constructed a new audio dataset of
pornography soundtracks comprising two sets of training and testing partitions.
After conducting multiple experiments, the best performance enhancement in
terms of recall, F-score, and AUC was achieved by the Medium KNN, and the
highest recognition rates for precision and accuracy were obtained by Cosine
KNN and Weighted KNN, respectively.
In future works, we intend to extend our research by investigating the effects of other pitch-based feature descriptor algorithms, such as those reported in studies by Drugman and Alwan (2011), Gonzalez and Brookes (2011), Hermes (1988), and Noll (1967). We will also explore the performance of different supervised and unsupervised learning models on a larger pornography audio dataset.
This research was
fully funded by TELEKOM Malaysia Research and Development (TM R&D).
Atal, B.S., 1972. Automatic
Speaker Recognition based on Pitch Contours. The Journal of the Acoustical
Society of America, Volume 52(6B), pp. 1687–1697
Caetano, C., Avila, S., Schwartz, W.R.,
Guimarães, S.J.F., Araújo, A. de A., 2016. A Mid-level Video Representation based
on Binary Descriptors: A Case Study for Pornography Detection. Neurocomputing,
Volume 213, pp. 102–114
Dalal, N., Triggs, B., Europe, D., 2005. Histograms of Oriented Gradients for Human
Detection. In: IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 20-25
June 2005
Drugman, T., Alwan, A., 2011. Joint Robust
Voicing Detection and Pitch Estimation based on Residual Harmonics. In: Twelfth
Annual Conference of the International Speech Communication Association, 27-31
August 2011
Foggia, P., Petkov, N., Saggese, A.,
Strisciuglio, N., Vento, M., 2015. Reliable Detection of Audio Events in Highly
Noisy Environments. Pattern Recognition Letters, Volume 65, pp. 22–28
Foggia, P., Petkov, N., Saggese, A., Strisciuglio,
N., Vento, M., 2016. Audio Surveillance of Roads: A System for Detecting
Anomalous Sounds. IEEE Transactions on Intelligent Transportation Systems,
Volume 17(1), pp. 279–288
Gemmeke, J.F., Ellis, D.P.W., Freedman, D.,
Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M., 2017. Audio Set:
An Ontology and Human-labeled Dataset for Audio Events. In:
International Conference on Acoustics, Speech, and Signal Processing, pp.
776–780
Geng, Z., Zhuo, L., Zhang, J., Li, X., 2016. A
Comparative Study of Local Feature Extraction Algorithms for Web Pornographic
Image Recognition. In: Proceedings of 2015 IEEE International Conference
on Progress in Informatics and Computing, PIC 2015, pp. 87–92
Gonzalez, S., Brookes, M., 2011. A Pitch
Estimation Filter Robust to High Levels of Noise (PEFAC). In: European
Signal Processing Conference, (Eusipco), 29 August - 2 September 2011
Gupta, M., Bhaskar, D., Bera, R., 2016.
Automatic Target Classification in GMTI Airborne Scenario. International
Journal of Technology, Volume 7(5), pp. 840–848
Hasan, R., Jamil, M., Rabbani, G., Rahman, S.,
2004. Speaker Identification using Mel Frequency Cepstral Coefficients. In: Proceedings
of the 3rd International Conference on Electrical & Computer
Engineering (ICECE 2004), December 2004, pp. 28–30
Hermes, D.J., 1988. Measurement of Pitch by
Subharmonic Summation. The Journal of the Acoustical Society of America,
Volume 83(1), pp. 257–264
Jiang, Y.G., Bhattacharya, S., Chang, S.F.,
Shah, M., 2013. High-level Event Recognition in Unconstrained Videos. International
Journal of Multimedia Information Retrieval, Volume 2(2), pp. 73–101
Jin, X., Wang, Y., Tan, X., 2018. Pornographic
Image Recognition via Weighted Multiple Instance Learning. IEEE Transactions
on Cybernetics, Volume 49(12), pp. 4412–4420
Lopes, A.P.B., De Avila, S.E.F., Peixoto,
A.N.A., Oliveira, R.S., Coelho, M.D.M., Araújo, A.D.A., 2009. Nude Detection in
Video using Bag-of-visual-features. In: Proceedings of SIBGRAPI 2009, 22nd
Brazilian Symposium on Computer Graphics and Image Processing, pp. 224–231
Lowe, D.G., 1999. Object Recognition from Local
Scale-invariant Features. In: Proceedings of the Seventh IEEE
International Conference on Computer Vision, Volume 2, pp. 1150–1157
Mesaros, A., Heittola, T., Virtanen, T., 2016.
TUT Database for Acoustic Scene Classification and Sound Event Detection. In:
European Signal Processing Conference (EUSIPCO), November 2016, pp. 1128–1132
More, M.D., Souza, D.M., Barros, R.C., 2018.
Seamless Nudity Censorship: An Image-to-Image Translation Approach based on
Adversarial Training. In: IEEE International Joint Conference on Neural
Networks (IJCNN)
Moreira, D., Avila, S., Perez, M., Moraes, D.,
Testoni, V., Valle, E., Goldenstein, S., Rocha, A., 2016. Pornography
Classification: The Hidden Clues in Video Space–time. Forensic Science
International, Volume 268, pp. 46–61
Naik, S., Metkewar, P., 2015. Recognizing
Offline Handwritten Mathematical Expressions (ME) based on a Predictive
Approach of Segmentation using K-NN Classification. International Journal of
Technology, Volume 6(3), pp. 345–354
Nian, F., Li, T., Wang, Y., Xu, M., Wu, J.,
2016. Pornographic Image Detection Utilizing Deep Convolutional Neural
Networks. Neurocomputing, Volume 210, pp. 283–293
Noll, A.M., 1967. Cepstrum Pitch Determination.
The Journal of the Acoustical Society of America, Volume 41, pp. 293–309
Nurhadiyatna, A., Cahyadi, S., Damatraseta, F.,
Rianto, Y., 2018. Adult Content Classification through Deep Convolution Neural
Network. In: Proceedings of the 2017 International Conference on
Computer, Control, Informatics and Its Applications: Emerging Trends In
Computational Science and Engineering, IC3INA 2017, January 2018, pp. 106–110
Piczak, K.J., 2015. ESC: Dataset for
Environmental Sound Classification. In: Proceedings of the 23rd
ACM International Conference on Multimedia, pp. 1015–1018
Pieropan, A., Salvi, G., Pauwels, K.,
Kjellstrom, H., 2014. Audio-visual Classification and Detection of Human
Manipulation Actions. In: IEEE International Conference on Intelligent
Robots and Systems (IROS), pp. 3045–3052
Rabiner, L.R., Schafer, R.W., 2011. Theory and Applications of Digital Speech
Processing. Pearson, Upper Saddle River, NJ
Salamon, J., Bello, J.P., 2015. Unsupervised
Feature Learning for Urban Sound Classification. In: Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), August 2015, pp. 171–175
Shen, R., Zou, F., Song, J., Yan, K., Zhou, K.,
2018. EFUI: An Ensemble Framework using Uncertain Inference for Pornographic
Image Recognition. Neurocomputing, Volume 322, pp. 166–176
Shi, Z., Han, J., Zheng, T., Li, J., 2013.
Identification of Objectionable Audio Segments based on Pseudo and
Heterogeneous Mixture Models. IEEE Transactions on Audio, Speech and
Language Processing, Volume 21(3), pp. 611–623
Snoek, C.G.M., Worring, M., 2007. Concept-based
Video Retrieval. Foundations and Trends® in Information Retrieval, Volume
2(4), pp. 215–322
Snoek, C.G.M., Worring, M., Smeulders, A.W.M.,
2005. Early Versus Late Fusion in Semantic Video Analysis. In: Proceedings
of the 13th Annual ACM International Conference on Multimedia,
MULTIMEDIA ’05, pp. 399–402
Stowell, D., Giannoulis, D., Benetos, E.,
Lagrange, M., Plumbley, M.D., 2015. Detection and Classification of Acoustic
Scenes and Events. IEEE Transactions on Multimedia, Volume 17(10), pp. 1733–1746
Zhou, K., Zhuo, L., Geng, Z., Zhang, J., Li,
X.G., 2016. Convolutional Neural Networks Based Pornographic Image
Classification. In: Proceedings of the 2016 IEEE 2nd
International Conference on Multimedia Big Data, BigMM 2016, pp. 206–209
Zhou, W., Ahrary, A., Kamata, S.I., 2012. Image
Description with Local Patterns: An Application to Face Recognition. IEICE
Transactions on Information and Systems, Volume E95-D(5), pp. 1494–1505
Zuo, H., Wu, O., Hu, W., Xu, B., 2008.
Recognition of Blue Movies by Fusion of Audio and Video. In: Proceedings
of the 2008 IEEE International Conference on Multimedia and Expo, ICME 2008,
pp. 37–40