• International Journal of Technology (IJTech)
  • Vol 11, No 2 (2020)

Improving Accuracy of Isolated Word Recognition System by using Syllable Number Characteristics

Risanuri Hidayat, Anggun Winursito

Corresponding email: risanuri@ugm.ac.id


Cite this article as:
Hidayat, R., Winursito, A., 2020. Improving Accuracy of Isolated Word Recognition System by using Syllable Number Characteristics. International Journal of Technology. Volume 11(2), pp. 411-421

119
Downloads
Risanuri Hidayat Department of Electrical Engineering and Information Technology, Faculty of Engineering, Universitas Gadjah Mada, Jl. Grafika No. 2, Yogyakarta 55281, Indonesia
Anggun Winursito Department of Electrical Engineering and Information Technology, Faculty of Engineering, Universitas Gadjah Mada, Jl. Grafika No. 2, Yogyakarta 55281, Indonesia
Email to Corresponding Author

Abstract
image

Studies are constantly developing and improving speech recognition systems, especially their accuracy. This study developed an isolated word recognition system by using the syllable number characteristics of speech signals that will be recognized. First, the syllable number of speech signals to be recognized was detected, and then, the detection results were used to call one of the database groups that matched the syllable number characteristics. This method was designed to reduce the error possibility through a matching process between test data features and database features. This study used Mel frequency cepstral coefficients (MFCC) for feature extraction and the K-nearest neighbor (KNN) method for classification. Three versions of the proposed method were designed. The results showed that version three increased the accuracy by 4% compared to the conventional recognition system. Version three had the fastest computational time compared to the other methods. The addition of syllable detection algorithms in version three increased the computational time by only 0.151 s compared to the conventional MFCC method. The data cut length and threshold value for the filter also influenced the speech recognition system accuracy.

Isolated word; K-nearest neighbor (KNN); Mel frequency cepstral coefficients (MFCC); Number of syllables; Speech recognition

Introduction

Technology plays a crucial role in human daily life and is developing at a rapid rate. One such technology is speech recognition. Speech recognition technology is being widely used in applications such as mobile phones, home security systems, and global positioning systems. Studies have used speech recognition systems to recognize drones (Shi et al., 2018). Studies on speech recognition systems are continually improving the recognition results. Speech recognition systems use several main stages including feature extraction and classification to identify speech patterns (Dahake et al., 2016). The feature extraction process obtains the characteristics of a sound frame, and the classification process chooses a word by analyzing extracted features (Jo et al., 2016). Mel frequency cepstral coefficients (MFCC) are widely used for feature extraction. Some studies have already used MFCC for feature extraction (Adiwijaya et al., 2017; Vijayan et al., 2017; Hidayat et al., 2018; Kumar et al., 2018; Marlina et al., 2018; Winursito et al., 2018; Li et al., 2020). Although MFCC is widely used in speech recognition systems, they still require further development, especially in terms of their accuracy (Winursito et al., 2018).

Several studies have tried to improve the performance of the MFCC method. One study improved the MFCC method by adding a delta coefficient (Hossan et al., 2010) and compared this MFCC + Delta method with the ordinary MFCC method. The results indicated that the added delta coefficient improved the speech recognition system’s accuracy. Another study (Hidayat et al., 2018) added a wavelet-transform-based noise reduction system. This is because the MFCC is quite susceptible to noise interference in the sound input, and this impacts the speech recognition system’s accuracy. Other studies used wavelets and a psychoacoustic model for speech compression (Gunjal and Raut, 2015). A study on noise removal in speech signals (Tomchuk, 2018) tried to realize high speech recognition system accuracy for both signals with and without noise. A recent speech recognition system added a data compression method (Winursito et al., 2018) and compressed the total output data of all MFCC features by using a principal component analysis (PCA) method. Data compression was performed for removing unnecessary data and leaving behind only important data. The study results indicated increased accuracy at the cost of increased computational time.

    Many studies have focused on improving the speech recognition system accuracy, generally by adding algorithms from other methods into the system; this resulted in side-effects such as increasingly heavy computational loads. Overly large computational loads are a problem for speech recognition systems because speech recognition applications are expected to work in real-time. The present study aims to improve the speech recognition system accuracy without requiring large computational loads. This study increases the accuracy of an isolated word recognition system by using the syllable number characteristics of the speech to be recognized. Most studies on speech recognition systems use utterance data in the form of isolated words (Masood et al., 2015; Hidayat et al., 2018; Raczynski, 2018; Sawant and Deshpande, 2018; Tomchuk, 2018; Winursito et al., 2018). Others developed syllable-based speech recognition systems (Can and Artuner, 2013; Soe and Theins, 2015; Kristomo et al., 2017). Isolated word recognition systems are preferable because they have high accuracy and require less-complicated algorithms; however, most developed systems add other methods to the system. In this study, the utterance data to be recognized is an isolated word. An isolated word recognition system was developed by using the syllable number characteristics of speech signals to be recognized as additional feature data to increase the speech recognition system accuracy. An added syllable detection algorithm is simplified to avoid greatly affecting the computational load. The word utterance objects examined in this study were several isolated words in Bahasa. In general, daily spoken words are divided into five types of words based on the number of syllables, namely, words that have 1 (one) syllable, 2 (two) syllables, 3 (three) syllables, 4 (four) syllables, and 5 (five) syllables. Researchers use these characteristics to improve the isolated word recognition system accuracy by grouping words into several databases based on the syllable numbers. Three versions of the proposed method were designed to determine the best accuracy improvement. These three versions differed in terms of the division of database group numbers in the classification process. In version one, the database was divided into five parts per syllable number. In version two, the database was divided into three parts. Finally, in version three, the database was divided into only two parts. Test results obtained using the proposed method were then compared with previously developed methods such as MFCC + Delta and MFCC + PCA and in terms of the accuracy and computational time. 

Conclusion

        The development of speech recognition systems by using syllable number characteristics improved the speech recognition accuracy. Version three of the proposed method improved the speech recognition accuracy by 4% compared to the conventional MFCC method. This method was developed by dividing the reference database into two parts based on the syllable number characteristics. In developing a recognition system using the proposed method, the speech recognition system accuracy strongly depends on the syllable number detection accuracy. That is because if the system incorrectly recognizes the syllable number, the classification process will use a wrong database and the word recognition will also be wrong. The data cut length and threshold values also affected the speech recognition system accuracy. Version three had the fastest computational time compared to other methods. The addition of syllable detection algorithms to version three of the proposed method only increased the computation time by 0.151 s compared with the conventional MFCC method. 

Supplementary Material
FilenameDescription
R1-EECE-3678-20200330204616.JPG Figure 1 and Figure 2
R1-EECE-3678-20200330204651.JPG Figure 3
R1-EECE-3678-20200330204715.JPG Figure 4
R1-EECE-3678-20200330204745.JPG Figure 5
References

Adiwijaya, A., Aulia, M.N., Mubarok, M.S., Novia, W.U., Nhita, F., 2017. A Comparative Study of MFCC-KNN and LPC-KNN for Hijaiyyah Letters Pronunciation Classification System. In: IEEE Fifth International Conference on Information and Communication Technology (ICoICT), pp. 1–5

Banaeeyan, R., Karim, H.A., Lye, H., Fauzi, M.F.A., Mansor, S., See, J. 2019. Acoustic Pornography Recognition using Fused Pitch and Mel-Frequency Cepstrum Coefficients. International Journal of Technology, Volume 10(7), pp. 1335–1343

Can, B., Artuner, H., 2013. A Syllable-based Turkish Speech Recognition System by using Time Delay Neural Networks (TDNNs). In: International Conference on Soft Computing and Pattern Recognition (SoCPaR). Hanoi, Vietnam, pp. 219–224

Dahake, P.P., Shaw, K., Malathi, P., 2016. Speaker Dependent Speech Emotion Recognition Using MFCC and Support Vector Machine. In: IEEE International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), pp. 1080–1084

Gunjal, S., Raut, R., 2015. Traditional Psychoacoustic Model and Daubechies Wavelets for Enhanced Speech Coder Performance. International Journal of Technology, Volume 6(2), pp. 190–197

Hidayat, R., Bejo, A., Sumaryono, S., Winursito, A., 2018. Denoising Speech for MFCC Feature Extraction using Wavelet Transformation in Speech Recognition System. In: IEEE 10th International Conference on Information Technology and Electrical Engineering (ICITEE), Kuta, pp. 280–284

Hossan, Md.A., Memon, S., Gregory, M.A., 2010. A Novel Approach for MFCC Feature Extraction. In: IEEE 4th International Conference on Signal Processing and Communication Systems, Gold Coast, Australia, pp. 1–5

Enriko, I.K.A., Suryanegara, M., Gunawan, D., 2016. Heart Disease Prediction System using k-Nearest Neighbor Algorithm with Simplified Patient’s Health Parameters. Journal of Telecommunication, Electronic and Computer Engineering, Volume 8(12), pp. 59–65

Jo, J., Yoo, H., Park, I.-C., 2016. Energy-Efficient Floating-Point MFCC Extraction Architecture for Speech Recognition Systems. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 24(2), pp. 754–758

Li, Q., Yang, Y., Lan, T., Zhu, H., Wei, Q., Qiao, F., Liu, X., Yang, H., 2020. MSP-MFCC: Energy-Efficient MFCC Feature Extraction Method with Mixed-signal Processing Architecture for Wearable Speech Recognition Applications. In: IEEE Access, Volume 8

Kristomo, D., Hidayat, R., Soesanti, I., 2017. Classification of the Syllables Sound using Wavelet, Renyi Entropy and AR-PSD features. In: IEEE 13th International Colloquium on Signal Processing & Its Applications (CSPA), Penang, Malaysia, pp. 94–99

Kumar, C., ur Rehman, F., Kumar, S., Mehmood, A., Shabir, G., 2018. Analysis of MFCC and BFCC in a speaker identification system. In: International Conference on Computing, Mathematics and Engineering Technologies (ICoMET), Sukkur, pp. 1–5

Marlina, L., Wardoyo, C., Sanjaya, W.S.M., Anggraeni, D., Dewi, S.F., Roziqin, A., Maryanti, S., 2018. Makhraj Recognition of Hijaiyah Letter for Children based on Mel-Frequency Cepstrum Coefficients (MFCC) and Support Vector Machines (SVM) Method. In: IEEE International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, pp. 935–940

Masood, S., Mehta, M., Namrata, Rizvi, D.R., 2015. Isolated Word Recognition using Neural Network. In: IEEE Annual IEEE India Conference (INDICON), New Delhi, India, pp. 1–5

Mufarroha, F.A., Utaminingrum, F., 2017. Hand Gesture Recognition using Adaptive Network Based Fuzzy Inference System and K-Nearest Neighbor. International Journal of Technology, Volume 8(3), pp. 559–567

Raczynski, M., 2018. Speech Processing Algorithm for Isolated Words Recognition. In: IEEE International Interdisciplinary PhD Workshop (IIPhDW), Swinouj?cie, pp. 27–31

Sawant, S., Deshpande, M., 2018. Isolated Spoken Marathi Words Recognition using HMM. In: IEEE 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, pp. 1–4

Shi, L., Ahmad, I., He, Y., Chang, K., 2018. Hidden Markov Model Based Drone Sound Recognition using MFCC Technique in Practical Noisy Environments. Journal of Communication and Network, Volume 20, pp. 509–518

Soe, W., Theins, Y., 2015. Syllable-based Myanmar Language Model for Speech Recognition. In: IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS), Las Vegas, NV, USA, pp. 291–296

Tomchuk, K.K., 2018. Spectral Masking in MFCC Calculation for Noisy Speech. In: IEEE Wave Electronics and Its Application in Information and Telecommunication Systems (WECONF), St. Petersburg, pp. 1–4

Vijayan, A., Mathai, B.M., Valsalan, K., Johnson, R.R., Mathew, L.R., Gopakumar, K., 2017. Throat Microphone Speech Recognition using MFCC. In: IEEE International Conference on Networks & Advances in Computational Technologies (NetACT), pp. 392–395

Winursito, A., Hidayat, R., Bejo, A., Utomo, M.N.Y., 2018. Feature Data Reduction of MFCC using PCA and SVD in Speech Recognition System. In: IEEE International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Shah Alam, pp. 1–6