ANALYSIS OF WHISPER AUTOMATIC SPEECH RECOGNITION PERFORMANCE ON LOW RESOURCE LANGUAGE

Riefkyanov Surya Adia Pratama; Agit Amrullah

doi:10.33480/pilar.v20i1.4633

Authors

Riefkyanov Surya Adia Pratama Universitas Amikom Yogyakarta
Agit Amrullah Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.33480/pilar.v20i1.4633

Keywords:

automatic speech recognition, low-resources language, whisper fine-tuning

Abstract

Implementing Automatic Speech Recognition Technology in daily life could give convenience to its users. However, speeches that can be recognized accurately by the ASR model right now are in languages considered high resources, like English. In previous research, a few regional languages like Javanese, Sundanese, Balinese and Btaknese are used in automatic speech recognition. This research aim is to improve speech recognition using the ASR model on low-resource language. The dataset used in this research is the Javanese dataset specifically because there is a high-quality Javanese speech dataset provided by previous research. The method used is fine-tuning the Whisper model which has been trained on 680,000 hours of multilingual voice data using a Javanese speech dataset. To reduce computation requirements, parameter efficient fine-tuning (PEFT) implemented in the fine-tuning process. The trainable parameter is reduced to <1% because the implementation of PEFT reduces the computation required by the model for fine-tuning. The best WER evaluation result is 13.77%, achieved by the fine-tuned Whisper large-v2 model compared to the base model of Whisper large-v2, which achieves 89.40% in WER evaluation. Performance improvement in WER evaluation showed that fine-tuning effectively improves the performance of the Whisper automatic speech recognition model on recognizing speeches in low-resource languages like the Javanese language compared to the Original Whisper model performance with minimal computational cost needed for fine-tuning large model.

Downloads

Download data is not yet available.

References

Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., … Almojil, M. (2021). Automatic Speech Recognition: Systematic Literature Review. IEEE Access, 9, 131858–131876. https://doi.org/10.1109/ACCESS.2021.3112535

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.

Fu, Z., Yang, H., So, A. M. C., Lam, W., Bing, L., & Collier, N. (2023, June). On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 12799-12807).

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Joseph, V. R. (2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(4), 531-538.

Kumar, A., & Mittal, V. (2019). Speech recognition: A complete perspective. International Journal of Recent Technology and Engineering (IJRTE), 7(6), 78-83.

Li, J. (2022). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. A. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35, 1950-1965.

Min, S., Lewis, M., Zettlemoyer, L., & Hajishirzi, H. (2021). Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943.

Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2020). Cross-lingual machine speech chain for javanese, sundanese, balinese, and bataks speech recognition and synthesis. arXiv preprint arXiv:2011.02128.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Rouditchenko, A., Khurana, S., Thomas, S., Feris, R., Karlinsky, L., Kuehne, H., ... & Glass, J. (2023). Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages. arXiv preprint arXiv:2305.12606.

Wang, C., Cho, K., & Gu, J. (2020, April). Neural machine translation with byte-level subwords. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 05, pp. 9154-9160).

Butryna, A., Chu, S. H. C., Demirsahin, I., Gutkin, A., Ha, L., He, F., ... & Wibawa, J. A. E. (2020). Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: an overview. arXiv preprint arXiv:2010.06778.

Toraman, C., Yilmaz, E. H., Şahinuç, F., & Ozcelik, O. (2023). Impact of tokenization on language models: An analysis for turkish. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1-21.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

Yang, H., Zhang, M., Tao, S., Ma, M., & Qin, Y. (2023, February). Chinese ASR and NER Improvement Based on Whisper Fine-Tuning. In 2023 25th International Conference on Advanced Communication Technology (ICACT) (pp. 213-217). IEEE.

Zhang, C., & Lu, Y. (2021). Study on artificial intelligence: The state of the art and future prospects. Journal of Industrial Information Integration, 23, 100224.

Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., ... & Perrault, R. (2021). The AI index 2021 annual report. arXiv preprint arXiv:2103.06312.

Zhang, X., Peng, Y., & Xu, X. (2019, September). An overview of speech recognition technology. In 2019 4th International Conference on Control, Robotics and Cybernetics (CRC) (pp. 81-85). IEEE.

ANALYSIS OF WHISPER AUTOMATIC SPEECH RECOGNITION PERFORMANCE ON LOW RESOURCE LANGUAGE

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information

Language