DEVELOPMENT OF CNN-LSTM-BASED IMAGE CAPTIONING DATASET TO ENHANCE VISUAL ACCESSIBILITY FOR DISABILITIES

Authors

  • Muhammad Rifki Universitas Majalengka
  • Ade Bastian
  • Ardi Mardiana

DOI:

https://doi.org/10.33480/jitk.v10i4.6657

Keywords:

accessibility, CNN-LSTM, image captioning, sidewalk, visual impairment

Abstract

Visual accessibility in public spaces remains limited for individuals with visual impairments in Indonesia, despite technological advancements such as image captioning. This study aims to develop a custom dataset and a baseline CNN-LSTM image captioning model capable of describing sidewalk accessibility conditions in Indonesian language. The methodology includes collecting 748 annotated images from various Indonesian cities, with captions manually crafted to reflect accessibility features. The model employs DenseNet201 as the CNN encoder and LSTM as the decoder, with 70% of the data used for training and 30% for validation. Evaluation was conducted using BLEU and CIDEr metrics. Results show a BLEU-4 score of 0.27 and a CIDEr score of 0.56, indicating moderate alignment between model-generated and reference captions. While the absence of an attention mechanism and the limited dataset size constrain overall performance, the model demonstrates the ability to identify key elements such as tactile paving, signage, and pedestrian barriers. This study contributes to assistive technology development in a low-resource language context, providing foundational work for future research. Enhancements through data expansion, incorporation of attention mechanisms, and transformer-based models are recommended to improve descriptive richness and accuracy

Downloads

Download data is not yet available.

References

B. Arystanbekov, A. Kuzdeuov, S. Nurgaliyev, and H. A. Varol, “Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages,” Feb. 23, 2023. doi: 10.36227/techrxiv.22133894.v1.

L. Rif’Ati, A. Halim, Y. D. Lestari, N. F. Moeloek, and H. Limburg, “Blindness and Visual Impairment Situation in Indonesia Based on Rapid Assessment of Avoidable Blindness Surveys in 15 Provinces,” Ophthalmic Epidemiol, vol. 28, no. 5, pp. 408–419, 2021, doi: 10.1080/09286586.2020.1853178.

R. R. A. Bourne et al., “Trends in prevalence of blindness and distance and near vision impairment over 30 years: An analysis for the Global Burden of Disease Study,” Lancet Glob Health, vol. 9, no. 2, pp. e130–e143, Feb. 2021, doi: 10.1016/S2214-109X(20)30425-3.

R. Kesuma and A. Prasetyo, “Literature Review on the Prevalence of Vision Impairment and Age-Related Eye Diseases,” 2024, pp. 1079–1085. doi: 10.2991/978-2-38476-273-6_111.

D. Daniel, A. Nastiti, H. Y. Surbakti, and N. M. U. Dwipayanti, “Access to inclusive sanitation and participation in sanitation programs for people with disabilities in Indonesia,” Sci Rep, vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-30586-z.

Y. Azhar, M. R. Anugerah, M. A. R. Fahlopy, and A. Yusriansyah, “Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model,” Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Nov. 2022, doi: 10.22219/kinetik.v7i4.1568.

P. Patel, S. Pampaniya, A. Ghosh, R. Raj, D. Karuppaih, and S. Kandasamy, “Enhancing Accessibility Through Machine Learning: A Review on Visual and Hearing Impairment Technologies,” 2025, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2025.3539081.

B. Nurhakim, A. Rifai, D. A. Kurnia, D. Sudrajat, and U. Supriatna, “Smart Attendance Tracking System Employing Deep Learning For Face Anti-Spoofing Protection,” JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), vol. 10, no. 3, pp. 496–505, Feb. 2025, doi: 10.33480/jitk.v10i3.5992.

Moh. H. Fariz and E. B. Setiawan, “The Impact Of Word Embedding On Cyberbulliying Detection Using Hybrid Deep Learning CNN-BILSTM,” JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), vol. 10, no. 3, pp. 661–671, Feb. 2025, doi: 10.33480/jitk.v10i3.6270.

J. Ganesan, A. T. Azar, S. Alsenan, N. A. Kamal, B. Qureshi, and A. E. Hassanien, “Deep Learning Reader for Visually Impaired,” Electronics (Switzerland), vol. 11, no. 20, Oct. 2022, doi: 10.3390/electronics11203335.

D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, “Captioning Images Taken by People Who Are Blind.” [Online]. Available: https://vizwiz.org.

F. Liu, Y. Wang, T. Wang, and V. Ordonez, “Visual News: Benchmark and Challenges in News Image Captioning,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.03743

Y. Yang, J. Yu, J. Zhang, W. Han, H. Jiang, and Q. Huang, “Joint Embedding of Deep Visual and Semantic Features for Medical Image Report Generation,” IEEE Trans Multimedia, vol. 25, pp. 167–178, 2023, doi: 10.1109/TMM.2021.3122542.

Z. Zohourianshahzadi and J. K. Kalita, “Neural Attention for Image Captioning: Review of Outstanding Methods,” Nov. 2021, doi: 10.1007/s10462-021-10092-2.

H. Sharma and D. Padha, “A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues,” Artif Intell Rev, vol. 56, pp. 1–43, Apr. 2023, doi: 10.1007/s10462-023-10488-2.

L. Xu, Q. Tang, J. Lv, B. Zheng, X. Zeng, and W. Li, “Deep image captioning: A review of methods, trends and future challenges,” Neurocomputing, vol. 546, p. 126287, 2023, doi: https://doi.org/10.1016/j.neucom.2023.126287.

S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” Sep. 2024, doi: 10.1109/LGRS.2024.3523134.

D. Kumar, V. Srivastava, D. E. Popescu, and J. D. Hemanth, “Dual-Modal Transformer with Enhanced Inter-and Intra-Modality Interactions for Image Captioning,” Applied Sciences (Switzerland), vol. 12, no. 13, Jul. 2022, doi: 10.3390/app12136733.

O. Ondeng, H. Ouma, and P. Akuon, “A Review of Transformer-Based Approaches for Image Captioning,” Oct. 01, 2023, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/app131911103.

Q. Nguyen Van, M. Suganuma, and T. Okatani, “GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features,” 2022, pp. 167–184. doi: 10.1007/978-3-031-20059-5_10.

P. Zhang et al., “VinVL: Revisiting Visual Representations in Vision-Language Models,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.00529

L. Louis and O. Adedamola Aluko, “Enhancing accessibility : a pilot study for context-aware image-Enhancing accessibility : a pilot study for context-aware image-caption to American Sign Language (ASL) translation caption to American Sign Language (ASL) translation,” 2024. [Online]. Available: https://louis.uah.edu/uah-theses/720

H. Hibatullah, A. Thobirin, S. Surono, and A. Dahlan University, “Deep Belief Network (DBN) Implementation For Multimodal Cclassification Of Sentiment Analysis,” vol. 10, no. 3, 2025, doi: 10.33480/jitk.v10i2.6257.

D. Vala, K. Sharma, J. Rathod, and M. Holia, “Image Captioning Using Deep Learning-An Overview,” 2024. [Online]. Available: www.tnsroindia.org.in

T. Ghandi, H. Pourreza, and H. Mahyar, “Deep Learning Approaches on Image Captioning: A Review,” Jan. 2022, doi: 10.1145/3617592.

A. Nursikuwagus, R. Munir, and M. L. Khodra, “Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images,” J Imaging, vol. 8, no. 11, Nov. 2022, doi: 10.3390/jimaging8110294.

S. Cho and H. Oh, “Generalized Image Captioning for Multilingual Support,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042446.

S. Akbar et al., “Folk Games Image Captioning using Object Attention,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 7, no. 4, pp. 758–766, Aug. 2023, doi: 10.29207/resti.v7i4.4708.

Downloads

Published

2025-06-18

How to Cite

[1]
M. Rifki, A. Bastian, and A. Mardiana, “DEVELOPMENT OF CNN-LSTM-BASED IMAGE CAPTIONING DATASET TO ENHANCE VISUAL ACCESSIBILITY FOR DISABILITIES”, jitk, vol. 10, no. 4, pp. 980–992, Jun. 2025.