UTILIZING RETRIEVAL-AUGMENTED GENERATION IN LARGE LANGUAGE MODELS TO ENHANCE INDONESIAN LANGUAGE NLP
Abstract
The improvement of Large Language Models (LLM) such as ChatGPT through Retrieval-Augmented Generation (RAG) techniques has urgency in the development of natural language translation technology and dialogue systems. LLMs often experience obstacles in addressing special requests that require information outside the training data. This study aims to discuss the use of Retrieval-Augmented Generation (RAG) on large-scale language models to improve the performance of Natural Language Processing (NLP) in Indonesian, which has so far been poorly supported by high-quality data and to overcome the limitations of traditional language models in understanding the context of Indonesian better. The method used is a combination of retrieval capabilities (external information search) with generation (text generation), where the model utilizes broader and more structured basic data through the retrieval process to produce more accurate and relevant text. The data used includes the Indonesian corpus of the 30 Juz Quran translation into Indonesian. The results of the trial show that the RAG approach significantly improves the performance of the model in various NLP tasks, including token usage optimization, text classification, and context understanding, by increasing the accuracy and relevance of the results
Downloads
References
I. L. Alberts et al., “Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be?,” Eur. J. Nucl. Med. Mol. Imaging, vol. 50, no. 6, pp. 1549–1552, 2023, doi: 10.1007/s00259-023-06172-w.
I. O. Gallegos et al., “Bias and Fairness in Large Language Models: A Survey,” Comput. Linguist., no. March, pp. 1–83, 2024, doi: 10.1162/coli_a_00524.
P. Dufter, M. Schmitt, and H. Schütze, “Position Information in Transformers: An Overview,” Comput. Linguist., vol. 48, no. 3, pp. 733–763, 2022, doi: 10.1162/coli_a_00445.
M. Mandelkern and T. Linzen, “Do Language Models’ Words Refer?,” Comput. Linguist., no. October 2023, pp. 1–10, 2024, doi: 10.1162/coli_a_00522.
A. Y. Alan, Ö. Aydın, and E. Karaarslan, “A RAG-based Question Answering System Proposal for Understanding Islam: MufassirQAS LLM,” SSRN Electron. J., pp. 1–21, 2024, doi: 10.2139/ssrn.4707470.
A. Chaturvedi, S. Bhar, S. Saha, U. Garain, and N. Asher, “Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering,” Comput. Linguist., vol. 50, no. 1, pp. 119–155, 2023, doi: 10.1162/coli_a_00493.
N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large Language Models Struggle to Learn Long-Tail Knowledge,” Proc. Mach. Learn. Res., vol. 202, pp. 15696–15707, 2023.
C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang, “Can Large Language Models Transform Computational Social Science?,” Comput. Linguist., vol. 50, no. 1, pp. 237–291, 2023, doi: 10.1162/coli_a_00502.
A. H. Huang, H. Wang, and Y. Yang, “FinBERT: A Large Language Model for Extracting Information from Financial Text*,” Contemp. Account. Res., vol. 40, no. 2, pp. 806–841, 2023, doi: 10.1111/1911-3846.12832.
J. Huang et al., ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps, vol. 1, no. 1. Association for Computing Machinery, 2022. doi: 10.1145/3534678.3539021.
T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, and K. Lindén, “Automatic language identification in texts: A survey,” J. Artif. Intell. Res., vol. 65, pp. 675–782, 2019, doi: 10.1613/JAIR.1.11675.
T. A. Chang and B. K. Bergen, “Language Model Behavior: A Comprehensive Survey,” Comput. Linguist., vol. 50, no. 1, pp. 293–350, 2024, doi: 10.1162/coli_a_00492.
T. Sommerschield et al., “Machine Learning for Ancient Languages: A Survey,” Comput. Linguist., vol. 49, no. 3, pp. 703–747, 2023, doi: 10.1162/coli_a_00481.
T. Giallanza, D. Campbell, and J. D. Cohen, “Toward the Emergence of Intelligent Control: Episodic Generalization and Optimization,” Open Mind, vol. 8, pp. 688–722, 2024, doi: 10.1162/opmi_a_00143.
M. Fatehkia, J. K. Lucas, and S. Chawla, “T-RAG: Lessons from the LLM Trenches,” pp. 1–22, 2024, [Online]. Available: http://arxiv.org/abs/2402.07483
F. Cuconasu et al., “The Power of Noise: Redefining Retrieval for RAG Systems,” SIGIR 2024 - Proc. 47th Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., pp. 719–729, 2024, doi: 10.1145/3626772.3657834.
S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana, and S. Nanayakkara, “Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering,” Trans. Assoc. Comput. Linguist., vol. 11, pp. 1–17, 2023, doi: 10.1162/tacl_a_00530.
W. Fan et al., “A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 6491–6501, 2024, doi: 10.1145/3637528.3671470.
E. Melz, “Enhancing LLM Intelligence with ARM-RAG: Auxiliary Rationale Memory for Retrieval Augmented Generation,” 2023, [Online]. Available: http://arxiv.org/abs/2311.04177
P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” Adv. Neural Inf. Process. Syst., vol. 2020-December, 2020.
K. Guu, K. Lee, Z. Tung, and P. Pasupat, “REALM : Retrieval-Augmented Language Model Pre-Training,” 2019.
Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” pp. 1–21, 2023, [Online]. Available: http://arxiv.org/abs/2312.10997
I. Drori et al., “From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 3947–3955, 2023, doi: 10.1145/3580305.3599827.
C. Fang et al., “RecruitPro: A Pretrained Language Model with Skill-Aware Prompt Learning for Intelligent Recruitment,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 3991–4002, 2023, doi: 10.1145/3580305.3599894.
K. He et al., “A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics,” vol. 14, no. 8, pp. 1–32, 2023, [Online]. Available: http://arxiv.org/abs/2310.05694
Q. Guo, S. Cao, and Z. Yi, “A medical question answering system using large language models and knowledge graphs,” Int. J. Intell. Syst., vol. 37, no. 11, pp. 8548–8564, 2022, doi: https://doi.org/10.1002/int.22955.
A. Hamidi and K. Roberts, “Evaluation of AI Chatbots for Patient-Specific EHR Questions,” 2023, [Online]. Available: http://arxiv.org/abs/2306.02549
T. Lai et al., “Psy-LLM: Scaling up Global Mental Health Psychological Services with AI-based Large Language Models,” 2023, [Online]. Available: http://arxiv.org/abs/2307.11991
A. Louis, G. van Dijck, and G. Spanakis, “Interpretable Long-Form Legal Question Answering with Retrieval-Augmented Large Language Models,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 20, pp. 22266–22275, 2024, doi: 10.1609/aaai.v38i20.30232.
Copyright (c) 2024 Herdian Tohir, Nita Merlina, Muhammad Haris
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.