AUTOMATION OF THE BERT AND RESNET50 MODEL INFERENCE CONFIGURATION ANALYSIS PROCESS
Abstract
Inference is the process of using models to make predictions on new data, performance is measured based on throughput, latency, GPU memory usage, and GPU power usage. The models used are BERT and ResNet50. The right configuration can be used to maximise inference. Configuration analysis needs to be done to find out which configuration is right for model inference. The main challenge in the analysis process lies in its inherent time-intensive nature and inherent complexity, making it a task that is not simple. The analysis needs to be made easier by building an automation programme. The automation programme analyses the BERT model inference configuration by dividing 10 configurations namely bert-large_config_0 to bert-large_config_9, the result is that the right configuration is bert-large_config_2 resulting in a throughput of 12.8 infer/sec with a latency of 618 ms. While the ResNet50 model is divided into 5 configurations, namely resnet50_config_0 to resnet50_config_4, the result is that the right configuration is resnet50_config_1 which produces a throughput of 120.6 infer/sec with a latency of 60.9 ms. The automation programme has the benefit of facilitating the process of analysing the inference configuration.
Downloads
References
Supriyadi Irawan Endang, & Asih Banyu Dianing. "Implementasi Artificial Inteleligence (AI) di Bidang Administrasi Publik pada Era Revolusi Industri 4.0," Jurnal RASI, vol. 2, no. 2, pp. 12–22. January 2021, doi: doi:10.52496/rasi.v2i2.62,
N. N. Misra et al., "IoT, Big Data, and Artificial Intelligence in Agriculture and Food Industry," IEEE Internet of Things Journal, vol. 9, pp. 6305–6324, May 2022, doi: 10.1109/JIOT.2020.2998584.
Nugroho Sasongko, "Strategi Nasional Kecerdasan Artifisial Indonesia Strategi Nasional Kecerdasan Artifisia," Badan Pengkajian dan Penerapan Teknologi, July 2020.
Sergio P. Perez et al., "Training and inference of large language models using 8-bit floating point", available online at: https://arxiv.org/pdf/2309.17224, 2023.
Long Ying, Hui Yu, Jinguang Wang, Yongze Ji, and Shengsheng Qian, "Multi-Level Multi-Modal Cross-Attention Network for Fake News Detection," IEEE Access, vol. 9, pp. 132363–132373, 2021, doi: 10.1109/ACCESS.2021.3114093
Subakti Alvin, Murfi Hendri, & Hariadi Nora. The performance of BERT as data representation of text clustering. Journal of Big Data, 9. February 2022, doi:10.1186/s40537-022-00564-9.
Kirandeep, Ramanpreet Kaur, and Vijay Dhir, "Image Recognition using ResNet50," European Chemical Bulletin, vol. 12, pp. 7533-7538, July 2023.
Hanqiu Chen, Yahya Alhinai, Yihan Jiang, Eunjee Na, and Cong Hao, "Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU," in 2022 IEEE International Symposium on Workload Characterization (IISWC), pp. 130-145, November 2022, 10.1109/IISWC55918.2022.00021.
Chunrong Yao et al., "Evaluating and analyzing the energy efficiency of CNN inference on high‐performance GPU," Concurrency and Computation: Practice and Experience, vol. 33, no. 6, pp. e6067, October 2020, doi: https://doi.org/10.1002/cpe.6064.
Erqian Tang, Svetlana Minakova, and Todor Stefanov, "Energy-Efficient and High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs," in Embedded Computer Systems: Architectures, Modeling, and Simulation.: Springer International Publishing, pp. 127–143, 2022, doi: https://doi.org/10.1007/978-3-031-04580-6_9.
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh:, "Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning," CoRR abs/2109.01611, 2021.
Ali Jahanshahi, Hadi Zamani Sabzi, Chester Lau, and Daniel Wong, "Gpu-Nest: Characterizing Energy Efficiency of Multi-Gpu Inference Servers," IEEE Computer Architecture Letters, vol. 19, pp. 139–142, July 2020, doi: 10.1109/LCA.2020.3023723
Hanif Abdullah Muhammad, & Shafique Muhammad. Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge. Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing,". Springer Nature Switzerland, (225–248) doi:10.1007/978-3-031-39932-9_9, October 2023.
Jiacong Fang, Qiong Liu, and Jingzheng Li, "A Deployment Scheme of YOLOv5 with Inference Optimizations Based on the Triton Inference Server," in 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), April 2021.
Hohman Fred, Wang Chaoqun, Lee Jinmook, Görtler Jochen, Moritz Dominik, Bigham Jeffrey, Zhang Xiaoyi, "Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. CHI." In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1-19, 2024.
Copyright (c) 2024 Medi Noviana, Sunny Arief Sudiro
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.