Real-Time Workload Prediction and Resource Optimization for Parallel Heterogeneous High-Performance Computing Systems Architectures

Ayesha Aslam; Zhumakhanova Darya Anuarovna

doi:10.71346/utj.v1i1.11

Authors

Ayesha Aslam Chang’an University https://orcid.org/0009-0001-4685-4541
Zhumakhanova Darya Anuarovna Shakarim g Semey University

DOI:

https://doi.org/10.71346/utj.v1i1.11

Keywords:

Adaptive resource management, Heterogeneous parallel architectures, Task scheduling, Machine learning-driven optimization, Source allocation, Data placement, Energy efficiency, Fault tolerance, Workload prediction, High-performance computing.

Abstract

The rapid advancements in heterogeneous parallel architectures, consisting of CPUs, GPUs, FPGAs, have introduced significant challenges in efficient resource management for high-performance computing systems. Static and heuristic-based approaches fail to address the adaptability required for handling varying workloads and hardware configurations which results in suboptimal performance and energy inefficiency. This research proposes a machine learning-driven adaptive resource management framework that dynamically optimizes task scheduling, resource allocation, and data placement. The framework employs regression models & reinforcement learning algorithms to predict workload behaviors, resource utilization, and task execution times in real time. Experimental results on heterogeneous testbed demonstrate a 21% reduction in task execution time, 18% improvement in energy efficiency, and 38% decrease in fault recovery time compared to conventional methods. These findings highlight the framework’s ability to improve resource utilization while maintaining reliability and minimizing energy overhead. The work advances the field by introducing a unified approach that integrates machine learning for runtime optimization across heterogeneous systems. Practical implications include its applicability to large-scale scientific simulations and deep learning tasks, where adaptive resource management is critical. Future study can focus on enhancing prediction accuracy by advanced deep learning techniques and extending the framework to handle emerging hardware accelerators and edge computing environments.

Author Biographies

Ayesha Aslam, Chang’an University

Ayesha Aslam received the B.S. degree in computer science from University of the Punjab in 2016 and an M.S. Computer Science degree from Bahria University, Islamabad Campus in 2022. She is pursuing a PhD with the School of Information Engineering at Chang’an University, Xi’an, China. Her research interests include autonomous vehicles, trajectory prediction, path planning, predictive modelling, and forecasting. She can be contacted at email: [email protected]

Zhumakhanova Darya Anuarovna , Shakarim g Semey University

Zhumakhanova Darya Anuarovna is a master's degree from Shakarim g Semey University, Kazakhstan. She holds a master’s degree in computer science and information technology. Her research areas are database design for the professional field. You can contact her by e-mail: [email protected]

References

M. De Castro, D. L. Vilariño, Y. Torres, and D. R. Llanos, "The Role of Field-Programmable Gate Arrays in the Acceleration of Modern High-Performance Computing Workloads," Computer, vol. 57, no. 7, pp. 66–76, Jun. 2024, doi: 10.1109/MC.2024.3378380.

C. A. Silva, R. Vilaça, A. Pereira, and R. J. Bessa, "A review on the decarbonization of high-performance computing centers," Renewable and Sustainable Energy Reviews, vol. 189, p. 114019, Nov. 2023, doi: 10.1016/j.rser.2023.114019.

S. Gurusamy and R. Selvaraj, "Resource allocation with efficient task scheduling in cloud computing using hierarchical auto-associative polynomial convolutional neural network," Expert Systems with Applications, vol. 249, p. 123554, Feb. 2024, doi: 10.1016/j.eswa.2024.123554.

A. H. A. Al-Jumaili, R. C. Muniyandi, M. K. Hasan, J. K. S. Paw, and M. J. Singh, "Big Data Analytics Using Cloud Computing Based Frameworks for Power Management Systems: Status, Constraints, and Future Recommendations," Sensors, vol. 23, no. 6, p. 2952, Mar. 2023, doi: 10.3390/s23062952.

T. A. Rahmani, G. Belalem, S. A. Mahmoudi, and O. R. Merad-Boudia, "Machine learning-driven energy-efficient load balancing for real-time heterogeneous systems," Cluster Computing, vol. 27, no. 4, pp. 4883–4908, Jan. 2024, doi: 10.1007/s10586-023-04215-3.

Z. Ye et al., "Deep Learning Workload Scheduling in GPU Datacenters: A Survey," ACM Computing Surveys, vol. 56, no. 6, pp. 1–38, Jan. 2024, doi: 10.1145/3638757.

B. Premalatha and P. Prakasam, "Optimal Energy-efficient Resource Allocation and Fault Tolerance scheme for task offloading in IoT-FoG Computing Networks," Computer Networks, vol. 238, p. 110080, Nov. 2023, doi: 10.1016/j.comnet.2023.110080.

N. Jafarzadeh et al., "A novel buffering fault‐tolerance approach for network on chip (NoC)," IET Circuits, Devices & Systems, vol. 17, no. 4, pp. 250–257, Aug. 2022, doi: 10.1049/cds2.12127.

B. Hu, X. Yang, and M. Zhao, "Online energy-efficient scheduling of DAG tasks on heterogeneous embedded platforms," Journal of Systems Architecture, vol. 140, p. 102894, May 2023, doi: 10.1016/j.sysarc.2023.102894.

G. Galante and R. Da Rosa Righi, "Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)," Cluster Computing, vol. 25, no. 6, pp. 4439–4461, Aug. 2022, doi: 10.1007/s10586-022-03692-2.

Z. Yang, S. Zhang, C. Li, M. Wang, H. Wang, and M. Zhang, "Efficient knowledge management for heterogeneous federated continual learning on resource-constrained edge devices," Future Generation Computer Systems, vol. 156, pp. 16–29, Feb. 2024, doi: 10.1016/j.future.2024.02.018.

Q. Zeng, Y. Du, K. Huang, and K. K. Leung, "Energy-Efficient Resource Management for Federated Edge Learning With CPU-GPU Heterogeneous Computing," IEEE Transactions on Wireless Communications, vol. 20, no. 12, pp. 7947–7962, Jun. 2021, doi: 10.1109/TWC.2021.3088910.

Y. Wang et al., "DRLCap: Runtime GPU Frequency Capping with Deep Reinforcement Learning," IEEE Transactions on Sustainable Computing, vol. 9, no. 5, pp. 712–726, Feb. 2024, doi: 10.1109/TSUSC.2024.3362697.

M. Kirti, A. K. Maurya, and R. S. Yadav, "Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directions," Concurrency and Computation: Practice and Experience, vol. 36, no. 13, Mar. 2024, doi: 10.1002/cpe.8081.

R. Kaur, A. Asad, and F. Mohammadi, "A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural Networks," Computers, vol. 13, no. 7, p. 174, Jul. 2024, doi: 10.3390/computers13070174.

X. Li, Y. Li, Y. Li, T. Cao, and Y. Liu, "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices," Proceedings of the 28th Annual International Conference on Mobile Computing and Networking, vol. 54, pp. 709–723, May 2024, doi: 10.1145/3636534.3649391.

Real-Time Workload Prediction and Resource Optimization for Parallel Heterogeneous High-Performance Computing Systems Architectures

Authors

DOI:

Keywords:

Abstract

Author Biographies

Ayesha Aslam, Chang’an University

Zhumakhanova Darya Anuarovna , Shakarim g Semey University

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Information

Current Issue