Large-Model-Driven Supercomputer Center Cooling System Intelligent Operations and Maintenance: Mechanism-Based Digital Twin and Knowledge-Augmented Multi-Agent Collaboration
编号:77 访问权限:仅限参会人 更新:2025-11-10 11:40:47 浏览:123次 口头报告

报告开始:2025年11月23日 08:30(Asia/Shanghai)

报告时间:20min

所在会场:[S4] Parallel Session 4 [S4-2] Parallel Session 4-23 AM

暂无文件

摘要
Efficient and reliable cold source supply is vital for large-scale supercomputing systems, especially during batch computing or training, where temperature fluctuations directly affect throughput and hardware lifespan. However, conventional cooling strategies often struggle to adapt to the dynamic and complex thermal loads in modern high-performance computing (HPC), resulting in energy inefficiency and potential hardware failures. This paper introduces DCAIM-GPT—a mechanism-knowledge hybrid framework that integrates high-fidelity digital twins with a multi-agent decision engine empowered by enhanced Retrieval-Augmented Generation (RAG) technology. This framework autonomously optimizes critical operational parameters—including but not limited to the number of operating pumps, pump operating frequencies, fan speeds, and cooling tower hierarchical operation modes—under fluctuating computational workloads and outdoor temperature/humidity conditions. It incorporates a large-model-driven multi-agent iterative mechanism to determine optimal parameters while ensuring the supercomputer's cooling source remains within safe operating limits. Experiments in digital twin environments demonstrate that DCAIM-GPT consistently maintains cooling safety and enhances energy efficiency under both dynamic and extreme conditions, evidenced by stable PUE metrics and improved coefficient of performance (COP). This research provides a scalable, universal solution for large-model-empowered efficient and reliable supercomputing operations, laying the foundation for broader vertical industrial large-model applications across the supercomputing domain.
关键词
digital twin,multi-agent,Large Language Models (LLM),retrieval-augmented generation,energy efficiency,supercomputer center cooling management
报告人
Yutong Xu
Student University of Electronic Science and Technology of China

稿件作者
Yutong Xu University of Electronic Science and Technology of China
Jiacheng Dai University of Electronic Science and Technology of China
Liyuan Ren Institute of Applied Physics and Computational Mathematics
Linping Wu Institute of Applied Physics and Computational Mathematics
Ying Li Institute of Applied Physics and Computational Mathematics
Huan Wang City University of Hong Kong
Zhiliang Liu University of Electronic Science and Technology of China
发表评论
验证码 看不清楚,更换一张
全部评论
重要日期
  • 会议日期

    11月21日

    2025

    11月23日

    2025

  • 10月20日 2025

    初稿截稿日期

  • 11月23日 2025

    注册截止日期

主办单位
IEEE Instrumentation and Measurement Society
South China University of Technology
承办单位
South China University of Technology
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询