Researchers at the University of Michigan have developed an innovative optimization framework that could dramatically reduce the energy demands of training deep learning models, a critical tool in powering artificial intelligence systems.
The open-source optimization framework, called Zeus, studies deep learning models during the training process to identify the optimal balance between energy consumption and training speed.
Deep learning models have seen a surge in popularity in recent years, powering a range of applications from image-generation models and expressive chatbots to recommender systems for platforms like TikTok and Amazon. However, the energy consumption associated with training these models is substantial and has significant environmental implications.
“At extreme scales, training the GPT-3 model just once consumes 1,287 MWh, which is enough to supply an average U.S. household for 120 years,” said Professor Mosharaf Chowdhury.
With the new Zeus energy optimization framework, Chowdhury and his team aim to reduce energy consumption figures like these by up to 75 percent without the need for new hardware and with only minor impacts on the time it takes to train a model. The framework was presented at the 2023 USENIX Symposium on Networked Systems Design and Implementation (NSDI) in Boston.
As cloud computing continues to grow and outpace the emissions of commercial aviation, the increased climate burden from artificial intelligence is a pressing concern.
“Existing work primarily focuses on optimizing deep learning training for faster completion, often without considering the impact on energy efficiency,” said study first author Jae-Won Chung, a doctoral student in computer science and engineering.
“We discovered that the energy we’re pouring into GPUs is giving diminishing returns, which allows us to reduce energy consumption significantly, with relatively little slowdown.”
Deep learning is a subset of machine learning that relies on multilayered, artificial neural networks, also known as deep neural networks (DNNs), to tackle a wide array of tasks. These models are highly complex and learn from some of the largest data sets ever used in machine learning.
As a result, they benefit significantly from the multitasking capabilities of graphical processing units (GPUs), which account for 70 percent of the power consumed during the training process.
Zeus achieves this optimization by adjusting two critical software parameters in real-time: the GPU power limit and the deep learning model’s batch size. The GPU power limit controls the energy consumption of the GPU, reducing it while temporarily slowing down the model’s training until the setting is adjusted again.
The batch size parameter, on the other hand, determines how many samples from the training data the model processes before updating its internal representations. Higher batch sizes can decrease training time but at the cost of increased energy consumption.
The unique ability of Zeus to tune these settings in real time allows it to find the optimal tradeoff point between energy usage and training time. According to Jie You, a recent doctoral graduate in computer science and engineering and co-lead author of the study, the repetitive nature of machine learning enables Zeus to learn about the DNN’s behavior across different recurrences, making it highly effective in practice.
Zeus stands out as the first framework designed to integrate seamlessly into existing workflows for various machine learning tasks and GPUs. This innovative solution reduces energy consumption without necessitating changes to a system’s hardware or data center infrastructure.
To further reduce the carbon footprint of DNN training, the team has also developed complementary software called Chase. This software prioritizes speed when low-carbon energy is available and opts for efficiency at the expense of speed during peak times, which are more likely to involve carbon-intensive energy generation like coal.
Chase won second place at last year’s CarbonHack hackathon and will be presented on May 4 at the International Conference on Learning Representations Workshop.
Study co-author Zhenning Yang emphasized the need for solutions that don’t conflict with the realistic constraints of DNN training, such as data regulations or the requirement for up-to-date data.
“Our aim is to design and implement solutions that do not conflict with these realistic constraints, while still reducing the carbon footprint of DNN training,” Yang said.
Zeus represents a groundbreaking step forward in addressing the energy efficiency concerns associated with training deep learning models. By optimizing the tradeoff between energy consumption and training speed, the framework has the potential to significantly reduce the environmental impact of artificial intelligence systems without sacrificing performance.
The study was supported in part by the National Science Foundation grants CNS-1909067 and CNS-2104243, VMWare and the Kwanjeong Educational Foundation, and computing credits provided by CloudLab and Chameleon Cloud. As the AI industry continues to grow, frameworks like Zeus and software like Chase can play a crucial role in promoting sustainability and minimizing the environmental impact of AI technologies.