Zhejiang University Researchers Propose Fuyou: A Low-Cost Deep Learning Training Framework that Enables Efficient 100B Huge Model Fine-Tuning on a Low-End Server with a Low-End GPU and Limited CPU Memory Capacity

[ad_1]

The arrival of enormous language fashions (LLMs) has sparked a revolution in pure language processing, charming the world with their superior capabilities stemming from the large variety of parameters they make the most of. These LLMs, epitomized by the transformative energy of dense transformer fashions, haven’t solely damaged information in accuracy however have additionally turn into indispensable belongings in information administration duties. Lately, the mannequin measurement of dense transformer fashions has grown from 1.5B (GPT-2) to 540B (PaLM), which reveals the evolution of those fashions in an unprecedented journey into the realm of linguistic mastery.

Whereas the potential of LLMs is plain, a important problem arises from their immense parameter sizes overwhelming even probably the most highly effective GPUs, which presently peak at 80GB of reminiscence. When conducting stochastic gradient descent-based optimization, they should be extra adequate to accommodate these huge parameters and their related optimizer states. To host such an enormous mannequin, one can mixture machine reminiscence from a number of GPUs, and it takes 32 NVIDIA A100 GPUs to suit a mannequin with 100 billion parameters for fine-tuning. Nevertheless, this method introduces prohibitive prices for many tutorial researchers, who all the time have a restricted finances for a lot of high-end GPU servers.

Researchers from Zhejiang College proposed Fuyou. This low-cost coaching framework allows environment friendly 100B large mannequin fine-tuning on a low-end server with a low-end GPU and restricted CPU reminiscence capability. It’s applied on PyTorch, which is a well-liked deep-learning framework. In contrast with different fashions like ZeRO-Infinity, Fuyou can fine-tune 175B GPT-3 on a shopper GPU RTX 4090 with excessive GPU utilization, whereas ZeRO-Infinity fails to fine-tune.

The main target lies on integrating SSD-CPU communication as a pivotal optimization dimension, strategically harmonizing computation and information swapping to unlock the total potential of GPU utilization. This endeavor unfolds by three pioneering improvements:

A synchronous out-of-core CPU optimizer that overlaps with backward propagation to maximise GPU utilization.
A GPU-CPU-SSD fully-pipelined activation swapping mechanism to permit for a considerably bigger mannequin fine-tuning.
An automated activation swapping administration to robotically decide the optimum quantity of swapping activations to reduce the epoch time.

Within the dynamic realm of mannequin fine-tuning, Fuyou emerges as a powerhouse, delivering distinctive efficiency whether or not on the cutting-edge A100-80GB or the formidable 4090 in a commodity server. When fine-tuning a GPT-3 175B mannequin, Fuyou achieves 87 TFLOPS on 4090 and 172 TFLOPS on A100-80GB. Additionally, it reaches as much as 3.47×TFLOPS in comparison with ZeRO-Infinity when a GPT-3 13B mannequin is fine-tuned. To make the most of low cost SSDs in enhancing coaching throughput, the cost-effectiveness of Fuyou with Megatron-LM is in contrast on DGX-2 nodes utilizing tensor parallelism. Throughput is in contrast over the whole value of GPUs6 and SSDs in a server the place Fuyou achieves at most 1.70× cost-effectiveness over Megatron-LM.

In conclusion, this paper proposed Fuyou, a low-cost coaching framework that allows environment friendly 100B large mannequin fine-tuning on a low-end server with a low-end GPU and restricted CPU reminiscence capability. It’s applied on PyTorch. It achieves 87 and 172 TFLOPS when fine-tuning GPT-3 175B. Apart from, it reaches as much as 3.42× and 6.73× TFLOPS in comparison with ZeRO-Infinity and Colossal-AI when fine-tuning GPT-3 13B. Additionally, Fuyou achieves at most 1.70× cost-effectiveness over Megatron-LM.

Try the Paper. All credit score for this analysis goes to the researchers of this undertaking. Additionally, don’t neglect to comply with us on Twitter. Be part of our Telegram Channel, Discord Channel, and LinkedIn Group.

When you like our work, you’ll love our newsletter..

Don’t Overlook to affix our 38k+ ML SubReddit

Sajjad Ansari is a ultimate 12 months undergraduate from IIT Kharagpur. As a Tech fanatic, he delves into the sensible purposes of AI with a give attention to understanding the impression of AI applied sciences and their real-world implications. He goals to articulate advanced AI ideas in a transparent and accessible method.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

[ad_2]

Source link

Zhejiang University Researchers Propose Fuyou: A Low-Cost Deep Learning Training Framework that Enables Efficient 100B Huge Model Fine-Tuning on a Low-End Server with a Low-End GPU and Limited CPU Memory Capacity

How Zipline Designed Its Droid Delivery System

Meet Ragas: A Python-based Machine Learning Framework that Helps to Evaluate Your Retrieval Augmented Generation (RAG) Pipelines

Editor

Meet Ragas: A Python-based Machine Learning Framework that Helps to Evaluate Your Retrieval Augmented Generation (RAG) Pipelines

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Zhejiang University Researchers Propose Fuyou: A Low-Cost Deep Learning Training Framework that Enables Efficient 100B Huge Model Fine-Tuning on a Low-End Server with a Low-End GPU and Limited CPU Memory Capacity

How Zipline Designed Its Droid Delivery System

Meet Ragas: A Python-based Machine Learning Framework that Helps to Evaluate Your Retrieval Augmented Generation (RAG) Pipelines

Editor

Meet Ragas: A Python-based Machine Learning Framework that Helps to Evaluate Your Retrieval Augmented Generation (RAG) Pipelines

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended