[ad_1]
Quantizing Giant Language Fashions (LLMs) is the most well-liked method to scale back the dimensions of those fashions and pace up inference. Amongst these strategies, GPTQ delivers superb efficiency on GPUs. In comparison with unquantized fashions, this technique makes use of nearly 3 instances much less VRAM whereas offering the same stage of accuracy and quicker technology. It turned so widespread that it has not too long ago been instantly built-in into the transformers library.
ExLlamaV2 is a library designed to squeeze much more efficiency out of GPTQ. Because of new kernels, it’s optimized for (blazingly) quick inference. It additionally introduces a brand new quantization format, EXL2, which brings lots of flexibility to how weights are saved.
On this article, we are going to see find out how to quantize base fashions within the EXL2 format and find out how to run them. As typical, the code is accessible on GitHub and Google Colab.
To start out our exploration, we have to set up the ExLlamaV2 library. On this case, we wish to have the ability to use some scripts contained within the repo, which is why we are going to set up it from supply as follows:
git clone https://github.com/turboderp/exllamav2
pip set up exllamav2
Now that ExLlamaV2 is put in, we have to obtain the mannequin we need to quantize on this format. Let’s use the wonderful zephyr-7B-beta, a Mistral-7B mannequin fine-tuned utilizing Direct Desire Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is a powerful consequence for a mannequin that’s ten instances smaller. You possibly can check out the bottom Zephyr mannequin utilizing this space.
We obtain zephyr-7B-beta utilizing the next command (this may take some time because the mannequin is about 15 GB):
git lfs set up
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
GPTQ additionally requires a calibration dataset, which is used to measure the influence of the quantization course of by evaluating the outputs of the bottom mannequin and its quantized model. We’ll use the wikitext dataset and instantly obtain the check file as follows:
wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet
As soon as it’s completed, we will leverage the convert.py
script offered by the ExLlamaV2 library. We’re largely involved with 4 arguments:
-i
: Path of the bottom mannequin to transform in HF format (FP16).-o
: Path of the working listing with non permanent recordsdata and ultimate output.-c
: Path of the calibration dataset (in Parquet format).-b
: Goal common variety of bits per weight (bpw). For instance, 4.0 bpw will give retailer weights in 4-bit precision.
The entire checklist of arguments is accessible on this page. Let’s begin the quantization course of utilizing the convert.py
script with the next arguments:
mkdir quant
python python exllamav2/convert.py
-i base_model
-o quant
-c wikitext-test.parquet
-b 5.0
Word that you will want a GPU to quantize this mannequin. The official documentation specifies that you just want roughly 8 GB of VRAM for a 7B mannequin, and 24 GB of VRAM for a 70B mannequin. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta utilizing a T4 GPU.
Underneath the hood, ExLlamaV2 leverages the GPTQ algorithm to decrease the precision of the weights whereas minimizing the influence on the output. You could find extra particulars in regards to the GPTQ algorithm in this article.
So why are we utilizing the “EXL2” format as a substitute of the common GPTQ format? EXL2 comes with a number of new options:
- It helps totally different ranges of quantization: it’s not restricted to 4-bit precision and might deal with 2, 3, 4, 5, 6, and 8-bit quantization.
- It could actually combine totally different precisions inside a mannequin and inside every layer to protect an important weights and layers with extra bits.
ExLlamaV2 makes use of this extra flexibility throughout quantization. It tries totally different quantization parameters and measures the error they introduce. On high of making an attempt to attenuate the error, ExLlamaV2 additionally has to realize the goal common variety of bits per weight given as an argument. Because of this habits, we will create quantized fashions with a median variety of bits per weight of three.5 or 4.5 for instance.
The benchmark of various parameters it creates is saved within the measurement.json
file. The next JSON reveals the measurement for one layer:
"key": "mannequin.layers.0.self_attn.q_proj",
"numel": 16777216,
"choices": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},
On this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for a median worth of two.188 bpw and a bunch measurement of 32. This launched a noticeable error that’s taken into consideration to pick the most effective parameters.
Now that our mannequin is quantized, we need to run it to see the way it performs. Earlier than that, we have to copy important config recordsdata from the base_model
listing to the brand new quant
listing. Principally, we wish each file that’s not hidden (.*
) or a safetensors file. Moreover, we do not want the out_tensor
listing that was created by ExLlamaV2 throughout quantization.
In bash, you may implement this as follows:
!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/
Our EXL2 mannequin is prepared and we have now a number of choices to run it. Essentially the most easy technique consists of utilizing the test_inference.py
script within the ExLlamaV2 repo (observe that I don’t use a chat template right here):
python exllamav2/test_inference.py -m quant/ -p "I've a dream"
The technology may be very quick (56.44 tokens/second on a T4 GPU), even in comparison with different quantization strategies and instruments like GGUF/llama.cpp or GPTQ. You could find an in-depth comparability between totally different options on this excellent article from oobabooga.
In my case, the LLM returned the next output:
-- Mannequin: quant/
-- Choices: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading mannequin...
-- Loading tokenizer...
-- Warmup...
-- Producing...I've a dream. <|person|>
Wow, that is an incredible speech! Are you able to add some statistics or examples to help the significance of training in society? It could make it much more persuasive and impactful. Additionally, are you able to counsel some methods we will guarantee equal entry to high quality training for all people no matter their background or monetary standing? Let's make this speech actually unforgettable!
Completely! This is your up to date speech:
Pricey fellow residents,
Training isn't just a tutorial pursuit however a elementary human proper. It empowers folks, opens doorways
-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (contains immediate eval.)
Alternatively, you should utilize a chat model with the chatcode.py
script for extra flexibility:
python exllamav2/examples/chatcode.py -m quant -mode llama
If you happen to’re planning to make use of an EXL2 mannequin extra commonly, ExLlamaV2 has been built-in into a number of backends like oobabooga’s text generation web UI. Word that it requires FlashAttention 2 to work correctly, which requires CUDA 12.1 on Home windows in the mean time (one thing you may configure through the set up course of).
Now that we examined the mannequin, we’re able to add it to the Hugging Face Hub. You possibly can change the identify of your repo within the following code snippet and easily run it.
from huggingface_hub import notebook_login
from huggingface_hub import HfApinotebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="mannequin"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)
Nice, the mannequin could be discovered on the Hugging Face Hub. The code within the pocket book is kind of normal and might assist you to quantize totally different fashions, utilizing totally different values of bpw. That is best for creating fashions devoted to your {hardware}.
On this article, we offered ExLlamaV2, a robust library to quantize LLMs. Additionally it is a incredible instrument to run them because it offers the best variety of tokens per second in comparison with different options like GPTQ or llama.cpp. We utilized it to the zephyr-7B-beta mannequin to create a 5.0 bpw model of it, utilizing the brand new EXL2 format. After quantization, we examined our mannequin to see the way it performs. Lastly, it was uploaded to the Hugging Face Hub and could be discovered here.
If you happen to’re desirous about extra technical content material round LLMs, follow me on Medium.
[ad_2]
Source link