[ad_1]
Generated utilizing ideogram.ai with the immediate: “A photograph of LLAMA with the banner written “QLora” on it., 3d render, wildlife images”
It was a dream to fine-tune a 7B mannequin on a single GPU without cost on Google Colab till just lately. On 23 Could 2023, Tim Dettmers and his workforce submitted a revolutionary paper[1] on fine-tuning Quantized Massive Language Fashions.
A Quantized mannequin is a mannequin that has its weights in a knowledge kind that’s decrease than the information kind on which it was educated. For instance, when you practice a mannequin in a 32-bit floating level, after which convert these weights to a decrease information kind equivalent to 16/8/4 bit floating level such that there’s minimal to no impact on the efficiency of the mannequin.
Supply [2]
We’re not going to speak a lot concerning the principle of quantization right here, You may seek advice from the superb weblog submit by Hugging-Face[2][3] and a very good YouTube video[4] by Tim Dettmers himself to grasp the underlying principle.
Briefly, it may be stated that QLora means:
Wonderful-Tuning a Quantized Massive Language fashions utilizing Low Rank Adaptation Matrices (LoRA)[5]
Let’s bounce straight into the code:
You will need to perceive that the massive language fashions are designed to take directions, this was first launched within the 2021 ACL paper[6]. The thought is easy, we give a language mannequin an instruction, and it follows the instruction and performs that job. So the dataset that we need to fine-tune our mannequin ought to be within the instruct format, if not we will convert it.
One of many widespread codecs is the instruct format. We shall be utilizing the Alpaca Immediate Template[7] which is
Under is an instruction that describes a job, paired with an enter that gives additional context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Enter:
{enter}
### Response:
{response}
We shall be utilizing the SNLI dataset which is a dataset that has 2 sentences and the connection between them whether or not they’re contradiction, entailment of one another, or impartial. We shall be utilizing it to generate contradiction for a sentence utilizing LLAMAv2. We will load this dataset merely utilizing pandas.
import pandas as pd
df = pd.read_csv('snli_1.0_train_matched.csv')
df['gold_label'].value_counts().plot(type='barh')
Labels Distribution
We will see just a few random contradiction examples right here.
df[df['gold_label'] == 'contradiction'].pattern(10)[['sentence1', 'sentence2']]
Contradiction Examples from SNLI
Now we will create a small perform that takes solely the contradictory sentences and converts the dataset instruct format.
def convert_to_format(row):
sentence1 = row['sentence1']
sentence2 = row['sentence2']ccccc
immediate = """Under is an instruction that describes a job paired with enter that gives additional context. Write a response that appropriately completes the request."""
instruction = """Given the next sentence, your job is to generate the negation for it within the json format"""
enter = str(sentence1)
response = f"""```json
{{'orignal_sentence': '{sentence1}', 'generated_negation': '{sentence2}'}}
```
"""
if len(enter.strip()) == 0: # immediate + 2 new traces + ###instruction + new line + enter + new line + ###response
textual content = immediate + "nn### Instruction:n" + instruction + "n### Response:n" + response
else:
textual content = immediate + "nn### Instruction:n" + instruction + "n### Enter:n" + enter + "n" + "n### Response:n" + response
# we'd like 4 columns for auto practice, instruction, enter, output, textual content
return pd.Collection([instruction, input, response, text])
new_df = df[df['gold_label'] == 'contradiction'][['sentence1', 'sentence2']].apply(convert_to_format, axis=1)
new_df.columns = ['instruction', 'input', 'output', 'text']
new_df.to_csv('snli_instruct.csv', index=False)
Right here is an instance of the pattern information level:
"Under is an instruction that describes a job paired with enter that gives additional context. Write a response that appropriately completes the request.
### Instruction:
Given the next sentence, your job is to generate the negation for it within the json format
### Enter:
A pair taking part in with a little bit boy on the seashore.
### Response:
```json
{'orignal_sentence': 'A pair taking part in with a little bit boy on the seashore.', 'generated_negation': 'A pair watch a little bit lady play by herself on the seashore.'}
```
Now we now have our dataset within the right format, let’s begin with fine-tuning. Earlier than beginning it, let’s set up the mandatory packages. We shall be utilizing speed up, peft (Parameter environment friendly Wonderful Tuning), mixed with Hugging Face Bits and bytes and transformers.
!pip set up -q speed up==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
You may add the formatted dataset to the drive and cargo it within the Colab.
from google.colab import drive
import pandas as pd
drive.mount('/content material/drive')
df = pd.read_csv('/content material/drive/MyDrive/snli_instruct.csv')
You may convert it to the Hugging Face dataset format simply utilizing from_pandas
methodology, this shall be useful in coaching the mannequin.
from datasets import Dataset
dataset = Dataset.from_pandas(df)
We shall be utilizing the already quantized LLamav2 mannequin which is offered by abhishek/llama-2–7b-hf-small-shards. Let’s outline some hyperparameters and variables right here:
# The mannequin that you just need to practice from the Hugging Face hub
model_name = "abhishek/llama-2-7b-hf-small-shards"
# Wonderful-tuned mannequin identify
new_model = "llama-2-contradictor"
################################################################################
# QLoRA parameters
################################################################################
# LoRA consideration dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout chance for LoRA layers
lora_dropout = 0.1
################################################################################
# bitsandbytes parameters
################################################################################
# Activate 4-bit precision base mannequin loading
use_4bit = True
# Compute dtype for 4-bit base fashions
bnb_4bit_compute_dtype = "float16"
# Quantization kind (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base fashions (double quantization)
use_nested_quant = False
################################################################################
# TrainingArguments parameters
################################################################################
# Output listing the place the mannequin predictions and checkpoints shall be saved
output_dir = "./outcomes"
# Variety of coaching epochs
num_train_epochs = 1
# Allow fp16/bf16 coaching (set bf16 to True with an A100)
fp16 = False
bf16 = False
# Batch measurement per GPU for coaching
per_device_train_batch_size = 4
# Batch measurement per GPU for analysis
per_device_eval_batch_size = 4
# Variety of replace steps to build up the gradients for
gradient_accumulation_steps = 1
# Allow gradient checkpointing
gradient_checkpointing = True
# Most gradient regular (gradient clipping)
max_grad_norm = 0.3
# Preliminary studying charge (AdamW optimizer)
learning_rate = 1e-5
# Weight decay to use to all layers besides bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to make use of
optim = "paged_adamw_32bit"
# Studying charge schedule
lr_scheduler_type = "cosine"
# Variety of coaching steps (overrides num_train_epochs)
max_steps = -1
# Ratio of steps for a linear warmup (from 0 to studying charge)
warmup_ratio = 0.03
# Group sequences into batches with similar size
# Saves reminiscence and hurries up coaching significantly
group_by_length = True
# Save checkpoint each X updates steps
save_steps = 0
# Log each X updates steps
logging_steps = 100
################################################################################
# SFT parameters
################################################################################
# Most sequence size to make use of
max_seq_length = None
# Pack a number of brief examples in the identical enter sequence to extend effectivity
packing = False
# Load the complete mannequin on the GPU 0
device_map = {"": 0}
Most of those are fairly simple hyper-parameters having these default values. You may at all times seek advice from the documentation for extra particulars.
We will now merely use BitsAndBytesConfig class to create the config for 4-bit fine-tuning.
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
Now we will load the bottom mannequin with 4 bit BitsAndBytesConfig and tokenizer for Wonderful-Tuning.
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "proper"
mannequin = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
mannequin.config.use_cache = False
mannequin.config.pretraining_tp = 1
We will now create the LoRA config and set the coaching parameters.
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)
# Set coaching parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)
Now we will merely use SFTTrainer which is offered by trl from HuggingFace to begin the coaching.
# Set supervised fine-tuning parameters
coach = SFTTrainer(
mannequin=mannequin,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="textual content", # that is the textual content column in dataset
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)
# Practice mannequin
coach.practice()
# Save educated mannequin
coach.mannequin.save_pretrained(new_model)
This may begin the coaching for the variety of epochs you may have set above. As soon as the mannequin is educated, ensure that to put it aside within the drive in an effort to load it once more (as it’s a must to restart the session within the colab). You may retailer the mannequin within the drive through zip and mv command.
!zip -r llama-contradictor.zip outcomes llama-contradictor
!mv llama-contradictor.zip /content material/drive/MyDrive
Now once you restart the Colab session, you’ll be able to transfer it again to your session once more.
!unzip /content material/drive/MyDrive/llama-contradictor.zip -d .
You have to load the bottom mannequin once more and merge it with the fine-tuned LoRA matrices. This may be finished utilizing merge_and_unload()
perform.
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "proper"
base_model = AutoModelForCausalLM.from_pretrained(
"abhishek/llama-2-7b-hf-small-shards",
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map={"": 0},
)
mannequin = PeftModel.from_pretrained(base_model, '/content material/llama-contradictor')
mannequin = mannequin.merge_and_unload()
pipe = pipeline(job="text-generation", mannequin=mannequin, tokenizer=tokenizer, max_length=200)
You may take a look at your mannequin by merely passing within the inputs in the identical immediate template that we now have outlined above.
prompt_template = """### Instruction:
Given the next sentence, your job is to generate the negation for it within the json format
### Enter:
{}
### Response:
"""
sentence = "The climate forecast predicts a sunny day with a excessive temperature round 30 levels Celsius, good for a day on the seashore with family and friends."
input_sentence = prompt_template.format(sentence.strip())
consequence = pipe(input_sentence)
print(consequence)
### Instruction:
Given the next sentence, your job is to generate the negation for it within the json format
### Enter:
The climate forecast predicts a sunny day with a excessive temperature round 30 levels Celsius, good for a day on the seashore with family and friends.
### Response:
```json
{
"sentence": "The climate forecast predicts a sunny day with a excessive temperature round 30 levels Celsius, good for a day on the seashore with family and friends.",
"negation": "The climate forecast predicts a wet day with a low temperature round 10 levels Celsius, not perfect for a day on the seashore with family and friends."
}
```
There shall be many occasions when the mannequin will carry on predicting even after the response is generated because of the token restrict. On this case, you might want to add a post-processing perform that filters the JSON half which is what we’d like. This may be finished utilizing a easy Regex.
import re
import json
def format_results(s):
sample = r'```jsonn(.*?)n```'
# Discover all occurrences of JSON objects within the string
json_matches = re.findall(sample, s, re.DOTALL)
if not json_matches:
# attempt to discover 2nd sample
sample = r'{.*?"sentence":.*?"negation":.*?}'
json_matches = re.findall(sample, s)
# Return the primary JSON object discovered, or None if no match is discovered
return json.masses(json_matches[0]) if json_matches else None
This provides you with the required output as an alternative of the mannequin repeating random output tokens.
On this weblog, you discovered the fundamentals of QLora, fine-tuning a LLama v2 mannequin on Colab utilizing QLora, Instruction Tuning, and a pattern template from the Alpaca dataset that can be utilized to instruct tune a mannequin additional.
References
[1]: QLoRA: Environment friendly Finetuning of Quantized LLMs, 23 Could 2023, Tim Dettmers et al.
[2]: https://huggingface.co/blog/hf-bitsandbytes-integration
[3]: https://huggingface.co/blog/4bit-transformers-bitsandbytes
[4]: https://www.youtube.com/watch?v=y9PHWGOa8HA
[5]: https://arxiv.org/abs/2106.09685
[6]: https://aclanthology.org/2022.acl-long.244/
[7]: https://crfm.stanford.edu/2023/03/13/alpaca.html
[8]: Colab Pocket book by @maximelabonne https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing
Ahmad Anis is a passionate Machine Studying Engineer and Researcher at the moment working at redbuffer.ai. Past his day job, Ahmad actively engages with the Machine Studying neighborhood. He serves as a regional lead for Cohere for AI, a nonprofit devoted to open science, and is an AWS Neighborhood Builder. Ahmad is an lively contributor at Stackoverflow, the place he has 2300+ factors. He has contributed to many well-known open-source initiatives, together with Shap-E by OpenAI.
[ad_2]
Source link