[ad_1]
Within the last post, we talked about what CausalLM is and the way Hugging Face expects information to be formatted. On this submit, we’re going to stroll by means of an abridged pocket book with 3 ways to format the information to fine-tune a mannequin. The primary is an easy method constructing on the instinct from the earlier submit merely copying input_ids into labels. The second method makes use of masking to study choose components of the textual content. The third method makes use of a separate library, TRL, in order that we don’t need to manually masks the information.
I’ll pass over some perform definitions to maintain it readable, so it’s greatest to reference the full noteboook to get all of the code.
High-quality-tuning with labels copied from enter ids
We’re going to be utilizing Bloom-560m, a multilingual mannequin which is sufficiently small that we are able to fine-tune it on a typical laptop computer.
model_name = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True, padding_side="proper"
) # padding aspect must be proper for CausalLM fashions
# overfit to five made up examples
str1 = 'nn### Human: How do you say "canine" in Spanish?nn### Assistant: perro'
str2 = 'nn### Human: How do you say "water" in Spanish?nn### Assistant: agua'
str3 = 'nn### Human: How do you say "hi there" in Spanish?nn### Assistant: hola'
str4 = 'nn### Human: How do you say "tree" in Spanish?nn### Assistant: árbol'
str5 = 'nn### Human: How do you say "mom" in Spanish?nn### Assistant: madre'
train_data = {
"textual content": [str1, str2, str3, str4, str5],
}
dataset_text = Dataset.from_dict(train_data)# to check if we learn to generate an unknown phrase.
holdout_str = (
'nn### Human: How do you say "day" in Spanish?nn### Assistant:<s>' # día
)
system = "cuda" if torch.cuda.is_available() else "cpu"
holdout_input = tokenizer(holdout_str, return_tensors="pt").to(system)
Let’s begin by performing some preprocessing. We’re going so as to add some particular tokens, particularly “finish of sequence” (eos) and “starting of sequence“ (bos). These particular tokens could be useful for the mannequin to know when it’s supposed to start out and cease producing textual content.
INSTRUCTION_TEMPLATE_BASE = "nn### Human:"
RESPONSE_TEMPLATE_BASE = "nn### Assistant:"
def add_special_tokens(
instance: Dict,
tokenizer: PreTrainedTokenizerBase,
) -> Dict:
# add eos_token earlier than human textual content and bos_token earlier than assistant textual content
instance["text"] = (
instance["text"]
.substitute(
INSTRUCTION_TEMPLATE_BASE, tokenizer.eos_token + INSTRUCTION_TEMPLATE_BASE
)
.substitute(RESPONSE_TEMPLATE_BASE, RESPONSE_TEMPLATE_BASE + tokenizer.bos_token)
)
if not instance["text"].endswith(tokenizer.eos_token):
instance["text"] += tokenizer.eos_token
# Take away main EOS tokens
whereas instance["text"].startswith(tokenizer.eos_token):
instance["text"] = instance["text"][len(tokenizer.eos_token) :]
return instancedataset_text = dataset_text.map(lambda x: add_special_tokens(x, tokenizer))
print(f"{dataset_text=}")
print(f"{dataset_text[0]=}")
>>> dataset_text=Dataset({
options: ['text'],
num_rows: 5
})
>>> dataset_text[0]={'textual content': 'nn### Human: How do you say "canine" in Spanish?nn### Assistant:<s> perro</s>'}
Now, we’re going to do what we realized final session: create an enter with a labels key copied from input_ids.
# tokenize the textual content
dataset = dataset_text.map(
lambda instance: tokenizer(instance["text"]), batched=True, remove_columns=["text"]
)
# copy the input_ids to labels
dataset = dataset.map(lambda x: {"labels": x["input_ids"]}, batched=True)
print(f"{dataset=}")
print(f"{dataset[0]['input_ids']=}")
print(f"{dataset[0]['labels']=}")
>>> dataset=Dataset({
options: ['input_ids', 'attention_mask', 'labels'],
num_rows: 5
})
>>> dataset[0]['input_ids']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
>>> dataset[0]['labels']=[603, 105311, 22256, 29, 7535, 727, 1152, 5894, 20587, 744, 5, 361, 49063, 7076, 105311, 143005, 29, 1, 82208, 2]
To begin, labels and input_ids are equivalent. Let’s see what occurs after we prepare a mannequin like that.
# coaching code impressed by
#https://mlabonne.github.io/weblog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html
mannequin = load_model(model_name)
output_dir = "./outcomes"
# What number of occasions to iterate over all the dataset
num_train_epochs = 15
# We're not aligning the sequence size (ie padding or truncating)
# so batch coaching will not work for our toy instance.
per_device_train_batch_size = 1training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
seed=1,
)
coach = Coach(
mannequin=mannequin,
train_dataset=dataset,
args=training_arguments,
)
training1 = coach.prepare()
# Pattern generate prediction on holdout set
“nn### Human: How do you say "good" in Spanish?nn### Assistant:”
# the right output is “bueno</s>”
sample_generate(mannequin, tokenizer, holdout_inputs, max_new_tokens=5)
>>> ‘</s>’
After 15 epochs, we’re nonetheless type of confused. We output ‘</s>’ which is shut however we actually need to output “perro</s>”. Let’s study one other 15 epochs.
coach.prepare()
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
After 30 epochs we realized what we have been presupposed to!
Let’s simulate what occurs in coaching by iteratively predicting the immediate one token at a time, primarily based on the earlier tokens.
print_iterative_generate(mannequin, tokenizer, inputs)
>>>
#
: How do you say "how morning in Spanish?### Assistant: gu buenopu
That’s fairly near the precise immediate, as we anticipated. However the activity is translation, so we don’t actually care about having the ability to predict the consumer immediate. Is there a approach to study simply the response half?
Masked method
Hugging Face means that you can solely study to foretell sure tokens by “masking” the tokens you don’t care about in “labels.” That is completely different from the eye masks, which hides earlier tokens we use to generate a brand new token. Masking the labels hides the token you’re presupposed to output at a sure index from the loss perform. Observe the wording: Hugging Face has it applied such that in coaching, we nonetheless generate predictions for that masked token. Nonetheless, as a result of we disguise the true label to check the predictions with, we don’t straight learn to enhance on that prediction.
We create the “masks” by flipping these tokens to -100 within the labels key.
def create_special_mask(instance: Dict) -> Dict:
"""Masks human textual content and hold assistant textual content as it's.Args:
instance (Dict): Results of tokenizing some textual content
Returns:
Dict: The dict with the label masked
"""
# setting a token to -100 is how we "masks" a token
# and inform the mannequin to disregard it when calculating the loss
mask_token_id = -100
# assume we at all times begin with a human textual content
human_text = True
for idx, tok_id in enumerate(instance["labels"]):
if human_text:
# masks all human textual content up till and together with the bos token
instance["labels"][idx] = mask_token_id
if tok_id == tokenizer.bos_token_id:
human_text = False
elif not human_text and tok_id == tokenizer.eos_token_id:
# don’t masks the eos token, however the subsequent token will probably be human textual content to masks
human_text = True
elif not human_text:
# go away instance['labels'] textual content as it's when assistant textual content
proceed
return instance
dataset_masked = dataset.map(create_special_mask)
# convert dataset from lists to torch tensors
dataset_masked.set_format(sort="torch", columns=["input_ids", "attention_mask", "labels"])
print(f"{dataset_masked[0]["labels"]=}")
>>> dataset[0]["labels"]=tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 82208, 2])
mannequin = load_model(model_name)
coach = Coach(
mannequin=mannequin,
train_dataset=dataset_masked,
args=training_arguments,
)training2 = coach.prepare()
print(f"{training2.metrics['train_runtime']=}")
print(f"{training1.metrics['train_runtime'] =}")
print(
f"{100*spherical((training1.metrics['train_runtime'] - training2.metrics['train_runtime']) / training1.metrics['train_runtime'] , 2)}%"
)
>>> training2.metrics['train_runtime']=61.7164
>>> training1.metrics['train_runtime'] =70.8013
>>> 13.0%
First off, we have been sooner this time by greater than 10%. Presumably, the truth that now we have fewer loss calculations makes issues a bit faster.
I wouldn’t financial institution on the pace up being this massive — our instance is fairly lopsided with rather more human textual content than generated textual content. However when coaching occasions are within the hours, each little proportion is useful.
The large query: did we study the duty?
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> bueno </s>
This time we solely want 15 epochs to study the duty. Let’s return to how issues are below the hood throughout coaching
print_iterative_generate(mannequin, tokenizer, inputs)
>>>#embody
code
to I get "we" in English?
A: Spanish: How bueno
Iteratively predicting the immediate results in non-sense in contrast with our first coaching method. This checks out: we masked the immediate throughout coaching and due to this fact don’t learn to predict something up till our actual goal: the assistant response.
Utilizing TRL’s supervised fine-tuning coach
Hugging Face semi-recently rolled out a TRL (transformer reinforcement studying) library so as to add end-to-end assist for the LLM coaching course of. One function is supervised fine-tuning. Utilizing the DataCollatorForCompletionOnlyLM and SFTTrainer courses, we are able to create the labels like we did with create_special_mask with only a few configs.
mannequin = load_model(model_name)# a hugging face perform to do the copying of labels for you.
# utilizing the instruction and response templates will masks all the pieces between the instruction template and the beginning of the response_template
collator = DataCollatorForCompletionOnlyLM(
instruction_template=tokenizer.eos_token,
response_template=tokenizer.bos_token,
tokenizer=tokenizer,
)
trainersft = SFTTrainer(
mannequin,
train_dataset=dataset_text,
dataset_text_field="textual content",
data_collator=collator,
args=training_arguments,
tokenizer=tokenizer,
)
sftrain = trainersft.prepare()
sample_generate(mannequin, tokenizer, holdout_input, max_new_tokens=5)
>>> ' perro</s>'
Success! When you dig deeper, coaching truly took longer utilizing SFT. This is perhaps credited to the truth that now we have to tokenize at coaching time somewhat than as a preprocessing step within the masked method. Nonetheless, this method provides us free batching (you’d have to tweak the tokenization course of to make use of the masked method to batch correctly), which ought to make issues sooner in the long term.
The total pocket book explores just a few different issues like coaching off multi-turn chats and utilizing special_tokens to point human vs chat textual content.
Clearly, this instance is a bit fundamental. Nonetheless, hopefully you can begin to see the ability of utilizing CausalLM: You may think about taking interactions from a big, dependable mannequin, and utilizing the methods above to fine-tune a smaller mannequin on the massive mannequin’s outputs. That is referred to as data distillation.
If we’ve realized something during the last couple years of LLMs, it’s that we are able to do some surprisingly clever issues simply by coaching on subsequent token prediction. Causal language fashions are designed to do exactly that. Even when the Hugging Face class is a bit complicated at first, when you’re used to it, you’ve gotten a really highly effective interface to coach your individual generative fashions.
[ad_2]
Source link