[ad_1]
In the event you’ve performed round with latest fashions on HuggingFace, likelihood is you encountered a causal language mannequin. Whenever you pull up the documentation for a model household, you’ll get a web page with “duties” like LlamaForCausalLM or LlamaForSequenceClassification.
In the event you’re like me, going from that documentation to really finetuning a mannequin is usually a bit complicated. We’re going to concentrate on CausalLM, beginning by explaining what CausalLM is on this publish adopted by a sensible instance of the right way to finetune a CausalLM mannequin in a subsequent publish.
Background: Encoders and Decoders
Most of the greatest fashions right this moment equivalent to LLAMA-2, GPT-2, or Falcon are “decoder-only” fashions. A decoder-only mannequin:
- takes a sequence of earlier tokens (AKA a immediate)
- runs these tokens by the mannequin (typically creating embeddings from tokens and operating them by transformer blocks)
- outputs a single output (normally the chance of the following token).
That is contrasted with fashions with “encoder-only” or hybrid “encoder-decoder” architectures which can enter the total sequence, not simply earlier tokens. This distinction disposes the 2 architectures in direction of totally different duties. Decoder fashions are designed for the generative process of writing new textual content. Encoder fashions are designed for duties which require taking a look at a full sequence equivalent to translation or sequence classification. Issues get murky as a result of you may repurpose a decoder-only mannequin to do translation or use an encoder-only mannequin to generate new textual content. Sebastian Raschka has a pleasant guide if you wish to dig extra into encoders vs decoders. There’s a additionally a medium article which works extra in-depth into the differeneces between masked langauge modeling and causal langauge modeling.
For our functions, all you might want to know is that:
- CausalLM fashions usually are decoder-only fashions
- Decoder-only fashions have a look at previous tokens to foretell the following token
With decoder-only language fashions, we will consider the following token prediction course of as “causal language modeling” as a result of the earlier tokens “trigger” every extra token.
HuggingFace CausalLM
In HuggingFace world, CausalLM (LM stands for language modeling) is a category of fashions which take a immediate and predict new tokens. In actuality, we’re predicting one token at a time, however the class abstracts away the tediousness of getting to loop by sequences one token at a time. Throughout inference, CausalLMs will iteratively predict particular person tokens till some stopping situation at which level the mannequin returns the ultimate concatenated tokens.
Throughout coaching, one thing comparable occurs the place we give the mannequin a sequence of tokens we wish to be taught. We begin by predicting the second token given the primary one, then the third token given the primary two tokens and so forth.
Thus, if you wish to learn to predict the sentence “the canine likes meals,” assuming every phrase is a token, you’re making 3 predictions:
- “the” → canine,
- “the canine” → likes
- “the canine likes” → meals
Throughout coaching, you may take into consideration every of the three snapshots of the sentence as three observations in your coaching dataset. Manually splitting lengthy sequences into particular person rows for every token in a sequence could be tedious, so HuggingFace handles it for you.
So long as you give it a sequence of tokens, it would escape that sequence into particular person single token predictions behind the scenes.
You possibly can create this ‘sequence of tokens’ by operating an everyday string by the mannequin’s tokenizer. The tokenizer will output a dictionary-like object with input_ids and an attention_mask as keys, like with any extraordinary HuggingFace mannequin.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer("the canine likes meals")
>>> {'input_ids': [5984, 35433, 114022, 17304], 'attention_mask': [1, 1, 1, 1]}
With CausalLM fashions, there’s one extra step the place the mannequin expects a labels key. Throughout coaching, we use the “earlier” input_ids to foretell the “present” labels token. Nevertheless, you do not wish to take into consideration labels like a query answering mannequin the place the primary index of labels corresponds with the reply to the input_ids (i.e. that the labels needs to be concatenated to the top of the input_ids). Reasonably, you need labels and input_ids to reflect one another with equivalent shapes. In algebraic notation, to foretell labels token at index ok, we use all of the input_ids by the k-1 index.
If that is complicated, virtually, you may normally simply make labels an equivalent copy of input_ids and name it a day. In the event you do wish to perceive what’s happening, we’ll stroll by an instance.
A fast labored instance
Let’s return to “the canine likes meals.” For simplicity, let’s go away the phrases as phrases somewhat than assigning them to token numbers, however in follow these could be numbers which you’ll be able to map again to their true string illustration utilizing the tokenizer.
Our enter for a single factor batch would appear like this:
{
"input_ids": [["the", "dog", "likes", "food"]],
"attention_mask": [[1, 1, 1, 1]],
"labels": [["the", "dog", "likes", "food"]],
}
The double brackets denote that technically the form for the arrays for every key’s batch_size x sequence_size. To maintain issues easy, we will ignore batching and simply deal with them like one dimensional vectors.
Underneath the hood, if the mannequin is predicting the kth token in a sequence, it would achieve this type of like so:
pred_token_k = mannequin(input_ids[:k]*attention_mask[:k]^T)
Notice that is pseudocode.
We will ignore the eye masks for our functions. For CausalLM fashions, we normally need the eye masks to be all 1s as a result of we wish to attend to all earlier tokens. Additionally notice that [:k] actually means we use the 0th index by the k-1 index as a result of the ending index in slicing is unique.
With that in thoughts, we now have:
pred_token_k = mannequin(input_ids[:k])
The loss could be taken by evaluating the true worth of labels[k] with pred_token_k.
In actuality, each get represented as 1xv vectors the place v is the dimensions of the vocabulary. Every factor represents the chance of that token. For the predictions (pred_token_k), these are actual possibilities the mannequin predicts. For the true label (labels[k]), we will artificially make it the right form by making a vector with 1 for the precise true token and 0 for all different tokens within the vocabulary.
Let’s say we’re predicting the second phrase of our pattern sentence, which means ok=1 (we’re zero indexing ok). The primary bullet merchandise is the context we use to generate a prediction and the second bullet merchandise is the true label token we’re aiming to foretell.
ok=1:
- Input_ids[:1] == [the]
- Labels[1] == canine
ok=2:
- Input_ids[:2] == [the, dog]
- Labels[2] == likes
ok =3:
- Input_ids[:3] == [the, dog, likes]
- Labels[3] == meals
Let’s say ok=3 and we feed the mannequin “[the, dog, likes]”. The mannequin outputs:
[P(dog)=10%, P(food)=60%,P(likes)=0%, P(the)=30%]
In different phrases, the mannequin thinks there’s a ten% likelihood the following token is “canine,” 60% likelihood the following token is “meals” and 30% likelihood the following token is “the.”
The true label may very well be represented as:
[P(dog)=0%, P(food)=100%, P(likes)=0%, P(the)=0%]
In actual coaching, we’d use a loss perform like cross-entropy. To maintain it as intuitive as attainable, let’s simply use absolute distinction to get an approximate really feel for loss. By absolute distinction, I imply absolutely the worth of the distinction between the expected chance and our “true” chance: e.g. absolute_diff_dog = |0.10–0.00| = 0.10.
Even with this crude loss perform, you may see that to reduce the loss we wish to predict a excessive chance for the precise label (e.g. meals) and low possibilities for all different tokens within the vocabulary.
As an example, let’s say after coaching, after we ask our mannequin to foretell the following token given [the, dog, likes], our outputs appear like the next:
Now our loss is smaller now that we’ve discovered to foretell “meals” with excessive chance given these inputs.
Coaching would simply be repeating this technique of attempting to align the expected possibilities with the true subsequent token for all of the tokens in your coaching sequences.
Conclusion
Hopefully you’re getting an instinct about what’s taking place below the hood to coach a CausalLM mannequin utilizing HuggingFace. You might need some questions like “why do we want labels as a separate array after we may simply use the kth index of input_ids immediately at every step? Is there any case when labels could be totally different than input_ids?”
I’m going to go away you to consider these questions and cease there for now. We’ll decide again up with solutions and actual code within the subsequent publish!
[ad_2]
Source link