[ad_1]
This (written collectively with Marco Tulio Ribeiro) is an element 2 of a sequence on the artwork of immediate design (half 1 here), the place we speak about controlling massive language fashions (LLMs) with guidance
.
On this publish, we’ll talk about how the grasping tokenization strategies utilized by language fashions can introduce a delicate and highly effective bias into your prompts, resulting in puzzling generations.
Language fashions usually are not skilled on uncooked textual content, however fairly on tokens, that are chunks of textual content that always happen collectively, just like phrases. This impacts how language fashions ‘see’ textual content, together with prompts (since prompts are simply units of tokens). GPT-style fashions make the most of tokenization strategies like Byte Pair Encoding (BPE), which map all enter bytes to token ids in a grasping method. That is superb for coaching, however it may possibly result in delicate points throughout inference, as proven within the instance beneath.
Contemplate the next instance, the place we are attempting to generate an HTTP URL string:
import steerage# we use StableLM for example, however these points affect all fashions to various levels
steerage.llm = steerage.llms.Transformers("stabilityai/stablelm-base-alpha-3b", machine=0)
# we flip token therapeutic off in order that steerage acts like a standard prompting library
program = steerage('The hyperlink is <a href="http:{{gen max_tokens=10 token_healing=False}}')
program()
Be aware that the output generated by the LLM doesn’t full the url with the apparent subsequent characters (two ahead slashes). It as a substitute creates an invalid URL string with an area within the center. That is stunning, as a result of the //
completion is extraordinarily apparent after http:
. To grasp why this occurs, let’s change our immediate boundary in order that our immediate doesn’t embody the colon character:
steerage('The hyperlink is <a href="http{{gen max_tokens=10 token_healing=False}}')()
Now the language mannequin generates a sound url string like we anticipate. To grasp why the :
issues, we have to have a look at the tokenized illustration of the prompts. Beneath is the tokenization of the immediate that ends in a colon (the immediate with out the colon has the identical tokenization, aside from the final token):
print_tokens(steerage.llm.encode('The hyperlink is <a href="http:'))
Now be aware what the tokenization of a sound URL seems like, paying cautious consideration to token 1358
, proper after http
:
print_tokens(steerage.llm.encode('The hyperlink is <a href="http://www.google.com/search?q'))
Most LLMs (together with this one) use a grasping tokenization methodology, at all times preferring the longest potential token, i.e. ://
will at all times be most well-liked over :
in full textual content (e.g. in coaching).
Whereas URLs in coaching are encoded with token 1358 (://
), our immediate makes the LLM see token 27
(:
) as a substitute, which throws off completion by artificially splitting ://
.
In reality, the mannequin might be fairly certain that seeing token 27
(:
) means what comes subsequent may be very unlikely to be something that would have been encoded along with the colon utilizing a “longer token” like ://
, since within the mannequin’s coaching information these characters would have been encoded along with the colon (an exception to this that we’ll talk about later is subword regularization throughout coaching). The truth that seeing a token means each seeing the embedding of that token and in addition that no matter comes subsequent wasn’t compressed by the grasping tokenizer is simple to neglect, however it is vital in immediate boundaries.
Let’s search over the string illustration of all of the tokens within the mannequin’s vocabulary, to see which of them begin with a colon:
print_tokens(steerage.llm.prefix_matches(":"))
Be aware that there are 34 completely different tokens beginning with a colon, and thus ending a immediate with a colon means the mannequin will seemingly not generate completions with any of those 34 token strings. This delicate and highly effective bias can have all types of unintended penalties. And this is applicable to any string that may very well be probably prolonged to make an extended single token (not simply :
). Even our “mounted” immediate ending with “http” has a in-built bias as nicely, because it communicates to the mannequin that what comes after “http” is probably going not “s” (in any other case “http” wouldn’t have been encoded as a separate token):
print_tokens(steerage.llm.prefix_matches("http"))
Lest you suppose that is an arcane drawback that solely touches URLs, keep in mind that most tokenizers deal with tokens in another way relying on whether or not they begin with an area, punctuation, quotes, and so on, and thus ending a immediate with any of those can result in fallacious token boundaries, and break issues:
# By accident including an area, will result in bizarre technology
steerage('I learn a ebook about {{gen max_tokens=5 token_healing=False temperature=0}}')()
# No area, works as anticipated
steerage('I learn a ebook about{{gen max_tokens=5 token_healing=False temperature=0}}')()
One other instance of that is the “[“ character. Consider the following prompt and completion:
guidance('An example ["like this"] and one other instance [{{gen max_tokens=10 token_healing=False}}')()
Why is the second string not quoted? Because by ending our prompt with the “ [
” token, we are telling the model that it should not generate completions that match the following 27 longer tokens (one of which adds the quote character, 15640
):
print_tokens(guidance.llm.prefix_matches(" ["))
Token boundary bias happens everywhere. Over 70% of the 10k most-common tokens for the StableLM model used above are prefixes of longer possible tokens, and so cause token boundary bias when they are the last token in a prompt.
What can we do to avoid these unintended biases? One option is to always end our prompts with tokens that cannot be extended into longer tokens (for example a role tag for chat-based models), but this is a severe limitation.
Instead, guidance
has a feature called “token healing”, which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt. In our URL example, this would mean removing the :
, and forcing generation of the first token to have a :
prefix. Token healing allows users to express prompts however they wish, without worrying about token boundaries.
For example, let’s re-run some of the URL examples above with token healing turned on (it’s on by default for Transformer models, so we remove token_healing=False
):
# With token healing we generate valid URLs,
# even when the prompt ends with a colon:
guidance('The link is <a href="http:{{gen max_tokens=10}}')()
# With token healing, we will sometimes generate https URLs,
# even when the prompt ends with "http":
program = guidance('''The link is <a href="http{{gen 'completions' max_tokens=10 n=10 temperature=1}}''')
program()["completions"]
Equally, we don’t have to fret about additional areas:
# By accident including an area won't affect technology
program = steerage('''I learn a ebook about {{gen max_tokens=5 temperature=0}}''')
program()
# This may generate the identical textual content as above
program = steerage('''I learn a ebook about{{gen max_tokens=6 temperature=0}}''')
program()
And we now get quoted strings even when the immediate ends with a “ [
” token:
guidance('An example ["like this"] and one other instance [{{gen max_tokens=10}}')()
If you are familiar with how language models are trained, you may be wondering how subword regularization fits into all this. Subword regularization is a technique where during training sub-optimal tokenizations are randomly introduced to increase the model’s robustness. This means that the model does not always see the best greedy tokenization. Subword regularization is great at helping the model be more robust to token boundaries, but it does not altogether remove the bias that the model has towards the standard greedy tokenization. This means that while depending on the amount of subword regularization during training models may exhibit more or less token boundaries bias, all models still have this bias. And as shown above it can still have a powerful and unexpected impact on the model output.
When you write prompts, remember that greedy tokenization can have a significant impact on how language models interpret your prompts, particularly when the prompt ends with a token that could be extended into a longer token. This easy-to-miss source of bias can impact your results in surprising and unintended ways.
To address to this, either end your prompt with a non-extendable token, or use something like guidance
’s “token healing” feature so you can to express your prompts however you wish, without worrying about token boundary artifacts.
To reproduce the results in this article yourself check out the notebook version.
[ad_2]
Source link