16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev

[ad_1]

Let’s go into bits and bytes

13 min learn

10 hours in the past

For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language ebook, it was identified {that a} single-precision “float” sort has a 32-bit measurement and a double-precision sort has 64 bits. There was additionally an 80-bit “lengthy double” sort with prolonged precision, and all these varieties coated virtually all of the wants for floating-point information processing. Nonetheless, throughout the previous few years, the appearance of huge neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level varieties as a lot as doable.

Actually, I used to be shocked after I found that the 4-bit floating-point format exists. How on Earth can or not it’s doable? One of the simplest ways to know is to check it on our personal. On this article, we are going to uncover the preferred floating level codecs, make a easy neural community, and see the way it works.

Let’s get began.

A “Customary” 32-bit Floating level

Earlier than going into “excessive” codecs, let’s recall an ordinary one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float sort appears like this:

Right here, the primary bit is an indication, the subsequent 8 bits symbolize an exponent, and the final bits symbolize the mantissa. The ultimate worth is calculated utilizing the components:

This straightforward helper operate permits us to print a floating level worth in binary kind:

import structdef print_float32(val: float):
""" Print Float32 in a binary kind """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000

Let’s additionally make one other helper for backward conversion, which can be helpful later:

def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary information into the floating level worth """
sign_mult = -1 if signal == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in vary(mant_len - 1, -1, -1):
if mantissa & (2 **…

[ad_2]

Source link

16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

Getty Images Unveils a New AI-Powered Image-Creation Tool: Revolutionizing Visual Content with Generative AI Technology

Researchers from Apple and EPFL Introduce the Boolformer Model: The First Transformer Architecture Trained to Perform End-to-End Symbolic Regression of Boolean Functions

Editor

Researchers from Apple and EPFL Introduce the Boolformer Model: The First Transformer Architecture Trained to Perform End-to-End Symbolic Regression of Boolean Functions

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

16, 8, and 4-bit Floating Point Formats — How Does it Work? | by Dmitrii Eliuseev | Sep, 2023

Let’s go into bits and bytes

A “Customary” 32-bit Floating level

Getty Images Unveils a New AI-Powered Image-Creation Tool: Revolutionizing Visual Content with Generative AI Technology

Researchers from Apple and EPFL Introduce the Boolformer Model: The First Transformer Architecture Trained to Perform End-to-End Symbolic Regression of Boolean Functions

Editor

Researchers from Apple and EPFL Introduce the Boolformer Model: The First Transformer Architecture Trained to Perform End-to-End Symbolic Regression of Boolean Functions

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended