[ad_1]
For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language ebook, it was identified {that a} single-precision “float” sort has a 32-bit measurement and a double-precision sort has 64 bits. There was additionally an 80-bit “lengthy double” sort with prolonged precision, and all these varieties coated virtually all of the wants for floating-point information processing. Nonetheless, throughout the previous few years, the appearance of huge neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level varieties as a lot as doable.
Actually, I used to be shocked after I found that the 4-bit floating-point format exists. How on Earth can or not it’s doable? One of the simplest ways to know is to check it on our personal. On this article, we are going to uncover the preferred floating level codecs, make a easy neural community, and see the way it works.
Let’s get began.
A “Customary” 32-bit Floating level
Earlier than going into “excessive” codecs, let’s recall an ordinary one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float sort appears like this:
Right here, the primary bit is an indication, the subsequent 8 bits symbolize an exponent, and the final bits symbolize the mantissa. The ultimate worth is calculated utilizing the components:
This straightforward helper operate permits us to print a floating level worth in binary kind:
import structdef print_float32(val: float):
""" Print Float32 in a binary kind """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000
Let’s additionally make one other helper for backward conversion, which can be helpful later:
def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary information into the floating level worth """
sign_mult = -1 if signal == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in vary(mant_len - 1, -1, -1):
if mantissa & (2 **…
[ad_2]
Source link