Often floating-point numbers lead to confusion and frustration. Unfortunately, these problems cannot be avoided and to properly work with floating-point values a basic understanding is required.

**How floating-point numbers are stored**

Floating point decimal values generally do not have an exact binary representation. This is a side effect of how the FPU represents and processes floating point numbers. The storage format for **Double** and **Single** is the same as expected by the FPU-registers of the CPU. This ensures consistency and fast reading from and writing to memory. The problem however, is *how to store a floating-point value in a binary computer*. This is solved by storing a floating-point number as a formula. There are two types of floating-point numbers: **Float** (or **Single**) and **Double**. The difference is their size in bytes, and therefore the minimum and maximum values that can be stored. Another difference is the higher accuracy for a **Double**. The maximum number a **Float** (taking 4-bytes) can store is much less than a **Double** (taking 8-bytes) can store. Since floating-point numbers can have an infinite number of values, you cannot store all of them in either 4 bytes (float) or 8 bytes (double). To be able to store as much numbers as possible, with as much accuracy as possible, another approach is necessary. A floating-point value is stored as a formula:

X = (*-1*)*^sign ** 2^(*exponent* - *bias*)* * *(1 + *fraction * 2^-23*)

The formula contains 3 variables (*sign, exponent, fraction*) and one constant (*bias*).The *bias* for
single-precision numbers is 127 and 1,023 (decimal) for double-precision
numbers. The values of these formula-variables are stored in either 4 bytes for a Single or 8 bytes for a Double.

The next example uses an user defined type *Sfloat* to illustrate the storage of the **Float** data type. The values for *fraction* and *exponent*, together with a *sign* bit, are stored in the 4 bytes. By using a **Union** we can assign a value to a **Single** variable and then use *Sfloat* to dump the 4 bytes that make up the **Float**:

Debug.Show Type Sfloat fraction As Bits 23 // fractional part exponent As Bits 8 // exponent + 127 sign As Bits 1 // sign bit EndType Type TFloat Union // sizeof() = 4 value As Float sf As Sfloat EndType Dim fv As TFloat fv.value = 2.0 : DumpFloat(fv) fv.value = 0.0 : DumpFloat(fv) fv.value = -345.01 : DumpFloat(fv) ' Do test some more .. Proc DumpFloat(ByRef tflt As TFloat) Global Const bias As Int = 127 ' Standard IEEE Global Const frexp As Float = 2 ^ -23 ' Standard IEEE Dim flt! Debug "> DumpFloat:"; tflt.value; With tflt.sf Debug " (sign =";.sign; " exponent ="; .exponent; _ " fraction =";.fraction;")" Debug "Binary format: ";Bin(.sign, 1)` _ Bin(.exponent, 8)`Bin(.fraction, 23) ' Reconstruct value from Sfloat using formula: flt! = ((-1) ^ .sign Mul 2 ^ (.exponent - bias)) _ * (1 + (.fraction * frexp)) Debug "Float reconstructed ="; flt! EndWith Debug EndProc

The output of the demo is:

> DumpFloat: 2 (sign = 0 exponent = 128 fraction = 0)

Binary format: 0 10000000 00000000000000000000000

Float reconstructed = 2

> DumpFloat: 0 (sign = 0 exponent = 0 fraction = 0)

Binary format: 0 00000000 00000000000000000000000

Float reconstructed = 0

> DumpFloat:-345.01 (sign = 1 exponent = 135 fraction = 2916680)

Binary format: 1 10000111 01011001000000101001000

Float reconstructed =-345.01

What does this tell us? A **Float** (or **Double**) is stored and described using 3 components in the bits of either a 4 or 8 bytes type. Due to the limited storage of all these components only an approximation of the decimal value can be ‘described’. When a floating point value is assigned to a variable the value is dissected into these 3 components. To get back to the original floating-point number these 3 components are substituted in this standardized formula.

**Effect of floating point values**

A floating-point value is stored by a description, not by its value. This makes it inherently inaccurate. Even common decimal fractions, such as 0.0001 cannot be represented exactly in binary, only fractional numbers of the form n/f where f is an integer power of 2 can be expressed exactly with a finite number of bits. Examples are 1/4, 7/16, 3/128, of each f is a power of 2.

The inaccuracy may increase slightly when a floating point value is loaded into a 80-bits FPU register. The FPU fills the remaining bits, because the 80-bits representation is different from the 32-bits Single or 64-bits Double format. Moving a value out of the FPU register will round the value back to fit in either a Single or a Double. Storing and transporting may add to the inaccuracy of of the value.

The following example shows what happens when the small error in representing 0.0001 propagates to the sum:

Dim dSum As Double, i As Int For i = 1 To 10000 dSum = dSum + 0.0001 Next i Debug dSum ' = 0.999999999999906

Theoretically the sum should be 1.0.

Not only the calculations suffer from inaccuracy, comparisons with floating point numbers are equally problematic. The following example demonstrates a ‘forbidden’ comparison between a floating-point constant number and the result of a calculation:

Global Double dVal1, dVal2 dVal1 = 69.82 dVal2 = 69.20 + 0.62 Assert dVal1 == dVal2 ' Not equal

This throws an ASSERT exception, because the assertion that dVal1 and dVal2 are equal fails.

A comparison between two floating-point constants of the same type is allowed. For instance, the GFA-BASIC runtime returns a Single constant from **DllVersion** (2.33; 2.341; etc.). This constant may be compared to a literal Single constant (note the exclamation mark, without it 2.33 is a double!):

If DllVersion == 2.33! MsgBox "This is version 2.33"

Never compare two different data types, a **Single** to to **Double**, or a floating-point to an integer, These comparisons will most certainly fail (unless they can be described using a finite number of bits, see above). Any comparison to a floating point will most likely fail, because the comparison is executed in the FPU expanding the values to 80-bits. The same is true for the comparison of the results of two floating point calculations, it will most certainly fail.

The most logical solution for floating-point comparison of type Double is the use of the special operator NEAR, which uses only 7 decimal digits from both expressions for the comparison. In practice the expressions are compared as if they are both of Single precision.

**Improve floating-point consistency in calculations**

If your application expects multiple fp-calculations, it is necessary to keep the intermediate values in the proper data format, otherwise small errors are propagated through the calculations. For multiple floating-point calculations the FPU uses the intermediate results that it holds in the 80-bits FPU registers. However, these 80-bits calculations do not reflect the data types involved, the Single or Double. Due to the extra level of accuracy multiple calculations may produce unexpected results. The compiler setting ‘*Improve floating-point consistency*’ inserts code to load and write immediate results from and to memory in the appropriate type. This decreases program speed, but improves the chance for an expected result of the calculation. Make sure the ‘*Improve floating-point consistency’* is checked always (unless you know exactly what you’re doing).

**Conclusion**

Floating-point values are inherently inaccurate, you might want to avoid them as much as possible. Instead use integers when ever possible, or otherwise use **Currency**, which is an integer value as well. The **Currency** data type exactly stores up to 19 digits, with 4 digits after the decimal point.

## No comments:

## Post a Comment