Numerical Accuracy

Before GEMPACK and similar systems became available, economists who wished to solve CGE models required skills in scientific computing and numerical methods. Today, GEMPACK frees the modeller from these requirements. Nevertheless, it might be interesting or comforting to know a little about these matters. Occasionally, such knowledge might even be useful!

Contents

Significant figures and decimal places
Binary storage of real numbers
What accuracy does GEMPACK use ?
Rounding errors in computer arithmetic
Could double precision help?
Problems with small numbers
Should the database be scaled?
Could a shock be too small to register?
New or embryo industries

Significant figures and decimal places

A reminder of the difference between these two measures of accuracy: "decimal places" is merely the number of digits after the decimal point, while "significant figures" is the number of digits including and following the first non-zero digit. For example:

Value Significant
figures
Decimal
places
7.5 2 1
7.50 3 2
0.075 2 3

Significant figures is a measure of relative or proportional accuracy and is far more meaningful.

Binary storage of real numbers

Computers usually store real numbers in a format similar to the familiar scientific notation for numbers. For example, using scientific notation

          - 6.144215*10^3 = - 6144.215

Here, we call 6.144215 the mantissa and 3 the exponent

However,computers use binary bits (not decimal digits, as above). One binary bit is used to store the sign. A few more bits store the exponent. Most bits are used for the mantissa. Actual value is given by:

          [sign]*mantissa*[2^exponent]

Three commonly used real number formats are the IEEE Single, Double and Extended formats:

Single
precision
Double
precision
Extended
precision
Size (bytes) 4 8 10
Size (bits) 32 64 80
Exponent bits 8 11 15
Mantissa bits 23 52 63
Significant figures 7-8 15-16 19-20
Accurate to 1 part in 8 million 4 million million 10^19
Minimum absolute value ~10^-45 ~10^-324 ~10^-4951
Maximum absolute value ~10^38 ~10^308 ~10^4932

Note that:

What accuracy does GEMPACK use ?

GEMPACK uses single precision numbers (aka singles) to store numbers in RAM and in HAR files. This applies to original and updated data, and to percent (or ordinary) change results -- none can be accurate to more than 7 significant digits, or 1 part in 8 million. Nonetheless, this inaccuracy is tiny compared to that induced by other sources of doubt, such as (largest doubt first):

  1. Doubt about model specifications, closure, or parameter values.
  2. Measurement errors in IO tables or SAMs.
  3. Linearization errors, arising because GEMPACK uses several linear approximations to mimic non-linear equations.
  4. Non-balanced initial data -- which are inconsistent with an accurate solution of the levels equation system.
  5. Change equations which cannot be derived from the levels equation system.
  6. Accumulated rounding errors (see below) which reduce accuracy to 5 or 6 significant digits.

Rounding errors in computer arithmetic

Until 2001 most PCs did arithmetic using hardware known as the 8087 or x87 FPU (floating point unit). The x87 FPU does arithmetic extremely accurately -- using 10-byte extended precision. Rounding errors arise when the result is stored (to save space) as a single precision number. For example, 4.5678 x 3.4567 = 15.78951426. This is stored in single precision as 1.578951*10^1, losing several digits. Rounding can be more noticeable in addition. For example, 10000 + 0.0001 = 10000.0001. This is stored in single precision as 1.000000*10^4 (= 10000 !). However, the result is still accurate to 7 significant figures, or 1 part in 8 million.

Modern or 64-bit compilers using SSE2 may be less accurate

Since 2001, different arithmetic hardware, called SSE2, has become used widely, especially in 64-bit computing. Under SSE2, intermediate results are computed with 4 or 8 byte precision (rather than the x87's 10 byte precision). So SSE2 tends to be less accurate (but is usually faster). Your compiler might choose to use either method, or combine the two. As configured by GEMPACK, the older Lahey compiler (which does not support 64-bit) uses the more accurate x87 hardware. Some GEMPACK users who moved to one of the Intel or GFortran compilers (which use SSE2 and do support 64-bit) have noticed small accuracy losses.

To combat these problems, successive releases of GEMPACK have introduced more frequent internal use of double precision calculation and storage for intermediate results.

Read more about differences between x87 FPU and SSE2.

Accumulated rounding errors

TABLO translates each of your model formulae and equations into many Fortran code statements. Even though the PC may evaluate the RHS of Fortran formulae using extended (10 byte) reals, intermediate results from such formulae are stored with single (4 byte) precision, and combined (via many complex steps) to eventually yield results and updated data.

In such a huge calculation, tiny individual rounding errors will accumulate. Nonetheless, this source of inaccuracy is surprisingly small. You can see this by running a Johansen 10% numeraire shock with GTAP -- results here are independent of original data or linearization error. Price results ranging from 9.999998 to 10.000002 (should be 10) are still accurate to 7 significant figures. We can safely rely on 5 or 6 significant figures.

Could double precision help?

Some programs, such as Excel or GAMS, use double precision internally -- greatly reducing rounding error. Should GEMPACK do likewise? Hitherto, we have thought not, because:

These are significant disadvantages. However we have produced double precision experimental versions of GEMPACK which might be developed for future public release. The current [2012] 64-bit version of GEMPACK uses 8-byte (64-bit) memory addressing so that more than 2Gb of RAM can be accessed, but still uses 4-byte (32-bit, single precision) real numbers.

Standard GEMPACK programs (eg, GEMSIM and ViewHAR) and TABLO-generated programs make limited use of double precision, where it seems advantageous but cheap.

Problems with small numbers

Quite often, solution problems seem to be associated with small flows. As we have seen, small flows are stored/computed with the same relative accuracy as large flows. So why do small flows often cause problems?

One reason is simply that most database flows are small (Pareto's Law). This is particulary so with large multi-regional databases: some sectors (eg, paddy rice) will be insignificant in some regions; and techniques used to generate multi-dimensional data from observed sub-totals often give rise to many small flows. As well, tiny values are sometimes inserted to avoid problems associated with zero flows (such as zero-divide or singular-matrix errors).

A subtler reason is that the careful scrutiny of data, which underpins model quality, tends to be concentrated on larger, more important flows -- on the grounds that large or erroneous percent changes in tiny values should not affect results for significant variables. However, this reasonable strategy fails when the simulation crashes because a tiny sector has spun out of control: then no results are available!

The remedy is to incorporate in your model a system of automated and rigorous data checking which applies properly to tiny flows as well as large flows.

For example, a condition for accurate solutions is that initial data must be consistent with model equations. The main implication is that accounting balance conditions must be satisfied. Some models might check this by an assertion:

  Assertion # check costs=sales # (all,s,SEC)(all,r,REG) ABS[COSTS(s,r)-SALES(s,r)] < 0.01;

However, although the above check works fine for SALES values around 100, it is useless where SALES has a tiny value, such as 0.001. The assertion should be replaced by something like:

  Formula
   (all,s,SEC)(all,r,REG) DIFF(s,r)  =  ABS[COSTS(s,r)-SALES(s,r)];
   (all,s,SEC)(all,r,REG) DENOM(s,r) = 0.5*[COSTS(s,r)+SALES(s,r)];
  Assertion # check costs=sales # (all,s,SEC) (all,r,REG: DENOM(s,r) > 0)
                    [DIFF(s,r)/DENOM(s,r)] < 0.000001;

The improved version checks that proportional (not absolute) error is small, and works much better for small flows.

Again, some models compute shares with formulae like:

  Formula (all,s,SEC) LABSHR(s) = WAGES(s)/[0.0000001+VALUEADDED(s)];

The "0.0000001" avoids zero-divide errors, but has no effect on the LABSHR value as long as VALUEADDED>10. However, a significant error is introduced when VALUEADDED is very small. The formula could instead be written using GEMPACK's ID01 function:

  Formula (all,s,SEC) LABSHR(s) = WAGES(s)/ID01[VALUEADDED(s)];

For all nonzero x, ID01(x) = x; ID01(0) = 1. So zerodivides are avoided, without compromising accuracy.

Your model should include very many assertions, checking that:

It takes extra time to prepare a database which conforms with such rules. However, you save time later: simulations run more smoothly; and when they do crash, the error is trapped earlier when it is easier to identify the cause.

Subtractive Cancellation

You might naively suppose that:

  A = B + C - B     implies  A = C

However, consider what happens when B=1000000 and C=0.000001:

  A = 1000000 + 0.000001 - 1000000 

if all calculations are carried out in single precision. The first two numbers may be added to give:

  1000000 + 0.000001 = 1000000          !! 

and then the subtraction gives:

  A = 1000000 - 1000000
    = 0 

However, compilers often rearrange expressions to increase execution speed. The compiler might actually implement:

  A = B + C - B     
    = B - B + C     
    = (1000000 - 1000000) + 0.000001     
    =  0.000001     

On the other hand, if the compiler chose to evaluate the entire RHS in double precision, both orders of evaluation would give the same answer [0.000001].

Moral: Since you cannot foresee just what arithmetic method the compiler will choose, avoid computing expressions that are sensitive to details of computer arithmetic.

Read more about Subtractive Cancellation.

Should the database be scaled?

GAMS users are sometimes urged to scale (choose appropriate units for) their data. GEMPACK users do not need to do this. As explained above, because of the exponential format used by the computer to store real numbers, small numbers are stored and manipulated as accurately as large ones. Scaling of all flow values by some common factor would neither affect percent change results, nor the accuracy of those results.

Why should GAMS users choose appropriate database units? Starting from an initial set of levels data values, a GAMS solver attempts to adjust endogenous variables so that all equations are true. "True" means, left and right hand sides of each levels equation must be equal to within some measure of tolerance (which is the same for all equations). The criterion will be harder to meet if terms within equations are very large; conversely if terms are too small, equations might seem solved which are not. Correct data scaling reduces these problems.

GEMPACK by contrast never (or rarely) evaluates errors in equations, obviating the need for scaling. The exception is in models with quota constraints. Here GEMPACK (like GAMS) evaluates constraints in the levels to see if they are binding. Modellers are advised to choose complementarity variables (eg, the ratio of actual imports to the quota limit) that naturally have values around 1.

Could a shock be too small to register?

GAMS level solution method underlies another difference with GEMPACK. When a GAMS simulation starts from a set of levels variables which satisfy the equation system, very small changes in exogenous (and endogenous) will cause only small errors in equations. The solver will soon reach a new solution, where equations are solved accurately enough in the sense described above. Due to rounding errors, there is a small amount of "noise" in the new solution values, which, if shocks are very small, could dominate the true induced changes in variables. The modeller must take care to apply shocks which are large enough to drown out these random variations.

Nothing like this happens in GEMPACK (which does not usually evaluate errors in equations). Results (including rounding errors) in a Johansen simulation are strictly proportional to shock size, so the relative size of arithmetic errors does not increase with tiny shocks. Indeed, since halving shock size reduces linearization errors by more than half, GEMPACK simulations get more accurate with tiny shocks!

The levels approach requires a little more numerical sophistication than GEMPACK's percent-change approach. That is one reason why GAMS uses double precision -- it greatly extends the range of variable (or shock) values that work well.

New or embryo industries

We have argued above that small database flows should cause no more problems than large flows, if proper precautions are taken. An exception arises when we wish to model new activities by creating, in the database, a very tiny industry -- which will in our simulation grow to appreciable size. The problem here is not that the initial sector is small -- but that the simulated rate of growth is so large. Very many steps may be needed to reduce linearization errors. In addition, if the input supply and output demand curves facing the new sector are not flat, large percentage increases in output will be accompanied by changes in relative input and output prices -- leading to changes in input proportions. Then, the adult industry may little resemble its tiny beginnings.

Some of these problems could be reduced by creating an embryo industry which is not so very tiny. Then enormous percent output changes would be reduced. On the other hand, if your embryonic industry is larger, extra work may be needed to ensure that initial data is balanced.