7 A Case Study on Floating-Point Arithmetic Stability

This chapter is inspired by the example in Christoph Überhuber, “Computer-Numerik” Vol. 1, Springer 1995, Chap. 2.3.

7.1 Calculation of \(\pi\) According to Archimedes

A lower bound for \(2\pi\) – the circumference of the unit circle – is obtained by summing the side lengths of a regular \(n\)-gon inscribed in the unit circle. The figure on the left illustrates the iterative doubling of the number of vertices starting from a square with side length \[s_4=\sqrt{2}\].

Fig. 1	Fig. 2

The second figure shows the geometry of the vertex doubling.

With \(|\overline{AC}|= s_{n},\quad |\overline{AB}|= |\overline{BC}|= s_{2n},\quad |\overline{MN}| =a, \quad |\overline{NB}| =1-a,\) the Pythagorean theorem applied to triangles \(MNA\) and \(NBA\) respectively yields \[ a^2 + \left(\frac{s_{n}}{2}\right)^2 = 1\qquad\text{and } \qquad (1-a)^2 + \left(\frac{s_{n}}{2}\right)^2 = s_{2n}^2 \] Elimination of \(a\) gives the recursion \[ s_{2n} = \sqrt{2-\sqrt{4-s_n^2}} \qquad\text{with initial value}\qquad s_4=\sqrt{2} \] for the length \(s_n\) of one side of the inscribed regular \(n\)-gon.

The sequence \((n\cdot s_n)\) converges monotonically from below to the limit \(2\pi\): \[ n\, s_n \rightarrow 2\pi % \qquad\text{and} \qquad {n c_n}\downarrow 2\pi \] The relative error of approximating \(2\pi\) by an \(n\)-gon is: \[ \epsilon_n = \left| \frac{n\, s_n-2 \pi}{2\pi} \right| \]

7.2 Two Iteration Formulas

The equation \[ s_{2n} = \sqrt{2-\sqrt{4-s_n^2}}\qquad \qquad \text{(Iteration A)} \] is mathematically equivalent to: \[ s_{2n} = \frac{s_n}{\sqrt{2+\sqrt{4-s_n^2}}} \qquad \qquad \text{(Iteration B)} \]

However, Iteration A is ill-conditioned and numerically unstable, as the following code demonstrates. The output shows the approximation for \(\pi\) for both formulae iteration.

using Printf

iterationA(s) =  sqrt(2 - sqrt(4 - s^2))
iterationB(s) = s / sqrt(2 + sqrt(4 - s^2))

s_B = s_A = sqrt(2)  # initial value

ϵ(x) = abs(x - 2π)/2π  # relative error   

ϵ_A = Float64[]  # vectors for the plot
ϵ_B = Float64[]
is = Float64[]

@printf """Approx. value of π
           n-gon       Iteration A            Iteration B
"""

for i in 3:35
    push!(is, i)
    s_A = iterationA(s_A)         
    s_B = iterationB(s_B)         
    doublePi_A = 2^i * s_A
    doublePi_B = 2^i * s_B
    push!(ϵ_A, ϵ(doublePi_A))
    push!(ϵ_B, ϵ(doublePi_B))
      @printf "%14i %20.15f %20.15f\n" 2^i doublePi_A/2 doublePi_B/2 
end

Approx. value of π
           n-gon       Iteration A            Iteration B
             8    3.061467458920719    3.061467458920718
            16    3.121445152258053    3.121445152258052
            32    3.136548490545941    3.136548490545939
            64    3.140331156954739    3.140331156954753
           128    3.141277250932757    3.141277250932773
           256    3.141513801144145    3.141513801144301
           512    3.141572940367883    3.141572940367092
          1024    3.141587725279961    3.141587725277160
          2048    3.141591421504635    3.141591421511200
          4096    3.141592345611077    3.141592345570118
          8192    3.141592576545004    3.141592576584873
         16384    3.141592633463248    3.141592634338564
         32768    3.141592654807589    3.141592648776986
         65536    3.141592645321215    3.141592652386592
        131072    3.141592607375720    3.141592653288993
        262144    3.141592910939673    3.141592653514594
        524288    3.141594125195191    3.141592653570994
       1048576    3.141596553704820    3.141592653585094
       2097152    3.141596553704820    3.141592653588619
       4194304    3.141674265021758    3.141592653589501
       8388608    3.141829681889202    3.141592653589721
      16777216    3.142451272494134    3.141592653589776
      33554432    3.142451272494134    3.141592653589790
      67108864    3.162277660168380    3.141592653589794
     134217728    3.162277660168380    3.141592653589794
     268435456    3.464101615137754    3.141592653589795
     536870912    4.000000000000000    3.141592653589795
    1073741824    0.000000000000000    3.141592653589795
    2147483648    0.000000000000000    3.141592653589795
    4294967296    0.000000000000000    3.141592653589795
    8589934592    0.000000000000000    3.141592653589795
   17179869184    0.000000000000000    3.141592653589795
   34359738368    0.000000000000000    3.141592653589795

While Iteration B stabilizes to a value accurate within machine precision, Iteration A is unstable and diverges. A plot of the relative errors \(\epsilon_i\) confirms this:

using PlotlyJS

layout = Layout(xaxis_title="Iteration steps", yaxis_title="Relative error",
            yaxis_type="log", yaxis_exponentformat="power", 
            xaxis_tick0=2, xaxis_dtick=2)

plot([scatter(x=is, y=ϵ_A, mode="markers+lines", name="Iteration A", yscale=:log10),
      scatter(x=is, y=ϵ_B, mode="markers+lines", name="Iteration B", yscale=:log10)],
      layout)

7.3 Stability and Cancellation

At \(i=26\), Iteration B reaches a relative error on the order of machine epsilon:

ϵ_B[22:28]

7-element Vector{Float64}:
 5.3716034620272725e-15
 9.895059008997609e-16
 1.4135798584282297e-16
 4.240739575284689e-16
 5.654319433712919e-16
 5.654319433712919e-16
 5.654319433712919e-16

Further iterations do not improve the result; it stabilizes at a relative error of approximately 2.5 machine epsilons.

ϵ_B[end]/eps(Float64)

2.5464790894703255

Iteration A is unstable; already at \(i=16\), the relative error begins to grow again.

The cause is a typical numerical cancellation effect. The side lengths \(s_n\) become very small very quickly. Thus \(a_n=\sqrt{4-s_n^2}\) is only slightly smaller than 2, and computing \(s_{2n}=\sqrt{2-a_n}\) leads to a catastrophic cancellation.

setprecision(80) # precision for BigFloat

s = sqrt(BigFloat(2))

@printf "     a = √(4-s^2) as BigFloat       and as Float64\n\n"

for i= 3:44
    s = iterationA(s)
    x = sqrt(4-s^2)
    if i > 20
        @printf "%i %30.26f %20.16f \n" i  x  Float64(x)
    end 
end

     a = √(4-s^2) as BigFloat       and as Float64

21   1.99999999999775591177215422   1.9999999999977560 
22   1.99999999999943897794303856   1.9999999999994389 
23   1.99999999999985974448576005   1.9999999999998597 
24   1.99999999999996493612143919   1.9999999999999649 
25   1.99999999999999123403035980   1.9999999999999913 
26   1.99999999999999780850758995   1.9999999999999978 
27   1.99999999999999945212689707   1.9999999999999996 
28   1.99999999999999986303172344   1.9999999999999998 
29   1.99999999999999996575793045   2.0000000000000000 
30   1.99999999999999999143948303   2.0000000000000000 
31   1.99999999999999999785987034   2.0000000000000000 
32   1.99999999999999999946496800   2.0000000000000000 
33   1.99999999999999999986624159   2.0000000000000000 
34   1.99999999999999999996656040   2.0000000000000000 
35   1.99999999999999999999163886   2.0000000000000000 
36   1.99999999999999999999790889   2.0000000000000000 
37   1.99999999999999999999947722   2.0000000000000000 
38   1.99999999999999999999986931   2.0000000000000000 
39   1.99999999999999999999996691   2.0000000000000000 
40   1.99999999999999999999999173   2.0000000000000000 
41   1.99999999999999999999999835   2.0000000000000000 
42   1.99999999999999999999999835   2.0000000000000000 
43   1.99999999999999999999999835   2.0000000000000000 
44   1.99999999999999999999999835   2.0000000000000000

This demonstrates the loss of significant digits. It also shows that using BigFloat with 80 bits of precission (mantissa length) only slightly delays the onset of the cancellation effect.

Countermeasures against unstable algorithms typically require better, stable algorithms, not more precise machine numbers.