Math 321 Lecture 1 Newton`s method in one and more dimensions

Math 321 Lecture 1
Newton’s method in one and more dimensions and IEEE floating point
One of the giants of scientific computing was Isaac Newton, 300 years before he could have
had access to a computer! Many of the best algorithms in use today were invented before the
computer. The lesson you should draw is not to underestimate the importance of mathematics
in scientific computing. The more you know, the better.
A prototype numerical problem is to find the root of the equation
f (x) = 0
To make it more practical, let us suppose that we want to produce a computer subroutine to
x = sqrt(z)
In order to cast it into the standard form suitable for applying Newton’s method, we rewrite
it as
f (x) = x2 − z = 0
This equation has two roots, + (z), and − (z), so we will have to make sure that we get the
answer that we are after. Of course, if z < 0, then there will be no real solution.
One step of Newton’s method:
Suppose that we have arrived somehow at an estimate xn to the root x∗ on iteration n. We
draw the tangent to the curve at the point (xn , f (xn )), and project it down to the x axis, which
becomes our next iterate xn+1 , hopefully closer to x∗ . Mathematically,
tan(θ) = f (xn ) =
f (xn )
xn − xn+1
which, after rearrangement gives us the computational procedure for obtaining the next iterate
xn+1 from the current iterate xn :
xn+1 = xn −
f (xn )
f 0 (xn )
In our sqrt example, this has a very simple form:
f (x) = x2 − z,
f (x) = 2x,
xn+1 = xn −
x2n − z
(xn +
This example is typical of a lot of scientific computation, in that we have a problem and a
proposed method of solution. We would like to know when it will work and when it won’t
(the Russian mathematician Kantorovich worked on this problem); how fast it will give us an
answer (Newton understood this); and what to watch out for in a numerical implementation
(the Berkeley mathematician Kahan gave us the IEEE754 standard for floating point arithmetic
used in almost all computers today).
Obviously, it won’t always work. We will not go into all the gory details, but if f (xn ) = 0
it fails disastrously, and if f (xn ) ever becomes numerically small, the procedure is likely to
wander far from the desired solution.
To understand the rate of convergence, we have to do some analysis. Write
xn+1 = x∗ + δn+1 ,
xn = x∗ + δn
where x∗ is the solution point where f (x∗ ) = 0, δn is the error at step n, and δn+1 is the error
at step n + 1.
xn+1 = xn − f (xn )f (xn )
x∗ + δn+1 = x∗ + δn − f (x∗ + δn )/f (x∗ + δn )
which gives
δn+1 = δn − f (x∗ + δn )/f (x∗ + δn ).
Expanding using Taylor series:
δn2 00 ∗
f (x ) + . . . ,
δ 2 000
f (x∗ + δn ) = f (x∗ ) + δn f (x∗ ) + n f (x∗ ) + . . .
f (x∗ + δn ) = f (x∗ ) + δn f (x∗ ) +
f (x∗ ) = 0
δn+1 = δn −
2 000
δn 00 ∗
2 f (x ) + 6 f (x ) + . . .)
δn f 00 (x∗ ) + δ2n f 000 (x∗ ) + . . .)
δn (f (x∗ ) +
(f 0 (x∗ ) +
2δn 000 ∗
2 (f (x ) + 3 f (x ) + . . .)
(f 0 (x∗ ) + δn f 00 (x∗ ) + δ2n f 000 (x∗ ) + . . .)
When we are near the solution, δn → 0, so provided that f (x∗ ) 6= 0
δn+1 ≈
f (x∗ )
2 f 0 (x∗ )
This is described as quadratic convergence, a wonderful outcome 300 years ago and still a very
desirable property of numerical algorithms today.
If f (x∗ ) = 0, the leading term in both numerator and denominator is f (x∗), and the convergence reverts to linear convergence:
δn+1 = δn
If you were to solve f (x) = x2 − 4 = 0, and f (x) = (x − 2)2 = 0, both starting from x0 = 1, you
would clearly see the difference. Newton’s algorithm does not work as well for repeated roots.
Computer implementation
Much scientific computation involves iterations of the type
xn+1 = some f unction of xn
and on the computer these will usually involve inexact arithmetic if we are using C++ or
FORTRAN or the evalf function in Maple. For example,
2 = 1.4142 . . .
We know 2 is not a rational, so it has a non-terminating, non-recurring decimal expansion,
and a similar but not so familiar non-teminating, non-recurring binary expansion.
Computers are able to represent integers, and do integer arithmetic exactly. For example, a 32
bit binary representation of the decimal number 11 is
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1
^ ^
sign bit
2 1
The sign bit is 0 for positive integers. The decimal number 11 is 23 + 21 + 20 .
The ones complement of a number replaces each 1 bit by a 0 bit, and each 0 bit by a 1 bit.
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0
The twos complement of a number is the ones complement + 1;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1
The twos complement representation is used most commonly today. The number directly above
is the binary representation of −11 in twos complement 32 bit computers. As you can see, the
leading sign bit of a negative integer is 1.
The representation of floating point, or real, numbers is a little more complicated. Over the
last 50 years of computing, various different formats with varying precision have been used
in different computers, with the somewhat unnerving consequence that the answers produced
on different computers have not always produced exactly identical answers. We could spend
a long time talking about the IEEE754 standard, but we cannot afford the time to discuss it
thoroughly. A google search on IEEE754 will provide more detail, including some of Kahan’s
proposals for the standard and comments on its implementation. What I will try to give here
is a brief outline which may help you to read about it in more detail if you are tempted.
To illustrate the point, consider the scientific notation numbers
1.234 ∗ 106
1.234 ∗ 10−6
−1.234 ∗ 106
−1.234 ∗ 10−6
There are two parts to the number, the fraction (mantissa) and the power (exponent). There
are two signs, one for the number itself and one for the exponent. There are many possible
representations of the same number, all of which are perfectly valid scientific notation:
1.234 ∗ 106 = 12.34 ∗ 105 = 0.1234 ∗ 107 . . .
Let us pretend for a moment we are designing a decimal floating point computer. (In fact,
such a computer was designed in 1955, the IBM1620).
Firstly we have to decide on a dynamic range of numbers. Say we decide to accept numbers
with exponents in the range 10−50 to 10+50 .
Then suppose that we want each fraction to be able to be represented with 7 decimal digits,
and that a properly normalised real will resemble 1.234567 ∗ 1016 , in that the digits before the
decimal point are all 0, except for the one immediately before, which is only 0 if the entire
number is 0.0. We can allocate one digit to the sign of the number, but we still have to represent
the sign of the exponent. This can be done by using an excess 50 notation. We allocate two
digits for the exponent, and equate 00 with an exponent of 10−50 , 50 with an exponent of
100 , and 99 with an exponent of 10+49 . Because the digit before the decimal point is always
0, we need not represent it, or the decimal point itself. Then we can represent the number
1.234567 ∗ 1016 by + 66 1234567. Apart from a couple of minor details, this is the design of
the IBM1620 floating point number.
Some of the things that can go wrong when we do floating point arithmetic are
big * big
big * (-big)
x / 0.0
zero divide
0.0 / 0.0
small * small
+Inf (x > 0)
-Inf (x < 0)
Nan (not a number)
subnormal result, which could be 0.0
The principles underlying IEEE754 representation are similar, except that the fraction is a
binary fraction, and the exponent is a power of 2 and not a power of 10.
A single precision real occupies 32 bits, or four 8-bit bytes, has an 8-bit exponent giving a
dynamic range of roughly 10−38 to 1038 , and a 24-bit fraction roughly equivalent to 7 decimal
digits. Actually, because the leading digit of the fraction of a properly normalised number is
non-zero, it has to be a 1, there is no need to store it, and the fraction can be stored in 23
bits, leaving one bit over for the leading sign bit (0 for a positive number, and 1 for a negative
number). Thus the binary representation of 13.0, which is 1.101 ∗ 23 is
1 0 0 0 0 0 1 0
1 0 1 0 0 0 0
the leading sign bit is 0, denoting a
Note the excess exponent. 1 0 0 0 0 0
exponent 1, so 1 0 0 0 0 0 1 0 is the
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
positive number.
0 0 is the representation of
representation of exponent 3.
A double precision real occupies 64 bits, or eight 8-bit bytes, has an 11-bit exponent giving a
dynamic range of roughly 10−308 to 10308 , and a 53-bit fraction (with 52 bits stored and the
first an implied 1), roughly equivalent to 16 decimal digits, plus a leading sign bit.
We give a table of special numbers in single precision
Hexadecimal representation
both are possible
the biggest normalised number
the smallest normalised number
subnormal number
smallest subnormal number
Note the subnormal numbers, which have an exponent as small as it can be (0 0 0 0 0 0 0 0).
For this exponent, the fraction is represented explicitly, and does not have an implied 1 before
the decimal point. This allows computations to underflow gracefully and to represent numbers
smaller than would be possible if only normalised numbers were permitted
What causes underflow?
The smallest single precision normalised number is 1.17549435e-38, so if we did any of the
following, we would produce an underflow:
1.2e-30 * 1.2e-30
= 1.44 e-40
1.2345e-36 - 1.2344e-36 = 1.0e-40
Notice that when you subtract quantities that are almost equal you lose precision in the fraction
part of the result as well. Allowing subnormal numbers is ”better” than saying either of the
above results is 0.0, which is the only alternative if we allow only normalised numbers. The
very next operation might be a divide by this result, which will be catastrophic if it has been
set to 0.0.
Quiet Nan and Signalling Nan results are treated differently, in that every time an operation
is performed on a signalling Nan an interrupt is generated, which if properly handled, will
allow a program in trouble to terminate gracefully. If the interrupts are ignored, Nans will
likely propagate through the program and produce output which can immediately be seen to
be wrong.
For many programs, it is not necessary for a programmer to know the details of the binary
representation of the numbers in the computer, but programmers have to be aware of
the fact that the precision is finite, even in double precision. You should know that
terminating decimal fractions such as 0.1 and 0.01 do not have terminating binary fractions,
so it would not be guaranteed, for example, that 10 ∗ 0.01 = 0.1 exactly. When you compute
xn − xn+1 in Newton’s method to see how your solution is converging, the closer they come
together, the more digits of precision you will lose when computing the difference. Sometimes
it is useful to know the precise details. Suppose that you wish to use Newton’s method to
compute the sqrt() function. If you are trying to compute sqrt(1.234e+50), you would like
to start with a guess like 1.0e+25, so you really need to be able to manipulate the binary
Newton’s method extended to higher dimensions
Suppose we are looking for the solution of two equations in two unknowns:
f (x, y) = 0
g(x, y) = 0
For example, we could be looking for the points of intersection of the line
y = mx + b
with the circle
(x − p)2 + (y − q)2 = r2
We recast Newton’s method in one dimension and generalise:
We would like to go from our current value xn to our new point xn+1 = xn + h, where
f (xn + h) = f (xn ) + hf (xn ) + . . . = 0
h = −f (xn )/f (xn )
xn+1 = xn − h = xn − f (xn )/f (xn )
The two dimensional analogue is as follows: From the point (xn , yn ), we want to move to the
point (xn+1 , yn+1 ) where
xn+1 = xn + h
yn+1 = yn + k
such that
f (xn+1 , yn+1 ) = 0
g(xn+1 , yn+1 ) = 0
Using the first order Taylor series approximation, now in 2 dimensions,
g(xn+1 , yn+1 ) ≈ g(xn , yn ) + h
f (xn+1 , yn+1 ) ≈ f (xn , yn ) + h
+ ... = 0
+ ... = 0
which we can write as
J h = − f;
, h=
, and f =
so we find that we have a matrix equation to solve for h and k at each step.
Returning to our example of finding the point(s) where the line
mx − y + b = 0
cuts the circle
(x − p)2 + (y − q)2 − r2 = 0
we find
= 2(x − p),
2(y − q)
2(xn − p) 2(yn − q)
= −
mxn − yn + b
(xn − p)2 + (yn − q)2 − r2
The power of matrix notation is that it extends to any number of dimensions. If we have m
functions f1 , f2 , . . . fm , each of which is a function of m variables, x1 , x2 , . . . xm , then we can
write one step of Newton’s method as
J h=−f ,
Ji j =
evaluated at xn .
The full Newton algorithm is as follows:
Begin with a starting guess x0
Evaluate the matrix J, with operation countm2
Evaluate the vector f , with operation countm
Solve the matrix equation J h = −f , with operation count m3
Update xn+1 ← xn + h, with operation count m
UNTIL h and f are ”small enough”, or maximum iterations exceeded.
One application of Newton’s method is Bairstow’s method for extracting the roots of polynomials. It is worhwhile for you to look at this application, because it shows you another
important idea, that you can differentiate through a set or recurrence relations in a way that
perhaps you have not seen before, though of course Maple will support such operations.
A fundamental theorem of algebra is that a polynomial of degree n with complex coefficients
has n complex roots satisfying
Pn (z) = 0
If we look at polynomials with real coefficients, it is not difficult to establish that if
z1 = x + ι y
is a root of
Pn (z) = 0,
z2 = x − ι y
is also a root. That is, polynomials with real coefficients have roots which are either real, or
can be grouped together in complex conjugate pairs. As an example to illustrate this, consider
the quartic polynomial:
P4 (z) =
z 4 + a3 z 3 + a2 Z 2 + a1 z + a0
= (x + ιy)4 + a3 (x + ιy)3 + a2 (x + ιy)2 + a1 (x + ιy) + a0
Real + ι Imag
Real = x4 − 6x2 y 2 + y 4 + a3 (x3 − 3xY 2 ) + a2 (x2 − y 2 ) + a1 x + a0
contains only even powers of y.
Imag = 4x3 y − 4xy 3 + a3 (3x2 y − y 3 ) + a2 (2xy) + a1 y
contains only odd powers of y. Therefore it follows that
Real + ι Imag = 0
Real − ι Imag = 0
which in turn illustrates that the complex roots of a real polynomial occur in complex conjugate
pairs, which we could think of as the pair of roots of the real quadratic equation z 2 +pz +q = 0,
where p and q are real.
Looking at the first few monic polynomials
P1 (z) = z + a0 = 0 has one real root;
P2 (z) = z 2 + a1 z + a0 = 0 has 2 real roots, or a complex conjugate pair;
P3 (z) = z 3 +a2 z 2 +a1 z+a0 = 0; has one real root, and either a complex conjugate pair, or two more real roots . . .
Bairstow’s idea is that we can write Pn (z) as follows:
Pn (z) = (z 2 + pz + q) Qn−2 (z) + R z + S
where Rz + S is the remainder which is left when we divide through by the quadratic, and
the coefficients R and S depend on the coefficients p and q of the quadratic. If the quadratic
z 2 + pz + q exactly divides Pn (z), then R and S will be zero. Although it is obvious that R
and S depend on p and q, the dependence is implicit, and is usually established through the
Horner recurrence relations. Let
Pn (z) =
z n + an−1 z n−1 + an−2 z n−2 + . . . + a1 z + a0
Qn−2 (z) = z n−2 + bn−3 z n−3 + bn−4 z n−4 + . . . + b1 z + b0
Multiply and equate coefficients
(z 2 + pz + q)Qn−2 (z) + Rz + S =
z n + z n−1 (bn−3 + p) + z n−2 (bn−4 + pbn−3 + q) + z n−3 (bn−5 + pbn−4 + qbn−3 ) +
. . . + z 2 (b0 + pb1 + qb2 ) + z(pb0 + qB1 + R) + (qb0 + S)
This leads to the recurrence relations:
bn−3 + p = an−1
bn−4 + p bn−3 + q = an−2
bn−5 + p bn−4 + q bn−3 = an−3
... =
b 0 + p b1 + q b 2 =
R + p b0 + q b 1 =
s + q b0 =
from which we can define the computational procedure:
b[n-1] = 0;
b[n-2] = 1;
for i:= (n-3) downto 0 by -1 do
b[i] := a[i+2] - p * b[i+1] - q * b[i+2];
end do;
R := a[1] - p * b[0] - q * b[1];
S := a[0] - q * b[0];
So, given p, q we can calculate R(p, q) and S(p, q). We are looking for values of p and q, such
R(p, q) = 0
S(p, q) = 0
an obvious application for Newton’s method, but in order to implement it, we will have to be
able to compute the partial derivatives ∂R
∂p , etc. We do not have an explicit function R(p, q),
but if we go back to the definition of a partial derivative, we have
rate of change of R with p, when q is held constant.
rate of change of R with q, when p is held constant.
Returning to our set of equations giving R and S as a function of p and q we can differentiate
each in turn with respect to p to give:
dbdp[n-1] = 0;
dbdp[n-2] = 0;
for i:= (n-3) downto 0 by -1 do
dbdp[i] := - b[i+1] - p * dbdp[i+1] - q * dbdp[i+2];
end do;
dRdp := - b[0] - p * dbdp[0] - q * dbdp[1];
dSdp := - q * dbdp[0];
and again with respect to q to give:
dbdq[n-1] = 0;
dbdq[n-2] = 0;
for i:= (n-3) downto 0 by -1 do
dbdq[i] := - p * dbdq[i+1] - q * dbdq[i+2] - b[i+2];
end do;
dRdq := - p * dbdq[0] - q * dbdq[1] - b[1] ;
dSdq := - q * dbdq[0] - b[0];
I wanted to show you Bairstow’s method, because it is an illustration of the way partial derivatives can be computed, even though the functions R(p, q) and S(p, q) are defined implicitly
through a set of recurrence relations. There are several automatic differentiation packages,
including two of the well known packages, ADIFOR for FORTRAN and ADIC for C, see , which if you supply the program to evaluate the function will return the program to compute the partial derivatives. These automatic differentiation
packages have made Newton’s method for complex functions much easier to use, because they
provide mistake free programs to compute the derivatives. It is very easy to make a mistake
in coding the partial derivatives, and of course if the function is a function of many variables
there is a lot of work to be done when the differentiation is performed manually.
With these procedures available, it is easy to iterate p ← p + δp , q ← q + δq where
= −
R(p, q)
S(p, q)
until R and S are 0, implying z 2 + p z + q is a factor of Pn (z).
While on the subject of polynomials, it is worth mentioning that if you have to program, in a
procedural language, the evaluation of the polynomial
Pn (z) = a0 + a1 z + a2 z 2 + . . . + an z n
with the coefficients a0 , a1 , . . . an stored in a[0] . . . a[n], the procedure should be as follows:
f := a[n];
for i:= n-1 downto 0 by -1 do
f := z * f + a[i];
end do;
As you can see, the total operation count for performing the loop is n additions and n multiplies. Even when coding small polynomials, you should use the efficient evaluation process, for
f := a[0] + z * (a[1] + z * (a[2] + z * a[3] )
for a cubic, rather than the obvious, but less efficient
f :=
a[0] + a[1] * z + a[2]
* z * z + a[3] * z * z * z
Finally, I will show how detailed knowledge of IEEE format can be used in a language like C to
generate a good starting approximation to sqrt(x), which, as you have seen from the tutorial
is necessary for Newton’s method to converge rapidly. In double precision, a floating point
number is stored as one sign bit, followed by an 11 bit exponent stored in excess notation (
ie 10000000000 represents 20 ), followed by a 53 bit mantissa with the first 1 bit implied and
the remaining 52 bits stored. To generate a close approximation to the correct exponent, the
following sequence suffices:
AND (z, mask), where mask = 0111111111110000000000.....
SHIFT result 52 places to the right
SUBTRACT 100000000 to remove the excess from the exponent,
DIVIDE by 2, to halve the real exponent
ADD 100000000000 to restore the representation of the halved exponent
SHIFT result 52 places to the left to put exponent in the correct position
AND result with the same mask to produce the approximate sqrt
Of course, in a language like Maple such bit manipulations are unnecessary, and procedural
languages like C will always provide sqrt() in their libraries, but it may be useful to how how
to manipulate exponents in other applications. For example, if you were writing a procedure
to compute the f if th root, rather than the sqrt, the process would be the same, except that
”DIVIDE by 2” would be replaced by ”DIVIDE by 5”.