### Bengtsson, Ewert

- Centre for Image Analysis, Swedish University of Agricultural Sciences

Research article2009Peer reviewed

Barrera, Tony; Spångberg, Daniel; Hast, Anders; Bengtsson, Ewert

This paper presents fast implementations of the inverse square root and arc- sine, both in double precision. In single precision it is often possible to use a small table and one ordinary Newton-Raphson iteration to compute elementary functions such as the square root. In double precision a substantially larger table is necessary to obtain the desired precision, or, if a smaller table is used, the additional Newton-Raphson iterations required to obtain the precision often requires the evaluation of other expensive elemen- tary functions. Furthermore, large tables use a lot of the cash memory that should have been used for the application code. Obtaining the desired precision using a small table can instead be realised by using a higher order method than the second order Newton-Raphson method. A generalization of Newton’s method to higher order is Householder’s method, which unfortunately often results in very complicated expressions requiring many multiplications, additions, and even divisions. We show how a high-order method can be used, which only requires a few extra additions and multiplications for each degree of higher order. The method starts from the Taylor expansion of the diﬀerence of the value of the elementary function and a starting guess value for each iteration. If the Taylor series is truncated after the second term, ordinary Newton iterations are obtained. In several cases it is possible to algebraically simplify the diﬀerence between the true value and the starting guess value. In those cases we show that it is advantageous to use the Taylor series to higher order to obtain the fast convergent method. Moreover, we will show how the coeﬃcients of a Chebyshev polynomial can be ﬁtted to give as little error as possible for the functions close to zero and in the same time reduce the terms in the Taylor expansion. In the paper we benchmark two example implementations of the method on the x86 64 architecture. The ﬁrst is the inverse square root, where the actual table (to 12 bit precision) is provided by the processor hardware. The inverse square root is important in many application programs, including computer graphics, and explicit particle simulation codes, for instance the Monte Carlo and Molecular Dynamics methods of statistical mechanics. The other example is the arcsine function, which has a slow converging Taylor expansion and where no tables are provided by the hardware. The vectorized versions of the implementations of the inverse square root are 3.5 times faster than compiled code on the Athlon64 and about 5 times faster on the Core 2. The scalar version of the arcsine function is, depending on order and table size, between 2 and 3 times faster than the compiled code, and the vectorized version is between 3 and 4 times faster on the Athlon64, while it is between 4 and 5 times faster than the compiled version on the Core 2

APLIMAT - Journal of Applied Mathematics

2009, Volume: 2, number: 3, pages: 171-178

### Bengtsson, Ewert

- Centre for Image Analysis, Swedish University of Agricultural Sciences

- Centre for Image Analysis, Swedish University of Agricultural Sciences

Other Mathematics

https://res.slu.se/id/publ/27701