Square root and inverse square root are important operations for several applications such as digital signal processing, multimedia, computer graphics, and scientific computing. Although they are less frequent than the two basic arithmetic operations, the poor performance of many processors when computing these operations can make their overall execution time comparable to the time spent performing addition and multiplication. For a low precision computation of these functions, it is possible to employ direct table look-up, bipartite tables, (table look-up and addition), or low-degree polynomial or rational approximations but the area requirements become prohibitive for table-based methods when performing high precision computations (such as the 53-bit accuracy double-precision floating-point format). Iterative algorithms are employed for these calculations. On one hand, digit recurrence methods such as the SRT algorithm result in small units, but their linear convergence sometimes leads to long latency and makes them in adequate methods for these computations. High-radix digit-recurrence methods result in faster but bigger designs. On the other hand, multiplicative-based methods such as the Newton-Raphson and Goldschmidt algorithms, have quadratic convergence, which leads to faster execution times, usually at the expense of greater hardware requirements.
These methods employ an initial table-based low-precision approximation (seed value) of the final result to reduce the number of iterations to be performed, thus reducing the overall latency of the algorithm. A new multiplicative-based method is proposed in this paper. Our method combines an efficient second-degree minimax polynomial approximation to obtain the seed values and a modified Goldschmidt iteration to provide double-precision floating-point results. The high accuracy (about 30-bits of precision) of the second-degree approximation allows the computation of a single iteration, significantly reducing the overall latency of the standard Goldschmidt algorithm. On the other hand, the modification performed in the iteration allows an important reduction on the hardware requirements. We have chosen this algorithm instead of Newton-Raphson due to its higher intrinsic parallelism, which leads to lower execution times. The second-degree minimax polynomial approximation consists of three look-up tables storing C0,C1, and C2, the coefficients of a second-degree polynomial. This approximation combines the speed of linear approximations and the reduced area of the second-degree interpolations. The lookup tables are addressed with the upper part of the significant input, X1 (having a word-length of m1=9 bits), and the evaluation of the quadratic polynomial is carried out by a specialized squaring unit and a multi-operand adder. When obtaining the initial approximation for several different functions, only the look-up tables storing the coefficients must be replicated because the squaring unit and multi operand adder can be shared for the different computations.
Since there are two different implementations of the Goldschmidt algorithm for division and square root functions, we propose two architectures in this paper. The first one deals only with the computation of reciprocal and division operations, with a very low latency and reasonable hardware requirements. The second proposed architecture also deals with the computation of square root and inverse square root, with an increased latency (but still a fast) execution time compared with previous methods) and about the same amount of hardware as the first scheme. The modification performed in the Goldschmidt iteration for computing square root and inverse square root allows the reutilization of the logic blocks employed in the first architecture, saving an important amount of area.
Vedic Mathematics is the ancient methodology of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae). The idea for designing the square root processor a number was adopted from ancient Indian mathematics “Vedas”. On account of the Vedic formulas, the square root of any number is generated in one step which reduces the iteration. As the number of iteration is decreased the time consumption to extract the square root also decreases. That’s why if we implement this square root extraction method in FPGA chip then the processing will be faster than the other traditional methods.