+\1f
+File: gmp.info, Node: Exact Remainder, Next: Small Quotient Division, Prev: Exact Division, Up: Division Algorithms
+
+15.2.6 Exact Remainder
+----------------------
+
+If the exact division algorithm is done with a full subtraction at each
+stage and the dividend isn't a multiple of the divisor, then low zero
+limbs are produced but with a remainder in the high limbs. For dividend
+a, divisor d, quotient q, and b = 2^mp_bits_per_limb, this remainder r
+is of the form
+
+ a = q*d + r*b^n
+
+ n represents the number of zero limbs produced by the subtractions,
+that being the number of limbs produced for q. r will be in the range
+0<=r<d and can be viewed as a remainder, but one shifted up by a factor
+of b^n.
+
+ Carrying out full subtractions at each stage means the same number of
+cross products must be done as a normal division, but there's still some
+single limb divisions saved. When d is a single limb some
+simplifications arise, providing good speedups on a number of
+processors.
+
+ The functions 'mpn_divexact_by3', 'mpn_modexact_1_odd' and the
+internal 'mpn_redc_X' functions differ subtly in how they return r,
+leading to some negations in the above formula, but all are essentially
+the same.
+
+ Clearly r is zero when a is a multiple of d, and this leads to
+divisibility or congruence tests which are potentially more efficient
+than a normal division.
+
+ The factor of b^n on r can be ignored in a GCD when d is odd, hence
+the use of 'mpn_modexact_1_odd' by 'mpn_gcd_1' and 'mpz_kronecker_ui'
+etc (*note Greatest Common Divisor Algorithms::).
+
+ Montgomery's REDC method for modular multiplications uses operands of
+the form of x*b^-n and y*b^-n and on calculating (x*b^-n)*(y*b^-n) uses
+the factor of b^n in the exact remainder to reach a product in the same
+form (x*y)*b^-n (*note Modular Powering Algorithm::).
+
+ Notice that r generally gives no useful information about the
+ordinary remainder a mod d since b^n mod d could be anything. If
+however b^n == 1 mod d, then r is the negative of the ordinary
+remainder. This occurs whenever d is a factor of b^n-1, as for example
+with 3 in 'mpn_divexact_by3'. For a 32 or 64 bit limb other such
+factors include 5, 17 and 257, but no particular use has been found for
+this.
+
+\1f
+File: gmp.info, Node: Small Quotient Division, Prev: Exact Remainder, Up: Division Algorithms
+
+15.2.7 Small Quotient Division
+------------------------------
+
+An NxM division where the number of quotient limbs Q=N-M is small can be
+optimized somewhat.
+
+ An ordinary basecase division normalizes the divisor by shifting it
+to make the high bit set, shifting the dividend accordingly, and
+shifting the remainder back down at the end of the calculation. This is
+wasteful if only a few quotient limbs are to be formed. Instead a
+division of just the top 2*Q limbs of the dividend by the top Q limbs of
+the divisor can be used to form a trial quotient. This requires only
+those limbs normalized, not the whole of the divisor and dividend.
+
+ A multiply and subtract then applies the trial quotient to the M-Q
+unused limbs of the divisor and N-Q dividend limbs (which includes Q
+limbs remaining from the trial quotient division). The starting trial
+quotient can be 1 or 2 too big, but all cases of 2 too big and most
+cases of 1 too big are detected by first comparing the most significant
+limbs that will arise from the subtraction. An addback is done if the
+quotient still turns out to be 1 too big.
+
+ This whole procedure is essentially the same as one step of the
+basecase algorithm done in a Q limb base, though with the trial quotient
+test done only with the high limbs, not an entire Q limb "digit"
+product. The correctness of this weaker test can be established by
+following the argument of Knuth section 4.3.1 exercise 20 but with the
+v2*q>b*r+u2 condition appropriately relaxed.
+
+\1f
+File: gmp.info, Node: Greatest Common Divisor Algorithms, Next: Powering Algorithms, Prev: Division Algorithms, Up: Algorithms
+
+15.3 Greatest Common Divisor
+============================
+
+* Menu:
+
+* Binary GCD::
+* Lehmer's Algorithm::
+* Subquadratic GCD::
+* Extended GCD::
+* Jacobi Symbol::
+
+\1f
+File: gmp.info, Node: Binary GCD, Next: Lehmer's Algorithm, Prev: Greatest Common Divisor Algorithms, Up: Greatest Common Divisor Algorithms
+
+15.3.1 Binary GCD
+-----------------
+
+At small sizes GMP uses an O(N^2) binary style GCD. This is described
+in many textbooks, for example Knuth section 4.5.2 algorithm B. It
+simply consists of successively reducing odd operands a and b using
+
+ a,b = abs(a-b),min(a,b)
+ strip factors of 2 from a
+
+ The Euclidean GCD algorithm, as per Knuth algorithms E and A,
+repeatedly computes the quotient q = floor(a/b) and replaces a,b by v, u
+- q v. The binary algorithm has so far been found to be faster than the
+Euclidean algorithm everywhere. One reason the binary method does well
+is that the implied quotient at each step is usually small, so often
+only one or two subtractions are needed to get the same effect as a
+division. Quotients 1, 2 and 3 for example occur 67.7% of the time, see
+Knuth section 4.5.3 Theorem E.
+
+ When the implied quotient is large, meaning b is much smaller than a,
+then a division is worthwhile. This is the basis for the initial a mod
+b reductions in 'mpn_gcd' and 'mpn_gcd_1' (the latter for both Nx1 and
+1x1 cases). But after that initial reduction, big quotients occur too
+rarely to make it worth checking for them.
+
+
+ The final 1x1 GCD in 'mpn_gcd_1' is done in the generic C code as
+described above. For two N-bit operands, the algorithm takes about 0.68
+iterations per bit. For optimum performance some attention needs to be
+paid to the way the factors of 2 are stripped from a.
+
+ Firstly it may be noted that in twos complement the number of low
+zero bits on a-b is the same as b-a, so counting or testing can begin on
+a-b without waiting for abs(a-b) to be determined.
+
+ A loop stripping low zero bits tends not to branch predict well,
+since the condition is data dependent. But on average there's only a
+few low zeros, so an option is to strip one or two bits arithmetically
+then loop for more (as done for AMD K6). Or use a lookup table to get a
+count for several bits then loop for more (as done for AMD K7). An
+alternative approach is to keep just one of a or b odd and iterate
+
+ a,b = abs(a-b), min(a,b)
+ a = a/2 if even
+ b = b/2 if even
+
+ This requires about 1.25 iterations per bit, but stripping of a
+single bit at each step avoids any branching. Repeating the bit strip
+reduces to about 0.9 iterations per bit, which may be a worthwhile
+tradeoff.
+
+ Generally with the above approaches a speed of perhaps 6 cycles per
+bit can be achieved, which is still not terribly fast with for instance
+a 64-bit GCD taking nearly 400 cycles. It's this sort of time which
+means it's not usually advantageous to combine a set of divisibility
+tests into a GCD.
+
+ Currently, the binary algorithm is used for GCD only when N < 3.
+
+\1f
+File: gmp.info, Node: Lehmer's Algorithm, Next: Subquadratic GCD, Prev: Binary GCD, Up: Greatest Common Divisor Algorithms
+
+15.3.2 Lehmer's algorithm
+-------------------------
+
+Lehmer's improvement of the Euclidean algorithms is based on the
+observation that the initial part of the quotient sequence depends only
+on the most significant parts of the inputs. The variant of Lehmer's
+algorithm used in GMP splits off the most significant two limbs, as
+suggested, e.g., in "A Double-Digit Lehmer-Euclid Algorithm" by Jebelean
+(*note References::). The quotients of two double-limb inputs are
+collected as a 2 by 2 matrix with single-limb elements. This is done by
+the function 'mpn_hgcd2'. The resulting matrix is applied to the inputs
+using 'mpn_mul_1' and 'mpn_submul_1'. Each iteration usually reduces
+the inputs by almost one limb. In the rare case of a large quotient, no
+progress can be made by examining just the most significant two limbs,
+and the quotient is computed using plain division.
+
+ The resulting algorithm is asymptotically O(N^2), just as the
+Euclidean algorithm and the binary algorithm. The quadratic part of the
+work are the calls to 'mpn_mul_1' and 'mpn_submul_1'. For small sizes,
+the linear work is also significant. There are roughly N calls to the
+'mpn_hgcd2' function. This function uses a couple of important
+optimizations:
+
+ * It uses the same relaxed notion of correctness as 'mpn_hgcd' (see
+ next section). This means that when called with the most
+ significant two limbs of two large numbers, the returned matrix
+ does not always correspond exactly to the initial quotient sequence
+ for the two large numbers; the final quotient may sometimes be one
+ off.
+
+ * It takes advantage of the fact the quotients are usually small.
+ The division operator is not used, since the corresponding
+ assembler instruction is very slow on most architectures. (This
+ code could probably be improved further, it uses many branches that
+ are unfriendly to prediction).
+
+ * It switches from double-limb calculations to single-limb
+ calculations half-way through, when the input numbers have been
+ reduced in size from two limbs to one and a half.
+
+\1f
+File: gmp.info, Node: Subquadratic GCD, Next: Extended GCD, Prev: Lehmer's Algorithm, Up: Greatest Common Divisor Algorithms
+
+15.3.3 Subquadratic GCD
+-----------------------
+
+For inputs larger than 'GCD_DC_THRESHOLD', GCD is computed via the HGCD
+(Half GCD) function, as a generalization to Lehmer's algorithm.
+
+ Let the inputs a,b be of size N limbs each. Put S = floor(N/2) + 1.
+Then HGCD(a,b) returns a transformation matrix T with non-negative
+elements, and reduced numbers (c;d) = T^{-1} (a;b). The reduced numbers
+c,d must be larger than S limbs, while their difference abs(c-d) must
+fit in S limbs. The matrix elements will also be of size roughly N/2.
+
+ The HGCD base case uses Lehmer's algorithm, but with the above stop
+condition that returns reduced numbers and the corresponding
+transformation matrix half-way through. For inputs larger than
+'HGCD_THRESHOLD', HGCD is computed recursively, using the divide and
+conquer algorithm in "On Schönhage's algorithm and subquadratic integer
+GCD computation" by Möller (*note References::). The recursive
+algorithm consists of these main steps.
+
+ * Call HGCD recursively, on the most significant N/2 limbs. Apply
+ the resulting matrix T_1 to the full numbers, reducing them to a
+ size just above 3N/2.
+
+ * Perform a small number of division or subtraction steps to reduce
+ the numbers to size below 3N/2. This is essential mainly for the
+ unlikely case of large quotients.
+
+ * Call HGCD recursively, on the most significant N/2 limbs of the
+ reduced numbers. Apply the resulting matrix T_2 to the full
+ numbers, reducing them to a size just above N/2.
+
+ * Compute T = T_1 T_2.
+
+ * Perform a small number of division and subtraction steps to satisfy
+ the requirements, and return.
+
+ GCD is then implemented as a loop around HGCD, similarly to Lehmer's
+algorithm. Where Lehmer repeatedly chops off the top two limbs, calls
+'mpn_hgcd2', and applies the resulting matrix to the full numbers, the
+sub-quadratic GCD chops off the most significant third of the limbs (the
+proportion is a tuning parameter, and 1/3 seems to be more efficient
+than, e.g, 1/2), calls 'mpn_hgcd', and applies the resulting matrix.
+Once the input numbers are reduced to size below 'GCD_DC_THRESHOLD',
+Lehmer's algorithm is used for the rest of the work.
+
+ The asymptotic running time of both HGCD and GCD is O(M(N)*log(N)),
+where M(N) is the time for multiplying two N-limb numbers.
+
+\1f
+File: gmp.info, Node: Extended GCD, Next: Jacobi Symbol, Prev: Subquadratic GCD, Up: Greatest Common Divisor Algorithms
+
+15.3.4 Extended GCD
+-------------------
+
+The extended GCD function, or GCDEXT, calculates gcd(a,b) and also
+cofactors x and y satisfying a*x+b*y=gcd(a,b). All the algorithms used
+for plain GCD are extended to handle this case. The binary algorithm is
+used only for single-limb GCDEXT. Lehmer's algorithm is used for sizes
+up to 'GCDEXT_DC_THRESHOLD'. Above this threshold, GCDEXT is
+implemented as a loop around HGCD, but with more book-keeping to keep
+track of the cofactors. This gives the same asymptotic running time as
+for GCD and HGCD, O(M(N)*log(N))
+
+ One difference to plain GCD is that while the inputs a and b are
+reduced as the algorithm proceeds, the cofactors x and y grow in size.
+This makes the tuning of the chopping-point more difficult. The current
+code chops off the most significant half of the inputs for the call to
+HGCD in the first iteration, and the most significant two thirds for the
+remaining calls. This strategy could surely be improved. Also the stop
+condition for the loop, where Lehmer's algorithm is invoked once the
+inputs are reduced below 'GCDEXT_DC_THRESHOLD', could maybe be improved
+by taking into account the current size of the cofactors.
+
+\1f
+File: gmp.info, Node: Jacobi Symbol, Prev: Extended GCD, Up: Greatest Common Divisor Algorithms
+
+15.3.5 Jacobi Symbol
+--------------------
+
+Jacobi symbol (A/B)
+
+ Initially if either operand fits in a single limb, a reduction is
+done with either 'mpn_mod_1' or 'mpn_modexact_1_odd', followed by the
+binary algorithm on a single limb. The binary algorithm is well suited
+to a single limb, and the whole calculation in this case is quite
+efficient.
+
+ For inputs larger than 'GCD_DC_THRESHOLD', 'mpz_jacobi',
+'mpz_legendre' and 'mpz_kronecker' are computed via the HGCD (Half GCD)
+function, as a generalization to Lehmer's algorithm.
+
+ Most GCD algorithms reduce a and b by repeatatily computing the
+quotient q = floor(a/b) and iteratively replacing
+
+ a, b = b, a - q * b
+
+ Different algorithms use different methods for calculating q, but the
+core algorithm is the same if we use *note Lehmer's Algorithm:: or *note
+HGCD: Subquadratic GCD.
+
+ At each step it is possible to compute if the reduction inverts the
+Jacobi symbol based on the two least significant bits of A and B. For
+more details see "Efficient computation of the Jacobi symbol" by Möller
+(*note References::).
+
+ A small set of bits is thus used to track state
+ * current sign of result (1 bit)
+
+ * two least significant bits of A and B (4 bits)
+
+ * a pointer to which input is currently the denominator (1 bit)
+
+ In all the routines sign changes for the result are accumulated using
+fast bit twiddling which avoids conditional jumps.
+
+ The final result is calculated after verifying the inputs are coprime
+(GCD = 1) by raising (-1)^e
+
+ Much of the HGCD code is shared directly with the HGCD
+implementations, such as the 2x2 matrix calculation, *Note Lehmer's
+Algorithm:: basecase and 'GCD_DC_THRESHOLD'.
+
+ The asymptotic running time is O(M(N)*log(N)), where M(N) is the time
+for multiplying two N-limb numbers.
+