This file lists itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc with a small (1 limb) size.
mpz_XXX(a,a,a).
mp_exp_t, the precision of
     __mp_bases[base].chars_per_bit_exactly is insufficient and
     mpf_get_str aborts.  Detect and compensate.
mpz_get_si to work properly for MIPS N32 ABI (and other
     machines that use long long for storing limbs.)
mpz_*_ui division routines.  Perhaps make them return the
     real remainder instead?  Changes return type to signed long int.
mpf_eq is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in mpf_sub for this
     situation; put similar code in mpf_eq.
__builtin_constant_p is unavailable?  Same problem with MacOS
     X.
dump_abort in
     mp?/tests/*.c.
mpz_get_si returns 0x80000000 for -0x100000000.
count_leading_zeros if
     the bit is set.  This is an optimization on all machines, and significant
     on machines with slow count_leading_zeros.
count_trailing_zeros is used
     on more or less uniformly distributed numbers.  For some CPUs
     count_trailing_zeros is slow and it's probably worth
     handling the frequently occurring 0 to 2 trailing zeros cases specially.
udiv_qrnnd for inverting limbs to
     instead use invert_limb.
umul_ppmm to use floating-point for generating the
     most significant limb (if BITS_PER_MP_LIMB <= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
umul_ppmm code in longlong.h: Add partial
     products with fewer operations.
mpn_get_str and mpn_set_str running in
     the sub O(n^2) range, using some divide-and-conquer approach, preferably
     without using division.
mpn_get_str to mpf/get_str.  (Talk to Torbjörn about this.)
mpz_size,
     mpz_set_ui, mpz_set_q, mpz_clear,
     mpz_init, mpz_get_ui, mpz_scan0,
     mpz_scan1, mpz_getlimbn,
     mpz_init_set_ui, mpz_perfect_square_p,
     mpz_popcount, mpf_size,
     mpf_get_prec, mpf_set_prec_raw,
     mpf_set_ui, mpf_init, mpf_init2,
     mpf_clear, mpf_set_si.
mpz_powm and mpz_powm_ui aren't very
     fast on one or two limb moduli, due to a lot of function call
     overheads.  These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better
     algorithm selection, and the latter should use REDC.  Both could
     change to use an mpn_powm and mpn_redc.
mpn_gcd might be able to be sped up on small to
     moderate sizes by improving find_a, possibly just by
     providing an alternate implementation for CPUs with slowish
     count_leading_zeros.
USE_MORE_MPN code.  The necessary
     right-to-left mpn_divexact_by3c exists.
mpn_mul_basecase on NxM with big N but small M could try for
     better cache locality by taking N piece by piece.  The current code could
     be left available for CPUs without caching.  Depending how karatsuba etc
     is applied to unequal size operands it might be possible to assume M is
     always smallish.
mpn_addmul_1, mpn_submul_1, and
     mpn_mul_1 for the 21264.  On 21264, they should run at 4, 3,
     and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask
     Torbjörn for his xm.s and xam.s skeleton files.)
mpn_addmul_1, mpn_submul_1, and
     mpn_mul_1 for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks.
mpn_addmul_1,
     mpn_submul_1, and mpn_mul_1.  Should use
     floating-point operations, and split the invariant single-limb multiplier
     into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline
     will become very deep.  (Torbjörn has C code that is useful as a starting
     point.)
mpn_lshift and mpn_rshift.
     Should give 2 cycles/limb.  (Torbjörn has code that just needs to be
     finished.)
mpn_addmul_1
     and the other multiplies varies so much on successive sizes.
mpn_addmul_1, mpn_submul_1, and
     mpn_mul_1.  The current development code runs at 11
     cycles/limb, which is already very good.  But it should be possible to
     saturate the cache, which will happen at 7.5 cycles/limb.
umul_ppmm.  Important in particular for
     mpn_sqr_basecase.  Using four "mulx"s either
     with an asm block or via the generic C code is about 90 cycles.
mpn_mul_basecase and mpn_sqr_basecase
     for important machines.  Helping the generic sqr_basecase.c with an
     mpn_sqr_diagonal might be enough for some of the RISCs.
mpn_lshift/mpn_rshift.
     Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift for shifts by 1.  (See Pentium code.)
count_leading_zeros.
udiv_qrnnd.  (Ask Torbjörn for the file
     test-udiv-preinv.c as a starting point.)
mpn_add_n and mpn_sub_n.
     It should just require 3 cycles/limb, but the current code propagates
     carry poorly.  The trick is to add carry-in later than we do now,
     decreasing the number of operations used to generate carry-out from 4 to
     to 3.
mpn_lshift.
     The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5
     is unused.  (Ask Torbjörn for a copy of the current code.)
mpn_rshift based on new mpn_lshift.
mpn_add_n and mpn_sub_n.  Should
     run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)
mpn_mul_basecase and
     mpn_sqr_basecase.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
mpn_mul_basecase and
     mpn_sqr_basecase.  Same comment applies to this as to the
     same functions for Fujitsu VPP.
count_leading_zeros for 64-bit machines:
  
  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
  if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
  ... 
mpz_get_nth_ui.  Return the nth word (not necessarily the nth limb).
mpz_crr (Chinese Remainder Reconstruction).
mpq_set_f for assignment from mpf_t
     (cf. mpq_set_d).
mpz_init (and mpq_init) do lazy
     allocation.  Set ALLOC(var) to 0, and have
     mpz_realloc special-handle that case.  Update functions that
     rely on a single limb (like mpz_set_ui,
     mpz_[tfc]div_r_ui, and others).
mpf_out_raw and mpf_inp_raw.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
gmp_errno.
gmp_fprintf, gmp_sprintf, and
     gmp_snprintf.  Think about some sort of wrapper
     around printf so it and its several variants don't
     have to be completely reimplemented.
mpq input and output functions.
mpz_kronecker, leave
     mpz_jacobi for compatibility.
mpz_set_str etc taking string
     lengths rather than null-terminators.
<=" rather than "<", so a threshold can
     be set to MP_SIZE_T_MAX to get only the simpler code (the
     compiler will know size <= MP_SIZE_T_MAX is always true).
mpz_cdiv_q_2exp and mpz_cdiv_r_2exp
     could be implemented to match the corresponding tdiv and fdiv.
     Maybe some code sharing is possible.
implver and
     amask assembly instructions are better, but that
     doesn't differentiate between ev5 and ev56.
In general, getting the exact right configuration, passing the exact right options to the compiler, etc, might mean that the GMP performance more than doubles.
When testing, make sure to test at least the following for all out target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system `make' and GNU `make'. (4) With and without GNU binutils.
mpz_div and mpz_divmod use rounding
     analogous to mpz_mod.  Document, and list as an
     incompatibility.
mcount call at
     the start of each assembler subroutine (for important targets at
     least).
malloc
     separately for each TMP_ALLOC block, so a redzoning
     malloc debugger could be used during development.
ASSERTs at the start of each user-visible
     mpz/mpq/mpf function to check the validity of each
     mp?_t parameter, in particular to check they've been
     mp?_inited.  This might catch elementary mistakes in
     user programs.  Care would need to be taken over
     MPZ_TMP_INITed variables used internally.
 unsigned long int is used for
     bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.
| Please send comments about this page to
      tege@swox.com. Copyright (C) 1999, 2000 Torbjörn Granlund. |