Detailed description 
http://bugzilla.redhat.com/643657
This source should show a bug, where mpfr_fma (IMNSHO) gives incorrect result
(1ulp smaller), which causes also __builtin_fmal to misbehave.
The testcase comment explains why I believe the right result is
0xc.0bffffffffffffep3942L (== 0x3.02fffffffffffff8p3940L)
instead of
0xc.0bffffffffffffdp3942L (== 0x3.02fffffffffffff4p3940L)
FMA is supposed to be computed with infinite precision and only rounded at the
end. When stepping through mpfr_fma, it seems u is computed correctly:
(gdb) p mpfr_printf ("%Ra\n", u)
0x3.02fffffffffffffap3940
$1 = 27
so the problem seems to be in the following mpfr_add (s, u, z, rnd_mode);
where u has precision 128, but s and z have precision 64.
The textcase is for extended 80bit long double, so ?86/x86_64/ia64.
#include <math.h>
#include <gmp.h>
#include <stdio.h>
#include <mpfr.h>
/*
80bit extended long double
fmal (0x8.07fffffffffffffp+7418L, 0xcp11363L, 0xf.ff80f8p4070L)
0x8.07fffffffffffffp+7418L
* 0xcp11363L
= 0xc.0bffffffffffffe8p3942L
0xc.0bffffffffffffe8p3942L
 0x0.0000000000000000000000000000000fff80f8p3942L
= 0xc.0bffffffffffffe7fffffffffffffff0007f08p3942L
RNDN:
= 0xc.0bffffffffffffep3942L instead of 0xc.0bffffffffffffdp3942L
__builtin_fmal uses mpfr_fma internally in the compiler.
*/
volatile long double zero;
int
main (void)
{
mpz_t a, b, c;
mpz_init (a); mpz_init (b); mpz_init (c);
mpz_set_str (a, "807fffffffffffff", 16);
mpz_set_str (b, "c", 16);
mpz_mul (c, a, b);
mpz_mul_si (c, c, 2);
mpz_out_str (stdout, 16, c);
printf ("\n");
mpz_set_str (a, "c0bffffffffffffe80000000000000000000000", 16);
mpz_set_str (b, "00000000000000000000000000000000fff80f8", 16);
mpz_sub (c, a, b);
mpz_out_str (stdout, 16, c);
printf ("\n");
mpz_clear (a); mpz_clear (b); mpz_clear (c);
printf ("mul %La\n__builtin_fmal %La\nfmal %La\n",
0x8.07fffffffffffffp+7418L * 0xcp11363L,
__builtin_fmal (0x8.07fffffffffffffp+7418L, 0xcp11363L,
0xf.ff80f8p4070L),
fmal (0x8.07fffffffffffffp+7418L, 0xcp11363L,
0xf.ff80f8p4070L + zero));
mpfr_set_default_prec (64);
mpfr_set_emin (16384 64 +4);
mpfr_set_emax (16384);
mpfr_t xx, xy, xz, xr, xm;
mpfr_init (xx); mpfr_set_ld (xx, 0x8.07fffffffffffffp+7418L, GMP_RNDN);
mpfr_init (xy); mpfr_set_ld (xy, 0xcp11363L, GMP_RNDN);
mpfr_init (xz); mpfr_set_ld (xz, 0xf.ff80f8p4070L, GMP_RNDN);
mpfr_init (xr); mpfr_init (xm);
int i = mpfr_mul (xm, xx, xy, GMP_RNDN);
mpfr_subnormalize (xm, i, GMP_RNDN);
i = mpfr_fma (xr, xx, xy, xz, GMP_RNDN);
mpfr_subnormalize (xr, i, GMP_RNDN);
mpfr_printf ("mpfr_mul %Ra\nmpfr_fma %Ra\n", xm, xr);
mpfr_clear (xx); mpfr_clear (xy); mpfr_clear (xz);
mpfr_clear (xm); mpfr_clear (xr);
return 0;
}
mpfr_sub1 is called, where b is 2 limbs long (128bit), while a and c is just
one limb long (64bit).
As c is small, but not small enough (difference in exponents exactly 128), we
reach the:
for (k = 0; (bn > 0)  (cn > 0); k = 1)
loop in sub1.c, with sh == 0. In the first iteration bb is equal to
MPFR_LIMB_HIGHBIT and cc is 0 (it is before the highest limb in c), and we
reach the special case:
/* the case rounding to nearest with sh=0 is special since one couldn't
subtract above 1/2 ulp in the trailing limb of the result */
if ((rnd_mode == GMP_RNDN) && sh == 0 && k == 0)
{
mp_limb_t half = MPFR_LIMB_HIGHBIT;
is_exact = (bb == cc);
/* add one ulp if bb > cc + half
truncate if cc  half < bb < cc + half
sub one ulp if bb < cc  half
*/
...
}
The comments seems to cover most of the cases, just not two important ones,
where either bb is equal to cc + half (this case) or bb is equal to cc  half.
In that case after this bb is equal to cc, eventhough they weren't equal
before. In this case with round to nearest we really can't decide right away,
need to look at next (or following limbs) to decide if exactly 0.5ulp was added
resp. subtracted, or if tiny bit above it or below it, but that decision should
take into account that the special case above with bb == cc + half or bb == cc
 half happened. In the code it doesn't, so we fall through to the next limb 
0 in bb (as it is after lowest limb) and 0xfff80f8000000000 in cc. k is
nonzero, so the special case isn't done now, bb is smaller than cc, so it sets
down and reaches the
420 else if (down && sh == 0)
421 goto sub_one_ulp;
case, so it incorrectly subtracts 1ulp instead of just raising inexact and
truncating. 
