LAProof.accuracy_proofs.float_acc_lems

Floating-Point Operation Accuracy

This file establishes accuracy lemmas for the basic floating-point operations BPLUS, BMINUS, BMULT, and BFMA as defined in the VCFloat library. Each operation is analyzed in the round-to-nearest-even (RNE) rounding mode.
The central results take the form of the standard rounding error model: given that a floating-point operation does not overflow, its result fl(op) satisfies a bound of the form
\[ \mathtt{fl}(a \mathbin{\mathrm{op}} b) = (a \mathbin{\mathrm{op}} b)(1 + \delta) + \varepsilon \]
where the relative error \(\delta\) and absolute error \(\varepsilon\) satisfy
\[ |\delta| \leq \mathbf{u}, \qquad |\varepsilon| \leq \eta, \qquad \delta \cdot \varepsilon = 0 \]
and where \(\mathbf{u}\) denotes the unit roundoff and \(\eta\) denotes the underflow threshold for the given floating-point type t . The mutual exclusion condition \(\delta \cdot \varepsilon = 0\) reflects the standard decomposition: in the normal range only relative error is incurred, while in the subnormal range only absolute error is incurred.
For BPLUS and BMINUS the error model simplifies to \(\mathtt{fl}(a \mathbin{\mathrm{op}} b) = (a \mathbin{\mathrm{op}} b)(1 + \delta)\), since addition and subtraction on floating-point numbers that are already representable incur only relative error.

Structure

The file is organized as follows:
  • Signed zero lemmas: behavior of operations when one argument is zero.
  • Signed zero facts: FT2R evaluations at signed zeros.
  • No-overflow predicates: fma_no_overflow, Bmult_no_overflow, Bplus_no_overflow, Bminus_no_overflow.
  • Generic rounding lemma: generic_round_property, which extracts the \((\delta, \varepsilon)\) decomposition from the Flocq library for an arbitrary real number.
  • Per-operation accuracy and finiteness lemmas for BFMA, BMULT, BPLUS, and BMINUS, each in two forms:
    • a form assuming a no-overflow hypothesis on real-valued arguments;
    • a primed form (fma_accurate', BMULT_accurate', etc.) assuming only that the floating-point result is finite, from which the no-overflow condition is derived automatically.
  • Correctness lemmas (BFMA_correct, BMULT_correct, BPLUS_correct): these combine the finiteness of inputs and the rounding identity into a single conclusion.

From LAProof.accuracy_proofs Require Import preamble common.

Section GenFloat.

Context {NAN : FPCore.Nans} {t : type}.

Signed Zero Lemmas

Behavior of BMULT, BPLUS, and BFMA when one operand evaluates to zero under FT2R. These are used in higher-level proofs to eliminate zero-valued terms from error bounds.

Lemma Bmult_0R (a f : ftype t) :
  Binary.is_finite (BMULT a f)
  FT2R a = 0
  (BMULT a f) = neg_zero (BMULT a f) = pos_zero.

Lemma Bplus_0R (a f : ftype t) :
  Binary.is_finite (BPLUS a f)
  FT2R f = 0
  FT2R (BPLUS a f) = FT2R a.

Lemma Bfma_mult_0R (a f s : ftype t) :
  Binary.is_finite (BFMA a f s)
  FT2R a = 0
  FT2R (BFMA a f s) = FT2R s.

Signed Zero Values

Signed zeros are finite, and their real-valued interpretations and that of Zconst t 0 under FT2R are all zero. These are used directly in arithmetic simplifications.

Fact neg_zero_is_finite :
  Binary.is_finite (@neg_zero t).

Fact FT2R_neg_zero :
  FT2R (@neg_zero t) = 0.

Fact FT2R_pos_zero :
  FT2R (@pos_zero t) = 0.

Fact FT2R_Zconst_0 :
  FT2R (Zconst t 0) = 0.

Notation fmax := (@fmax t).

No-Overflow Predicates

Each predicate asserts that the RNE-rounded exact result of the corresponding operation has absolute value strictly less than fmax, i.e., the result falls within the normal floating-point range. These are the preconditions for the main accuracy lemmas and are derived automatically in the primed variants.

Definition fma_no_overflow (x y z : R) : Prop :=
  (Rabs (rounded t (x × y + z)) < fmax)%R.

Definition Bmult_no_overflow (x y : R) : Prop :=
  (Rabs (rounded t (x × y)) < fmax)%R.

Definition Bplus_no_overflow (x y : R) : Prop :=
  (Rabs (Generic_fmt.round Zaux.radix2
           (SpecFloat.fexp (fprec t) (femax t))
           (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
           (x + y)) < fmax)%R.

Definition Bminus_no_overflow (x y : R) : Prop :=
  (Rabs (Generic_fmt.round Zaux.radix2
           (SpecFloat.fexp (fprec t) (femax t))
           (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
           (x - y)) < fmax)%R.

Notation D := (@default_rel t).
Notation E := (@default_abs t).

Generic Rounding Error Decomposition

The lemma below is the shared foundation for all per-operation accuracy results. It packages the Flocq error_N_FLT theorem into the \((\delta, \varepsilon)\) form used throughout this library.
generic_round_property is the fundamental \((\delta,\varepsilon)\) decomposition of RNE rounding error in the Flocq FLT format.
The condition \(\delta * \varepsilon = 0\) asserts that exactly one of the two error terms is zero, reflecting the fact that the two sources of rounding error are mutually exclusive:
  • In the normal range, error is purely relative: \(\varepsilon = 0\).
  • In the subnormal range, error is purely absolute: \(\delta = 0\)
No rounding event can simultaneously be in both regimes, so the two error terms never both appear at once.

Lemma generic_round_property :
   (x : R),
   delta epsilon : R,
    delta × epsilon = 0
    (Rabs delta D)%R
    (Rabs epsilon E)%R
    Generic_fmt.round Zaux.radix2
      (SpecFloat.fexp (fprec t) (femax t))
      (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
      x = (x × (1 + delta) + epsilon)%Re.

Fused Multiply-Add

fma_accurate establishes the standard rounding error model for the fused multiply-add operation.

Lemma fma_accurate :
   (x : ftype t) (FINx : Binary.is_finite x)
         (y : ftype t) (FINy : Binary.is_finite y)
         (z : ftype t) (FINz : Binary.is_finite z)
         (FIN : fma_no_overflow (FT2R x) (FT2R y) (FT2R z)),
   delta, epsilon,
    delta × epsilon = 0
    Rabs delta D
    Rabs epsilon E
    (FT2R (BFMA x y z) = (FT2R x × FT2R y + FT2R z) × (1 + delta) + epsilon)%Re.

is_finite_fma_no_overflow shows that finiteness of the result of an FMA implies the no-overflow condition on the exact value \(x \cdot y + z\). This allows the primed form fma_accurate' to work directly from a finiteness hypothesis.

Lemma is_finite_fma_no_overflow :
   (x y z : ftype t)
    (HFINb : Binary.is_finite (BFMA x y z)),
  fma_no_overflow (FT2R x) (FT2R y) (FT2R z).

BFMA_finite_e extracts finiteness of each individual argument from finiteness of the FMA result. This is a standard regularity property: the IEEE 754 FMA result is non-finite if any input is non-finite (with the exception of a zero-times-infinity addend, which produces a NaN).

Lemma BFMA_finite_e :
   (a f u : ftype t)
    (Hfin : Binary.is_finite (BFMA a f u)),
  Binary.is_finite a
  Binary.is_finite f
  Binary.is_finite u.

is_finite_fma_no_overflow': If all three inputs to an FMA are finite and no overflow occurs, then the FMA result is finite.

Lemma is_finite_fma_no_overflow' :
   (x y z : ftype t)
    (Hfinx : Binary.is_finite x = true)
    (Hfiny : Binary.is_finite y = true)
    (Hfinz : Binary.is_finite z = true)
    (Hov : fma_no_overflow (FT2R x) (FT2R y) (FT2R z)),
    Binary.is_finite (BFMA x y z) = true.

fma_accurate' is the finiteness-hypothesis form of fma_accurate: it requires only that the result is finite, deriving the no-overflow condition and input finiteness internally.

Lemma fma_accurate' :
   (x y z : ftype t)
    (FIN : Binary.is_finite (BFMA x y z)),
   delta, epsilon,
    delta × epsilon = 0
    Rabs delta @default_rel t
    Rabs epsilon @default_abs t
    (FT2R (BFMA x y z) = (FT2R x × FT2R y + FT2R z) × (1 + delta) + epsilon)%Re.

BFMA_correct combines BFMA_finite_e and is_finite_fma_no_overflow to give the full correctness statement: finiteness of the result implies finiteness of all inputs and a rounding identity.

Lemma BFMA_correct (a b s : ftype t) :
  Binary.is_finite (BFMA a b s)
  (Binary.is_finite a Binary.is_finite b Binary.is_finite s)
  FT2R (BFMA a b s) =
    Generic_fmt.round Zaux.radix2 (SpecFloat.fexp (fprec t) (femax t))
      (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
      (FT2R a × FT2R b + FT2R s).

Floating-Point Multiplication

BMULT_accurate establishes the rounding error model for multiplication, which computes \(\mathtt{fl}(x \cdot y)\) under RNE rounding.

Lemma BMULT_accurate :
   (x y : ftype t)
    (FIN : Bmult_no_overflow (FT2R x) (FT2R y)),
   delta, epsilon,
    delta × epsilon = 0
    Rabs delta @default_rel t
    Rabs epsilon @default_abs t
    (FT2R (BMULT x y) = (FT2R x × FT2R y) × (1 + delta) + epsilon)%Re.

is_finite_BMULT_no_overflow shows that finiteness of BMULT x y implies that the exact product \(x \cdot y\) does not overflow.

Lemma is_finite_BMULT_no_overflow :
   (x y : ftype t)
    (HFINb : Binary.is_finite (BMULT x y)),
  Bmult_no_overflow (FT2R x) (FT2R y).

BMULT_accurate' is the finiteness-hypothesis form of BMULT_accurate.

Lemma BMULT_accurate' :
   (x y : ftype t)
    (FIN : Binary.is_finite (BMULT x y)),
   delta, epsilon,
    delta × epsilon = 0
    Rabs delta @default_rel t
    Rabs epsilon @default_abs t
    (FT2R (BMULT x y) = (FT2R x × FT2R y) × (1 + delta) + epsilon)%Re.

BMULT_finite_e extracts finiteness of each factor from finiteness of the product.

Lemma BMULT_finite_e :
   (a b : ftype t)
    (Hfin : Binary.is_finite (BMULT a b)),
  Binary.is_finite a Binary.is_finite b.

BMULT_correct gives the full correctness statement for multiplication: finiteness of the result implies finiteness of the operands and a rounding identity.

Lemma BMULT_correct (a b : ftype t) :
  Binary.is_finite (BMULT a b)
  (Binary.is_finite a Binary.is_finite b)
  FT2R (BMULT a b) =
    Generic_fmt.round Zaux.radix2 (SpecFloat.fexp (fprec t) (femax t))
      (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
      (FT2R a × FT2R b).

Floating-Point Addition

BPLUS_B2R_zero is a special case: adding the zero constant to a finite value returns a result with the same real value.

Lemma BPLUS_B2R_zero (a : ftype t) :
  Binary.is_finite a
  FT2R (BPLUS a (Zconst t 0)) = FT2R a.

BPLUS_accurate establishes the standard relative rounding error model for floating-point addition BPLUS x y.

Lemma BPLUS_accurate :
   (x : ftype t) (FINx : Binary.is_finite x)
         (y : ftype t) (FINy : Binary.is_finite y)
         (FIN : Bplus_no_overflow (FT2R x) (FT2R y)),
   delta,
    Rabs delta @default_rel t
    (FT2R (BPLUS x y) = (FT2R x + FT2R y) × (1 + delta))%Re.

is_finite_sum_no_overflow shows that finiteness of the result implies the no-overflow condition on the exact sum \(x + y\).

Lemma is_finite_sum_no_overflow :
   (x y : ftype t)
    (HFINb : Binary.is_finite (BPLUS x y)),
  Bplus_no_overflow (FT2R x) (FT2R y).

no_overflow_sum_is_finite is the converse direction: if both summands are finite and the exact sum does not overflow, then the result is finite.

Lemma no_overflow_sum_is_finite :
   (x y : ftype t)
    (H1 : Binary.is_finite x)
    (H2 : Binary.is_finite y)
    (Hov : Bplus_no_overflow (FT2R x) (FT2R y)),
  Binary.is_finite (BPLUS x y).

BPLUS_accurate' is the finiteness-hypothesis form of BPLUS_accurate.

Lemma BPLUS_accurate' :
   (x y : ftype t)
    (FIN : Binary.is_finite (BPLUS x y)),
   delta,
    Rabs delta @default_rel t
    (FT2R (BPLUS x y) = (FT2R x + FT2R y) × (1 + delta))%Re.

BPLUS_finite_e extracts finiteness of each summand from finiteness of the sum.

Lemma BPLUS_finite_e :
   (a b : ftype t)
    (Hfin : Binary.is_finite (BPLUS a b)),
  Binary.is_finite a Binary.is_finite b.

BPLUS_correct gives the full correctness statement for floating-point addition: finiteness of the result implies finiteness of each summand and relates the floating-point result to the rounded real result.

Lemma BPLUS_correct (a b : ftype t) :
  Binary.is_finite (BPLUS a b)
  (Binary.is_finite a Binary.is_finite b)
  FT2R (BPLUS a b) =
    Generic_fmt.round Zaux.radix2 (SpecFloat.fexp (fprec t) (femax t))
      (BinarySingleNaN.round_mode BinarySingleNaN.mode_NE)
      (FT2R a + FT2R b).

Floating-Point Subtraction

BMINUS_accurate establishes the pure relative rounding error model for floating-point subtraction.

Lemma BMINUS_accurate :
   (x : ftype t) (FINx : Binary.is_finite x)
         (y : ftype t) (FINy : Binary.is_finite y)
         (FIN : Bminus_no_overflow (FT2R x) (FT2R y)),
   delta,
    Rabs delta @default_rel t
    (FT2R (BMINUS x y) = (FT2R x - FT2R y) × (1 + delta))%Re.

is_finite_minus_no_overflow shows that finiteness of the floating-point result implies that the exact difference does not overflow.

Lemma is_finite_minus_no_overflow :
   (x y : ftype t)
    (HFINb : Binary.is_finite (BMINUS x y)),
  Bminus_no_overflow (FT2R x) (FT2R y).

no_overflow_minus_is_finite is the converse direction for subtraction: if both operands are finite and the exact difference does not overflow, then the result is finite.

Lemma no_overflow_minus_is_finite :
   (x y : ftype t)
    (H1 : Binary.is_finite x)
    (H2 : Binary.is_finite y)
    (Hov : Bminus_no_overflow (FT2R x) (FT2R y)),
  Binary.is_finite (BMINUS x y).

BMINUS_accurate' is the finiteness-hypothesis form of BMINUS_accurate.

Lemma BMINUS_accurate' :
   (x y : ftype t)
    (FIN : Binary.is_finite (BMINUS x y)),
   delta,
    Rabs delta @default_rel t
    (FT2R (BMINUS x y) = (FT2R x - FT2R y) × (1 + delta))%Re.

BMINUS_finite_sub extracts finiteness of each operand from finiteness of the difference.

Lemma BMINUS_finite_sub :
   (a b : ftype t)
    (Hfin : Binary.is_finite (BMINUS a b)),
  Binary.is_finite a Binary.is_finite b.

End GenFloat.