# Formal Verification of an IEEE Floating Point Adder 

Formale Verifikation eines IEEE-GleitKomma-Addierers


## Christoph Berg CB @ CS.UNI-SB.DE

Universität des SaArlandes<br>FAChrichtung 6.2 - Informatik<br>Lehrstuhl für Rechnerarchitektur und Parallelrechner<br>Prof. Dr. Wolfgang J. Paul

## Eidesstattliche Erklärung

Hiermit erkläre ich an Eides Statt, daß ich die vorliegende Arbeit selbstständig verfaßt und nur die angegebenen Quellen benutzt habe. Ich habe diese Arbeit keinem anderen Prüfungsamt vorgelegt.

Saarbrücken, im Mai 2001

Christoph Berg

Im Gebirge der Wahrheit kletterst du nie umsonst: entweder du kommst schon heute weiter hinauf oder du übst deine Kräfte, um morgen höher steigen zu können.

In the mountains of truth you never climb in vain: either you ascend today, or you practise your strengths for climbing higher tomorrow.

Friedrich Nietzsche

## Danke

Meinen Dank aussprechen möchte ich all denjenigen, die mich beim Entstehen dieser Diplomarbeit unterstützt haben:

- Christian Jacobi, der vom Betreuer meiner Arbeit zu einem guten Freund wurde, und dessen Unterstützung eine große Hilfe war,
- Prof. Wolfgang Paul, der mir durch seine Vorlesungen und die Vergabe des Themas ermöglicht hat, an einem spannenden Projekt zu arbeiten,
- Jochen Preiß, Daniel Kröning, Sven Beyer, Dirk Leinenbach, Michael Klein und Stefan Kunde für die „soziale Infrastruktur" am Lehrstuhl,
- meinen Freunden, die mir genügend Ablenkung vom Streß geboten haben, der sich beim Arbeiten immer wieder einstellt,
- und nicht zuletzt meinen Eltern, die mich auf meinem universitären Werdegang immer wieder ermutigt haben.


#### Abstract

Our group at Saarland University is formally verifying the correctness of a complete microprocessor called VAMP. The PVS theorem prover is used to specify the circuit definitions and to prove their correctness. For the VAMP project, a library of basic circuits was developed. The formal verification of this library is described in the first part of this thesis. Part of the VAMP is an IEEE compliant floating point unit. The second part of this thesis describes the formal verification of the gate level correctness of the VAMP floating point adder.


## Zusammenfassung

Die Korrektheit eines vollständigen Mikroprozessors, dem VAMP, wird von unserer Gruppe an der Universität des Saarlandes formal verifiziert. Der Theorembeweiser PVS wird benutzt, um die Schaltkreis-Definitionen zu spezifizieren und deren Korrektheit zu zeigen. Für das VAMPProjekt wurde eine Bibliothek von Standard-Schaltkreisen entwickelt. Die formale Verifikation dieser Bibliothek ist im ersten Teil dieser Arbeit beschrieben. Teil des VAMP ist eine IEEE-Gleitkomma-Einheit. Die formale Verifikation der Korrektheit des VAMP-Gleitkomma-Addierers auf Gatterebene ist Thema des zweiten Teils dieser Diplomarbeit.

## Contents

1 Introduction ..... 1
2 The PVS Theorem Prover ..... 5
2.1 The PVS Logic ..... 5
2.2 An Example Proof ..... 6
2.2.1 Proof of sum_is_recsum ..... 7
2.2.2 Proof of recsum_is_gauss ..... 12
2.2.3 Proof of sum_is_gauss ..... 12
2.2.4 TCCs ..... 14
2.3 PVS and Mathematics ..... 14
2.4 Bitvectors ..... 15
3 Basic Components ..... 17
3.1 Halfadder, Fulladder ..... 17
3.2 Carry Chain Incrementer ..... 19
3.3 Carry Chain Adder ..... 21
3.4 Compound Adder ..... 21
3.5 Generic Adder ..... 23
3.6 Carry Save Adder ..... 26
3.7 Arithmetic Unit, Subtraction ..... 28
3.8 Absolute Value ..... 29
3.9 Multiplier ..... 30
3.10 Decoder ..... 31
3.11 Halfdecoder ..... 32
3.12 Encoder ..... 33
3.13 Leading Zero Counter ..... 34
3.14 Cyclic Left Shifter ..... 36
3.15 Logical Left Shifter ..... 38
3.16 Logical Right Shifter ..... 39
3.17 Or Tree ..... 40
4 IEEE Floating Point Arithmetic ..... 41
4.1 Factorings ..... 41
4.2 Rounding ..... 42
$4.3 \quad \alpha$-Equivalence ..... 43
4.4 Exceptions ..... 45
4.5 Correctness of the FPU ..... 45
5 Floating Point Adder ..... 47
5.1 Adder Correctness ..... 47
5.2 Addition Algorithm ..... 48
5.3 Algorithm Correctness ..... 49
5.4 Adder Hardware ..... 51
5.5 Stage 1: Computing $s_{b}^{\prime}$ ..... 52
5.6 Stage 2: Alignment Shift ..... 52
5.6.1 Exponent Subtract ..... 53
5.6.2 Exponent Select ..... 54
5.6.3 Circuit Limit ..... 54
5.6.4 Significand Swapping ..... 56
5.6.5 Alignment Shift and Sticky Bit Computation ..... 57
5.6.6 Alignment Shifter Correctness ..... 59
5.7 Stage 3: Significand Addition ..... 61
5.8 Putting It All Together ..... 62
5.9 Boundary Constraints ..... 63
6 Summary ..... 65
6.1 The VAMP Project ..... 65
6.2 Bugs ..... 65
6.3 Related Work ..... 66
6.4 Prospect ..... 67

## Chapter 1

## Introduction

Floating point hardware consists of complex circuits that tend to have subtle errors. Design flaws are usually eliminated by testing, but it is impossible to test every state the circuit can enter. Even millions of test vectors were unable to find the 1995 Pentium division bug [Pra95].

Proving the correctness using mathematical reasoning overcomes this limitation. But reasoning about single bits in paper-and-pencil proofs is tedious and error prone, so the correctness of the design is not entirely certain.

Formal verification-using theorem proving or other formal techniques-provides a way to rigorously prove the correctness of a design. Theorem proving is a general framework for formal reasoning in logic. It allows to prove very complex statements and can-using abstraction or induction-even reason about infinite state spaces. The disadvantage is that a considerable amount of manual work is needed. Finding proofs in a theorem prover is essentially as hard as finding proofs in mathematics.
This thesis covers the formal verification of a library of basic circuits and an IEEE floating point adder using theorem proving.

The VAMP project. Müller and Paul design a complete floating point unit (FPU) on the gate level in their textbook [MP00]. The FPU features an addition/subtraction unit, a multiplication/division unit, a rounder common to the functional units, conversion to/from integer operands, and floating point comparison. Along with the designs come paper-and-pencil proofs for the correctness of the circuits.

The FPU from [MP00] is embedded into the DLX processor. The DLX is a RISC processor based on the MIPS instruction set architecture [HP96]. Features of the DLX implementation in [MP00] are a 5 -stage pipeline, precise and nested interrupts, delayed branch, and a cache memory interface. [MP00] also includes paper-and-pencil proofs for the correctness of the DLX integer core.

In the VAMP project, our group formally verifies the DLX processor [JK00, BJ01, Krö01, Jac01b, VAM] using the PVS theorem prover [OSR92]. The VAMP-standing for Verified Architecture Microprocessor-is an implementation of the paper designs from [MP00] in the PVS language. We verify the correctness of the [MP00] proofs. We successively add new features to the VAMP and formally verify them. The major improvement over the DLX from [MP00] is the implementation of a Tomasulo scheduler. A cache memory interface with TLB is being worked


Figure 1.1: The VAMP FPU
on. The topic of this thesis is the formal verification of the VAMP floating point adder correctness.
Our group has implemented a translation tool that converts the hardware specifications from PVS to Verilog HDL [BJKL01]. The Verilog files of the VAMP processor are compiled for a Xilinx FPGA, which is hosted on a PCI board. We are porting the gcc compiler and the GNU C library to the VAMP architecture to evaluate the verified VAMP processor.

Basic circuits. The correctness of the basic circuits library used in the VAMP processor is proved in the first part of this thesis [BJK01]. Basic circuits are components like adders, multipliers, shifters, and decoders that are used frequently in hardware design. The library has been developed for the VAMP project, but may be used in any hardware verification project, as the circuits are of arbitrary bit width and for general use.

FPU verification. For the verification of floating point hardware, a formalization of the IEEE standard 754 [Ins85] is needed [Jac01b, Jac01a]. Based on this abstraction level, the FPU modules are verified.

The VAMP FPU consists of three major parts (figure 1.1). The unpacker is given the FPU input operands in the IEEE format and converts them to an internal representation; the unpacker also handles integer operands and special cases as operations with NaN or infinite operands. The functional units then compute the operation's result. In the last step, the rounding unit rounds the result and packs it into the IEEE format.

Other than the adder in [MP00], the adder design in this thesis does not handle special cases, as this is done by the unpacker. The adder is a combinatorial circuit. Pipelining circuits by inserting registers and building a functional unit for the Tomasulo scheduler is described in [Jac01b].

Project status. For his PhD thesis, Daniel Kröning has verified the correctness of the VAMP integer core and implemented a verified Tomasulo scheduler [Krö01]. Christian Jacobi has implemented the VAMP FPU functional modules (except for the adder) and the pipeline control for the FPU. The verification of these will be described in his PhD thesis [Jac01b]. The verification of the floating point adder hardware is the topic of this thesis. Sven Beyer has started to work on the verification of the cache memory interface and TLB. Dirk Leinenbach and Sven Beyer have implemented the PVS to Verilog translation tool [BJKL01].

The PVS files-hardware specifications and proofs-and the Verilog sources of the VAMP are publically available at our web site [VAM].

Outline. Chapter 2 is a brief introduction to the PVS theorem prover. A simple example is given to make the reader familiar with PVS proofs. Notations and lemmas used in the following chapters are introduced. Chapter 3 covers the basic circuits library. For each circuit, the circuit implementation is specified and the correctness is proved. Chapter 4 is an overview on the IEEE floating point arithmetic formalization used. The correctness of the floating point adder is proved in chapter 5 . A summary and an overview on related work are given in chapter 6 .

## Chapter 2

## The PVS Theorem Prover

PVS (short for Prototype Verification System) is a general purpose theorem prover [OSR92]. This chapter gives a brief introduction to the PVS logic, presents a simple example proof to make the reader familiar with PVS, and introduces some of the most important PVS commands. For a further treatment, see the tutorial [COR ${ }^{+95] . \text { At the end of the chapter, we introduce notations, }}$ definitions, and lemmas that will be used in the following chapters.

### 2.1 The PVS Logic

PVS is based on typed higher order logic (HOL). Predefined types include bool, nat, int, real, $b i t, \ldots$ From these, sets, functions, tuples, and records are derived. Each PVS file consists of one or more theories which in turn contain definitions, lemmas, theorems, and axioms written as HOL formulas.

Within proofs, PVS operates on sequents of the form $\Gamma \vdash \Delta$, where $\Gamma$ and $\Delta$ are lists of HOL formulas, called antecedents and consequents, respectively [Gen35]. For the proof to succeed, we have to show that the disjunction of the consequents is a logical consequence of the conjunction of the antecedents: ${ }^{1}$

$$
\gamma_{-1}, \ldots, \gamma_{-i} \vdash \delta_{1}, \ldots, \delta_{j} \quad \text { is to be read as } \quad \gamma_{-1} \wedge \cdots \wedge \gamma_{-i} \Longrightarrow \delta_{1} \vee \cdots \vee \delta_{j}
$$

A proof of a lemma $F$ starts with the sequent $\vdash F$. Sequents are modified by proof commands. Proof commands transform a sequent into several—possibly zero-child sequents. These sequents are subcases of the parent sequent; their conjunction implies the validity of the parent sequent. The proof sequents form a tree. A proof branch is closed if the last proof command yielded no children, i.e., if PVS was able to verify the validity of the sequent. A proof is finished when there are no open branches left in the proof.

PVS has to verify that all types used in the specification are well-defined and generates so-called $T C C s$ (type correctness conditions) if it encounters any non-trivial type usage. For example, in the expression $\sqrt{n}, n$ must not be negative. The user then has to prove the TCCs to show that the specification is sound. In most cases, TCCs are simple statements like $\forall n \in \mathbb{N}^{+}: n-1 \geq 0$ that can be proved automatically. Sometimes, however, TCCs require a substantial amount of manual work.

[^0]```
gauss: Theory
Begin
Importing bitvectors@sums
n, i: Var nat
sum(n): nat = sigma(0, n, Lambda i: i)
recsum(n): Recursive nat =
    If n = O Then 0 Else n + recsum(n-1) EndIf
    Measure n
gauss(n): real = n * (n + 1) / 2
sum_is_recsum: Lemma
    sum(n) = recsum(n)
recsum_is_gauss: Lemma
    recsum(n) = gauss(n)
sum_is_gauss: Theorem
    sum(n) = gauss(n)
```

End gauss
Figure 2.1: PVS file gauss.pvs

A PVS lemma is called proved if its associated proof is finished. It is called complete if all lemmas used in the proof and all TCCs generated are proved and complete themselves. Only lemmas that are proved and complete may be considered valid, as unproved lemmas that are used in proofs may be unsound.

### 2.2 An Example Proof

To give an intuition about how the PVS theorem prover works, we present a rather simple example: the proof of GAUSS's theorem.

Theorem 1 (GaUSs) For all $n \in \mathbb{N}$ :

$$
\sum_{i=0}^{n} i=\frac{n(n+1)}{2}
$$

We define three functions whose equality we want to prove using PVS (figure 2.1):

1. sum (n) : nat $=$ sigma(0, $n$, Lambda i: i)
sum uses the sigma function. sigma sums up a function over a finite natural domain. The term $i$ in theorem 1 is an expression, not a function. We therefore use the $\lambda$ term $(\lambda i: i)$ -
which is the identity function-in the formalization of the $\Sigma$ operator. In mathematical notation, this is

$$
\operatorname{sum}(n):=\sum_{0}^{n}(\lambda i: i)
$$

2. recsum(n) : Recursive nat $=$

If $\mathrm{n}=0$ Then 0 Else $\mathrm{n}+$ recsum( $\mathrm{n}-1$ ) EndIf
Measure n
Another formalization is given as the recursive function recsum. Recursive functions have to be well-founded, i.e., the recursion must terminate eventually; a measure has to be supplied whose natural value must decrease with each recursive call. In our case this is trivially true since $n-1<n$.

$$
\operatorname{recsum}(n):= \begin{cases}0 & \text { if } n=0 \\ n+\operatorname{recsum}(n-1) & \text { else }\end{cases}
$$

3. gauss $(\mathrm{n})$ : real $=n *(\mathrm{n}+1) / 2$

Finally, gauss is GAUSS's formula.

We aim to prove Gauss's theorem, formalized in sum_is_gauss. To clarify proof techniques commonly used in PVS, we split this theorem into the two lemmas sumis_recsum and recsum_is_gauss.

### 2.2.1 Proof of sum_is_recsum

We first show that sum and recsum are equivalent.

Lemma 1 sum_is_recsum:

```
sum(n) = recsum(n)
```

Proof. After starting the PVS prover, we are presented the first sequent.

```
sum_is_recsum :
```

|-------
\{1\} FORALL (n: nat): sum(n) $=\operatorname{recsum}(n)$

Rule?

Note that PVS has augmented the lemma by a $\forall$ quantor binding the free occurrence of $n$. We aim to prove the lemma by induction on $n$. We start with the PVS command (induct "n"), which yields two subgoals that correspond to induction base and step.

```
Rule? (induct "n")
Inducting on n on formula 1,
```

```
this yields 2 subgoals:
sum_is_recsum.1 :
    |-------
{1} sum(0) = recsum(0)
Rule?
```

Proof branches are named by the prover. Here, the induction base is named sum is recsum. 1. This subgoal is resolved by (grind). This proof command repeatedly expands all definitions and applies various simplification rules. (grind) is one of the most powerful proof commands, but in some cases, it will not terminate if recursive definitions are unwinded ad infinitum. Another possibility to prove the induction base was to expand sum, sigma, and recsum manually (using (expand) ), but (grind) is more compact.

```
Rule? (grind)
sigma rewrites sigma(0, 0, LAMBDA i: i)
    to 0
sum rewrites sum(0)
    to 0
recsum rewrites recsum(0)
    to 0
Trying repeated skolemization, instantiation, and if-lifting,
This completes the proof of sum_is_recsum.1.
sum_is_recsum.2 :
    |-------
{1} FORALL j: sum(j) = recsum(j) IMPLIES
        sum(j + 1) = recsum(j + 1)
Rule?
```

The induction step is named sum_is_recsum. 2. We replace the quantified $j$ by an arbitrary, but fixed $j_{1}$ (a so called skolem constant) using (skosimp*). This PVS command also simplifies the sequent: in this case, it splits the implication in formula 1 into an antecedent part -1 and a consequent part 1.

```
Rule? (skosimp*)
Repeatedly Skolemizing and flattening,
this simplifies to:
sum_is_recsum.2 :
{-1} sum(j!1) = recsum(j!1)
    |-------
{1} sum(j!1 + 1) = recsum(j!1 + 1)
```

Rule?

We have to prove that the equality in formula 1 (the induction claim) is an implication of formula -1 (the induction hypothesis). We want to use properties of the sigma operator which is currently hidden in our function sum. We therefore expand the definition of sum:

```
Rule? (expand "sum")
Expanding the definition of sum,
this simplifies to:
sum_is_recsum.2 :
{-1} sigma(0, j!1, LAMBDA i: i) = recsum(j!1)
    |-------
{1} sigma(0, 1 + j!1, LAMBDA i: i) = recsum(1 + j!1)
```

Rule?

Before we can apply the induction hypothesis, we use the lemma Sigma Split from the PVS bitvectors library. The command (lemma "sigma_split") introduces the lemma as a new antecedent formula.

```
Rule? (lemma "sigma_split")
Applying sigma_split
this simplifies to:
sum_is_recsum.2 :
{-1} FORALL (F: [nat -> nat], high, low, m: nat):
    m >= low AND high > m IMPLIES
        sigma(low, high, F) =
                        sigma(low, m, F) + sigma(m + 1, high, F)
[-2] sigma(0, j!1, LAMBDA i: i) = recsum(j!1)
    |-------
[1] sigma(0, 1 + j!1, LAMBDA i: i) = recsum(1 + j!1)
```

Rule?

This lemma allows us to split a sigma term over $F$ ranging from low to high at any intermediate $m$. In our case, $F=(\lambda i: i)$, $h i g h=j_{1}+1$, low $=0$, and $m=j_{1}$. Thus, we instantiate formula -1 accordingly: ${ }^{2}$

```
Rule? (inst -1 "LAMBDA i: i" "j!1+1" "0" "j!1")
Instantiating the top quantifier in -1 with the terms:
    LAMBDA i: i, j!1+1, 0, j!1,
this simplifies to:
```

[^1]```
sum_is_recsum.2 :
{-1} j!1 >= 0 AND j!1 + 1 > j!1 IMPLIES
    sigma(0, j!1 + 1, LAMBDA i: i) =
        sigma(0, j!1, LAMBDA i: i) +
                        sigma(j!1 + 1, j!1 + 1, LAMBDA i: i)
[-2] sigma(0, j!1, LAMBDA i: i) = recsum(j!1)
    |-------
[1] sigma(0, 1 + j!1, LAMBDA i: i) = recsum(1 + j!1)
```

Rule?

The lemma's prerequisites $j_{1} \geq 0 \wedge j_{1}+1>j_{1}$ require $j_{1}$ to lie in the range low . . high, which is true. By (assert), we invoke the PVS decision procedures for linear arithmetic, thereby simplifying the formula and eliminating the redundant prerequisites. We are left with the equation

$$
\sum_{0}^{j_{1}+1}(\lambda i: i)=\sum_{0}^{j_{1}}(\lambda i: i)+\sum_{j_{1}+1}^{j_{1}+1}(\lambda i: i)
$$

```
Rule? (assert)
Simplifying, rewriting, and recording with decision procedures,
this simplifies to:
sum_is_recsum.2 :
{-1} sigma(0, 1 + j!1, LAMBDA i: i) =
        sigma(0, j!1, LAMBDA i: i) +
            sigma(1 + j!1, 1 + j!1, LAMBDA i: i)
[-2] sigma(0, j!1, LAMBDA i: i) = recsum(j!1)
    |-------
[1] sigma(0, 1 + j!1, LAMBDA i: i) = recsum(1 + j!1)
```

Rule?

The left side of equation -1 (the instantiated lemma Sigma Split) is equal to the left side of equation 1 . To apply the lemma, we replace any occurrence of the left side of equation -1 by the right side. Since we do not need the lemma afterwards, we hide the formula.

```
Rule? (replace -1 :hide? t)
Replacing using formula -1,
this simplifies to:
sum_is_recsum.2 :
[-1] sigma(0, j!1, LAMBDA i: i) = recsum(j!1)
    |-------
{1} sigma(0, j!1, LAMBDA i: i) +
    sigma(1 + j!1, 1 + j!1, LAMBDA i: i) =
                        recsum(1 + j!1)
```

Rule?

The above four steps-importing a lemma, instantiating it, discharging its prerequisites, and replacing subterms-are the usual way to use lemmas in PVS.

We can now apply the induction hypothesis in formula -1, again by (replace) :

```
Rule? (replace -1 :hide? t)
Replacing using formula -1,
this simplifies to:
sum_is_recsum.2 :
    |-------
{1} recsum(j!1) + sigma(1 + j!1, 1 + j!1, LAMBDA i: i) =
    recsum(1 + j!1)
Rule?
```

The remaining sigma term is trivial, (expand "sigma") reduces it to $1+j_{1}$. Note that PVS tries to generate normal forms of expressions and moves the 1 to the front.

```
Rule? (expand "sigma")
Expanding the definition of sigma,
this simplifies to:
sum_is_recsum.2 :
    |-------
{1} 1 + recsum(j!1) + j!1 = recsum(1 + j!1)
Rule?
```

Next, we unwind one step of the recursion in the second occurrence of recsum in formula 1 by (expand "recsum" 12 ). This yields $1+\operatorname{recsum}\left(j_{1}\right)+j_{1}$ on the right side, which is syntactically equivalent to the left side. The PVS prover recognizes this and finishes the proof automatically.

```
Rule? (expand "recsum" 1 2)
Expanding the definition of recsum,
this simplifies to:
sum_is_recsum.2 :
```

    |-------
    \{1\} TRUE
which is trivially true.

```
This completes the proof of sum_is_recsum.2.
Q.E.D.
Run time = 0.50 secs.
Real time = 144.64 secs.
```

The proof subgoals generated by the proof commands form a tree representing the proof's structure. PVS visualizes this tree during the proof attempt in a separate window (figure 2.2).

### 2.2.2 Proof of recsum_is_gauss

To demonstrate a less 'manual' proof, we prove the next lemma using a more powerful proof command.

Lemma 2 recsum_is_gauss:

```
recsum(n) = gauss(n)
```


## Proof.

```
    |-------
{1} FORALL (n: nat): recsum(n) = gauss(n)
```

One step, namely (induct-and-simplify "n"), suffices to prove this goal. This command starts an induction and repeatedly expands and simplifies expressions.

### 2.2.3 Proof of sum_is_gauss

Finally, we can proceed to prove our original goal, the theorem sum is_gauss.

## Theorem 2 sum_is_gauss:

$$
\operatorname{sum}(n)=\operatorname{gauss}(n)
$$

## Proof.

```
    |-------
{1} FORALL (n: nat): sum(n) = gauss(n)
```

We first have to get rid of the quantor by using (skosimp*) , as in the sum is_recsum proof.
|-------
$\{1\} \quad \operatorname{sum}(n!1)=\operatorname{gauss}(n!1)$
Manually importing a lemma, instantiating it and replacing sub-terms using the equality in the lemma-as in our first proof-is a task that can be automated if PVS is able to guess the right instantiation. We now use the two previously proved lemmas to rewrite the formula in our current goal. (rewrite "sum_is_recsum") replaces the occurrence of sum by the corresponding recsum term:


Figure 2.2: The sum_is_recsum, recsum_is_gauss, and sum_is_gauss proof trees

```
    |-------
{1} recsum(n!1) = gauss(n!1)
```

Finally, (rewrite "recsum_is_gauss") replaces recsum by gauss, leading to the trivial sequent gauss $(n!1)=$ gauss $(n!1)$, which PVS recognizes as true. The theorem is proved; it is complete as well, since all lemmas we used are proved and complete.

The above proofs demonstrate some of the most important proof techniques used in PVS. We did not try to make the proofs as short as possible, but to give the reader the intuition of how PVS proofs work. In fact, the step (induct-and-simplify "n") that was used to prove lemma recsum_is_gauss would have resolved the first lemma, sum_is_recsum, immediately. Of course, more difficult theorems need many more steps, and our small proofs may be regarded as 'light' proof examples for more complex properties we want to verify.

### 2.2.4 TCCs

Two TCCs are generated for the gauss theory, both for the recursive call recsum ( $n-1$ ) in the recsum function. Since recsum is only defined on natural numbers, we have to prove that $n-1$ is non-negative. But recsum ( $\mathrm{n}-1$ ) is only called if $n \neq 0$ and hence the TCC

```
recsum_TCC1: OBLIGATION
    FORALL (n): NOT n = 0 IMPLIES n - 1 >= 0
```

is easily proved via (assert). The second TCC assures the termination of recursive recsum calls. The measure $n$ supplied in the recsum definition must decrease with each recursive call. This TCC is trivially discharged as well.

```
recsum_TCC2: OBLIGATION
    FORALL (n): NOT n = 0 IMPLIES n - 1 < n
```

Although we know that gauss returns a natural, we assigned it a type of real. Otherwise, we had to prove a third TCC that $\frac{n(n+1)}{2}$ was natural, which we do not want to do here. ${ }^{3}$

### 2.3 PVS and Mathematics

Syntax. The reader should be convinced by now that the PVS syntax closely resembles mathematical notation, and translating between both is straightforward. For readability, we will use conventional mathematical notation in the remaining part of this thesis.

Proofs. In textbooks, circuits are usually defined by using figures, and correctness proofs argue about these figures. Translating hardware correctness proofs from conventional mathematics to PVS means to formalize the figures in PVS, and then to adapt the proofs using proof techniques exploiting the capabilities of PVS. If the proofs are not entirely formal or use lemmas that are not

[^2]available in PVS, the proofs have to redone from scratch, or several auxillary lemmas have to be proved.

Conventional mathematical proofs use shortcuts where the line of reasoning is obvious-or at least supposed to be obivous. Usually, these 'proof gaps' are marked by flowery phrases like trivial, obvious, analogous, and without loss of generality. The gaps have to be filled for the formal verification in PVS.

Contrarily, translating PVS proofs back to mathematical notation means to 'extract' the essential proof commands from a proof tree, and to provide the necessary intuition on the proof goal. Only about one in every three or four PVS proof commands is worth being mentioned in the mathematical transcript, the others being uninteresting-or in other words trivial—transformations like evaluating expressions (e.g., $1-1=0$ ), or using associativity, commutativity, and such. Of course, we will mark these steps using phrases like those mentioned above, but we hope that the reader will be confident-due to the formal verification of the proofs in PVS-that there are no proof gaps left.

### 2.4 Bitvectors

PVS provides a bitvectors library that provides bits, bitvectors, and a rich collection of lemmas for bitvector transformation and arithmetic [ $\mathrm{BMS}^{+} 96$ ].

Notation. PVS defines the type bit, which we will denote by $\mathbb{B}$, to be the type boolean $:=$ $\{$ false, true $\}$. For convenience, we will interpret $\mathbb{B}$ as well as the set $\{\mathbf{0}, \mathbf{1}\}$. A bitvector of width $n$ is a function mapping the domain $\{n-1, \ldots, 0\}$ to $\mathbb{B}$. We denote the bitvector type by $\mathbb{B}^{n}$. We will implicitly identify bits $\mathbb{B}$ and bitvectors $\mathbb{B}^{1}$ of width 1 .

A bitvector $b$ of width $n$ is indexed by $b[i]$ with $i$ ranging from $n-1$ to 0 . For $h \geq l, b[h, l]$ denotes the extracted bit vector consisting of bits $h$ to $l$ of $b$; o is the concatenation operator. For $x \in \mathbb{B}, x^{i}$ denotes the bitvector consisting of bit $x$ repeated $i$ times. The bit operators $\neg, \wedge$, and $\vee$ will be applied to bitvectors as well, meaning bitwise application.

The natural number represented by $b$ is denoted by

$$
\langle b\rangle:=\sum_{i=0}^{n} 2^{i} \cdot b[i] .
$$

The two's complement value of $b$ is

$$
[b]:= \begin{cases}\langle b\rangle & \text { if }\langle b\rangle<2^{n-1} \\ \langle b\rangle-2^{n} & \text { else. }\end{cases}
$$

The range of the $n$-bit two's complement numbers is denoted by

$$
T_{n}:=\left\{-2^{n-1}, \ldots, 2^{n-1}-1\right\}
$$

The proof that $T_{n}$ is indeed the range of the $n$-bit two's complement numbers can be found in the bitvectors library.

Lemmas. In the remaining part of this thesis, we need the following lemmas. Most of these come from the PVS bitvectors library; we do not give the proofs. Unless otherwise noted, let $n \in \mathbb{N}^{+}$and $b \in \mathbb{B}^{n}$.

## Lemma 3

$$
\langle b\rangle=b[n-1] \cdot 2^{n-1}+\langle b[n-2,0]\rangle .
$$

Lemma 4 For all $n, m \in \mathbb{N}^{+}$, bn $\in \mathbb{B}^{n}$, $b m \in \mathbb{B}^{m}$ :

$$
\langle b n \circ b m\rangle=\langle b n\rangle \cdot 2^{m}+\langle b m\rangle
$$

## Lemma 5

$$
\langle b\rangle \geq 2^{n-1} \Longleftrightarrow b[n-1] .
$$

Lemma 6 For all $l \in \mathbb{N}_{<n}$ :

$$
\langle b\rangle<2^{l} \Longleftrightarrow b[n-1, l]=\mathbf{0}^{n-l} .
$$

Lemma 7 For all $l \in \mathbb{N}_{<n}$ :

$$
\langle b[l, 0]\rangle=\langle b\rangle \bmod 2^{l+1}
$$

Lemma 8 For all $l \in \mathbb{N}_{<n}$ :

$$
\langle b[n-1, l]\rangle=\langle b\rangle \operatorname{div} 2^{l} .
$$

## Lemma 9

$$
\langle\neg b\rangle=2^{n}-\langle b\rangle-1
$$

Lemma 10

$$
[b]=\langle b\rangle-2^{n} \cdot b[n-1] .
$$

## Lemma 11

$$
[b]=\langle b[n-2,0]\rangle-2^{n-1} \cdot b[n-1] .
$$

## Lemma 12

$$
[b]<0 \Longleftrightarrow b[n-1] .
$$

## Lemma 13

$$
-[b]=[\neg b]+1
$$

## Lemma 14

$$
\langle b[n-1] \circ \neg b[n-1] \circ b[n-2,0]\rangle=2^{n-1}+\langle b\rangle .
$$

## Chapter 3

## Basic Components

Large circuits are built from smaller modules that tend to be used frequently. Therefore, it is profitable to collect these standard modules in a library of basic components. This section describes the library developed as part of this thesis. The library consists of various combinatorial circuits as listed in table 3.1. A summary of this chapter has been published as [BJK01].

Each component is specified as a PVS function. The correctness criterion is formulated as a lemma stating the intended circuit behaviour using a mathematical formula. The correctness criterion is then used as a rewrite rule in proofs of larger circuits that use the component. The circuits are of arbitrary bit width and are designed for general use. Some circuits, however, are limited to bit widths that are powers of 2 .

The designs and most proofs were taken from [MP95, KP97, MP00]. As we will only prove the correctness of the circuits, the reader should refer to the cited publications for a further explanation of the circuit functionality. Cost and delay in table 3.1 are asymptotic measures, i.e., $n$ means $O(n)$. We will not give proofs for these here.

The goal of the VAMP project is to obtain a completely verified CPU. The designs are not necessarily optimized for speed. There are faster adder designs than carry chain adders, but as the correctness of the design does not depend on the adder implementation used (as long as the adder is correct), we use the simple, but slow carry chain implementation. Another reason for carry chain adders is the FPGA implementation: due to fast carry lines, the carry chain adder is the most efficient adder on Xilinx FPGAs [Xil00].

Unless noted otherwise, $n$ is a positive natural number in this chapter: $n \in \mathbb{N}^{+}$.

### 3.1 Halfadder, Fulladder

Halfadder. The halfadder takes two bits $a$ and $c$ and computes a bitvector of length 2 representing the sum $a+c$ (figure 3.1).

Circuit 1 Inputs $a, c \in \mathbb{B}$, output $s \in \mathbb{B}^{2}$.

$$
\text { halfadder }:=(a \wedge c) \circ(a \oplus c)
$$

| component | symbol | width | cost | delay |
| :--- | :--- | :---: | :---: | :---: |
| halfadder | halfadder | 1 | 1 | 1 |
| fulladder | fulladder | 1 | 1 | 1 |
| carry chain incrementer | carry_chain_inc | $n$ | $n$ | $n$ |
| carry chain adder | carry_chain | $n$ | $n$ | $n$ |
| compound adder | compound | $n$ | $n \log n$ | $\log n$ |
| generic adder | Add | $n$ | $n$ | $n$ |
| carry save adder | csa | $n$ | $n$ | 1 |
| arithmetic unit | add_sub | $n$ | $n$ | $n$ |
| subtract unit | sub | $n$ | $n$ | $n$ |
| absolute value | abs | $n$ | $n$ | $n$ |
| linear multiplier | mult_lin | $n, m$ | $n \cdot m$ | $n+m$ |
| decoder | dec | $n$ | $2^{n}$ | $\log n$ |
| half decoder | hdec | $n$ | $2^{n}$ | $n$ |
| encoder | enc | $2^{n}$ | $2^{n}$ | $n$ |
| leading zero counter | $l z$ | $2^{n}$ | $n$ | $\log n$ |
| cyclic left shifter | cls | $2^{n}$ | $n \log n$ | $\log n$ |
| logic left shifter | lls | $n$ | $n \log n$ | $\log n$ |
| logic right shifter | lrs | $n$ | $n \log n$ | $\log n$ |
| or tree | ortree | $n$ | $n$ | $\log n$ |

Table 3.1: The components contained in the library


Figure 3.1: Halfadder HA

Lemma 15 (Halfadder Correct) For all $a, c \in \mathbb{B}$ :

$$
\langle h a l f a d d e r(a, c)\rangle=a+c .
$$

Proof. Since the halfadder is a small circuit of fixed size, its correctness is automatically verified by using the PVS command (grind).

Fulladder. Similar to the halfadder, the fulladder sums up three bits (figure 3.2).

Circuit 2 Inputs $a, b, c \in \mathbb{B}$, output $s \in \mathbb{B}^{2}$. Let $x:=a \oplus b$.

$$
\text { fulladder }:=((a \wedge b) \vee(c \wedge x)) \circ(x \oplus c)
$$



Figure 3.2: Fulladder FA

Lemma 16 (Fulladder Correct) For all $a, b, c \in \mathbb{B}$ :

$$
\langle\text { fulladder }(a, b, c)\rangle=a+b+c \text {. }
$$

Proof. Trivial by (grind).

### 3.2 Carry Chain Incrementer

Our first non-trivial circuit is a carry chain incrementer that is built from halfadders (figure 3.3). Depending on a carry-in signal $c_{i n}$, the circuit increments $a$ by one or passes it unmodified. The carry chain incrementer is defined recursively for bitwidths $n$. We use indices to refer to a specific circuit instance, e.g., carry_chain_inc $c_{n-1}$ denotes an $n-1$ bit carry chain incrementer. The index is omitted if the circuit width is clear from the context. Later, indices will also be used for circuits that have more than one output (e.g., $A d d_{o v f}$ ).

Circuit 3 Input $a \in \mathbb{B}^{n}, c_{\text {in }} \in \mathbb{B}$, output $s \in \mathbb{B}^{n+1}$. For $n=1$,

$$
\text { carry_chain_inc } c_{1}:=\text { halfadder }\left(a[0], c_{i n}\right) .
$$

For $n>1$, let $C:=$ carry_chain_inc $n_{n-1}\left(a[n-2,0], c_{i n}\right)$,

$$
\text { carry_chain_inc }_{n}:=\text { halfadder }(a[n-1], C[n-1]) \circ C[n-2,0] .
$$

Lemma 17 (Carry Chain Inc Correct) For all $a \in \mathbb{B}^{n}, c_{i n} \in \mathbb{B}$ :

$$
\left\langle\text { carry_chain_inc }\left(a, c_{i n}\right)\right\rangle=\langle a\rangle+c_{i n} .
$$



Figure 3.3: Carry chain incrementer

Proof. The proof is by induction on the width $n$ of the input. The induction base $n=1$ is resolved by (grind).

In the induction step $n+1$, let $C:=$ carry_chain_inc $c_{n}\left(a[n-1,0], c_{i n}\right)$. The claim is

$$
\left\langle\text { carry_chain_inc } c_{n+1}\left(a, c_{i n}\right)\right\rangle=\langle a\rangle+c_{i n}
$$

Expanding the incrementer definition leads to

$$
\langle\text { halfadder }(a[n], C[n]) \circ C[n-1,0]\rangle=\langle a\rangle+c_{i n},
$$

By lemma 4, this is equivalent to

$$
\langle\text { halfadder }(a[n], C[n])\rangle \cdot 2^{n}+\langle C[n-1,0]\rangle=\langle a\rangle+c_{i n} .
$$

The halfadder is correct by lemma 15 :

$$
(a[n]+C[n]) \cdot 2^{n}+\langle C[n-1,0]\rangle=\langle a\rangle+c_{i n}
$$

By lemma 3 and the induction hypothesis, $\langle C\rangle=\langle a[n-1,0]\rangle+c_{i n}$, we have

$$
a[n] \cdot 2^{n}+\langle a[n-1,0]\rangle+c_{i n}=\langle a\rangle+c_{i n},
$$

which is true by lemma 3.

Incrementer. When we use an incrementer, we do not care about its actual implementation-as long as the circuit satisfies the above correctness criterion lemma 17. In the following, we will use inc to refer to the carry chain incrementer, but any other correct incrementer implementation could also be used.

$$
\text { inc }:=\text { carry_chain_inc. }
$$



Figure 3.4: Carry chain adder

### 3.3 Carry Chain Adder

By substituting the halfadders in the carry chain incrementer by fulladders, we obtain a carry chain adder (figure 3.4). It adds two input numbers $a$ and $b$, and a carry-in $c_{i n}$.

Circuit 4 Inputs $a, b \in \mathbb{B}^{n}, c_{\text {in }} \in \mathbb{B}$, output $s \in \mathbb{B}^{n+1}$. For $n=1$,

$$
\text { carry_chain }_{1}:=\text { fulladder }\left(a[0], b[0], c_{i n}\right) .
$$

For $n>1$, let $C:=$ carry_chain $_{n-1}\left(a[n-2,0], b[n-2,0], c_{i n}\right)$,

$$
\operatorname{carry}_{\_} \text {chain } n_{n}:=\text { fulladder }(a[n-1], b[n-1], C[n-1]) \circ C[n-2,0] .
$$

Lemma 18 (Carry Chain Correct) For all $a, b \in \mathbb{B}^{n}, c_{i n} \in \mathbb{B}$ :

$$
\left\langle\text { carry_chain }_{n}\left(a, b, c_{i n}\right)\right\rangle=\langle a\rangle+\langle b\rangle+c_{i n} .
$$

Proof. The proof is the same as the carry chain incrementer proof, with halfadder replaced by fulladder.

Adder. Analogous to $i n c$, we define $a d d$ to refer to any adder implementation satisfying lemma 18. We use the carry chain adder.

$$
\text { add }:=\text { carry_chain. }
$$

### 3.4 Compound Adder

Another adder implementation is the compound adder (figure 3.5). The compound adder features no carry-in bit, but computes both $a+b$ and $a+b+1$. The compound adder is used in the floating point rounder.


Figure 3.5: Compound adder

Circuit 5 Inputs $a, b \in \mathbb{B}^{n}$, outputs $s_{0}, s_{1} \in \mathbb{B}^{n+1}$. For $n=1$,

$$
\begin{aligned}
& \text { compound }_{1, s_{1}}:=(a[0] \vee b[0]) \circ \neg(a[0] \oplus b[0]), \\
& \text { compound }_{1, s_{0}}:=(a[0] \wedge b[0]) \circ(a[0] \oplus b[0]) .
\end{aligned}
$$

For $n>1$, let $m:=\left\lfloor\frac{n}{2}\right\rfloor, C H:=$ compound $_{n-m}(a[n-1, m\rfloor, b[n-1, m]), C L:=$ com-$\operatorname{pound}_{m}(a[m-1,0], b[m-1,0])$,

$$
\begin{aligned}
& \text { compound }_{n, s_{1}}:=\left\{\begin{array}{ll}
C H_{s_{1}} & \text { if } C L_{s_{1}}[m] \\
C H_{s_{0}} & \text { else }
\end{array}\right\} \circ C L_{s_{1}}[m-1,0], \\
& \text { compound }_{n, s_{0}}:=\left\{\begin{array}{ll}
C H_{s_{1}} & \text { if } C L_{s_{0}}[m] \\
C H_{s_{0}} & \text { else }
\end{array}\right\} \circ C L_{s_{0}}[m-1,0] .
\end{aligned}
$$

Lemma 19 (Compound Adder Correct) For all $a, b \in \mathbb{B}^{n}$ :
(i) $\left\langle\right.$ compound $\left._{s_{1}}(a, b)\right\rangle=\langle a\rangle+\langle b\rangle+1$,
(ii) $\left\langle\right.$ compound $\left._{s_{0}}(a, b)\right\rangle=\langle a\rangle+\langle b\rangle$.

Proof. The base $n=1$ of the induction on $n$ is resolved by case analysis on $a[0]$ and $b[0]$ using (grind).

In the induction step $n$, let $m:=\left\lfloor\frac{n}{2}\right\rfloor, C H:=$ compound $_{n-m}(a[n-1, m], b[n-1, m\rfloor)$, $C L:=$ compound $_{m}(a[m-1,0], b[m-1,0])$. The four induction hypotheses are

1. $\left\langle C H_{s_{1}}\right\rangle=\langle a[n-1, m]\rangle+\langle b[n-1, m]\rangle+1$,
2. $\left\langle C H_{s_{0}}\right\rangle=\langle a[n-1, m]\rangle+\langle b[n-1, m]\rangle$,
3. $\left\langle C L_{s_{1}}\right\rangle=\langle a[m-1,0]\rangle+\langle b[m-1,0]\rangle+1$,
4. $\left\langle C L_{s_{0}}\right\rangle=\langle a[m-1,0]\rangle+\langle b[m-1,0]\rangle$.
(i) In the case $C L_{s_{1}}[m]$, we show

$$
\begin{aligned}
\left\langle C H_{s_{1}} \circ C L_{s_{1}}[m-1,0]\right\rangle & =\langle a[n-1, m] \circ a[m-1,0]\rangle \\
& +\langle b[n-1, m] \circ b[m-1,0]\rangle+1
\end{aligned}
$$



Figure 3.6: Generic adder $A d d$

By lemma 4, this reduces to

$$
\begin{aligned}
\left\langle C H_{s_{1}}\right\rangle \cdot 2^{m}+\left\langle C L_{s_{1}}[m-1,0]\right\rangle & =\langle a[n-1, m]\rangle \cdot 2^{m}+\langle a[m-1,0]\rangle \\
& +\langle b[n-1, m]\rangle \cdot 2^{m}+\langle b[m-1,0]\rangle+1
\end{aligned}
$$

By lemma $3,\left\langle C L_{s_{1}}[m-1,0]\right\rangle=\left\langle C L_{s_{1}}\right\rangle-C L_{s_{1}}[m] \cdot 2^{m}$, and $C L_{s_{1}}[m]=\mathbf{1}$ :

$$
\begin{aligned}
\left\langle C H_{s_{1}}\right\rangle \cdot 2^{m}+\left\langle C L_{s_{1}}\right\rangle-2^{m} & =\langle a[n-1, m]\rangle \cdot 2^{m}+\langle a[m-1,0]\rangle \\
& +\langle b[n-1, m]\rangle \cdot 2^{m}+\langle b[m-1,0]\rangle+1
\end{aligned}
$$

We finish this case by applying the induction hypotheses:

$$
\begin{aligned}
(\langle a[n-1, m]\rangle+\langle b[n-1, m]\rangle+1) \cdot 2^{m} & \\
+(\langle a[m-1,0]\rangle+\langle b[m-1,0]\rangle+1)-2^{m} & =\langle a[n-1, m]\rangle \cdot 2^{m}+\langle a[m-1,0]\rangle \\
& +\langle b[n-1, m]\rangle \cdot 2^{m}+\langle b[m-1,0]\rangle+1
\end{aligned}
$$

The proofs for the case $\neg C L_{s_{1}}[m]$ and part (ii) are analogous.

### 3.5 Generic Adder

The adders presented so far do not operate on two's complement numbers. In this section, we build an adder $A d d$ that provides the signals $s, c_{o u t}, n e g$, and $o v f$. The correctness lemmas for binary and two's complement operands are proved. In the construction of $A d d$, we use the adder $a d d$ from section 3.3. We assume that $a d d$ is correct, this holds by lemma 18:

$$
\left\langle a d d\left(a, b, c_{i n}\right)\right\rangle=\langle a\rangle+\langle b\rangle+c_{i n}
$$

In the following, let $s$ denote $a d d\left(a, b, c_{i n}\right)$.

Circuit 6 Given a circuit add, inputs $a, b \in \mathbb{B}^{n}$, $c_{\text {in }} \in \mathbb{B}$, outputs $s \in \mathbb{B}^{n}$, neg, ov $f, c_{\text {out }} \in \mathbb{B}$.

$$
\begin{aligned}
& A d d_{s}:=s[n-1,0] \\
& A d d_{c_{o u t}}:=s[n] \\
& A d d_{n e g}:=s[n] \oplus a[n-1] \oplus b[n-1] \\
& A d d_{o v f}:=A d d_{n e g} \oplus s[n-1]
\end{aligned}
$$

We first show that only certain combinations of the bits $s[n], a[n-1]$, and $b[n-1]$ occur:
Lemma 20 For all $a, b \in \mathbb{B}^{n}, c_{i n} \in \mathbb{B}$ :

$$
\text { (i) } s[n] \Longrightarrow a[n-1] \vee b[n-1]
$$

(ii) $\quad s[n] \Longleftarrow a[n-1] \wedge b[n-1]$.

Proof. (i) The most significant bit of $s$ is set, and $\langle s\rangle=\langle a\rangle+\langle b\rangle+c_{i n}$. We therefore conclude by lemma 5

$$
\langle a\rangle+\langle b\rangle+c_{i n} \geq 2^{n}
$$

If $a[n-1]=b[n-1]=\mathbf{0}$, we have $\langle a\rangle<2^{n-1}$ and $\langle b\rangle<2^{n-1}$ by lemma 5 . The sum of both is less than $2^{n}-1$, which is a contradiction. Hence, we have

$$
a[n-1]=\mathbf{1} \vee b[n-1]=\mathbf{1}
$$

(ii) By lemma 5, we have

$$
\begin{aligned}
\langle a\rangle & \geq 2^{n-1} \\
\langle b\rangle & \geq 2^{n-1} \\
\langle a\rangle+\langle b\rangle+c_{i n}=\langle s\rangle & \geq 2^{n}
\end{aligned}
$$

Again by lemma 5, we conclude

$$
s[n]=\mathbf{1}
$$

The sum $[a]+[b]+c_{i n}$ cannot always be represented within the $n$ bits of $A d d_{s}$. The next lemma provides the representation of the sum in $n+1$ bits.

Lemma 21 For all $a, b \in \mathbb{B}^{n}$, $c_{i n} \in \mathbb{B}$ :

$$
\left[A d d_{n e g} \circ s[n-1,0]\right]=[a]+[b]+c_{i n} .
$$

Proof. After applying lemma 11 on the left side and lemma 10 on the right, we have to show

$$
-A d d_{n e g} \cdot 2^{n}+\langle s[n-1,0]\rangle=-a[n-1] \cdot 2^{n}+\langle a\rangle-b[n-1] \cdot 2^{n}+\langle b\rangle+c_{i n} .
$$

By lemma 3, $\langle s[n-1,0]\rangle=\langle s\rangle-2^{n} \cdot s[n]$ :

$$
-A d d_{n e g} \cdot 2^{n}+\langle s\rangle-2^{n} \cdot s[n]=-a[n-1] \cdot 2^{n}+\langle a\rangle-b[n-1] \cdot 2^{n}+\langle b\rangle+c_{i n} .
$$

$\langle s\rangle=\langle a\rangle+\langle b\rangle+c_{i n}$ gives

$$
-A d d_{n e g} \cdot 2^{n}-2^{n} \cdot s[n]=-a[n-1] \cdot 2^{n}-b[n-1] \cdot 2^{n}
$$

We expand $A d d_{n e g}$ and divide by $-2^{n}$ :

$$
(s[n] \oplus a[n-1] \oplus b[n-1])+s[n]=a[n-1]+b[n-1] .
$$

By case analysis of all combinations of $s[n], a[n-1]$, and $b[n-1]$ that are permissible by lemma 20 , this equation holds.

A two's complement number represented by a bitvector of width $n+1$ can be represented in $n$ bits if the two topmost bits are the same:

Lemma 22 For all $s \in \mathbb{B}^{n+1}$ :

$$
[s] \in T_{n} \Longleftrightarrow s[n]=s[n-1]
$$

Proof. We expand the definition of $T_{n}$ and apply lemma 3. We show

$$
-2^{n-1} \leq\left(-2^{n} \cdot s[n]+\langle s[n-1,0]\rangle\right) \leq 2^{n-1}-1 \Longleftrightarrow s[n]=s[n-1]
$$

By lemma 5, this is equivalent to the following, where $S:=\langle s[n-1,0]\rangle$ :

$$
-2^{n-1} \leq\left(-2^{n} \cdot s[n]+S\right) \leq 2^{n-1}-1 \Longleftrightarrow s[n]=\left(S \geq 2^{n-1}\right)
$$

This statement is verified by case analysis on $s[n]$ and the truth value of $\left(S \geq 2^{n-1}\right)$.
We now use the above lemmas to show the correctness of $A d d$ :
Lemma 23 (Adder Correct) For all $a, b \in \mathbb{B}^{n}, c_{i n} \in \mathbb{B}$, let binsum $:=\langle a\rangle+\langle b\rangle+c_{i n}$, intsum $:=[a]+[b]+c_{i n}$, and assume $\left\langle a d d\left(a, b, c_{i n}\right)\right\rangle=\langle a\rangle+\langle b\rangle+c_{i n}:$

$$
\text { (i) }\left\langle A d d_{c_{\text {out }}} \circ A d d_{s}\right\rangle=\text { binsum }
$$

(ii) $\quad$ Add $d_{o v f} \Longleftrightarrow$ intsum $\notin T_{n}$,
(iii) intsum $\in T_{n} \Longrightarrow\left[A d d_{s}\right]=$ intsum,
(iv) Add $\quad \Longleftrightarrow$ intsum $<0$.
where $T_{n}=\left\{-2^{n-1}, \ldots, 2^{n-1}-1\right\}$ denotes the range of the $n$-bit two's complement numbers.

Proof. (i) is trivial, since $A d d_{\text {cout }}$ and $A d d_{s}$ are re-concatenated after having been split in the definition of $A d d$.
(ii) We prove

$$
A d d_{n e g} \oplus s[n-1] \Longleftrightarrow[a]+[b]+c_{i n} \notin T_{n}
$$

By lemma 21, we have

$$
A d d_{n e g} \oplus s[n-1] \Longleftrightarrow\left[A d d_{n e g} \circ s[n-1,0]\right] \notin T_{n}
$$

The proof is finished by lemma 22, which yields

$$
A d d_{n e g} \oplus s[n-1] \Longleftrightarrow A d d_{n e g} \neq s[n-1]
$$

(iii) We proceed to prove

$$
[s[n-1,0]]=[a]+[b]+c_{i n} .
$$

We apply lemma 10 :

$$
\langle s[n-1,0]\rangle-2^{n} \cdot s[n-1]=\langle a\rangle-2^{n} \cdot a[n-1]+\langle b\rangle-2^{n} \cdot b[n-1]+c_{i n} .
$$

$\langle s[n-1,0]\rangle=\langle s\rangle-2^{n} \cdot s[n]$ by lemma 3:

$$
\langle s\rangle-2^{n} \cdot s[n]-2^{n} \cdot s[n-1]=\langle a\rangle-2^{n} \cdot a[n-1]+\langle b\rangle-2^{n} \cdot b[n-1]+c_{i n} .
$$

Using $\langle s\rangle=\langle a\rangle+\langle b\rangle+c_{i n}$ and simplifying gives

$$
s[n]+s[n-1]=a[n-1]+b[n-1] .
$$

By assumption, we have $[a]+[b]+c_{i n} \in T_{n}$, and hence, by part (ii), $A d d_{n e g} \oplus s[n-1]=\mathbf{0}$. By the definition of $A d d_{n e g}$, we equivalently have $s[n-1]=s[n] \oplus a[n-1] \oplus b[n-1]$. We therefore have to prove

$$
s[n]+(s[n] \oplus a[n-1] \oplus b[n-1])=a[n-1]+b[n-1] .
$$

This statement has already been proved at the end of the proof of lemma 21.
(iv) We show

$$
A d d_{n e g} \Longleftrightarrow[a]+[b]+c_{i n}<0
$$

With lemma 21, we have

$$
A d d_{n e g} \Longleftrightarrow\left[A d d_{n e g} \circ s[n-1,0]\right]<0
$$

By using lemma 12 , this is equal to

$$
A d d_{n e g} \Longleftrightarrow\left(A d d_{n e g} \circ s[n-1,0]\right)[n]
$$

which is trivially true.

### 3.6 Carry Save Adder

A carry save adder (also called 3/2-adder) sums up three input numbers $a, b, c$ and yields two outputs $s$ and $t$ (figure 3.7). The sum of the inputs is equal to the sum of the outputs. This adder is used in wallace tree multipliers and the floating point rounding and multiplication units, and has a constant delay independent of the input width.


Figure 3.7: Carry save adder (3/2-adder)

Circuit 7 Inputs $a, b, c \in \mathbb{B}^{n}$, outputs $s \in \mathbb{B}^{n}, t \in \mathbb{B}^{n+1}$. For $n=1$, let $F:=$ fulladder $(a[0]$, $b[0], c[0])$,

$$
\begin{aligned}
& \operatorname{carry}_{2} \text { save }_{1, s}:=F[0], \\
& \text { carry_save }_{1, t}:=F[1] \circ \mathbf{0} .
\end{aligned}
$$

For $n>1$, let $(S, T):=$ carry_save $_{n-1,(s, t)}(a[n-2,0], b[n-2,0], c[n-2,0])$, and $F:=$ fulladder $(a[n-1], b[n-1], c[n-1])$,

$$
\begin{aligned}
& \operatorname{carry}_{\_} \operatorname{save}_{n, s}:=F[0] \circ S, \\
& \text { carry_save }_{n, t}:=F[1] \circ T .
\end{aligned}
$$

Lemma 24 (CARry Save Correct) For all $a, b, c \in \mathbb{B}^{n}$, let $C=\operatorname{carry\_ save~}(a, b, c)$ :
(i) $\langle a\rangle+\langle b\rangle+\langle c\rangle=\left\langle C_{s}\right\rangle+\left\langle C_{t}\right\rangle$,
(ii) $[a]+[b]+[c]=\left[C_{s}\right]+\left[C_{t}\right]$.

Proof. (i) We induct on $n$. The induction base $n=1$ is trivial by (grind).
In the induction step for $n+1$, let $(S, T):=$ carry_save $_{n,(s, t)}(a[n-1,0], b[n-1,0], c[n-1,0])$, and $F=$ fulladder $(a[n], b[n], c[n])$. The induction claim is

$$
\langle a\rangle+\langle b\rangle+\langle c\rangle=\langle F[0] \circ S\rangle+\langle F[1] \circ T\rangle
$$

By lemma 3, we have

$$
\begin{gathered}
(a[n]+b[n]+c[n]) \cdot 2^{n}+ \\
\langle a[n-1,0]\rangle+\langle b[n-1,0]\rangle+\langle c[n-1,0]\rangle=f a[0] \cdot 2^{n}+\langle S\rangle+f a[1] \cdot 2^{n+1}+\langle T\rangle .
\end{gathered}
$$

We can simplify this by the induction hypothesis:

$$
(a[n]+b[n]+c[n]) \cdot 2^{n}=(F[0]+2 \cdot F[1]) \cdot 2^{n}
$$

This holds because of the correctness of the fulladder (lemma 16).
The proof for (ii) is analogous.


Figure 3.8: Arithmetic Unit $a d d \_s u b$

### 3.7 Arithmetic Unit, Subtraction

The arithmetic unit $a d d \_s u b$ can add and subtract two's complement numbers. It is constructed from $A d d$. The signal $s u b$ is used to select addition or subtraction. We define $\pm_{x}$ as

$$
\pm_{x}:= \begin{cases}+ & \text { if } x=0 \\ - & \text { if } x=1\end{cases}
$$

Circuit 8 Inputs $a, b \in \mathbb{B}^{n}$, sub $\in \mathbb{B}$, outputs $s \in \mathbb{B}^{n}$, neg, ov $f \in \mathbb{B}$. ${ }^{1}$

$$
a d d_{\_} s u b_{n}:=A d d_{n}\left(a, b \oplus s u b^{n}, s u b\right) .
$$

Lemma 25 (Add Sub Correct) For all $a, b \in \mathbb{B}^{n}$, sub $\in \mathbb{B}$, let intsum $=[a] \pm_{\text {sub }}[b]$ :
(i) add_sub ovf $\Longleftrightarrow$ (intsum $\notin T_{n}$ ),
(ii) $\quad$ intsum $\in T_{n} \Longrightarrow\left[a d d_{-} s u b_{s}\right]=$ intsum,
(iii) add_sub ${ }_{n e g} \Longleftrightarrow($ intsum $<0)$.

Proof. Using properties of mod arithmetic and the correctness of $\operatorname{Add}$ (lemma 23), it suffices to show

$$
\left(\left\langle b \oplus s u b^{n}\right\rangle+s u b\right) \bmod 2^{n}=(-1)^{s u b} \cdot\langle b\rangle
$$

This is trivial by lemmas 7 and 9 .

Subtraction. We define the circuit sub to be a circuit $a d d \_s u b$ that always subtracts. Lemma 25 will also be used for the correctness of sub.

$$
\operatorname{sub}(a, b):=a d d \_s u b(a, b, \mathbf{1}) .
$$

[^3]

Figure 3.9: Absolute value computation

### 3.8 Absolute Value

The next circuit is used to compute the absolute value of a two's complement number (figure 3.9). If the input is positive, we pass the lower bits. Otherwise, we compute the two's complement inverse by inverting and incrementing the lower bits.

Circuit 9 Let $n>1$. Input $b \in \mathbb{B}^{n}$, output abs $\in \mathbb{B}^{n-1}$.

$$
a b s(b):= \begin{cases}\operatorname{inc}(\neg b[n-2,0], \mathbf{1})[n-2,0] & \text { if } b[n-1] \\ b[n-2,0] & \text { else. }\end{cases}
$$

Since the most negative two's complement number $-2^{n-1}$ does not have a corresponding positive value that is representable in $n$ bits, we exclude this number in the correctness criterion.

Lemma 26 (Abs CORRECT) For all $b \in \mathbb{B}^{n}, b \neq 1 \circ \mathbf{0}^{n-2}$ :

$$
\langle a b s(b)\rangle=|[b]| .
$$

Proof. The case $b[n-1]=\mathbf{0}$ is trivial.
In the case $b[n-1]=\mathbf{1},[b]<0$ by lemma 12 , and we have to prove

$$
\langle i n c(\neg b[n-2,0], 1)[n-2,0]\rangle=-[b] .
$$

After applying lemmas 7 and 13, we have

$$
\langle i n c(\neg b[n-2,0], 1)\rangle \bmod 2^{n-1}=[\neg b]+1
$$

inc is correct by lemma 17, and lemma 11 yields:

$$
(\langle\neg b[n-2,0]\rangle+1) \bmod 2^{n-1}=\langle\neg b[n-2,0]\rangle+1 .
$$

This is trivial for $\langle\neg b[n-2,0]\rangle+1<2^{n-1}$. The only case where this does not hold is $b[n-2,0]=$ $\mathbf{0}^{n-1}$, which is excluded by the lemma's prerequisites.


Figure 3.10: Linear multiplier mult $_{n, m}$

### 3.9 Multiplier

In this section, we construct a linear array multiplier. Multiplication with a single bit is implemented by a bitvector-AND. The single bit products are added up by a linear adder chain (figure 3.10).

Lemma 27 For all $a \in \mathbb{B}^{n}, b \in \mathbb{B}$ :

$$
\left\langle a \wedge b^{n}\right\rangle=\langle a\rangle \cdot b
$$

Proof. Trivial.

Circuit 10 Let $n, m \in \mathbb{N}^{+}$, inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$, output $p \in \mathbb{B}^{n+m}$. For $m=1$,

$$
\text { mult_lin }_{n, 1}:=\mathbf{0} \circ\left(a \wedge b[0]^{n}\right)
$$

For $m>1$, let $M:=$ mult_lin $_{n, m-1}(a, b[m-2,0])$,

$$
\text { mult_lin }_{n, m}:=a d d_{n}\left(\left(a \wedge b[m-1]^{n}\right), M[n+m-2, m-1], \mathbf{0}\right) \circ M[m-2,0] .
$$

Lemma 28 (MUltiplier Correct) For all $n, m \in \mathbb{N}^{+}, a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$ :

$$
\langle\text { mult_lin }(a, b)\rangle=\langle a\rangle \cdot\langle b\rangle
$$

Proof. We induct on $m$. The induction base $m=1$ is trivial with lemma 27.
In the induction step $m+1$, let $M:=$ mult_lin $_{n, m}(a, b[m-1,0])$. We show

$$
\left\langle a d d\left(\left(a \wedge b[m]^{n}\right), M[n+m-1, m], \mathbf{0}\right) \circ M[m-1,0]\right\rangle=\langle a\rangle \cdot\langle b\rangle
$$



Figure 3.11: Decoder

We apply lemma 4 :

$$
\left\langle a d d\left(\left(a \wedge b[m]^{n}\right), M[n+m-1, m], \mathbf{0}\right)\right\rangle \cdot 2^{m}+\langle M[m-1,0]\rangle=\langle a\rangle \cdot\langle b\rangle
$$

By lemma 18, add correctly sums up its inputs:

$$
\left(\left\langle a \wedge b[m]^{n}\right\rangle+\langle M[n+m-1, m]\rangle\right) \cdot 2^{m}+\langle M[m-1,0]\rangle=\langle a\rangle \cdot\langle b\rangle .
$$

Another application of lemma 4 yields

$$
\left\langle a \wedge b[m]^{n}\right\rangle \cdot 2^{m}+\langle M[n+m-1,0]\rangle=\langle a\rangle \cdot\langle b\rangle
$$

By lemma 27 and the induction hypothesis, this is

$$
\langle a\rangle \cdot b[m] \cdot 2^{m}+\langle a\rangle \cdot\langle b[m-1,0]\rangle=\langle a\rangle \cdot\langle b\rangle .
$$

This is true by lemma 3.

### 3.10 Decoder

A decoder (figure 3.11) converts numbers in binary format to unary format, that is, it calculates the function

$$
\mathbb{B}^{n} \rightarrow \mathbb{B}^{2^{n}}: b \mapsto \mathbf{0}^{2^{n}-\langle b\rangle-1} \mathbf{1 0}^{\langle b\rangle}
$$

Circuit 11 Input $b \in \mathbb{B}^{n}$, output $d \in \mathbb{B}^{2^{n}}$. For $n=1$,

$$
d e c_{1}=b[0] \circ \neg b[0]
$$

For $n>1$, let $k:=\left\lceil\frac{n}{2}\right\rceil, D H:=\operatorname{dec}_{n-k}(b[n-1, k]), D L:=\operatorname{dec}_{k}(b[k-1,0])$, and $y<2^{n}$,

$$
d e c_{n}[y]:=D H\left[y \operatorname{div} 2^{k}\right] \wedge D L\left[y \bmod 2^{k}\right]
$$

Lemma 29 (Decoder Correct) For all $b \in \mathbb{B}^{n}, y<2^{n}$ :

$$
\operatorname{dec}(b)[y] \Longleftrightarrow\langle b\rangle=y
$$

Proof. We induct on $n$. The induction base $n=1$ is solved by (grind).
The induction hypothesis for $n>1$ is

$$
\text { For all } 0<j<n, b_{j} \in \mathbb{B}^{j}, y_{j}<2^{j}: d e c_{j}\left(b_{j}\right)\left[y_{j}\right] \Longleftrightarrow\left\langle b_{j}\right\rangle=y_{j}
$$

Let $k=\left\lceil\frac{n}{2}\right\rceil$. We need induction hypotheses for two values of $j$ :

1. $j=k$ : This is permissible, since $0<k=\left\lceil\frac{n}{2}\right\rceil<n$. In this case, $b_{j}=b[k-1,0]$ and $y_{j}=y \bmod 2^{k}$.

$$
\operatorname{dec}_{k}(b[k-1,0])\left[y \bmod 2^{k}\right] \Longleftrightarrow\langle b[k-1,0]\rangle=y \bmod 2^{k} .
$$

2. $j=n-k$ : Here, $n-\left\lceil\frac{n}{2}\right\rceil<n$ holds as well. $b_{j}=b[n-1, k]$ and $y_{j}=y \operatorname{div} 2^{k}$.

$$
d e c_{n-k}(b[n-1, k])\left[y \operatorname{div} 2^{k}\right] \Longleftrightarrow\langle b[n-1, k]\rangle=y \operatorname{div} 2^{k} .
$$

For the instantiation, we have to show that $y \operatorname{div} 2^{k}<2^{n-k}$, i.e., that we stay within the width of the bitvector. We successively use: div is defined via $\lfloor\cdot\rfloor, \forall x:\lfloor x\rfloor \leq x$, and $y<2^{n}$ :

$$
y \operatorname{div} 2^{k}=\left\lfloor\frac{y}{2^{k}}\right\rfloor \leq \frac{y}{2^{k}}<\frac{2^{n}}{2^{k}}=2^{n-k}
$$

We start the proof by expanding the dec definition.

$$
d e c_{n-k}(b[n-1, k])\left[y \operatorname{div} 2^{k}\right] \wedge \operatorname{dec} c_{k}(b[k-1,0])\left[y \bmod 2^{k}\right] \Longleftrightarrow\langle b\rangle=y .
$$

After applying the induction hypotheses, it remains to show

$$
\left(\langle b[n-1, k]\rangle=y \operatorname{div} 2^{k}\right) \wedge\left(\langle b[k-1,0]\rangle=y \bmod 2^{k}\right) \Longleftrightarrow\langle b\rangle=y
$$

By properties of bitvector extraction (lemmas 7 and 8), we have

$$
\left(\langle b\rangle \operatorname{div} 2^{k}=y \operatorname{div} 2^{k}\right) \wedge\left(\langle b\rangle \bmod 2^{k}=y \bmod 2^{k}\right) \Longleftrightarrow\langle b\rangle=y
$$

The $\Leftarrow$ part is trivial. The $\Rightarrow$ part holds because the decompositions of $y$ and $\langle b\rangle$ into div and mod components are unambiguous. This is proved by using the PVS mod and div lemmas.

### 3.11 Halfdecoder

A halfdecoder (figure 3.12) converts binary numbers to half unary notation:

$$
\mathbb{B}^{n} \rightarrow \mathbb{B}^{2^{n}}: b \mapsto \mathbf{0}^{2^{n}-\langle b\rangle} \mathbf{1}^{\langle b\rangle}
$$

Circuit 12 Input $b \in \mathbb{B}^{n}$, output $d \in \mathbb{B}^{2^{n}}$. For $n=1$,

$$
h d e c_{1}:=\mathbf{0} \circ b[0]
$$

For $n>1$, let $H D=h \operatorname{dec}_{n-1}(b[n-2,0])$,

$$
h d e c_{n}:=\left(b[n-1]^{n} \wedge H D\right) \circ\left(b[n-1]^{n} \vee H D\right)
$$

Lemma 30 (HAlFDECODER CORRECT) For all $b \in \mathbb{B}^{n}, y<2^{n}$ :

$$
h d e c(b)[y] \Longleftrightarrow y<\langle b\rangle .
$$



Figure 3.12: Halfdecoder

Proof. Again, the base of the induction on $n$ can be proved by (grind). We prove the statement for $n+1$ with an induction hypothesis for $n$.

In the case $y<2^{n}$, we have to show

$$
b[n] \vee h d e c_{n}(b[n-1,0])[y] \Longleftrightarrow y<\langle b\rangle .
$$

By the induction hypothesis, this is

$$
b[n] \vee(y<\langle b[n-1,0]\rangle) \Longleftrightarrow y<\langle b\rangle
$$

The $\Leftarrow$ part is trivial for $b[n]=\mathbf{1}$.
For $\Rightarrow$ and $b[n]=\mathbf{1}$, we have $2^{n} \leq\langle b\rangle$ by lemma 5, and therefore, $y<2^{n} \leq\langle b\rangle$.
In the case $b[n]=\mathbf{0}$, we have $\langle b[n-1,0]\rangle=\langle b\rangle$.
In the case $y \geq 2^{n}$, we show

$$
b[n] \wedge h d e c_{n}(b[n-1,0])\left[y-2^{n}\right] \Longleftrightarrow y<\langle b\rangle
$$

We apply the induction hypothesis:

$$
b[n] \wedge\left(y-2^{n}<\langle b[n-1,0]\rangle\right) \Longleftrightarrow y<\langle b\rangle
$$

For $b[n]=1$, we have $\langle b[n-1,0]\rangle=\langle b\rangle-2^{n}$ by lemma 3 .
In the other case $b[n]=\mathbf{0}$, the $\Rightarrow$ part is trivial by propositional logic reasoning.
For $\Leftarrow$ and $b[n]=\mathbf{0}$, we show that $y<\langle b\rangle$ is false. This holds by lemma 5: $\langle b\rangle<2^{n} \leq y$.

### 3.12 Encoder

The encoder computes the inverse of the decoder's function: for unary number inputs, it outputs the binary encoding (figure 3.13). The output is not defined for inputs that are not in unary form. The encoder implementation and correctness proof are due to Sven Beyer. We describe the implementation here because it is part of the VAMP basic circuits library. The proof is omitted here.


Figure 3.13: Encoder

Circuit 13 Input $b \in \mathbb{B}^{2 n}$, outputs or $\in \mathbb{B}$, enc $\in \mathbb{B}^{n}$. For $n=1$,

$$
\begin{aligned}
& e n c_{1, \text { or }}:=b[1] \vee b[0], \\
& e n c_{1, e n c}:=b[1] .
\end{aligned}
$$

For $n>1$, let $E H:=e n c_{n-1}\left(b\left[2^{n}-1,2^{n-1}\right]\right), E L:=e n c_{n-1}\left(b\left[2^{n-1}-1,0\right]\right)$, and $0 \leq i \leq n-2$,

$$
\begin{aligned}
& e n c_{n, o r}:=E H_{o r} \vee E L_{o r}, \\
& e n c_{n, e n c}[i]:=E H_{e n c}[i] \vee E L_{e n c}[i] \\
& e n c_{n, e n c}[n-1]:=E H_{o r} .
\end{aligned}
$$

Lemma 31 (Encoder Correct) For all $y<2^{n}$ :

$$
\left\langle e n c_{e n c}\left(\mathbf{0}^{2^{n}-y-1} \mathbf{1 0} 0^{y}\right)\right\rangle=y
$$

The signal $e n c_{o r}$ is only used internally and therefore not included in the correctness criterion.

### 3.13 Leading Zero Counter

The leading zero counter counts the number of zeros at the beginning of a bit string (figure 3.14). This section is a revised version of section 3 from [BJK01].

Before we can prove the leading zero counter implementation correct, we need a formal notion of 'leading zeros'. We define a function lzero on bitvectors of length $n$ :

$$
\operatorname{lzero}(b):=\max \left\{i \in \mathbb{N} \mid i=0 \vee\left(i \leq n \wedge b[n-1, n-i]=\mathbf{0}^{i}\right)\right\}
$$

We start with some lemmas on the lzero function. All these lemmas are fairly obvious, but their proofs are technically complicated in PVS. We omit the proofs.


Figure 3.14: Leading zero counter

Lemma 32 lzero essentially depends on the position of the first 1 -bit. ${ }^{2}$ Let $1 \leq i \leq n-2$.
(1) $\quad l z e r o(b)=0 \Longleftrightarrow b=\mathbf{1} \circ b[n-2,0]$,
(2) $\quad$ lzero $(b)=i \Longleftrightarrow b=\mathbf{0}^{i} \circ \mathbf{1} \circ b[n-i-2,0]$,
(3) $\quad l \operatorname{zero}(b)=n-1 \Longleftrightarrow b=\mathbf{0}^{n-1} \circ \mathbf{1}$,
(4) $\quad l z \operatorname{zero}(b)=n \Longleftrightarrow b=\mathbf{0}^{n}$.

Lemma 33 lzero is bounded by $n$ :

$$
l z e r o(b) \leq n
$$

Lemma 34 Leading zero concatenation: For all $l \in \mathbb{N}^{+}$:

$$
l z e r o\left(\mathbf{0}^{l} \circ b\right)=l+\operatorname{lzero}(b)
$$

Circuit 14 Let $n \in \mathbb{N}$. Input $b \in \mathbb{B}^{2^{n}}$, output $y \in \mathbb{B}^{n+1}$. For $n=0$,

$$
l z_{0}:=\neg b[0] .
$$

For $n>0$, let $L Z H=l z_{n-1}\left(b\left[2^{n}-1,2^{n-1}\right]\right)$ and $L Z L=l z_{n-1}\left(b\left[2^{n-1}-1,0\right]\right)$,

$$
l z_{n}:= \begin{cases}L Z L[n-1] \circ \neg L Z L[n-1] \circ L Z L[n-2,0] & \text { if } L Z H[n-1] \\ \mathbf{0} \circ L Z H & \text { else. }\end{cases}
$$

The implementation is correct if it counts the number of leading zeros for all inputs correctly:

Lemma 35 (LZero Correct) For all $n \in \mathbb{N}_{0}, b \in \mathbb{B}^{2^{n}}$ :

$$
\langle l z(b)\rangle=l z \operatorname{ero}(b)
$$

[^4]Proof. The proof is by induction on $n$. The induction base $n=0$ is easily proved by using lemmas 32.4 and 33.

In the induction step $n$, let $b_{H}:=b\left[2^{n}-1,2^{n-1}\right]$ and $b_{L}:=b\left[2^{n-1}-1,0\right]$. We first look at the case $L Z H[n-1]=1$. By lemma 5, it follows that

$$
2^{n-1} \leq\left\langle l z_{n-1}\left(b_{H}\right)\right\rangle
$$

The induction hypothesis and lemma 33 yield

$$
2^{n-1} \leq \operatorname{lzero}\left(b_{H}\right) \leq 2^{n-1}
$$

which implies equality. Lemma 32.4 leads to

$$
b_{H}=\mathbf{0}^{2^{n-1}}
$$

By lemma 34, and the induction hypothesis we have

$$
\begin{aligned}
\operatorname{lzero}(b) & =l z e r o\left(\mathbf{0}^{2^{n-1}} \circ b_{L}\right) \\
& =2^{n-1}+\operatorname{lzero}\left(b_{L}\right) \\
& =2^{n-1}+\left\langle l z_{n-1}\left(b_{L}\right)\right\rangle
\end{aligned}
$$

With lemma 14 , the output $y$ of the multiplexer satisfies

$$
l z e r o(b)=2^{n-1}+\left\langle l z_{n-1}\left(b_{L}\right)\right\rangle=\langle y\rangle
$$

When $L Z H[n-1]=\mathbf{0}$, by the induction hypothesis, the multiplexer output $y$ satisfies

$$
\langle y\rangle=\langle L Z H\rangle=l \operatorname{zero}\left(b_{L}\right)
$$

By lemma 5, we have

$$
\operatorname{lzero}\left(b_{L}\right)=\langle L Z H\rangle<2^{n-1}
$$

We finish the proof with lemma 32.2, which yields

$$
l z e r o\left(b_{L} \circ b_{H}\right)=\operatorname{lzero}(b)
$$

### 3.14 Cyclic Left Shifter

A cyclic left shifter shifts its input to the left. The bits that are shifted out on the left side are filled in on the right side. The circuit is defined for bitvector widths that are powers of two (figure 3.15).

Let $m \in \mathbb{N}^{+}, n:=2^{m}$ in this section. The formal definition of a cyclic left shift of a bitvector $a \in \mathbb{B}^{n}$ shifted $i<n$ bits is

$$
\operatorname{clshift}_{n}(a, i):= \begin{cases}a[n-i-1,0] \circ a[n-1, n-i] & \text { if } i>0 \\ a & \text { else }\end{cases}
$$

Lemma 36 For all $a \in \mathbb{B}^{n}, b, c \in \mathbb{N}, b+c<n$ :

$$
\operatorname{clshift}_{n}(a, b+c)=\operatorname{clshift}_{n}\left(\operatorname{clshift}_{n}(a, b), c\right)
$$



Figure 3.15: Cyclic left shifter cls

Proof. This lemma is resolved by expansion of the bitvector concatenation and extraction definitions, and applying the PVS decision procedures.

We first define a single shifter stage, consisting of a multiplexer. Depending on a bit $b$, the stage either passes $a$ unshifted, or shifts $a$ by a fixed amount $i$ to the left.

Circuit 15 Let $0<i<n$. Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}$, output $r \in \mathbb{B}^{n}$,

$$
\text { cls_stage }_{n, i}:= \begin{cases}a[n-i-1,0] \circ a[n-1, n-i] & \text { if } b \\ a & \text { else } .\end{cases}
$$

Lemma 37 For all $a \in \mathbb{B}^{n}, 0<i<n, b \in \mathbb{B}$ :

$$
\text { cls_stage }_{n, i}(a, b)=\text { clshift }_{n}(a, i \cdot b)
$$

Proof. Trivial, since the clshift and cls_stage definitions only differ marginally. The reader should note the subtle difference between the function clshift that shifts a bitvector by an arbitrary amount, and the cls_stage hardware that shifts by an arbitrary, but fixed amount.

The recursive shifter construction consists of a stack of stages that shift by powers of two (figure 3.15).

Circuit 16 Let $s<m$, inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$, output $r \in \mathbb{B}^{n}$.

$$
\begin{aligned}
& \text { cls_rec }_{m, 0}:=\text { cls_stage }_{n, 1}(a, b[0]) \\
& \text { cls_rec }_{m, s}:=\text { cls_stage }_{n, 2^{s}}\left(\text { cls_rec }_{m, s-1}(a, b), b[s]\right) .
\end{aligned}
$$

To prove the correctness of the cls_rec circuit, we show that the following invariant holds for each of the $m$ stages of the shifter:

Lemma 38 (CLS Rec Correct) For all $s<m$, $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$ :

$$
c l s \_r e c_{m, s}(a, b)=\operatorname{clshift}_{n}(a,\langle b[s, 0]\rangle)
$$

Proof. We induct on $s$. The induction base $s=0$ is a direct consequence of lemma 37 .
In the induction step $s+1$, the induction hypothesis is

$$
c l s \_r e c_{m, s}(a, b)=\operatorname{clshift}_{n}(a,\langle b[s, 0]\rangle)
$$

We start by expanding the recursive definition of cls_rec:

$$
\text { cls_stage }_{m, 2^{s+1}}\left(\text { cls_rec }_{m, s}(a, b), b[s+1]\right)=\operatorname{clshift}_{n}(a,\langle b[s+1,0]\rangle) .
$$

By using the induction hypothesis, we have

$$
\operatorname{cls}_{-} s t a g e_{m, 2^{s+1}}\left(\operatorname{clshift} t_{n}(a,\langle b[s, 0]\rangle), b[s+1]\right)=\operatorname{clshift}_{n}(a,\langle b[s+1,0]\rangle) .
$$

By lemma 37 we have

$$
\operatorname{clshift}_{n}\left(\operatorname{clshift} t_{n}(a,\langle b[s, 0]\rangle), 2^{s+1} \cdot b[s+1]\right)=\operatorname{clshift}_{n}(a,\langle b[s+1,0]\rangle)
$$

Applying lemma 36 yields

$$
\operatorname{clshift}_{n}\left(a,\langle b[s, 0]\rangle+2^{s+1} \cdot b[s+1]\right)=\operatorname{clshift}_{n}(a,\langle b[s+1,0]\rangle)
$$

This is true by lemma 3.
Finally, we define the cyclic left shifter by an initial call to the recursive definition (figure 3.15):
Circuit 17 Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$, output $r \in \mathbb{B}^{n}$.

$$
c l s_{m}(a, b):=c l s \_r e c_{m, m-1}(a, b) .
$$

Lemma 39 (CLS CORRECT) For all $m \in \mathbb{N}^{+}, n:=2^{m}, a \in \mathbb{B}^{n}, b \in \mathbb{B}^{m}$ :

$$
c l s_{m}=c l s h i f t_{n}(a,\langle b\rangle)
$$

Proof. A trivial application of lemma 38, where $s=m-1$.

### 3.15 Logical Left Shifter

In contrast to the cyclic left shifter, the logical left shifter pads the right part of the bitvector with zeros. We proceed in the same way as in the previous section: the shifter is a stack of stages that shift the bitvector by powers of two (analogous to figure 3.15). The difference is in the definition of the stages which do not shift cyclicly. Unlike the cyclic left shifter, the logical shifters are defined for arbitrary bitvector width.

Let $n \in \mathbb{N}^{+}$and $\log N:=\lceil\log n\rceil$ in this section. The definition of the logical right shift of $a \in \mathbb{B}^{n}$ by $i$ bits is

$$
\text { left_shift }_{n}(a, i):= \begin{cases}a & \text { if } i=0 \\ a[n-i-1,0] \circ \mathbf{0}^{i} & \text { if } i<n \\ \mathbf{0}^{n} & \text { else }\end{cases}
$$

Circuit 18 Let $0<i<n$. Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}$, output $r \in \mathbb{B}^{n}$.

$$
l l s_{-} s t a g e_{n, i}:= \begin{cases}a[N-i-1,0] \circ \mathbf{0}^{i} & \text { if } b \\ a & \text { else } .\end{cases}
$$

Circuit 19 Let $s<\log N$. Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$, output $r \in \mathbb{B}^{n}$.

$$
\begin{aligned}
l l s \_r e c_{n, 0} & :=l l s \_s t a g e_{n, 1}(a, b[0]) . \\
l l s \_r e c_{n, s} & :=l l s \_\operatorname{stage}_{n, 2^{s}}\left(l l s \_r e c_{n, s-1}(a, b), b[s]\right) .
\end{aligned}
$$

Lemma 40 (LLS Rec Correct) For all $s<\log N, a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$ :

$$
l l s \_r e c_{n, s}(a, b)=l e f t_{-} s h i f t_{n}(a,\langle b[s, 0]\rangle)
$$

Proof. The proof is analogous to the proof of lemma 38; we induct on $s$ and expand the recursive definition of $l l s \_r e c$. The PVS decision procedures then complete the proof.

Circuit 20 Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$, output $r \in \mathbb{B}^{n}$.

$$
l l s(a, b):=l l s \_r e c_{n, l o g N-1}(a, b) .
$$

Lemma 41 (LLS Correct) For all $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$ :

$$
l l s(a, b)=l e f t_{-} s h i f t_{n}(a,\langle b\rangle)
$$

Proof. Trivial application of the previous lemma 40 with $s=\log N-1$.

### 3.16 Logical Right Shifter

The logical right shifter is symmetric to the logical left shifter. We therefore only state the specification and the correctness lemma without proof. Again, let $n \in \mathbb{N}^{+}$and $\log N:=\lceil\log n\rceil$.

$$
\text { right_shift }(a, i):= \begin{cases}a & \text { if } i=0 \\ \mathbf{0}^{i} \circ a[n-1, i] & \text { if } i<n \\ \mathbf{0}^{n} & \text { else }\end{cases}
$$

Circuit 21 Let $0<i<n$. Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}$, output $r \in \mathbb{B}^{n}$.

$$
\text { lrs_stage }_{n, i}:= \begin{cases}\mathbf{0}^{i} \circ a[N-1, i] & \text { if } b \\ a & \text { else } .\end{cases}
$$

Circuit 22 Let $s<\log N$. Inputs $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$, output $r \in \mathbb{B}^{n}$.

$$
\begin{aligned}
& \text { lrs_rec }_{n, 0}:=\text { lrs_stage }(a, b[0], 1) \\
& \text { lrs_rec }_{n, s}:=\text { lrs_stage }_{n, 2^{s}}\left(\text { lrs_rec }_{n, s-1}(a, b), b[s]\right) . \\
& \text { lrs }:=\text { lrs_rec }_{n, \log N-1}(a, b) .
\end{aligned}
$$



Figure 3.16: Or tree

Lemma 42 (LRS Correct) For all $a \in \mathbb{B}^{n}, b \in \mathbb{B}^{\log N}$ :

$$
\operatorname{lrs}(a, b)=\text { right_shift }_{n}(a,\langle b\rangle)
$$

### 3.17 Or Tree

An or tree is a tree of OR gates (figure 3.16). It is used for testing whether a bitvector consists only of zeros.

Circuit 23 Input $b \in \mathbb{B}^{n}$, output or $\in \mathbb{B}$.

$$
\text { ortree }_{1}:=b[0] .
$$

Let $k=\left\lfloor\frac{n}{2}\right\rfloor$.

$$
\text { ortree }_{n}:=\operatorname{ortree}_{n-k}(b[n-1, k]) \vee \text { ortree }_{k}(b[k-1,0]) .
$$

Lemma 43 (Or Tree Correct) For all $b \in \mathbb{B}^{n}$ :

$$
\operatorname{ortree}(b) \Longleftrightarrow b \neq \mathbf{0}^{n} .
$$

Proof. We induct on $n$. The case $n=0$ is resolved by (grind).
In the induction step $n$, we show

$$
\operatorname{ortree}_{n-k}(b[n-1, k]) \vee \operatorname{ortree}_{k}(b[k-1,0]) \Longleftrightarrow b \neq \mathbf{0}^{n} .
$$

The proof is finished by using the induction hypotheses:

$$
b[n-1, k] \neq \mathbf{0}^{n-k} \vee b[k-1,0] \neq \mathbf{0}^{k} \Longleftrightarrow b \neq \mathbf{0}^{n} .
$$

## Chapter 4

## IEEE Floating Point Arithmetic

To formally verify the correctness of the VAMP FPU and its components, we need a formal notion of "correctness", i.e., a formalization of the IEEE standard [Ins85] the FPU shall obey. In this chapter, we sketch the formalization of the IEEE standard used in the VAMP verification project. The formalization is primarily based on [Min95, EP97, MP00]. The formalization of the IEEE standard has been implemented and formally verified in PVS by Christian Jacobi [JK00, BJ01, Jac01a, Jac01b]. The IEEE library is described here without proofs. This chapter is a revised version of section 2 from [BJ01].

### 4.1 Factorings

We abstract IEEE numbers as defined in the standard to factorings. A factoring is a triple $(s, e, f)$ with sign bit $s \in\{\mathbf{0}, \mathbf{1}\}$, exponent $e \in \mathbb{Z}$, and significand $f \in \mathbb{R}_{\geq_{0}}$. Note that exponent range and significand precision are unbounded. The value of a factoring is

$$
\llbracket s, e, f \rrbracket:=(-1)^{s} \cdot 2^{e} \cdot f
$$

The standard introduces an exponent width $N$, from which constants $e_{\min }:=-2^{N-1}+2$ and $e_{\max }:=2^{N-1}-1$ are derived. These constants are used to bound the exponent range.

We call a factoring $(s, e, f)$ normal if $e \geq e_{\min }$ and $1 \leq f<2$. A factoring is called denormal if $e=e_{\min }$ and $0 \leq f<1$. We call a factoring an IEEE factoring if it is either normal or denormal.

Lemma 44 Each $x \in \mathbb{R}_{\neq 0}$, has a unique factoring $(\hat{s}, \hat{e}, \hat{f})$ with $1 \leq \hat{f}<2$ and $\llbracket \hat{s}, \hat{e}, \hat{f} \rrbracket=x$. Each $x \in \mathbb{R}_{\neq 0}$ has a unique IEEE factoring $(s, e, f)$ with $\llbracket s, e, f \rrbracket=x$. Zero has two IEEE factorings $\left(0, e_{\text {min }}, 0\right)$ and $\left(1, e_{\text {min }}, 0\right)$, called +0 and -0 , respectively.

Let $\hat{\eta}$ and $\eta$ be the functions that map (non-zero) reals $x$ to their corresponding factorings $(\hat{s}, \hat{e}, \hat{f})$ and $(s, e, f)$, respectively. We define $\eta(0):=\left(0, e_{\min }, 0\right)$.

Lemma 45 Let $x \in \mathbb{R}$ with $x \neq 0$ in the context of $\hat{\eta}$. It holds: ${ }^{1}$

$$
\begin{array}{ll}
\hat{\eta}_{e}(x)=\left\lfloor\log _{2}|x|\right\rfloor, & \eta_{e}(x)= \begin{cases}\left\lfloor\log _{2}|x|\right\rfloor & \text { if } x \neq 0 \text { and }\left\lfloor\log _{2}|x|\right\rfloor \geq e_{\text {min }} \\
e_{\text {min }} & \text { else, }\end{cases} \\
\hat{\eta}_{f}(x)=|x| \cdot 2^{-\hat{\eta}_{e}(x)}, & \eta_{f}(x)=|x| \cdot 2^{-\eta_{e}(x)} .
\end{array}
$$

Let $P$ be the significand precision defined in the standard. A significand $f$ is called representable if $f$ has at most $P-1$ digits behind the binary point, i.e., if $2^{P-1} \cdot f \in \mathbb{N}$. We call an IEEE factoring $(s, e, f)$ semi-representable if $f$ is representable. We call an IEEE factoring representable if it is semi-representable and $e \leq e_{\max }$ holds. We call a real $x$ (semi-)representable if $\eta(x)$ is (semi-)representable.

Representable numbers exactly correspond to the representable numbers defined in the standard. Common values for $(N, P)$ are $(8,24)$ and $(11,53)$, called single and double precision, respectively. The standard defines an encoding of single and double precision IEEE factorings into bit strings of length 32 and 64 , respectively. In this chapter, factorings are triples of numbers. In chapter 5, we introduce factorings that are triples of bitvectors.

### 4.2 Rounding

We proceed with the definition of the rounding function. The IEEE standard defines four rounding modes: round to nearest, up, down, and to zero. We define a function $r_{\text {int }}(\cdot, \mathcal{M})$ for each rounding mode $\mathcal{M} \in\{$ near, up, down, zero $\}$, which rounds reals $x$ to integers [Min95]:

$$
\begin{aligned}
r_{\text {int }}(x, \text { near }) & := \begin{cases}\lfloor x\rfloor & \text { if } x-\lfloor x\rfloor<\lceil x\rceil-x \\
\lceil x\rceil & \text { if } x-\lfloor x\rfloor>\lceil x\rceil-x \\
x & \text { if }\lfloor x\rfloor=\lceil x\rceil \\
2 \cdot\lfloor\lceil x\rceil / 2\rfloor & \text { else, }\end{cases} \\
r_{\text {int }}(x, \text { up }) & :=\lceil x\rceil, \\
r_{\text {int }}(x, \text { down }) & :=\lfloor x\rfloor, \\
r_{\text {int }}(x, \text { zero }) & :=\operatorname{sign}(x) \cdot\lfloor|x|\rfloor .
\end{aligned}
$$

By scaling by $2^{P-1}$, reals are rounded to rationals with $P-1$ fractional bits:

$$
r_{r a t}(x, \mathcal{M}):=2^{-(P-1)} \cdot r_{i n t}\left(x \cdot 2^{P-1}, \mathcal{M}\right)
$$

Further scaling with $2^{e}, e:=\eta_{e}(x)$, yields the IEEE rounding function:

$$
r d(x, \mathcal{M}):=2^{e} \cdot r_{r a t}\left(x \cdot 2^{-e}, \mathcal{M}\right)
$$

It is not obvious that this definition conforms with the IEEE standard. The conformance is proved in [Jac01a].

The rounding of reals $x$ can be decomposed into three steps: $\eta$-computation, significand rounding, and post-normalization [MP00, Jac01b]. The benefit of this decomposition is that it simplifies the design and verification of the rounder (see theorem 3) [BJ01].

[^5]First, the $\eta$-computation step computes the IEEE factoring $\eta(x)$, where $x$ is the number to be rounded. The significand round then rounds the significand computed in the $\eta$-computation to $P-1$ digits behind the binary point. This is formalized in the function sigrd:

$$
\operatorname{sigrd}(X, \mathcal{M}):=\left|r_{r a t}\left((-1)^{s} \cdot f, \mathcal{M}\right)\right|
$$

where $X=(s, e, f)$ is an arbitrary IEEE factoring, and $\mathcal{M} \in\{n e a r, u p, d o w n$, zero $\}$ is a rounding mode.

In case the significand round returns 0 or 2 , the factoring has to be post-normalized: if the significand round returns 2 , the exponent is incremented, and the significand is forced to 1 ; if the significand round returns 0 , the sign bit is forced to 0 (in order to yield $\eta(0)$ ). The postnormalization is defined as follows:

$$
\operatorname{postnorm}(X, \mathcal{M})= \begin{cases}(s, e, \operatorname{sigrd}(X, \mathcal{M})) & \text { if } 0<\operatorname{sigrd}(X, \mathcal{M})<2 \\ (s, e+1,1) & \text { if } \operatorname{sigrd}(X, \mathcal{M})=2 \\ \left(0, e_{\text {min }}, 0\right) & \text { if } \operatorname{sigrd}(X, \mathcal{M})=0\end{cases}
$$

Theorem 3 (DECOMPOSITION THEOREM) For $x \in \mathbb{R}$ and rounding mode $\mathcal{M} \in\{$ near, up, down, zero\}:

$$
\operatorname{postnorm}(\eta(x), \mathcal{M})=\eta(r d(x, \mathcal{M}))
$$

## $4.3 \alpha$-Equivalence

We now define the concept of $\alpha$-equivalence and $\alpha$-representatives [EP97, MP00]. As we will see in theorem 5 , this concept is a very concise way to speak about sticky-bit computations.

Let $\alpha$ be an integer. Two reals $x$ and $y$ are said to be $\alpha$-equivalent, if

$$
x \equiv_{\alpha} y \quad: \Longleftrightarrow x=y \vee\left(\exists q \in \mathbb{Z}: q \cdot 2^{\alpha}<x, y<(q+1) \cdot 2^{\alpha}\right)
$$

i.e., if both $x$ and $y$ lie in the same open interval between two consecutive integral multiples of $2^{x}$. Clearly, if such an $q$ exists, it must be

$$
q_{\alpha}(x):=\left\lfloor x \cdot 2^{-\alpha}\right\rfloor .
$$

The $\alpha$-representative of $x$ is defined as

$$
[x]_{\alpha}:= \begin{cases}x & \text { if } x=q_{\alpha}(x) \cdot 2^{\alpha} \\ \left(q_{\alpha}(x)+\frac{1}{2}\right) \cdot 2^{\alpha} & \text { else }\end{cases}
$$

i.e., if $x$ is an integral multiple of $2^{\alpha}$, the representative of $x$ is $x$ itself, and otherwise the midpoint of the interval between the surrounding multiples of $2^{\alpha}$.

Lemma 46 Let $x, y \in \mathbb{R}$, and $\alpha, k \in \mathbb{Z}$.

1. $\equiv_{\alpha}$ is an equivalence relation,
2. $x \equiv{ }_{\alpha}[x]_{\alpha}$,
3. $x \equiv{ }_{\alpha} y \Longleftrightarrow[x]_{\alpha}=[y]_{\alpha}$,
(representative equivalence)
4. $x \equiv{ }_{\alpha} y \Longleftrightarrow-x \equiv_{\alpha}-y$, and $[-x]_{\alpha}=-[x]_{\alpha}$, (negative value)
5. $x \equiv_{\alpha} y \Longleftrightarrow 2^{k} \cdot x \equiv_{\alpha+k} 2^{k} \cdot y$, and $\left[2^{k} \cdot x\right]_{\alpha+k}=2^{k} \cdot[x]_{\alpha}$,
6. $x \equiv{ }_{\alpha} y \Longleftrightarrow x+k \cdot 2^{\alpha} \equiv_{\alpha} y+k \cdot 2^{\alpha}$, (translation)
7. $x \equiv_{\alpha} y \Longrightarrow x \equiv_{\alpha+k} y$ if $k \geq 0$,
8. $x=0 \Longleftrightarrow x \equiv_{\alpha} 0 \Longleftrightarrow[x]_{\alpha}=0$,
(zero value)
9. $0<x<2^{\alpha} \Longrightarrow[x]_{\alpha}=2^{\alpha-1}$.
(small value)

The following theorem describes the computation of IEEE factorings of representatives:

Theorem 4 Let $x \in \mathbb{R}$, let $(s, e, f):=\eta(x)$ be the corresponding IEEE factoring, and let $p \geq 0$ be an integer. The IEEE factoring of $[x]_{e-p}$ can be computed by computing the representative $[f]_{-p}$ of $f$ :

$$
\eta\left([x]_{e-p}\right)=\left(s, e,[f]_{-p}\right)
$$

Next, we show that the representative of $f$ can be computed by a sticky bit computation. Let $f \geq 0$ be a real in binary format $f_{k}, \ldots, f_{0}, \ldots, f_{-l} \in\{0,1\}^{k+l+1}$ such that $\langle f\rangle:=\sum_{i=-l}^{k} f_{i} \cdot 2^{i}$. Let $p$ be an integer, $k \geq p>-l$. The $p$-sticky-bit of $f$ is the logical OR of all bits $f_{p-1}, \ldots, f_{-l}$ :

$$
\operatorname{sticky}_{p}(f):=\bigvee_{i=-l}^{p-1} f_{i}
$$

Theorem 5 The representative $[f]_{p}$ of $f$ can be computed by replacing the less significant bits by the sticky bit:

$$
[f]_{p}=\langle f[k, p]\rangle+2^{p-1} \cdot \operatorname{sticky}_{p}(f)
$$

Theorems 4 and 5 together allow a very efficient computation of representatives (respectively their IEEE factorings) by or-ing the less significant bits in an or tree, and replacing them by the sticky bit. This technique is well known [Gol96], but introducing the formalism with $\alpha$ representatives allows for a very concise argumentation about sticky computations.

The valuable property of $\alpha$-representatives is that rounding $x$ and its representative $[x]_{-P}$ yields the same result:

Theorem 6 Let $x \in \mathbb{R}, e:=\eta_{e}(x)$, and $\mathcal{M}$ be a rounding mode. It holds

$$
r d(x, \mathcal{M})=r d\left([x]_{e-P}, \mathcal{M}\right)
$$

As a consequence, the significand round can be performed on the representative $[f]_{-P}$ of $f$ :

$$
\operatorname{sigrd}((s, e, f), \mathcal{M})=\operatorname{sigrd}\left(\left(s, e,[f]_{-P}\right), \mathcal{M}\right)
$$

Theorem 7 By lemma 46.7, theorem 6 also holds for any $\alpha \leq e-P$ :

$$
r d(x, \mathcal{M})=r d\left([x]_{\alpha}, \mathcal{M}\right)
$$

### 4.4 Exceptions

The IEEE standard defines five exceptions: invalid operation (INV), division by zero (DIVZ), overflow (OVF), underflow (UNF), and inexact result (INX). Our formalization of these exceptions is taken literally from [MP00], as the implementation in the actual hardware is. As a consequence of theorem 6, the exceptions can be detected by considering only the representative of the exact result.

In case of underflow or overflow with the respective trap handler enabled, the standard mandates scaling the result into the representable range, and passing the scaled result to the trap handler. This is called wrapped exponent. The scale factor is defined to be $2^{A}$ with $A:=3 \cdot 2^{N-2}$.

### 4.5 Correctness of the FPU

The standard requests that every floating point operation shall return a result obtained as if one first computed the exact result with infinite precision, and then rounded this exact result. We therefore call the FPU correct, if for each operation $\circ \in\{+,-, \times, \div\}$ on all representable numbers $x$ and $y$, the FPU returns the IEEE bit string encoding of the factoring

$$
\eta(r d(x \circ y, \mathcal{M}))
$$

## Chapter 5

## The Floating Point Adder

In this chapter, we prove the correctness of the floating point adder from the VAMP FPU. A summary of this chapter is part of [BJ01]. The adder is given two factorings $a$ and $b$ and a flag $s u b$. For $s u b=\mathbf{0}$, it computes the sum $a+b$, and for $s u b=\mathbf{1}$, the difference $a-b$.

### 5.1 Adder Correctness

Bitvector representation. The VAMP FPU handles single and double precision operands. Since single precision operands are embedded into double precision bitvectors by the unpacker, we only use double precision within the adder. We therefore fix the precision constants $(N, P)=(11,53)$ in this chapter.

We define the bitvector factoring type

$$
\mathbb{I}_{p}:=\left(\mathbb{B}, \mathbb{B}^{11}, \mathbb{B}^{p}\right)
$$

with the abbreviation $\mathbb{I}:=\mathbb{I}_{53}$. In section 4.1 , we defined $\llbracket \cdot \rrbracket$ on numbers. For $(s, e, f) \in \mathbb{I}_{p}$, we define

$$
\llbracket s, e, f \rrbracket_{i}:=\llbracket s,[e],\langle f\rangle \cdot 2^{i-p} \rrbracket,
$$

i.e., $e$ is interpreted as a two's complement number, and $f$ as a binary fraction with $i$ bits in front of the binary point. For convenience, we will omit the index 1 :

$$
\llbracket s, e, f \rrbracket:=\llbracket s, e, f \rrbracket_{1} .
$$

In section 4.1, we defined (semi-)representable IEEE factorings. We will also apply these notions to bitvector factorings, e.g., $(s, e, f) \in \mathbb{I}$ is called an IEEE factoring if $\left(s,[e],\langle f\rangle \cdot 2^{-p}\right)$ is an IEEE factoring (where $i$ is clear from the context).

Correctness criterion. As in chapter 3, we define $\pm_{x}$ for $x \in \mathbb{B}$ as

$$
\pm_{x}:= \begin{cases}+ & \text { if } x=\mathbf{0} \\ - & \text { if } x=1\end{cases}
$$

The VAMP unpacker passes the IEEE factorings $a$ and $b \in \mathbb{I}$ to the adder [BJ01]. Let the exact, infinitely precise result of the operation to be performed be

$$
S:=\llbracket a \rrbracket \pm_{\text {sub }} \llbracket b \rrbracket .
$$

The adder computes an approximation factoring $s$ of the exact result $S$.
The adder passes its output $s$ to the floating point rounder. From $s$, the rounder computes the rounded result $r d(S, \mathcal{M})$ according to the IEEE standard (cf. section 4.2). To meet the rounder input specifications, we have to prove that the output $s$ of the adder is close enough to $S$. In terms of $\alpha$-equivalence (see theorem 6), that is

$$
S \equiv_{e-P} \llbracket s \rrbracket,
$$

where $e:=\eta_{e}(S)$. Using $\hat{e}:=\hat{\eta}_{e}(S)$ and theorem 7, we will prove the adder correctness criterion

$$
S \equiv_{\hat{e}-P} \llbracket s \rrbracket .
$$

By lemma 45, we have $\hat{e} \leq e$, and hence, the prerequisites of theorem 7 are satisfied.
The output $s$ may be an arbitrary factoring, as the VAMP rounder is capable of rounding $s$ correctly even if $s$ is not an IEEE factoring. The rounder specification imposes two additional restrictions: the rounder input exponent must satisfy $e \leq e_{\max }$ in case of a denormal input significand, and the value of the input factoring must lie in the range that can be scaled into the range of representable numbers using the wrapped exponent from section 4.4. (See [Jac01b] for details.)

Special cases. For a zero sum $S, \hat{e}=\hat{\eta}_{e}(S)$ is not defined. We will therefore require

$$
S \neq 0
$$

as a prerequisite to the adder correctness criterion proved here. In the VAMP FPU, the case $S=0$ is detected by the unpacker. Other special cases that are handled by the unpacker are operations on NaN and $\pm \infty$. We therefore only treat numeric operands (represented by IEEE factorings). The unpacker implementation is described in detail in [Jac01b].

### 5.2 Addition Algorithm

We are given the input factorings $a:=\left(s_{a}, e_{a}, f_{a}\right)$ and $b:=\left(s_{b}, e_{b}, f_{b}\right)$, and a flag sub. We want to compute $\llbracket a \rrbracket \pm_{s u b} \llbracket b \rrbracket$, represented by $s:=\left(s_{s}, e_{s}, f_{s}\right)$.

We will implement the floating point addition using the following algorithm:

1. In case of a subtraction, flip the sign bit of $b: s_{b}:=s_{b} \oplus s u b$,
2. the larger exponent of $e_{a}$ and $e_{b}$ is the result's exponent $e_{s}$,
3. assume $e_{a} \geq e_{b}$, otherwise exchange $a$ and $b$.
4. align the significand $f_{b}$ to $f_{a}$ by shifting it $\delta:=\left|e_{a}-e_{b}\right|$ to the right: $f_{b}^{*}:=f_{b} \cdot 2^{-\delta}$,
5. let $s x:=s_{a} \oplus s_{b}^{\prime}$,
6. let $s u m:=f_{a} \pm_{s x} f_{b}^{*}$,
7. the results significand is $f_{s}:=\mid$ sum $\mid$,
8. the result's sign is $s_{s}:=s_{a} \oplus(s u m<0)$.

This addition algorithm for floating point addition is well known [Gol96].
The alignment shift would require shifters of size $\approx 2^{N+1}$, which is impractical. We therefore approximate the shifted significand $f_{b}^{*}$ by its $-(P+1)$-representative

$$
f_{b}^{\prime}:=\left[2^{-\delta} \cdot f_{b}\right]_{-(P+1)}
$$

### 5.3 Correctness of the Addition Algorithm

To justify using the algorithm from section 5.2, we prove the following theorem, which will be used as the core of the adder hardware correctness proof.

Theorem 8 For all representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right)$ and $b:=\left(s_{b}, e_{b}, f_{b}\right)$ where $e_{a} \geq e_{b}$, let $S:=\llbracket a \rrbracket+\llbracket b \rrbracket \neq 0, \hat{e}:=\hat{\eta}_{e}(S), \delta:=e_{a}-e_{b}$, and $f_{b}^{\prime}:=\left[2^{-\delta} \cdot f_{b}\right]_{-(P+1)}$.

$$
S \equiv_{\hat{e}-P} 2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot f_{b}^{\prime}\right)
$$

Proof. $S$ can be rewritten as

$$
\begin{aligned}
S & =\llbracket s_{a}, e_{a}, f_{a} \rrbracket+\llbracket s_{b}, e_{b}, f_{b} \rrbracket \\
& =(-1)^{s_{a}} \cdot 2^{e_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{e_{b}} \cdot f_{b} \\
& =2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b}\right) .
\end{aligned}
$$

It remains to prove

$$
2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b}\right) \equiv_{\hat{e}-P} 2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot f_{b}^{\prime}\right)
$$

For $\delta \leq 2$, we claim

$$
f_{b}^{\prime}=\left[2^{-\delta} \cdot f_{b}\right]_{-(P+1)}=2^{-\delta} \cdot f_{b}
$$

We use the premise that $b$ is a representable IEEE factoring with precision $P$. It follows that $f_{b} \cdot 2^{P-1}$ is an integer. Since $\delta \leq 2, f_{b} \cdot 2^{-\delta} \cdot 2^{P+1}$ is also integer. Therefore, $2^{-\delta} \cdot f_{b}=$ $\left[2^{-\delta} \cdot f_{b}\right]_{-(P+1)}$ by the definition of $[\cdot]_{\alpha}$, which proves the theorem for $\delta \leq 2$.

Now let $\delta \geq 3$. Starting from

$$
\left[2^{-\delta} \cdot f_{b}\right]_{-(P+1)}=f_{b}^{\prime}
$$

we have by lemma 46.3

$$
2^{-\delta} \cdot f_{b} \equiv_{-(P+1)} f_{b}^{\prime}
$$

Applying lemma 46.4 yields

$$
(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b} \equiv_{-(P+1)}(-1)^{s_{b}} \cdot f_{b}^{\prime}
$$

Lemma 46.6 yields

$$
(-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b} \equiv_{-(P+1)}(-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot f_{b}^{\prime}
$$

Lemma 46.5 yields

$$
2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b}\right) \equiv_{e_{a}-(P+1)} 2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot f_{b}^{\prime}\right)
$$

Lemma 47 below tells us that $\hat{e}-P \geq e_{a}-(P+1)$. Therefore, we can coarsen the $\alpha$-equivalence to $\hat{e}-P$ using lemma 46.7:

$$
2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b}\right) \equiv_{\hat{e}-P} 2^{e_{a}} \cdot\left((-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot f_{b}^{\prime}\right)
$$

It remains to prove lemma 47, from which we concluded $\hat{e}-P \geq e_{a}-(P+1)$ in the above proof.

Lemma 47 For all representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right)$ and $b:=\left(s_{b}, e_{b}, f_{b}\right)$, where $\delta:=e_{a}-e_{b} \geq 2$, let $S:=\llbracket a \rrbracket+\llbracket b \rrbracket \neq 0, \hat{e}:=\hat{\eta}_{e}(S):$

$$
\hat{e} \geq e_{a}-1
$$

Proof. By lemma 45 and the definitions of $S$ and $\llbracket \cdot \rrbracket$, we have to show

$$
\hat{e}=\hat{\eta}(S) \cdot e=\left\lfloor\log _{2}\left|(-1)^{s_{a}} \cdot 2^{e_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{e_{b}} \cdot f_{b}\right|\right\rfloor \geq e_{a}-1
$$

Since $e_{a}-1$ is integer, this is equivalent to

$$
\log _{2}\left|(-1)^{s_{a}} \cdot 2^{e_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{e_{b}} \cdot f_{b}\right| \geq e_{a}-1
$$

$b$ is an IEEE factoring, i.e.

$$
0 \leq f_{b}<2
$$

$\delta \geq 2$ implies

$$
0<2^{-\delta} \leq \frac{1}{4}
$$

we conclude

$$
2^{-\delta} \cdot f_{b}<\frac{1}{2}
$$

From the fact that $b$ is an IEEE factoring, we know

$$
e_{b} \geq e_{\min }
$$

and due to $\delta=e_{a}-e_{b} \geq 2$,

$$
e_{a}=e_{b}+\delta \geq e_{\min }+\delta>e_{\min }
$$

This means that $a$ is a normal IEEE factoring, hence

$$
f_{a} \geq 1
$$



Figure 5.1: Top level schematics for the adder

By checking the four cases for the sign bits $s_{a}$ and $s_{b}$, we conclude

$$
\left|(-1)^{s_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{-\delta} \cdot f_{b}\right| \geq \frac{1}{2}
$$

Multiplying by $2^{e_{a}}$ yields

$$
\left|(-1)^{s_{a}} \cdot 2^{e_{a}} \cdot f_{a}+(-1)^{s_{b}} \cdot 2^{e_{b}} \cdot f_{b}\right| \geq 2^{e_{a}-1} .
$$

Taking logarithms on both sides finishes the proof.

### 5.4 Adder Hardware

We will now show that the VAMP floating point adder (figure 5.1) correctly implements the addition algorithm in section 5.2.

Inputs to the adder are the representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right), b:=\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$ and the flag sub $\in \mathbb{B}$. Output is the factoring $s:=\left(s_{s}, e_{s}, f_{s}\right) \in \mathbb{I}_{57}$.
The adder is divided into three stages. The first stage consists of a single XOR gate to compute $s_{b}^{\prime}$ (step 1 of the algorithm). The second stage exchanges $a$ and $b$ if necessary and computes the alignment shift, producing factorings $a_{2}$ and $b_{3}$ (steps 2 to 5 ). The third stage then adds/subtracts the significands (steps 6 to 8 ).

Differences from the adder given in [MP00] are:

- The special cases of NaN and infinite operands are not handled by the adder itself, but the unpacker. A zero result is also a special case. This evades the need for a forth adder stage for the sign computation. The zero tester is moved from the significand add stage to the unpacker [BJ01, Jac01b].
- The formalization does not use binary fractions, but natural numbers. This is due to the fact that the PVS bitvectors library does not support binary fractions. One could formalize binary fractions in PVS, but one would loose the benefits of the bitvector lemmas PVS provides.


### 5.5 Stage 1: Computing $s_{b}^{\prime}$

Stage 1 implements the subtraction algorithm by inverting the sign bit $\&$ in case of a subtraction (see figure 5.1).

Circuit 24 Inputs to stage 1 are the factorings $a:=\left(s_{a}, e_{a}, f_{a}\right), b:=\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$, and the subtract bit sub $\in \mathbb{B}$. Outputs are $\left(s_{a}, e_{a}, f_{a}\right)$ and $b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}$. The sign bit of $b^{\prime}$ is computed as

$$
s_{b}^{\prime}:=s u b \oplus s_{b}
$$

All other outputs pass stage 1 unmodified.

The following lemma states that the value $b$ is computed correctly:

Lemma 48 For all factorings $\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$ and subtract bits sub, the output $\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}$ of stage 1 satisfies

$$
\llbracket s_{b}^{\prime}, e_{b}, f_{b} \rrbracket=\llbracket s_{b}, e_{b}, f_{b} \rrbracket \cdot(-1)^{s u b}
$$

Proof. The proof is trivial by applying the definitions of $s_{b}$ and $\llbracket \rrbracket \rrbracket$.
Stage 2 requires that its inputs are IEEE factorings, this holds for the output of stage 1 if its inputs are IEEE factorings.

Lemma 49 For all IEEE factorings $\left(s_{a}, e_{a}, f_{a}\right),\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$ :

$$
\text { Output }\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \text { is an IEEE factoring. }
$$

Output $\left(s_{a}, e_{a}, f_{a}\right)$ is an IEEE factoring by prerequisite.

Proof. Trivial by the definition of IEEE factorings.
The correctness of the stage 1 output is asserted by theorem 9 .

Theorem 9 (Stage 1 Correct) For all factorings $\left(s_{a}, e_{a}, f_{a}\right)$ and $\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$, and subtract bits sub $\in \mathbb{B}$ :

$$
\llbracket s_{a}, e_{a}, f_{a} \rrbracket \pm_{s u b} \llbracket s_{b}^{\prime}, e_{b}, f_{b} \rrbracket=\llbracket s_{a}, e_{a}, f_{a} \rrbracket+\llbracket s_{b}^{\prime}, e_{b}, f_{b} \rrbracket .
$$

Proof. Trivial by lemma 48.

### 5.6 Stage 2: Alignment Shift

Stage 2 consists of the alignment shifter that computes the larger of both input exponents (step 2 from the algorithm in section 5.2), swaps $a$ and $b$ if necessary (step 3), and aligns the significands according to the difference of the exponents, thereby computing the representative $f$ of the shifted significand (step 4).


Figure 5.2: Top level alignment shifter schematics


Figure 5.3: Exponent Subtract ExpSub

The alignment shifter itself is divided into several subcircuits (figure 5.2). Overall inputs are $a:=\left(s_{a}, e_{a}, f_{a}\right)$ and $b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right)$. Outputs are the (unrounded) result's exponent $e_{s}$ and the possibly swapped sign bits and significands $\left(s_{a 2}, e_{s}, f_{a 2}\right) \in \mathbb{I}$ and $\left(s_{b 2}, e_{s}, f_{b 3}\right) \in \mathbb{I}_{56}$. A bit $s x$ indicates whether we have to add or subtract the significands in the next stage.

### 5.6.1 Exponent Subtract

Circuit ExpSub computes the alignment shift distance $a s:=e_{a}-e_{b}$, and the flag $e_{b>a}$ indicating that the difference is negative (figure 5.3).

Circuit 25 Inputs $e_{a}, e_{b} \in \mathbb{B}^{11}$, outputs $e_{b>a} \in \mathbb{B}$, as $\in \mathbb{B}^{12}$. Let $\operatorname{sext}(b)$ denote the sign extension of b by one bit.

$$
\begin{aligned}
& a s:=\operatorname{sub}_{s}\left(\operatorname{sext}\left(e_{a}\right), \operatorname{sext}\left(e_{b}\right)\right) \\
& e_{b>a}:=\operatorname{sub}_{n e g}\left(\operatorname{sext}\left(e_{a}\right), \operatorname{sext}\left(e_{b}\right)\right) .
\end{aligned}
$$



Figure 5.4: Circuit Limit

Lemma 50 (ExpSub as Correct) For all $e_{a}, e_{b} \in \mathbb{B}^{11}$ :

$$
\text { as }=\left[e_{a}\right]-\left[e_{b}\right] .
$$

Lemma 51 (ExpSub eb-ea Correct) For all $e_{a}, e_{b} \in \mathbb{B}^{11}$ :

$$
e_{b>a} \Longleftrightarrow\left[e_{b}\right]>\left[e_{a}\right] \Longleftrightarrow[a s]<0 .
$$

Proof. Trivial application of the sub correctness lemma 25.

### 5.6.2 Exponent Select

The greater of both exponents is selected by a multiplexer (see figure 5.2).

Circuit 26 Inputs $e_{a}, e_{b} \in \mathbb{B}^{11}$, $e_{b>a} \in \mathbb{B}$, output $e_{s} \in \mathbb{B}^{11}$.

$$
e_{s}:= \begin{cases}e_{b} & \text { if } e_{b>a} \\ e_{a} & \text { else }\end{cases}
$$

Lemma 52 (Exponent es Correct) For all $e_{a}, e_{b} \in \mathbb{B}^{11}, e_{b>a} \in \mathbb{B}$, let $e_{b>a}=\left(\left[e_{b}\right]>\left[e_{a}\right]\right)$ :

$$
e_{s}=\max \left(e_{a}, e_{b}\right)
$$

Proof. Here, max is the bitvector two's complement maximum. This lemma assumes the correctness of the $e_{b>a}$ input, which was asserted by lemma 51. The proof is trivial.

### 5.6.3 Circuit Limit

The alignment shift distance $a s$ is limited to reduce the size of the shifter lrs that aligns the significands (figure 5.4). The first part limit_approx of the limit circuit computes an approximation as $s_{1}$ of the absolute value of as. Instead of computing the two's complement, the circuit only negates as, saving an incrementer on the critical path of the floating point adder. The error introduced in the shift distance is compensated by the swap circuit where $f_{b 2}$ is shifted by 1 in the erroneous case. The second part limit_limit limits $a s_{1}$ to a maximum value of $B=63$. $a s_{2}$ is the output of the concatenation of both parts.

## Approximating the Shift Distance

Circuit 27 Inputs $a s \in \mathbb{B}^{12}$, $e_{b>a} \in \mathbb{B}$, output as $s_{1} \in \mathbb{B}^{12}$.

$$
\text { limit_approx }:= \begin{cases}\neg a s & \text { if } e_{b>a} \\ a s & \text { else } .\end{cases}
$$

Lemma 53 (Limit Approx Correct) For all as $\in \mathbb{B}^{12}, e_{b>a} \in \mathbb{B}$, let $e_{b>a}=([a s]<0)$ :

$$
\left\langle a s_{1}\right\rangle=|[a s]|-e_{b>a}
$$

Proof. The assumption on $e_{b>a}$ holds by lemma 51.
In the case $e_{b>a}=\mathbf{1}$, we have to show that

$$
\langle\neg a s\rangle=|[a s]|-1
$$

holds. In this case, $[a s]<0$. By lemma 9, this is equivalent to

$$
2^{12}-1-\langle a s\rangle=-[a s]-1
$$

Again because of $[a s]<0$, we conclude that $a s[11]$ is set (lemma 12). We finish this case with lemma 10:

$$
2^{12}-\langle a s\rangle=-\left(\langle a s\rangle-1 \cdot 2^{12}\right)
$$

In the other case $e_{b>a}=\mathbf{0}$, we have $[a s] \geq 0$, and hence, $\neg a s[11]$ by lemma 12 .

$$
\begin{aligned}
\langle a s\rangle & =|[a s]|-0, \\
\langle a s\rangle & =[a s] .
\end{aligned}
$$

The latter holds by lemma 10.

## Limiting the Shift Distance

Circuit 28 Input a $s_{1} \in \mathbb{B}^{12}$, output as $s_{2} \in \mathbb{B}^{6}$.

$$
\text { limit_limit }:=a s_{1}[5,0] \vee\left(\operatorname{ortree}\left(a s_{1}[11,6]\right)\right)^{6}
$$

Lemma 54 (Limit Limit Correct) Let $B=63$. For all as $s_{1} \in \mathbb{B}^{12}$ :

$$
\left\langle a s_{2}\right\rangle=\min \left(\left\langle a s_{1}\right\rangle, B\right)
$$



Figure 5.5: Significand and sign bit swapping

Proof. First, we assume $\left\langle a s_{1}\right\rangle>B$. By lemma 6 , we have $a s_{1}[11,6] \neq \mathbf{0}^{6}$, and by lemma 43, $\operatorname{ortree}\left(a s_{1}[11,6]\right)=1$. Hence,

$$
\left\langle a s_{2}\right\rangle=\left\langle\mathbf{1}^{6}\right\rangle=B=\min \left(\left\langle a s_{1}\right\rangle, B\right) .
$$

Where $\left\langle a s_{1}\right\rangle \leq B, a s_{1}[11,6]=\mathbf{0}^{6}$, and $\operatorname{ortree}\left(a s_{1}[11,6]\right)=\mathbf{0}$. We finish with

$$
\left\langle a s_{2}\right\rangle=\left\langle a s_{1}[5,0]\right\rangle=\left\langle a s_{1}\right\rangle=\min \left(\left\langle a s_{1}\right\rangle, B\right)
$$

Finally, we combine the two circuits and lemmas:

Circuit 29 Inputs as $\in \mathbb{B}^{12}$, $e_{b>a} \in \mathbb{B}$, output as $s_{2} \in \mathbb{B}^{6}$.

$$
\text { limit }:=\text { limit_limit } \circ \text { limit_approx. }
$$

Here, o is function concatenation.

Lemma 55 (Limit Correct) For all as $\in \mathbb{B}^{12}$, $e_{b>a} \in \mathbb{B}$, let $e_{b>a}=([a s]<0)$ :

$$
\left\langle a s_{2}\right\rangle=\min \left(|[a s]|-e_{b>a}, B\right)
$$

Proof. A direct consequence of lemmas 53 and 54.

### 5.6.4 Significand Swapping

If exponent $e_{b}$ is greater than $e_{a}$, signal $e_{b>a}$ indicates that the significands $f_{a}$ and $f_{b}$, and the sign bits $s_{a}$ and $s_{b}$ have to be swapped (figure 5.5). This is realized by multiplexers. Because of the error in the shift amount introduced by the limit circuit, $f_{b 2}$ is shifted one digit to the right here when $e_{b>a}$ is active $\left(\mathbf{0} \circ f_{a}\right.$ vs. $f_{b} \circ \mathbf{0}$ in the $f_{b 2}$ definition).

Circuit 30 Inputs $s_{a}, s_{b}^{\prime}, e_{b>a} \in \mathbb{B}, f_{a}, f_{b} \in \mathbb{B}^{53}$, outputs $s_{a 2}, s_{b 2} \in \mathbb{B}, f_{a 2} \in \mathbb{B}^{53}$, $f_{b 2} \in \mathbb{B}^{55}$.

$$
\begin{array}{ll}
s_{a 2}:= \begin{cases}s_{b}^{\prime} & \text { if } e_{b>a} \\
s a & \text { else },\end{cases} & s_{b 2}:= \begin{cases}s a & \text { if } e_{b>a} \\
s_{b}^{\prime} & \text { else, }\end{cases} \\
f_{a 2}:= \begin{cases}f b & \text { if } e_{b>a} \\
f a & \text { else, }\end{cases} & f_{b 2}:=\left\{\begin{array}{ll}
\mathbf{0} \circ f_{a} & \text { if } e_{b>a} \\
f_{b} \circ \mathbf{0} & \text { else }
\end{array}\right\} \circ \mathbf{0} .
\end{array}
$$



Figure 5.6: Shifting $f_{b 2}$ by $\left\langle a s_{2}\right\rangle$ bits to the right
Lemma 56 (Significand Swap Correct) For all $f_{a}, f_{b} \in \mathbb{B}^{53}$, $e_{b>a} \in \mathbb{B}$ :

$$
\begin{aligned}
s_{a 2} & = \begin{cases}s_{b}^{\prime} & \text { if } e_{b>a} \\
s a & \text { else },\end{cases} \\
s_{b 2} & = \begin{cases}s a & \text { if } e_{b>a} \\
s_{b}^{\prime} & \text { else },\end{cases} \\
\left\langle f_{a 2}\right\rangle & =\left\{\begin{array}{lll}
\langle f b\rangle & \text { if } e_{b>a} \\
\langle f a\rangle & \text { else }, & \left\langle f_{b 2}\right\rangle
\end{array}\right.
\end{aligned}
$$

Proof. The proof is trivial by lemma 4.

### 5.6.5 Alignment Shift and Sticky Bit Computation

The alignment shifter shifts the significand $f_{b 2}$ to the right, collecting the shifted-out bits in the sticky bit (figure 5.6). We split this task into a logical right shifter lrs and the sticky bit computation stickybit (figure 5.7).

The sticky bit is the logical OR of the lower $a s_{2}$ bits of $f_{b 2}$. The lower bits are selected by an AND mask driven by a halfdecoder.

Circuit 31 Inputs $f_{b 2} \in \mathbb{B}^{53}$, as $s_{2} \in \mathbb{B}^{6}$, output stickybit $\in \mathbb{B}$.

$$
\text { stickybit }:=\operatorname{ortree}\left(h d e c\left(a s_{2}\right)[54,0] \wedge f_{b 2}\right) .
$$

This definition conforms with the definition of sticky in section 4.3:
Lemma 57 For all as $s_{2} \in \mathbb{B}^{6}, 0<\left\langle a s_{2}\right\rangle<55, f_{b 2} \in \mathbb{B}^{55}$ :

$$
\operatorname{stickybit}\left(a s_{2}, f_{b 2}\right)=\operatorname{sticky}_{\left\langle a s_{2}\right\rangle}\left(f_{b 2}\right) .
$$

Proof. After expanding stickybit and sticky, and applying the correctness lemmas of ortree (lemma 43) and hdec (lemma 30), it remains to show

$$
\left(\left(\mathbf{0}^{2^{6}-\left\langle a s_{2}\right\rangle} \circ \mathbf{1}^{\left\langle a s_{2}\right\rangle}\right)[54,0] \wedge f_{b 2}\right) \neq \mathbf{0}^{55} \Longleftrightarrow f_{b 2}\left[\left\langle a s_{2}\right\rangle-1,0\right] \neq \mathbf{0}^{\left\langle a s_{2}\right\rangle} .
$$

This is straightforward in PVS by expanding the definitions of $\circ$ and $[\cdot, \cdot]$.
We now show that the sticky bit is computed correctly for $\left\langle a s_{2}\right\rangle=0$ and $\left\langle a s_{2}\right\rangle=55$, and that all $\left\langle a s_{2}\right\rangle \geq 55$ are equivalent to the latter case.


Figure 5.7: Sticky bit computation

Lemma 58 Let $B_{55}$ be the unique bitvector of width 6 with $\left\langle B_{55}\right\rangle=55$. For $a s_{2} \in \mathbb{B}^{6}, f_{b 2} \in \mathbb{B}^{55}$ :

1. $\left\langle a s_{2}\right\rangle=0 \Longrightarrow \operatorname{stickybit}\left(a s_{2}, f_{b 2}\right)=\mathbf{0}$,
2. stickybit $\left(B_{55}, f_{b 2}\right) \Longleftrightarrow f_{b 2} \neq \mathbf{0}^{55}$,
3. $\left\langle a s_{2}\right\rangle \geq 55 \Longrightarrow \operatorname{stickybit}\left(a s_{2}, f_{b 2}\right)=\operatorname{stickybit}\left(B_{55}, f_{b 2}\right)$.

Proof. The proof is similar to the proof of lemma 57; after applying the correctness lemmas of ortree and $h d e c$, PVS is able to prove the statements by expanding $\circ$ and $[\cdot, \cdot]$.

The representative $f_{b 3}$ of $f_{b 2}$ is computed by the bitvector concatenation of a logical right shifter and stickybit (see figure 5.2):

Circuit 32 Inputs $f_{b 2} \in \mathbb{B}^{55}$, as $s_{2} \in \mathbb{B}^{6}$, output $f_{b 3} \in \mathbb{B}^{56}$.

$$
f_{b 3}:=\operatorname{lrs}\left(f_{b 2}, a s_{2}\right) \circ \operatorname{stickybit}\left(a s_{2}, f_{b 2}\right)
$$

Lemma 59 (Sticky Shift CORRECT) For all $a s_{2} \in \mathbb{B}^{6}, f_{b 2} \in \mathbb{B}^{55}$ :

$$
\left\langle f_{b 3}\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55}
$$

Proof. We split the proof in three cases:

1. $\left\langle a s_{2}\right\rangle=0$ : In this case, lrs does not modify $f_{b 2}$, and stickybit $=\mathbf{0}$ by lemma 58 . We show

$$
\left\langle f_{b 2} \circ \mathbf{0}\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-54}\right]_{-54} \cdot 2^{55} .
$$

Since $\left\langle f_{b 2}\right\rangle$ is integer, $\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-54}\right]_{-54}=\left\langle f_{b 2}\right\rangle \cdot 2^{-54}$ by the definition of $\alpha$-representatives. On the left side oft the equation, we use lemma 4.

$$
\left\langle f_{b 2}\right\rangle \cdot 2=\left\langle f_{b 2}\right\rangle \cdot 2^{-54} \cdot 2^{55}
$$

2. $0<\left\langle a s_{2}\right\rangle<55$ : After applying lemma 57, we have

$$
\left\langle\operatorname{lrs}\left(f_{b 2}, a s_{2}\right) \circ \operatorname{sticky}_{\left\langle a s_{2}\right\rangle}\left(f_{b 2}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

We apply the correctness of lrs (lemma 42):

$$
\left\langle\mathbf{0}^{\left\langle a s_{2}\right\rangle} \circ f_{b 2}\left[54,\left\langle a s_{2}\right\rangle\right] \circ \text { sticky }_{\left\langle a s_{2}\right\rangle}\left(f_{b 2}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

Lemma 4 yields

$$
\left\langle f_{b 2}\left[54,\left\langle a s_{2}\right\rangle\right] \circ \text { stick } y_{\left\langle a s_{2}\right\rangle}\left(f_{b 2}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

By lemma 46.5 we can scale the $\alpha$-representative by $2^{-\left\langle a s_{2}\right\rangle-54}$, yielding

$$
\left\langle f_{b 2}\left[54,\left\langle a s_{2}\right\rangle\right] \circ \text { sticky } y_{\left\langle a s_{2}\right\rangle}\left(f_{b 2}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle\right]_{\left\langle a s_{2}\right\rangle} \cdot 2^{1-\left\langle a s_{2}\right\rangle} .
$$

This is true by theorem 5 .
3. $\left\langle a s_{2}\right\rangle \geq 55$ : Using lemma 58 , we have to show

$$
\left\langle l r s\left(f_{b 2}, a s_{2}\right) \circ\left(f_{b 2} \neq \mathbf{0}^{55}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

$f_{b 2}$ is shifted out completely by lrs (lemma 42):

$$
\left\langle\mathbf{0}^{55} \circ\left(f_{b 2} \neq \mathbf{0}^{55}\right)\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

Lemma 4 yields

$$
\left\langle f_{b 2} \neq \mathbf{0}^{55}\right\rangle=\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54} \cdot 2^{55} .
$$

In case where $f_{b 2}=\mathbf{0}^{55}$, this is trivial by lemma 46.8. For $f_{b 2} \neq \mathbf{0}^{55}$, we use lemma 46.9,

$$
1=2^{-55} \cdot 2^{55}
$$

The application of lemma 46.9 is valid for $\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}<2^{-54}$, which is true.

### 5.6.6 Alignment Shifter Correctness

The alignment shifter consists of circuits 25 to 32 . This is also the definition of stage 2 .
Circuit 33 Inputs $\left(s_{a}, e_{a}, f_{a}\right),\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}$, signals $e_{b>a} \in B$, as $\in \mathbb{B}^{12}$, as $s_{2} \in \mathbb{B}^{6}, f_{b 2} \in \mathbb{B}^{55}$, outputs $\left(s_{a 2}, e_{s}, f_{a 2}\right) \in \mathbb{I},\left(s_{b 2}, e_{s}, f_{b 3}\right) \in \mathbb{I}_{56}, s x \in \mathbb{B}$. Let the wires be connected as defined in circuits 25 to 32 .

We now can prove that the sum of the outputs of the alignment shifter is $\alpha$-equivalent to the sum of the inputs.

Lemma 60 (Align Shift Correct) For all semi-representable IEEE factorings $a:=\left(s_{a}, e_{a}\right.$, $\left.f_{a}\right), b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}:$ Let $S:=\llbracket a \rrbracket+\llbracket b^{\rrbracket} \rrbracket \neq 0, \hat{e}:=\hat{\eta}_{e}(S)$.

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{s}\right]} \cdot\left((-1)^{s_{a 2}} \cdot \frac{\left\langle f_{a 2}\right\rangle}{2^{52}}+(-1)^{s_{b 2}} \cdot \frac{\left\langle f_{b 3}\right\rangle}{2^{55}}\right) .
$$

Proof. We apply lemma 59 for $f_{b 3}$ :

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{s}\right]} \cdot\left((-1)^{s_{a 2}} \cdot \frac{\left\langle f_{a 2}\right\rangle}{2^{52}}+(-1)^{s_{b 2}} \cdot\left[\left\langle f_{b 2}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54}\right)
$$

We distinguish two cases, depending on the signal $e_{b>a}$ :

1. $\neg e_{b>a}$ : The significand swap is correct (lemma 56 for $s_{a 2}, f_{a 2}, s_{b 2}$, and $f_{b 2}$ ):

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{s}\right]} \cdot\left((-1)^{s_{a}} \cdot \frac{\left\langle f_{a}\right\rangle}{2^{52}}+(-1)^{s_{b}^{\prime}} \cdot\left[4 \cdot\left\langle f_{b}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54}\right) .
$$

Lemmas $52\left(e_{s}\right)$ and $55\left(a s_{2}\right)$ yield

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{a}\right]} \cdot\left((-1)^{s_{a}} \cdot \frac{\left\langle f_{a}\right\rangle}{2^{52}}+(-1)^{s_{b}^{\prime}} \cdot\left[4 \cdot \frac{\left\langle f_{b}\right\rangle}{2^{54}} \cdot 2^{-\min ([a s], B)}\right]_{-54}\right) .
$$

By lemma 50 (as), we conclude

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{a}\right]} \cdot\left((-1)^{s_{a}} \cdot \frac{\left\langle f_{a}\right\rangle}{2^{52}}+(-1)^{s_{b}^{\prime}} \cdot\left[\frac{\left\langle f_{b}\right\rangle}{2^{52}} \cdot 2^{-\min \left(\left[e_{a}\right]-\left[e_{b}\right], B\right)}\right]_{-54}\right)
$$

By lemma 61 below, this is equivalent to

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{a}\right]} \cdot\left((-1)^{s_{a}} \cdot \frac{\left\langle f_{a}\right\rangle}{2^{52}}+(-1)^{s_{b}^{\prime}} \cdot\left[\frac{\left\langle f_{b}\right\rangle}{2^{52}} \cdot 2^{-\left(\left[e_{a}\right]-\left[e_{b}\right]\right)}\right]_{-54}\right)
$$

This is true by theorem 8 .
2. $e_{b>a}$ : The significand swap (lemma 56 for $s_{a 2}, f_{a 2}, s_{b 2}$, and $f_{b 2}$ ) yields

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{s}\right]} \cdot\left((-1)^{s_{b^{\prime}}} \cdot \frac{\left\langle f_{b}\right\rangle}{2^{52}}+(-1)^{s_{a}} \cdot\left[2 \cdot\left\langle f_{a}\right\rangle \cdot 2^{-\left\langle a s_{2}\right\rangle-54}\right]_{-54}\right)
$$

We use lemmas $52\left(e_{s}\right)$ and $55\left(a s_{2}\right)$ :

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{b}\right]} \cdot\left((-1)^{s_{b}^{\prime}} \cdot \frac{\left\langle f_{b}\right\rangle}{2^{52}}+(-1)^{s_{a}} \cdot\left[2 \cdot \frac{\left\langle f_{a}\right\rangle}{2^{54}} \cdot 2^{-\min (-[a s]-1, B)}\right]_{-54}\right) .
$$

Lemma 50 (as) yields

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{b}\right]} \cdot\left((-1)^{s_{b}^{\prime}} \cdot \frac{\left\langle f_{b}\right\rangle}{2^{52}}+(-1)^{s_{a}} \cdot\left[\frac{\left\langle f_{a}\right\rangle}{2^{53}} \cdot 2^{-\min \left(-\left(\left[e_{a}\right]-\left[e_{b}\right]\right)-1, B\right)}\right]_{-54}\right) .
$$

By lemma 62 below, we have

$$
S \equiv_{\hat{e}-53} 2^{\left[e_{b}\right]} \cdot\left((-1)^{s_{b}^{\prime}} \cdot \frac{\left\langle f_{b}\right\rangle}{2^{52}}+(-1)^{s_{a}} \cdot\left[\frac{\left\langle f_{a}\right\rangle}{2^{52}} \cdot 2^{-\left(\left[e_{b}\right]-\left[e_{a}\right]\right)}\right]_{-54}\right)
$$

This is proved by interchanging $a$ and $b$ in theorem 8 .

Lemma 61 For all semi-representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right), b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}$, where $e_{a} \geq e_{b}$, let $\delta:=\left[e_{a}\right]-\left[e_{b}\right]$ :

$$
\left[\frac{\left\langle f_{b}\right\rangle}{2^{52}} \cdot 2^{-\min (\delta, B)}\right]_{-54}=\left[\frac{\left\langle f_{b}\right\rangle}{2^{52}} \cdot 2^{-\delta}\right]_{-54 .}
$$



Figure 5.8: Significand addition SigAdd
Proof. By lemma 46.8, the claim is trivial for $f_{b}=0$. It is also trivial for $\delta \leq B$, where $\min (\delta, B)=\delta$. For $\delta \geq B$, both sides of the equation are equal to $2^{-55}$ by lemma 46.9 , and the lemma is proved.

Lemma 62 For all semi-representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right), b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}$, where $e_{a}<e_{b}$, let $\delta:=\left[e_{a}\right]-\left[e_{b}\right]$ :

$$
\left[\frac{\left\langle f_{a}\right\rangle}{2^{53}} \cdot 2^{-\min (-\delta-1, B)}\right]_{-54}=\left[\frac{\left\langle f_{a}\right\rangle}{2^{52}} \cdot 2^{-\delta}\right]_{-54}
$$

Proof. Analogous to the proof of lemma 61.

Theorem 10 (Stage 2 Correct) For all semi-representable IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right)$, $b^{\prime}:=\left(s_{b}^{\prime}, e_{b}, f_{b}\right) \in \mathbb{I}:$ Let $S:=\llbracket a \rrbracket+\llbracket b^{\prime} \rrbracket \neq 0, \hat{e}:=\hat{\eta}_{e}(S)$.

$$
S \equiv_{\hat{e}-53} \llbracket s_{a 2}, e_{s}, f_{a 2} \rrbracket+\llbracket s_{b 2}, e_{s}, f_{b 3} \rrbracket .
$$

Proof. Immediate consequence of lemma 60 and the definition of $\llbracket \cdot \rrbracket$.

### 5.7 Stage 3: Significand Addition

The third stage adds the significands computed by stage 2 . Depending on $s x$, we have to add or subtract the significands.

Circuit 34 Inputs $\left(s_{a 2}, \cdot, f_{a 2}\right) \in \mathbb{I},\left(s_{b 2}, \cdot, f_{b 3}\right) \in \mathbb{I}_{56}$, sx $\in \mathbb{B}$, output $\left(s_{s}, \cdot, f_{s}\right) \in \mathbb{I}_{57}$. Let $a d d \_s u b:=a d d \_s u b\left(\mathbf{0}^{2} \circ f_{a 2} \circ \mathbf{0}^{3}, \mathbf{0}^{2} \circ f_{b 3}, s x\right)$.

$$
\begin{aligned}
s_{s} & :=s_{a 2} \oplus a d d_{\_} s u b_{n e g} \\
f_{s} & :=a b s\left(a d d_{s} s u b_{s}\right) .
\end{aligned}
$$

Lemma 63 (Sig Add Correct) For all $\left(s_{a 2}, \cdot, f_{a 2}\right) \in \mathbb{I}$, $\left(s_{b 2}, \cdot, f_{b 3}\right) \in \mathbb{I}_{56}, s x \in \mathbb{B}$, sx $=$ $s_{a 2} \oplus s_{b 2}$. Let sum $:=\llbracket s_{a 2}, 0, f_{a 2} \rrbracket+\llbracket s_{b 2}, 0, f_{b 3} \rrbracket$.

$$
\text { sum }=\llbracket s_{s}, 0, f_{s} \rrbracket_{2}
$$

Proof. We show

$$
\text { sum }=(-1)^{s_{a 2} \oplus a d d \_s u b_{n e g}} \cdot \frac{\left\langle a b s\left(a d d_{\_} s u b_{s}\right)\right\rangle}{2^{55}} .
$$

Using the correctness of the $a b s$ circuit (lemma 26) yields

$$
\text { sum }=(-1)^{s_{a 2} \oplus a d d_{-} s u b_{n e g}} \cdot \frac{\left|\left[a d d_{\_} s u b_{s}\right]\right|}{2^{55}} .
$$

In the case $s_{a 2}=s_{b 2}=\mathbf{0}$, we have $s x=\mathbf{0}$ by the assumption on $s x$. The outputs $a d d_{\_} s u b_{s}$ and $a d d_{\_} s u b_{n e g}$ are correct by lemma 25 . In this case,

$$
\text { sum }=(-1)^{0 \oplus\left(\left[\mathbf{0}^{2} \circ f_{a 2} \circ \mathbf{0}^{3}\right]+\left[\mathbf{0}^{2} \circ f_{b 3}\right]<0\right)} . \frac{\left|\left[\mathbf{0}^{2} \circ f_{a 2} \circ \mathbf{0}^{3}\right]+\left[\mathbf{0}^{2} \circ f_{b 3}\right]\right|}{2^{55}} .
$$

We rewrite with lemmas 10 and 4:

$$
\frac{\left\langle f_{a 2}\right\rangle}{2^{52}}+\frac{\left\langle f_{b 3}\right\rangle}{2^{55}}=(-1)^{\left(8 \cdot\left\langle f_{a 2}\right\rangle+\left\langle f_{b 3}\right\rangle<0\right)} \cdot \frac{\left|8 \cdot\left\langle f_{a 2}\right\rangle+\left\langle f_{b 3}\right\rangle\right|}{2^{55}}
$$

The right side reduces to

$$
\frac{\left\langle f_{a 2}\right\rangle}{2^{52}}+\frac{\left\langle f_{b 3}\right\rangle}{2^{55}}=\frac{8 \cdot\left\langle f_{a 2}\right\rangle+\left\langle f_{b 3}\right\rangle}{2^{55}} .
$$

The other cases of $s_{a 2}$ and $s_{b 2}$ are resolved analogously.
It remains to show that the sum of the inputs is representable in the width of the $a d d$ sub output, and that the $a d d_{\_} s u b$ output is a valid input for $a b s$. Both statements are proved by using the fact that the range of the inputs is limited by the leading zeros fed into $a d d$ sub.

Stage_3 consists of the significand addition circuit and an additional wire for passing $e_{s}$ (see figure 5.1).

Theorem 11 (Stage 3 Correct) For all $\left(s_{a 2}, e_{s}, f_{a 2}\right) \in \mathbb{I}$, $\left(s_{b 2}, e_{s}, f_{b 3}\right) \in \mathbb{I}_{56}$, sx $\in \mathbb{B}$, where $s x=s_{a 2} \oplus s_{b 2}$, let sum $:=\llbracket s_{a 2}, e_{s}, f_{a 2} \rrbracket+\llbracket s_{b 2}, e_{s}, f_{b 3} \rrbracket$ :

$$
\text { sum }=\llbracket s_{s}, e_{s}, f_{s} \rrbracket_{2} .
$$

Proof. By multiplication with $2^{e_{s}}$ in lemma 63.

### 5.8 Putting It All Together

We now combine the above defined stages.

Circuit 35 Inputs $\left(s_{a}, e_{a}, f_{a}\right),\left(s_{b}, e_{b}, f_{b}\right) \in \mathbb{I}$, sub $\in B$, output $\left(s_{s}, e_{s}, f_{s}\right) \in \mathbb{I}_{57}$.

$$
\text { fpadder }:=\text { stage_3 ○ stage_2 ○ stage_1. }
$$

Here, o is function concatenation, not bitvector concatenation.

In section 5.1, we argued that the floating point adder is correct if its output is $(\hat{e}-P)$-equivalent to the correct result, where $P=53$. This is proved in the next theorem. Since the significand adder output $f_{s}$ has two bits in front of the binary point, we use $\llbracket \cdot \rrbracket_{2}$.

Theorem 12 (FP ADDER CORRECT) For all IEEE factorings $a:=\left(s_{a}, e_{a}, f_{a}\right), b:=\left(s_{b}, e_{b}\right.$, $\left.f_{b}\right) \in \mathbb{I}$ and subtract bits sub $\in \mathbb{B}$, let $S=\llbracket a \rrbracket \pm_{\text {sub }} \llbracket b \rrbracket$, and $\hat{e}=\hat{\eta}_{e}(S)$, where $S \neq 0$. Then the following equation holds for the output $\left(s_{s}, e_{s}, f_{s}\right) \in \mathbb{I}_{57}$ of fpadder:

$$
S \equiv_{\hat{e}-P} \llbracket s_{s}, e_{s}, f_{s} \rrbracket_{2}
$$

Proof. We start with

$$
S=\llbracket s_{a}, e_{a}, f_{a} \rrbracket \pm_{s u b} \llbracket s_{b}, e_{b}, f_{b} \rrbracket
$$

Theorem 9 yields

$$
S=\llbracket s_{a}, e_{a}, f_{a} \rrbracket+\llbracket s_{b}^{\prime}, e_{b}, f_{b} \rrbracket .
$$

By theorem 10, we have

$$
S \equiv_{\hat{e}-P} \llbracket s_{a 2}, e_{s}, f_{a 2} \rrbracket+\llbracket s_{b 2}, e_{s}, f_{b 3} \rrbracket .
$$

By theorem 11, this is

$$
S \equiv \equiv_{\hat{e}-P} \llbracket s_{s}, e_{s}, f_{s} \rrbracket_{2}
$$

Therefore, the VAMP floating point adder is correct.
Theorem 12 is an excellent example on how the task of "putting it all together" can be handled in a precise way by application of the specifications of the modules-if the modules have a well defined behaviour.

### 5.9 Boundary Constraints

It remains to prove the additional rounder input requirements from section 5.1.

$$
e_{s} \leq e_{\max }
$$

holds trivially, since $e_{s}$ is chosen from $\left\{e_{a}, e_{b}\right\}$ and $e_{\max }$ is the maximum value in $T_{N}$. (We prove this even if the significand is not denormal.)

To show that we stay in the range that is scalable into the range of representable numbers by wrapped exponent (section 4.4), we show

$$
2^{e_{\min }-A}<|\llbracket s \rrbracket|<2^{e_{\max }+A}
$$

Proof. For the lower bound, we first notice that $\llbracket s \rrbracket \neq 0$, since $S \neq 0$ by assumption, $S \equiv_{\alpha} \llbracket s \rrbracket$ by theorem 12 , and $\llbracket s \rrbracket=0 \Longleftrightarrow \llbracket s \rrbracket \equiv_{\alpha} 0$ by lemma 46.8. The smallest positive representable number is $2^{e_{\text {min }}-P+1}$. We finish this case by observing $A \gg P$, and hence,

$$
2^{e_{m i n}-A}<2^{e_{m i n}-P+1} \leq|\llbracket s \rrbracket| .
$$

For the upper bound, we start with theorem 12:

$$
\llbracket s \rrbracket \equiv \hat{e}-P \llbracket a \rrbracket \pm_{s u b} \llbracket b \rrbracket .
$$

This implies

$$
|\llbracket s \rrbracket| \equiv_{\hat{e}-P}\left|\llbracket a \rrbracket \pm_{s u b} \llbracket b \rrbracket\right| .
$$

We conclude

$$
|\llbracket s \rrbracket|<\left|\llbracket a \rrbracket \pm_{s u b} \llbracket b \rrbracket\right|+2^{\hat{e}-P} .
$$

Using the triangle inequality, this is

$$
|\llbracket s \rrbracket|<|\llbracket a \rrbracket|+|\llbracket b \rrbracket|+2^{\hat{e}-P} .
$$

$a$ and $b$ are bounded: $\llbracket a \rrbracket, \llbracket b \rrbracket<2^{e_{\max }+1}$, therefore

$$
|\llbracket s \rrbracket|<2^{e_{\max }+1}+2^{e_{\max }+1}+2^{\hat{e}-P}
$$

The claim is proved with

$$
|\llbracket s \rrbracket|<2^{e_{\max }+A}
$$

The VAMP floating point adder meets the VAMP rounder specification.

## Chapter 6

## Summary

### 6.1 The VAMP Project

Our group at Saarland University is verifying the correctness of the VAMP microprocessor. The VAMP is a RISC processor based on the DLX architecture [HP96, MP00]. The VAMP features a five stage pipeline, a Tomasulo scheduler, precise and nested interrupts, delayed branch, and a fully IEEE compliant floating point unit [JK00, Krö01, BJ01, Jac01b]. The correctness of the circuits is proved using the PVS theorem prover [OSR92]. The floating point adder that is formally verified in this thesis is part of the VAMP FPU.

The verified VAMP processor is being implemented on a Xilinx FPGA [BJKL01]. Our group is porting the $g c c$ compiler and the GNU C library to the VAMP architecture to yield an environment suitable to evaluating the verified hardware.

All PVS sources-specification, implementation, and proof scripts-and the Verilog hardware descriptions are publically available at our web site [VAM].

To the best of our knowledge, this is the first time that a complete floating point unit has been formally verified on the gate level, and all proofs and designs have been published.

### 6.2 Bugs

'Bugs' are either differences between a specification and corresponding implementation, or flaws in the specification itself. There are several ways to find bugs: proof-reading specification and implementation, testing, proofs, or combinations of these. The purpose of formal verification is to find bugs that would otherwise go unnoticed. Formal verification gives-to some extend-the confidence that the implementation adheres to its specification.

Two problems remain, however. On the one hand, the consistency of the specification itself cannot be asserted entirely. The specification may be inconsistent, which will probably be found within the verification process because of inherent errors, or the specification may be consistent, but be something else than what we wanted to formalize. We can only hope that the provably correct theorem 'the implementation is correct for the specification' is the same as 'the implementation is exactly what we intended to build'.

Furthermore, the verification tool-PVS in our case-could be unsound, meaning that it is possible to prove statements that are not true in the mathematical sense. There are some known flaws in PVS, and it is even possible to prove a theorem stating 'false' by exploiting these flaws. We tried not to use any faulty proof commands, although it is possible-though unlikely-that we accidentally got trapped by bugs in PVS that have not yet been discovered.

Bugs found. One major bug in the design of the floating point adder was found in the verification. In [MP00], the sign bit $s_{s}$ (called $s_{s 1}$ in the book) is wrongly computed as

$$
s_{s}:=\left(s_{b}^{\prime} \wedge n e g\right) \vee\left(s_{a} \wedge \neg\left(s_{b}^{\prime} \wedge n e g\right)\right)
$$

This error results from an erroneous entry in table 8.2 (page 370 ), where line 7 is marked 'impossible'. The proper entry is 'impossible' in line 8 , and ' 0 ' in line 7 . This change results in the correct equation for the sign bit:

$$
s_{s}:=s_{a} \oplus n e g
$$

Other bugs where incorrect bitvector subscripts and exponents in powers of 2 . Since these can be considered to be mere typos, we do not list them here.

No serious bugs in the PVS logic were found, i.e., bugs rendering PVS unsound. Several minor flaws in the PVS system were reported to SRI [PVS].

### 6.3 Related Work

Basic circuits. The verification of a simple adder and an arithmetic logic unit using PVS is reported in [CRSS94]. The PVS bitvectors library [ $\mathrm{BMS}^{+} 96$ ] includes a verified carry chain adder. The verification of an adder using various verification systems is described in [SBE88]. In [CB96], Bryant verifies fixed size arithmetic circuits against a mathematical specification.

Given a reference design and assuming its correctness, it is state-of-the-art to automatically verify equivalence with a new design. There are several approaches to this, e.g., boolean equivalence checkers using BDDs or variations [Bry86, BC95, CFZ95]. In [JLMC97], Clarke et al. use function abstraction and BDDs for equivalence checking. In [Sta99], Stanion proves the equivalence of two fixed bit width multipliers.

IEEE standard. Other formalizations of the IEEE standard in theorem proving systems have been given by Miner [Min95] and Harrison [Har99].

Floating point hardware. Aagaard and Seger combine BDD based methods and theorem proving techniques to verify a floating point multiplier [AS95]. Chen and Bryant [CB98] use wordlevel SMV to verify a floating point adder. Exceptions and denormals are not handled in both verification projects.

Cornea-Hasegan describes the computation of division and square root by Newton-Raphson iteration in the Intel IA-64 architecture [CH98, CHN99]. The verification is done using Mathematica. O'Leary et al. report on the verification of the gate level design of Intel's FPU using a combination of model-checking and theorem proving [OZGS99]. Denormals and exceptions are not covered in the paper.

Moore et al. have verified the AMD K5 division algorithm [MLK98] with the theorem prover ACL2. Russinoff has verified the K5 square root algorithm as well as the Athlon multiplication, division, square root, and addition algorithms [Rus98, Rus99, Rus00].

### 6.4 Prospect

There are several ways in which to extend the work on the VAMP architecture:

- Modern FPUs support a vast variety of operations, such as square root, trigonometric functions, and compound operations as $\frac{1}{1-x^{2}}$. One could verify functional units for these and incorporate them into the VAMP FPU.
- Similarly, modern CPUs support more precision modes.
- Efficiency was not a goal of the VAMP project. One could improve both throughput and latency of the verified functional units.
- In an analogous way, the above points apply to the VAMP integer core, pipeline, and control.
- Hardware verification is difficult, and even reusing proofs for similar circuits and modules requires much effort. A better framework for handling (hardware) proofs would be beneficial.
- We have verified the PVS hardware specifications, and translated these to Verilog using a tool that has not been verified. One could formally verify the Verilog specifications.


## Bibliography

[AS95] M. D. Aagaard and C.-J. H. Seger. The formal verification of a pipelined doubleprecision IEEE floating-point multiplier. In ICCAD, pages 7-10. IEEE, November 1995.
[BC95] R. E. Bryant and Y.-A. Chen. Verification of arithmetic circuits with binary moment diagrams. In 32nd ACM/IEEE Design Automation Conference, Pittsburgh, June 1995. Carnegie Mellon University.
[BJ01] Christoph Berg and Christian Jacobi. Formal verification of the VAMP floating point unit. To appear in CHARME 2001, 2001.
[BJK01] Christoph Berg, Christian Jacobi, and Daniel Kröning. Formal verification of a basic circuits library. In IASTED International Conference on Applied Informatics. ACTA Press, February 2001.
[BJKL01] Sven Beyer, Christian Jacobi, Daniel Kröning, and Dirk Leinenbach. Correct hardware by synthesis from PVS. Submitted to ICCD 2001, 2001.
[BMS $\left.{ }^{+} 96\right]$ Ricky W. Butler, Paul S. Miner, Mandayam K. Srivas, Dave A. Greve, and Steven P. Miller. A bitvectors library for PVS. Technical Report TM-110274, NASA Langley Research Center, 1996.
[Bry86] Bryant, R. E. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8):677-691, August 1986.
[CB96] Y. Chen and R. Bryant. ACV: An arithmetic circuit verifier. In In Proc. of IEEE ICCD '96, pages 361-365. IEEE, 1996.
[CB98] Y.-A. Chen and R. E. Bryant. Verification of floating point adders. In CAV'98, volume 1427 of LNCS, 1998.
[CFZ95] E. M. Clarke, M. Fujita, and X. Zhao. Hybrid decision diagrams overcoming the limitations of MTBDDs and BMDs. In ICCAD, pages 159-163, Los Alamitos, Ca., USA, November 1995. IEEE Computer Society Press.
[CH98] Marius Cornea-Hasegan. Proving the IEEE correctness of iterative floating-point square root, divide, and remainder algorithms. Intel Technology Journal, Q2, 1998.
[CHN99] Marius Cornea-Hasegan and Bob Norin. IA-64 floating point operations and the IEEE standard for binary floating-point arithmetic. Intel Technology Journal, Q4, 1999.
[COR $\left.{ }^{+} 95\right]$ J. Crow, S. Owre, J. Rushby, N. Shankar, and M. Srivas. A tutorial introduction to PVS. In WIFT'95: Workshop on Industrial-Strength Formal Specification Techniques, 1995.
[CRSS94] D. Cyrluk, S. Rajan, N. Shankar, and M. K. Srivas. Effective theorem proving for hardware verification. In 2nd International Conference on Theorem Provers in Circuit Design, volume 901 of LNCS, pages 203-222. Springer, 1994.
[EP97] Guy Even and Wolfgang J. Paul. On the design of IEEE compliant floating point units. In Proceedings of the 13th Symposium on Computer Arithmetic. IEEE Computer Society Press, 1997.
[Gen35] G. Gentzen. Untersuchungen über das logische Schließen. In Mathematische Zeitschrift, volume 1, pages 176-210, 1935.
[Gol96] David Goldberg. Computer arithmetic, 1996. Appendix A in [HP96].
[Har99] John Harrison. A machine checked theory of floating point arithmetic. In TPHOL '99, volume 1690 of $L N C S$. Springer, 1999.
[HP96] J.L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, CA, second edition, 1996.
[Ins85] Institute of Electrical and Electronics Engineers. ANSI/IEEE standard 754-1985, IEEE Standard for Binary Floating-Point Arithmetic, 1985.
[Jac01a] Christian Jacobi. A formally verified theory of IEEE rounding. Unpublished, available at http://www-wjp.cs.uni-sb.de/ ccj/ieee-lib.ps, 2001.
[Jac01b] Christian Jacobi. Formal Verification of an IEEE Compliant Floation Point Unit. PhD thesis, Saarland University, Computer Science Department, due in 2001.
[JK00] Christian Jacobi and Daniel Kröning. Proving the correctness of a complete microprocessor. In Proc. of the 30. Jahrestagung der Gesellschaft fir Informatik. Springer, 2000.
[JLMC97] Somesh Jha, Yuan Lu, Marius Minea, and Edmund M. Clarke. Equivalence checking using abstract BDDs. In Proc. of IEEE ICCD '98, pages 332-337. IEEE, 1997.
[KP97] Joerg Keller and Wolfgang J. Paul. Hardware Design, Formaler Entwurf digitaler Schaltungen, volume 15 of Teubner-Texte zur Informatik. Teubner, 1997.
[Krö01] Daniel Kröning. Formal Verification of Pipelined Microprocessors. PhD thesis, Saarland University, Computer Science Department, 2001.
[Min95] Paul S. Miner. Defining the IEEE-854 floating-point standard in PVS. Technical Report TM-110167, NASA Langley Research Center, 1995.
[MLK98] J Moore, T. Lynch, and M. Kaufmann. A mechanically checked proof of the AMD5K86 floating point division program. IEEE Transactions on Computers, 47(9):913-926, 1998.
[MP95] Silvia M. Mueller and Wolfgang J. Paul. The Complexity of Simple Computer Architectures. Springer, 1995.
[MP00] Silvia M. Mueller and Wolfgang J. Paul. Computer Architecture. Complexity and Correctness. Springer, 2000.
[OSR92] S. Owre, N. Shankar, and J. M. Rushby. PVS: A prototype verification system. In 11th International Conference on Automated Deduction (CADE), volume 607 of Lecture Notes in Artificial Intelligence, pages 748-752. Springer, 1992.
[OZGS99] J. O’Leary, X. Zhao, R. Gerth, and C.-J. H. Seger. Formally verifying IEEE compliance of floating-point hardware. Intel Technology Journal, Q1, 1999.
[Pra95] V. R. Pratt. Anatomy of the pentium bug. In TAPSOFT'95, volume 915, pages 97107, Aarhus, Denmark, 1995. Springer-Verlag.
[PVS] PVS bugs. http://pvs.csl.sri.com/cgi-bin/pvs/pvs-bug-list/, Bugs \#474, 529, 538, 551.
[Rus98] David M. Russinoff. A mechanically checked proof of IEEE compliance of the floating point multiplication, division and square root algorithms of the AMD-K7 processor. LMS Journal of Computation and Mathematics, 1:148-200, 1998.
[Rus99] David M. Russinoff. A mechanically checked proof of correctness of the AMD K5 floating point square root microcode. Formal Methods in System Design, 14(1):75125, January 1999.
[Rus00] David M. Russinoff. A case study in formal verification of register-transfer logic with ACL2: The floating point adder of the AMD Athlon processor. In FMCAD-00, volume 1954 of $L N C S$. Springer, 2000.
[SBE88] V. Stavridou, H. Barringer, and D.A. Edwards. Formal specification and verification of hardware: A comparative case study. In Proceedings of the 25th ACM/IEEE conference on Design Automation, pages 197-204, 1988.
[Sta99] Ted Stanion. Implicit verification of structurally dissimilar arithmetic circuits. In Proc. of IEEE ICCD '99, pages 46-50. IEEE, 1999.
[VAM] The VAMP processor homepage. http://www-wjp.cs.uni-sb.de/projects/verification/.
[Xil00] Xilinx, Inc. Virtex-E 1.8V Field Programmable Gate Arrays, Preliminary Product Specification, 2000.


[^0]:    ${ }^{1}$ Antecedent and consequent formulas are numbered using negative and positive integers, respectively.

[^1]:    ${ }^{2}$ Instantiating a formula means to replace a $\forall$ quantor over a variable in the antecedents (or equivalently, an $\exists$ quantor in the consequents) by a specific value of the proper type for the variable.

[^2]:    ${ }^{3}$ A possible approach would be to exploit the fact that gauss is equal to sum which is known to be natural.

[^3]:    ${ }^{1}$ The output $A d d_{c_{\text {out }}}$ is not used.

[^4]:    ${ }^{2}$ The case split is necessary to avoid bitvectors of zero length.

[^5]:    ${ }^{1} \eta_{e}(x)$ denotes the $e$-component of the factoring $\eta(x)=(s, e, f)$; analogous for other components and $\hat{\eta}$.

