#### Summary



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Electromigration

- high current density  $\longrightarrow$  movement of atoms in a conductor
- mean time to failure (MTTF) of a wire, Black's equation:

 $\mathsf{MTTF} = A \times J^{-n} \times e^{\frac{E_a}{kT}}$ 

A section, J current density,  $n \approx 2$  scale factor (cst),  $E_a$  activation energy (cst for a material), k Boltzmann's constant, T temperature

decreases IC reliability (permanent and intermittent failures)





Low-Power Arithmetic Operators

#### Arnaud Tisserand

CNRS, IRISA laboratory, CAIRN research team

ECOFAC 2014 Lorient, ENSIBS / UBS May 19th – 23th, 2014



Part I

#### Introduction

Motivations

#### Power Sources



Source: http://public.itrs.net/Links/2009ITRS/Home2009.htm

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Cooling

Problems due to temperature:

• performance decreases with temperature

 $25 \text{ °C} \rightarrow 105 \text{ °C} \longrightarrow 30 \%$  performance reduction

• reliability decreases with temperature

IC temperature  $> 125 \,^{\circ}\text{C} \longrightarrow$  faults and characteristics damage

#### Solutions:

- reduce power consumption
- cool circuits (air, water,...)





A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

5/104

7/104

Electromagnetic emissions from a device or system (the culprit or attacker) that interfere with the normal operation of another device or system (the victim)

Electromagnetic Interferences (EMI)





thermography 80C51 MCU by Philips synchronous (left), asynchronous (right)

#### Electromagnetic compatibility (EMC):

- ability to avoid introducing intolerable electromagnetic disturbance
- circuit specific design rules

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Cooling in Data Centers

- cooling is a significant challenge
- < 50 % power for electronic equipments (30–40 % in some cases)
- problem: keeping the hardware cool and humidified



Source: J. Cho, T. Lim, B. S. Kim. *Measurements and predictions of the air distribution systems in high compute density (Internet) data centers.* Energy and Buildings, vol. 41, pp. 1107-1115, 2009

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Power Consumption and Production in France (GW)



| date                | nuclear | coal | fuel oil | hydroelectric | total |
|---------------------|---------|------|----------|---------------|-------|
| Tue. 2009-01-05 20h | 57.5    | 6.8  | 2.2      | 14.2          | 80.9  |
| Sun. 2009-09-20 20h | 39.1    | 2.0  | 0.0      | 3.9           | 45.1  |
| Tue. 2009-09-22 20h | 42.5    | 4.8  | 0.0      | 4.3           | 51.7  |

Source: *RTE (Réseau de transport d'électricité)* http://www.rte-france.com/ A. Tisserand, CNRS-IRISA-CAIRN. *Low-Power Arithmetic Operators* 

MOS Transistor: N and P transistors



N transistors are made of:

- bulk (Si), P-type doping
- drain and source, N-type doping
- insulator
- gate or grid

W BULE gate 8 BIL A

In N-type doping area, the majority carriers are electrons (holes in a P-type area)

 $\mathsf{P}$  transistor: bulk is  $\mathsf{N}$  while source and drain are  $\mathsf{P}$  areas

#### Another Good Reason for Reducing Power Consumption

IL NE FAUT PAS CONFONDRE SURRÉGÉNÉRATEUR ET CENTRALE À COMIQUES ...



Idées Noires. Franquin. p. 10, Fluide Glacial, ISBN 2-85815-042-7

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Fast Circuit Design: Basic Ideas

- $V_{\text{DD}} \nearrow \implies \text{speed} \nearrow$  but limited by the technology
- Transistor size :
  - $\blacktriangleright W \nearrow \implies \text{speed} \nearrow \text{GOOD}$
  - $L \nearrow \implies \text{speed} \searrow \text{BAD}$  Transistor Sizing
  - ▶ but  $W \nearrow \implies C \nearrow \implies$  speed  $\searrow$
- Topology
- Logic optimizations
- Place and route optimizations
- Algorithms, data coding...

#### Fanout

The gate delay (change output state) depends on the output load. Fanout measures this load as the number of inputs of gate connected to the output (normalized w.r.t. an inverter)





FO = 4

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

13/104

#### Power Consumption: Components

Delay [ns]

Power dissipation in CMOS circuits comes from 2 main components:

- static dissipation:
  - sub-threshold conduction through OFF transistors
  - leakage current through P-N junctions
  - tunneling current through gate oxide
  - •
- dynamic dissipation:
  - charging and discharging of load capacitances (useful + parasitic)
  - short-circuit current

$$P_{\text{total}} = P_{\text{static}} + P_{\text{dynamic}}$$

## Power Consumption: Basic Definitions

Instantaneous power:

$$P(t) = i_{DD}(t) V_{DD}$$

Energy over some time interval T:

$$E = \int_0^T i_{DD}(t) \, V_{\rm DD} \, dt$$

Average power over interval T:

$$P_{avg} = rac{E}{T} = rac{1}{T} \, \int_0^T i_{DD}(t) \, V_{
m DD} \, dt$$

Units:

- current A
- voltage V
- power W
- energy J or Wh

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Short-Circuit Current in CMOS Gates

Occurs when both N and P transistors are ON while the input switches



Solution : short transition (crisp edges)

#### Charging and Discharging Load Capacitances

#### Transitions

There are capacitances everywhere in the circuit: transistor gate, routing, parasitics...

CMOS gate routing m parasitic

#### Solutions:

- design small circuits (small transistor, short wires, technology shrinking)
- reduce the activity (algorithms, data coding, sleep mode)
- reduce  $V_{DD}$  (without lowering speed)

| A. Tisserand, CNRS-IRISA-CAIRN | . Low-Power Arithmetic Operators | 17/104 |
|--------------------------------|----------------------------------|--------|
|                                |                                  |        |

## Simple Power Consumption Model

Average dynamic power dissipation (no leakage, no short circuit):

$$P = \alpha \times C \times f \times V_{\rm DD}^2$$

#### where

- $\alpha$  is the activity factor
- *C* is the average switched capacitance (at each cycle)
- *f* is the circuit frequency
- $V_{\rm DD}$  is the supply voltage

Remark: the gate delay is  $d = \gamma \times \frac{C \times V_{\text{DD}}}{(V_{\text{DD}} - V_T)^2} \approx \frac{1}{V_{\text{DD}}}$ 

There are 2 kinds of transitions:

- useful transitions (data switching)
- redundant or parasitic transitions (imperfections)



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators



#### Number systems

## Basic Number Systems

#### Exotic Number Systems

#### Positional Number System(s)

$$X = \sum_{i=-m}^{n-1} x_i \beta^i = (x_{n-1}x_{n-2}\cdots x_1x_0 \cdot x_{-1}x_{-2}\cdots x_{-m})$$

- radix  $\beta$  (usually a power of 2)
- digits  $x_i \ (\in \mathbb{N})$  in the digit set  $\mathcal{D}$
- rank or position *i*, weight  $\beta^i$
- *n* integer digits, *m* fractional digits

#### Examples:

- $\beta = 10, \mathcal{D} = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$
- $\beta = 2, D = \{0, 1\}$
- carry save:  $\beta = 2, \mathcal{D}_{cs} = \{0, 1, 2\}$
- borrow save:  $\beta = 2, \mathcal{D}_{\mathrm{bs}} = \{-1, 0, 1\}$
- signed digits:  $\beta>2, \mathcal{D}_{\mathrm{sd},\alpha,\beta}=\{-\alpha,\ldots,\alpha\}$  with  $2\alpha+1\geq\beta$
- theoretical systems:  $\beta = \frac{1+\sqrt{5}}{2}$ ,  $\beta = 1 + i \dots$

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### 21/104

#### Radix-2 Signed Integers

• sign and magnitude (absolute value)

$$A = (s_a a_{n-2} \dots a_1 a_0) = (-1)^{s_a} \times \sum_{i=0}^{n-2} a_i 2^i$$

• 2's complement

$$A = (a_{n-1}a_{n-2} \dots a_1a_0) = -a_{n-1}2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i$$

• biased (usually 
$$B = 2^{n-1} - 1$$
)

$$A = A_{math} + B$$

• ...

A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

22/104

## Fixed-Point Representations

Widely used in DSPs and digital integrated circuits for higher speed, lower silicon area and power consumption compared to floating point



Typical fixed-point formats: 16, 24, 32 and 48 bits

## Signed Integers

|         | representations |                |                 |  |  |  |  |  |
|---------|-----------------|----------------|-----------------|--|--|--|--|--|
| integer | sign/magnitude  | 2's complement | biased<br>(B=7) |  |  |  |  |  |
| -8      | _               | 1000           | _               |  |  |  |  |  |
| -7      | 1111            | 1001           | 0000            |  |  |  |  |  |
| -6      | 1110            | 1010           | 0001            |  |  |  |  |  |
| -5      | 1101            | 1011           | 0010            |  |  |  |  |  |
| -4      | 1100            | 1100           | 0011            |  |  |  |  |  |
| -3      | 1011            | 1101           | 0100            |  |  |  |  |  |
| -2      | 1010            | 1110           | 0101            |  |  |  |  |  |
| -1      | 1001            | 1111           | 0110            |  |  |  |  |  |
| 0       | 0000            | 0000           | 0111            |  |  |  |  |  |
| 1       | 0001            | 0001           | 1000            |  |  |  |  |  |
| 2       | 0010            | 0010           | 1001            |  |  |  |  |  |
| 3       | 0011            | 0011           | 1010            |  |  |  |  |  |
| 4       | 0100            | 0100           | 1011            |  |  |  |  |  |
| 5       | 0101            | 0101           | 1100            |  |  |  |  |  |
| 6       | 0110            | 0110           | 1101            |  |  |  |  |  |
| 7       | 0111            | 0111           | 1110            |  |  |  |  |  |
| 8       |                 |                | 1111            |  |  |  |  |  |

#### A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Floating-Point Representation(s)

Radix- $\beta$  floating-point representation of x:

- sign  $s_x$ , 1-bit encoding:  $0 \Rightarrow x > 0$  and  $1 \Rightarrow x < 0$
- exponent  $e_x \in \mathbb{N}$  on k digits and  $e_{min} \leq e_x \leq e_{max}$
- mantissa  $m_x$  on n+1 digits
- encoding:

$$x = (-1)^{s_x} \times m_x \times \beta^{e_x}$$
$$m_x = x_0 \cdot x_1 x_2 x_3 \cdots x_n$$
$$x_i \in \{0, 1, \dots, \beta - 1\}$$

For accuracy purpose, the mantissa must be normalized  $(x_0 \neq 0)$ Then  $m_x \in [1, \beta]$  and a specific encoding is required for the number 0

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

25/104

## Logarithmic Number System (LNS)

Representation of *x*:

(sign of x, fixed-point approximation of  $\log_2 x$ )

LNS operations:

$$\begin{split} &\log_2(a \times b) \ = \ \log_2 a + \log_2 b \\ &\log_2(a \div b) \ = \ \log_2 a - \log_2 b \\ &\log_2(a \pm b) \ = \ \log_2 a + \log_2(1 \pm 2^{\log_2 b - \log_2 a}) \\ &\log_2(a^q) \ = \ q \times \log_2 a \end{split}$$

where the functions  $\log_2(1+2^{x})$  and  $\log_2(1-2^{x})$  are approximated (tables or polynomials)

Applications in digital signal processing and digital control

## IEEE-754: basic formats

Radix  $\beta=$  2, the first bit of the normalized mantissa is always a "1" (non-stored implicit bit)

|                  | number of bits |      |          |               |  |  |  |
|------------------|----------------|------|----------|---------------|--|--|--|
| format           | total          | sign | exponent | mantissa      |  |  |  |
| double precision | 64             | 1    | 11       | 52 + <b>1</b> |  |  |  |
| simple precision | 32             | 1    | 8        | 23 + <b>1</b> |  |  |  |



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Part III



#### Basic Cells for Addition

Useful circuit element in computer arithmetic: counter

A (m, k)-counter is a cell that counts the number of 1 on its m inputs (result expressed as a k-bit integer)



Standard counters:

- half-adder or HA is a (2,2)-counter
- full-adder or FA is a (3,2)-counter

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## FA Implementations











A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

30/104

## **Optimized FA Cell**

10-transistor solution<sup>1</sup> (some output signals are weak signals):



| Α | В | D | C              | S       |
|---|---|---|----------------|---------|
| 0 | 0 | 0 | 0              | 0       |
| 0 | 0 | 1 | 0              | 1       |
| 0 | 1 | 0 | 0 <sub>w</sub> | 1       |
| 0 | 1 | 1 | 1              | 0       |
| 1 | 0 | 0 | 0 <sub>w</sub> | 1       |
| 1 | 0 | 1 | 1 <sub>w</sub> | 0       |
| 1 | 1 | 0 | 1              | 0       |
| 1 | 1 | 1 | $1_w$          | $1_{w}$ |

<sup>1</sup>H. T. Bui, Y. Wang et Y. Jiang. *Design and analysis of low-power 10-transistor* full adders using novel XOR-XNOR gates. IEEE Trans. CaS, jan. 2002. A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Carry Ripple Adder (CRA)

Very simple architecture: n FA cells connected in series



|       | complexity            |
|-------|-----------------------|
| delay | O(n)                  |
| area  | <i>O</i> ( <i>n</i> ) |

Warning: Sometimes a CRA is also called *Carry Propagate Adder* (CPA), but CPA also means a non-redundant adder (that propagates)

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

Subtraction

Using 2's complement representation:

$$A-B = A+(-B)$$

Implementation:





## Useless Activity in a Carry Ripple Adder



Theoretical models (equiprobable and uniform distribution of inputs):

- worst case  $n^2/2$  transitions
- average 3n/2 transitions and only n/2 useful

A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

## Sign Extension

Required for the addition of different size operands in 2's complement



Warning:

- fanout
- order in case of multiple additions

#### Representation(s) of Numbers and Power Consumption

Impact of the representation of numbers:

- operator speed
- circuit area
- useful and useless activity

| cycle | value | 2's complement                          | $t_{c2}$ | sign/magnitude                          | t <sub>sm</sub> |
|-------|-------|-----------------------------------------|----------|-----------------------------------------|-----------------|
| 0     | 0     | 000000000000000000000000000000000000000 | 0        | 000000000000000000000000000000000000000 | 0               |
| 1     | 1     | 000000000000000000000000000000000000000 | 1        | 000000000000000000000000000000000000000 | 1               |
| 2     | -1    | 1111111111111111111                     | 15       | 1000000000000001                        | 1               |
| 3     | 8     | 0000000000001000                        | 15       | 0000000000001000                        | 3               |
| 4     | -27   | 1111111111100101                        | 15       | 100000000011011                         | 4               |
| 5     | 27    | 000000000011011                         | 15       | 000000000011011                         | 1               |
| total |       |                                         | 61       |                                         | 10              |

• sign/magnitude (absolute value):

le (absolute value):
$$A=(s_aa_{n-2}\ldots a_1a_0)=(-1)^{s_a} imes\sum_{i=0}^{n-2}a_i2^i$$

• 2's complement:

$$A = (a_{n-1}a_{n-2}\dots a_1a_0) = -a_{n-1}2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i$$

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Carry-Select Adder: Fanout Problem



## Carry-Select Adder

Idea: computation of the higher half part for the 2 possible input carries (0 and 1) and selection when the output carry from lower half part is known





A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

38/104

## Carry Skip Adder

Idea: split in blocks, fast detection of the block propagation in each block (all ranks of the block propagate the block input carry)



#### Questions:

- delay with uniform block size? -> analytical models
- delay with non-uniform block size?  $\rightarrow$  heuristics (papers)

#### Carry Look-ahead Adder

Idea: compute all carries as fast as possible (instead of propagating them)

At rank *i*, the input carry  $c_i$  is 1 in the following cases:

- rank *i* − 1 generates a carry
   ⇒ g<sub>i−1</sub> = 1
- rank *i* − 1 propagates a carry generated at rank *i* − 2
   ⇒ *p*<sub>*i*−1</sub> = *g*<sub>*i*−2</sub> = 1
- ranks i 1 and i 2 propagate a carry generated at rank i 3 $\hookrightarrow p_{i-1} = p_{i-2} = g_{i-3} = 1$
- ranks i 1 to 0 propagate the adder input carry  $c_0$  (set to 1)  $\Rightarrow p_{i-1} = p_{i-2} = \ldots = p_1 = p_0 = c_0 = 1$

A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

41/104

#### Carry Look-ahead Adder: 4-Bit Example

$$c_1 = g_0 + p_0 c_0$$
  

$$c_2 = g_1 + p_1 g_0 + p_1 p_0 c_0$$
  

$$c_3 = g_2 + p_2 g_1 + p_2 p_1 g_0 + p_2 p_1 p_0 c_0$$

 $c_4 = g_3 + p_3 g_2 + p_3 p_2 g_1 + p_3 p_2 p_1 g_0 + p_3 p_2 p_1 p_0 c_0$ 



All carries can be computed using the relation  $(c_i = g_{i-1} + c_{i-1}p_{i-1})$ :

 $c_i = g_{i-1} + p_{i-1}g_{i-2} + p_{i-1}p_{i-2}g_{i-3} + \ldots + p_{i-1}\cdots p_1g_0 + p_{i-1}\cdots p_0c_0$ 

CLA architecture: parallel evaluation of

- $(g_i, p_i)$  for all i
- carries  $c_i$  for all i using the above equation



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

42/104

## Parallel-Prefix Problems

The *n* outputs  $(y_{n-1}, y_{n-2}, \dots, y_0)$  are computed using the *n* inputs  $(x_{n-1}, x_{n-2}, \dots, x_0)$  and the associative operator  $\Box$ :

$$y_0 = x_0$$
  

$$y_1 = x_1 \square x_0$$
  

$$y_2 = x_2 \square x_1 \square x_0$$
  

$$\vdots$$
  

$$y_{n-1} = x_{n-1} \square x_{n-2} \square \cdots \square x_1 \square x_0$$



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators



45/104

## Comparison of Adders



Source: PhD R. Zimmermann [11]

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

46/104

## Redundant or Constant Time Adders

To speed-up the addition, one solution consists in "saving" the carries and using them (this makes sense only in case of multiple additions)

In 1961, Avizienis suggested to represent numbers in radix  $\beta$  with digits in  $\{-\alpha, -\alpha + 1, \dots, 0, \dots, \alpha - 1, \alpha\}$  instead of  $\{0, 1, 2, \dots, \beta - 1\}$  with  $\alpha \leq \beta - 1$ 

Using this representation, if  $2\alpha + 1 > \beta$  some numbers have several possible representation at the bit level. For instance, the value 2345 (in the standard representation) can be represented in radix 10 with digits in  $\{-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5\}$  by the values 2345, 235(-5) or 24(-5)(-5)

Such a representation is said redundant

In a redundant number system there is constant-time addition algorithm (without carry propagation) where all computations are done in parallel

#### Carry-Save Adder

In carry-save, the number A is represented in radix 2 using digits  $a_i \in \{0, 1, 2\}$  coded by 2 bits such that  $a_i = a_{i,c} + a_{i,s}$  where  $a_{i,c} \in \{0, 1\}$ and  $a_{i,s} \in \{0, 1\}$ 



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Carry-Save Trees

Example with 3 inputs: A, B and C



Carry-save reduction tree: n(h) non-redundant inputs can be reduced by a *h*-level carry-save tree where  $n(h) = \lfloor 3n(h-1)/2 \rfloor$  and n(0) = 2

| h                     | 1 | 2 | 3 | 4 | 5  | 6  | 7  | 8  | 9  | 10 | 11  |
|-----------------------|---|---|---|---|----|----|----|----|----|----|-----|
| <i>n</i> ( <i>h</i> ) | 3 | 4 | 6 | 9 | 13 | 19 | 28 | 42 | 63 | 94 | 141 |

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Shift-And-Add Multiplication: Implementation



|       | complexity            |
|-------|-----------------------|
| delay | <i>O</i> ( <i>n</i> ) |
| area  | <i>O</i> ( <i>n</i> ) |

#### Borrow-Save Addition

In borrow-save, the number A is represented in radix 2 using digits  $a_i \in \{-1, 0, 1\}$  coded by 2 bits such that  $a_i = a_i^+ - a_i^-$  where  $a_i^+ \in \{0, 1\}$  and  $a_i^- \in \{0, 1\}$ 





Borrow-save addition: delay of 2 PPM cells (T = 0(1)) A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

50/104

## Shift-And-Add Multiplication: High-Radix Method

Idea: use high-radix digits for the multiplier



Problem: multiple generation A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

#### **Booth Recoding**

#### Modified Booth's Recoding

In 1951, Booth proposed to increase the number of 0s in the multiplier using the digit set  $\{-1 = \overline{1}, 0, 1\}$ 

Recoding based on the identity:  $2^{i+k} + 2^{i+k-1} + 2^{i+k-2} + \dots + 2^i = 2^{i+k+1} - 2^i$ 

Example: the integer 60 is represented by  $00111100 = 01000\overline{1}00$ 

The recoding replaces strings of 1s by a representation with more 0s

But, in some cases, this basic method leads to more 1 (or  $\overline{1}$ )!

Example: the value 01010101 is recoded to  $\overline{1}1\overline{1}1\overline{1}1\overline{1}1$ 



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Modified Booth Multiplier



Idea: do not recode isolated 1 but only strings of 1



| ai | <i>a</i> <sub><i>i</i>-1</sub> | <i>a</i> <sub><i>i</i>-2</sub> | Уi             | y <sub>i−1</sub> | meaning                     | operation   |
|----|--------------------------------|--------------------------------|----------------|------------------|-----------------------------|-------------|
| 0  | 0                              | 0                              | 0              | 0                | string of 0s                | +0          |
| 0  | 0                              | 1                              | 0              | 1                | end of a string of 1s       | +B          |
| 0  | 1                              | 0                              | 0              | 1                | isolated 1                  | +B          |
| 0  | 1                              | 1                              | 1              | 0                | end of a string of 1s       | +2B         |
| 1  | 0                              | 0                              | $\overline{1}$ | 0                | beginning of a string of 1s | -2 <i>B</i> |
| 1  | 0                              | 1                              | 1              | 1                | isolated 0                  | -B          |
| 1  | 1                              | 0                              | 0              | Ī                | beginning of a string of 1s | -B          |
| 1  | 1                              | 1                              | 0              | 0                | middle of a string of 1s    | +0          |

Improvement: leads to a *n*-product with  $\lfloor n/2 \rfloor + 1$  additions and shifts at most

A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

54/104

## 2's Complement Product

If A and B are represented using 2's complement, then some partial products have a negative weight

|                               |                                 |                                 | ×                               | a <sub>4</sub><br>b <sub>4</sub> | a <sub>3</sub><br>b <sub>3</sub> | a <sub>2</sub><br>b <sub>2</sub> | a <sub>1</sub><br>b <sub>1</sub> | a <sub>0</sub><br>b <sub>0</sub> |
|-------------------------------|---------------------------------|---------------------------------|---------------------------------|----------------------------------|----------------------------------|----------------------------------|----------------------------------|----------------------------------|
|                               |                                 |                                 |                                 | $-\mathbf{a_0}\mathbf{b_4}$      | a <sub>0</sub> b <sub>3</sub>    | a <sub>0</sub> b <sub>2</sub>    | a <sub>0</sub> b <sub>1</sub>    | a <sub>0</sub> b <sub>0</sub>    |
|                               |                                 |                                 | - a <sub>1</sub> b <sub>4</sub> | a <sub>1</sub> b <sub>3</sub>    | a <sub>1</sub> b <sub>2</sub>    | a <sub>1</sub> b <sub>1</sub>    | a <sub>1</sub> b <sub>0</sub>    |                                  |
|                               |                                 | - a <sub>2</sub> b <sub>4</sub> | a <sub>2</sub> b <sub>3</sub>   | a <sub>2</sub> b <sub>2</sub>    | a <sub>2</sub> b <sub>1</sub>    | a2b0                             |                                  |                                  |
|                               | - a <sub>3</sub> b <sub>4</sub> | a3b3                            | a <sub>3</sub> b <sub>2</sub>   | a <sub>3</sub> b <sub>1</sub>    | a <sub>3</sub> b <sub>0</sub>    |                                  |                                  |                                  |
| a <sub>4</sub> b <sub>4</sub> | - a <sub>4</sub> b <sub>3</sub> | - a <sub>4</sub> b <sub>2</sub> | - a <sub>4</sub> b <sub>1</sub> | - a <sub>4</sub> b <sub>0</sub>  |                                  |                                  |                                  |                                  |
|                               |                                 |                                 |                                 | $\overline{a_0 b_4}$             | a <sub>0</sub> b <sub>3</sub>    | a <sub>0</sub> b <sub>2</sub>    | a <sub>0</sub> b <sub>1</sub>    | a <sub>0</sub> b <sub>0</sub>    |
|                               |                                 |                                 | a <sub>1</sub> b <sub>4</sub>   | a <sub>1</sub> b <sub>3</sub>    | a <sub>1</sub> b <sub>2</sub>    | a <sub>1</sub> b <sub>1</sub>    | a <sub>1</sub> b <sub>0</sub>    |                                  |
|                               |                                 | $\overline{a_2 b_4}$            | a <sub>2</sub> b <sub>3</sub>   | a <sub>2</sub> b <sub>2</sub>    | a <sub>2</sub> b <sub>1</sub>    | a2b0                             |                                  |                                  |
|                               | a <sub>3</sub> b <sub>4</sub>   | a3b3                            | a <sub>3</sub> b <sub>2</sub>   | a <sub>3</sub> b <sub>1</sub>    | a <sub>3</sub> b <sub>0</sub>    |                                  |                                  |                                  |
| a <sub>4</sub> b <sub>4</sub> | $\overline{a_4 b_3}$            | $\overline{a_4 b_2}$            | a <sub>4</sub> b <sub>1</sub>   | $\overline{a_4 b_0}$             |                                  | Modifi                           | ed Baugh                         | -Woole                           |

1

53/104

1

#### Tree Multipliers

- 1. Partial products generation  $a_i b_j$ (with or without recoding)  $\hookrightarrow$  delay in O(1) (fanout  $a_i, b_j$  $O(\log n)$ )
- Sum of the partial products using a *carry-save* reduction tree → delay in O(log n)
- Assimilation of the carries using a fast adder
   → delay in O(log n)

n bits n<sup>2</sup> bits reduction 4n bits P (carry-save) 2n bits P

**PP** generation

B 7 n bits

Multiplication delay  $O(\log n)$ , area  $O(n^2)$ 

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

57/104

## Partial Product Generation: Booth-3





Partial Product Generation: Booth-2

#### A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Booth's Recoding Benefit

Booth-2 case:

- Speed: 1 or 2 levels removed from the reduction tree but this is balanced by the recoding step
- Area: true benefit (30%), recoding cells limit the fanout problem and their implementation is efficient in CMOS

#### Booth-3 case:

It is rarely used because of the very complex recoding and partial product cells  $% \left( {{{\boldsymbol{x}}_{i}}} \right)$ 

#### Partial Products Addition: Reduction Trees

Goal: compute the addition of the n/2 + 1 partial products in *carry-save* 

#### Wallace Trees

Wallace trees<sup>2</sup> are *p*-input counters ( $\lceil log_2 p \rceil$  outputs)

A Wallace tree with  $2^{p+1} - 1$  inputs can be built based on  $2^p - 1$ -input Wallace trees (a 3-input Wallace tree is a FA)



<sup>2</sup>C.S. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Computers, Feb. 1964.

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

62/104

## Placement and Routing Problems at the Gate Level

What is the best topology?



#### Example: 14-bit reduction

1 ADD(8

#### 8 9 10 11 6 7 3 4 5 13 28 42 9 19 63 94 n(h)3 4 6 141 Area benefit on large multipliers. Example: n = 12 bits $\Rightarrow 11\%$

9 FA + 3 HA + 1 ADD(8

less gates compared to a Wallace tree

61/104

63/104

• Trees based on counters or compressors

Several reduction trees can be used:

• Trees based on FA cells:

fast reduction trees

Trees based on "4 to 2" cells.

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

Wallace trees

Dadda trees

## Dadda Trees

Idea: minimal reduction at each level of the tree (just enough to reach the next level in n(h) = |3n(h-1)/2| with n(0) = 2)

#### Final Addition in Multipliers

The final addition or assimilation of the carries is performed using a fast adder

Based on the relative arrival time of the partial products, some (small) optimizations can be done: use of several adder types depending on the rank region



65/104

#### MAC and FMA

MAC: multiply and accumulate  $P(t) = A \times B + P(t-1)$ A, B are n-bit values and P a m-bit with m >> n (e.g.,  $16 \times 16 + 40 \longrightarrow 40$  in some DSPs) FMA: fused multiply and add  $P = A \times B + C$  where A, B, C and P can be stored in different registers (recent general purpose processors, e.g., Itanium)



#### Power Consumption in Fast Multipliers



- 30% to 70% of redundant transitions (useless)
- place and route steps based on the internal arrival time
- add a pipeline stage

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

# 



#### Multiplication by Constants (1/2)

**Problem**: substitute a complete multiplier by an optimized sequence of shifts and additions and/or subtractions

Papers: [2, 1]

Example:  $p = 111463 \times x$ 

| algo.     | $p = 111463 \times x =$                                         | #op.    |
|-----------|-----------------------------------------------------------------|---------|
| direct    | $(x \ll 16) + (x \ll 15) + (x \ll 13) + (x \ll 12) + (x \ll 9)$ | 10 ±    |
|           | $+(x \ll 8)+(x \ll 6)+(x \ll 5)+(x \ll 2)+(x \ll 1)+x$          |         |
| CSD       | $(x \ll 17) - (x \ll 14) - (x \ll 12) + (x \ll 10)$             | $7 \pm$ |
|           | $-(x \ll 7) - (x \ll 5) + (x \ll 3) - x$                        |         |
| Bernstein | $(((t_2 \ll 2) + x) \ll 3) - x$                                 | $5 \pm$ |
|           | where                                                           |         |
|           | $t_1 = (((x \ll 3) - x) \ll 2) - x$                             |         |
|           | $t_2 = t_1 \ll 7 + t_1$                                         |         |
| Our       | $(t_2 \ll 12) + (t_2 \ll 5) + t_1$                              | 4 ±     |
|           | where                                                           |         |
|           | $t_1 = (x \ll 3) - x$                                           |         |
|           | $t_2 = (t_1 \ll 2) - x$                                         |         |

$$\label{eq:csd} \begin{split} \text{CSD: canonical signed digit, } 111463 &= 11011001101100111_2 \\ \text{A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators} \\ \end{split}$$

## Multiplication by Constants (2/2)

| Power saving      | s: 30 ur | o to 60' | %   |     |     | FIR        |
|-------------------|----------|----------|-----|-----|-----|------------|
| operator init.    |          | [1]      | [2] | our |     | ×[t]<br>→  |
| DCT 8b            | 300      | 94       | 73  | 56  |     |            |
| DCT 12b           | 368      | 100      | 84  | 70  |     |            |
| DCT 16b           | 521      | 129      | 114 | 89  |     |            |
| DCT 24b           | 789      | 212      | —   | 119 | 1   |            |
| Power saving      | s: 10%   |          |     |     |     | ⊥<br>x[t]⊳ |
| operator          |          | init.    | [1] | [2] | our | ]          |
| 8 × 8 Had.        |          | 56       | 24  | _   | 24  | ····       |
| (16, 11) RM.      |          | 61       | 43  | 31  | 31  |            |
| (15,7) BCH        |          | 72       | 48  | 47  | 44  | x[t]       |
| (24, 12, 8) Golay |          | 76       | —   | 47  | 45  |            |
| Power saving      | s: up to | o 40%    |     |     |     | [          |
| operator          | init.    | [22]     | our |     |     |            |

24

46





р

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

32

70

35

72

remez(25, [0 0.2 0.25 1], [1 1 0 0]).

8 bits

16 bits

Parks-McClellan filter

70/104

Е

## Division Not an Important Operation? Not So Clear!

Technical report<sup>3</sup> from S. Oberman and M. Flynn : "*Design issues in floating-point division*", CSL-TR-94-647 Stanford University.



<sup>3</sup>SPECfp92 DECstation MIPS R3000 (latency : 2c add, 5c mul, 19c div), compil. O3. A. Tisserand, CNRS–IRISA–CAIRN. *Low-Power Arithmetic Operators* 72/104

Part IV

#### More advanced operations

Division (square root)

#### **Elementary Functions**

## **Restoring Division**

MSB

q

## How to Speed-Up Division ?

Idea : high radix iterations

Problem: accurate comparisons with divisor multiples





zones de choix difficile

Solution: redundant representation for quotient digits  $\hookrightarrow$  approximate comparisons

x(i+1)

zones à choix multiples

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

for i from 1 to n do

if  $x \ge 0$  then

 $q_i \leftarrow 1$ 

 $q_i \leftarrow 0$ 

 $x \leftarrow x + d$ 

 $x \leftarrow 2x$ 

else

 $x \leftarrow x - d$ 

1

2

3

4

5

6

7

8

73/104

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### SRT Division Table



zone impossible 00.1 2d/3 d/3 00.0 0.101 0.110 0.111 0.100

SRT Division Architecture

SRT: Sweeney, Robertson and Tocher, 1958



#### Micro-Coded Division using Newton-Raphson

Specific iteration with quadratic convergence to 1/x:

$$x_{i+1} = x_i \cdot (2 - d \cdot x_i)$$

Then, use  $d \cdot (1/x)$  (and a few op. for rounding if needed)

#### Motivation: simplified processor architecture



A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

77/104

## Divgen: a Division Unit Generator

- VHDL output
- supported algorithms: restoring, non-restoring and SRT
- partial remainder: 2's complement or carry-save
- # guard bits, SRT table folding, gray encoding



Publi. [5], web: http://lipforge.ens-lyon.fr/www/divgen/

## The Return of SRT Dividers in Processors...

Problems in Newton-Raphson iterations: power consumption and thermal hot-spot!



Fig. 6. Comparison of thermal profiles: FMA alone (left) and FMA plus r16div (right). Temperatures are in °C.

Source: W. Liu and A. Nannarelli. *Power Dissipation Challenges in Multicore Floating-Point Units.* Proc. ASAP 2010.

A. Tisserand, CNRS–IRISA–CAIRN. Low-Power Arithmetic Operators

78/104

## Motivations for More Advanced Operations



Initial approximations for floating-point units:

- reciprocal  $\frac{1}{x}$  for division
- inverse square root  $\frac{1}{\sqrt{X}}$  for square root



and many other functions: exp(x), log(x), arctan(x), cosh(x), ...

20

25

30

15

quotient size [#bits]

#### Error and Accuracy

Question: how many bits are correct ?

$$\begin{cases} x_{t} = (1.000\,000\,00)_{2} & \text{theoretical value} \\ x_{c} = (0.111\,111\,11)_{2} & \text{value in the circui} \\ |x_{t} - x_{c}| = (0.000\,000\,01)_{2} = 2^{-8} \end{cases}$$

**Error**,  $\epsilon$ : distance between 2 objects (e.g.  $\epsilon = ||f(x) - p(x)||$ )

Accuracy,  $\mu$ : (fractional) number of bits required to represent values with an error  $\leq \epsilon$ 

$$\mu = -\log_2 |\epsilon|$$

**Notation**:  $\mu$  expressed in terms of correct or significant bits ([cb], [sb])

**Example**: error  $\epsilon = 0.0000107$  is equivalent to accuracy  $\mu = 16.5$  sb



## Accuracy, Degree and Evaluation Cost

Degree-*d* minimax approximation polynomials to sin(x) with  $x \in [a, b]$ :



- higher accuracy  $\implies$  higher degree
- higher degree ⇒ more costly evaluation



**Polynomial Approximations** 

#### A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

82/104

#### **Polynomial Evaluation Schemes**

| scheme | computations                      | # ± | # × |
|--------|-----------------------------------|-----|-----|
| direct | $p_0 + p_1 x + p_2 x^2 + p_3 x^3$ | 3   | 5   |
| Horner | $p_0 + (p_1 + (p_2 + p_3 x)x)x$   | 3   | 3   |
| Estrin | $p_0 + p_1 x + (p_2 + p_3 x) x^2$ | 3   | 4   |

#### Trade-off:

- direct scheme  $\longrightarrow$  high operation cost and smaller accuracy
- Horner scheme  $\longrightarrow$  smallest cost but sequential
- Estrin scheme  $\longrightarrow$  some internal parallelism

**Question**: what is the best evaluation scheme?

#### Round-off Errors

Round-off errors occur during most of computations:

- due to the finite accuracy during the computations
- small for a single operation (fraction of the LSB)
- accumulation of such errors may be a problem in long computation sequences
- $\bullet$  need for a sufficient data-path width in order to limit round-off errors

Examples:  $1/3=0.33333333\ldots \rightarrow 0.3333$  or 0.3334 in  $1\mathrm{Q}_{10}4$  format



#### Question: what is the best data-path width?

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

Gappa Overview

- developed by Guillaume Melquiond
- goal: formal verification of the correctness of numerical programs:
  - software and hardware
  - $\blacktriangleright$  integer, floating-point and fixed-point arithmetic (±, ×, ÷,  $\surd)$
- uses multiple-precision interval arithmetic, forward error analysis and expression rewriting to bound mathematical expressions (rounded and exact operators)
- generates a theorem and its proof which can be automatically checked using a proof assistant (e.g. Coq or HOL Light)
- reports tight error bounds for given expressions in a given domain
- C++ code and free software license (CeCILL  $\simeq$  GPL)
- publication: ACM Transactions on Mathematical Software, n. 1, vol. 37, 2010, pp: 2:1–20, doi: 10.1145/1644001.1644003
- source code and doc: http://gappa.gforge.inria.fr/

## Bounding Round-off Errors

Problem: it is very difficult to get tight bounds

Solutions:

- worst case: assume 1/2 LSB error for each operation  $\rightsquigarrow$  simple but very pessimistic
- qualification: exhaustive or selected simulations
   → simple but only validated bounds for small systems
- specific tools: formal accurate analysis (and proof)
   → we use gappa developed by Guillaume Melquiond

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Gappa Example

Degree-2 polynomial approximation to  $e^x$  over [1/2, 1] and format 1Q9:

```
1 p0 = 571/512;
                      p1 = 275/512;
                                          p2 = 545/512;
2
3 \times = fixed < -9, dn > (Mx);
5 y1 \text{ fixed} < -9, dn >= p2 * x + p1;
    fixed < -9, dn >= v1 * x + p0;
6 p
_{8}Mp = (p2 * Mx + p1) * Mx + p0;
9
10 {
      Mx in [0.5,1]
                       /\ |Mp-Mf| in [0,0.001385]
11
12->
      |p-Mf| in ?
13
14 }
```

Gappa-0.14.0 result ([a, b],  $x\{(\approx x)_{10}, \log_2 x\}, xby = x2^y\}$ ): Results for Mx in [0.5, 1] and |Mp - Mf| in [0, 0.001385]: |p - Mf| in [0, 193518932894171697b-64 {0.0104907, 2^(-6.57475)}]

#### Minimax Polynomial Approximations

- approximation error  $\epsilon_{app} = ||f p||_{\infty} = \max_{a \le x \le b} |f(x) p(x)|$
- minimax polynomial approximation to f over [a, b] is  $p^*$  such that:

$$||f - p^*||_{\infty} = \min_{p \in \mathcal{P}_d} ||f - p||_{\infty}$$

- $\mathcal{P}_d$  set of polynomials with real coefficients and degree  $\leq d$
- $p^*$  computed using an algorithm from Remez (numerically implemented in Maple, Matlab, sollya...)

#### Problems:

- $p^*$  coefficients in  $\mathbb{R} \implies$  conversion to finite precision
- during  $p^*$  evaluation, some round-off errors add up to  $\epsilon_{app}$

#### A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### **Basic Coefficient Selection Method**

Idea: search among all the rounding modes for all the  $p_i^*$ 

- round up  $p_i = \triangle(p_i^*)$ , round down  $p_i = \nabla(p_i^*)$
- 2 values per coeff.  $\implies$  total of  $2^{d+1}$  values (but d is small)
- for each polynomial p evaluate  $\epsilon_{app} = ||f p||_{\infty}$ , then select polynomial(s) with the smallest  $\epsilon_{app}$



Result:  $p(x) = \sum_{i=0}^{d} p_i x^i$  where all  $p_i$  are representable in target format A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators 91/104

#### Finite Precision Coefficients Selection Problem

Example:  $f(x) = e^x$  over [1/2, 1] with d = 2, the remez function from sollya gives:

 $p^* = 1.116019297... + 0.535470348... \times x + 1.065407185... \times x^2$ 

**Question**: what are "good" representable values for  $p_0$ ,  $p_1$  and  $p_2$ ?

**Problem**:  $p^*$  is the best theoretical approximation to f (i.e.  $p_i \in \mathbb{R}$ ) **Need**: find good approximations with "machine-representable" coefficients Above example with 1Q9 format (all values for domain [1/2, 1]):

•  $\epsilon_{\text{app}} = ||f - p^*||_{\infty} \simeq 1.385 \times 10^{-3} \quad \rightsquigarrow \quad \simeq 9.4 \text{ sb}$ •  $\frac{571}{512} + \frac{137}{256}x + \frac{545}{512}x^2 \implies 8.1 \text{ sb} \quad (\forall i \text{ use } \mathcal{N}(p_i))$ •  $\frac{571}{512} + \frac{275}{512}x + \frac{545}{512}x^2 \rightarrow 9.3$  sb (best selection)

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

 $(n^*) \sim 18.04 \text{ sh}$ 

89/104

90/104

[ch]

## Example for $f(x) = 2^x$ , $x \in [0, 1]$ and d = 4

|                              |                                                                                                                                                                                                                                                                                                                      |                                                        | sapp [sa]                                              |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|
| $\epsilon_{\mathrm{app}}(p)$ | р                                                                                                                                                                                                                                                                                                                    | $\epsilon_{ m app}(\pmb{p})$                           | 20 🕇                                                   |
| 12.00                        | $(\nabla, \nabla, \nabla, \nabla, \Delta)$                                                                                                                                                                                                                                                                           | 13.00                                                  | d = d                                                  |
| 13.00                        | $(\bigtriangledown, \bigtriangledown, \bigtriangledown, \Delta, \Delta)$                                                                                                                                                                                                                                             | 14.03                                                  | 18                                                     |
| 13.00                        | $(\bigtriangledown, \bigtriangledown, \triangle, \bigtriangledown, \triangle)$                                                                                                                                                                                                                                       | 14.55                                                  |                                                        |
| 14.99                        | $(\bigtriangledown, \bigtriangledown, \triangle, \Delta, \Delta)$                                                                                                                                                                                                                                                    | 13.00                                                  | 10 +                                                   |
| 13.00                        | $(\bigtriangledown, \bigtriangleup, \bigtriangledown, \bigtriangledown, \bigtriangledown, \bigtriangleup)$                                                                                                                                                                                                           | 16.13                                                  | 14 + d - d                                             |
| 17.12                        | $(\bigtriangledown, \triangle, \bigtriangledown, \Delta, \Delta)$                                                                                                                                                                                                                                                    | 13.00                                                  |                                                        |
| 15.71                        | $(\bigtriangledown, \triangle, \triangle, \bigtriangledown, \triangle)$                                                                                                                                                                                                                                              | 13.00                                                  | 12 +                                                   |
| 13.00                        | $(\bigtriangledown, \triangle, \triangle, \Delta, \Delta)$                                                                                                                                                                                                                                                           | 12.00                                                  |                                                        |
| 13.00                        | $(\triangle,\bigtriangledown,\bigtriangledown,\bigtriangledown,\bigtriangledown,\triangle)$                                                                                                                                                                                                                          | 13.00                                                  | 10 + d - d                                             |
| 13.00                        | $(\triangle,\bigtriangledown,\bigtriangledown,\triangle,\triangle)$                                                                                                                                                                                                                                                  | 13.00                                                  | u                                                      |
| 13.00                        | $(\triangle,\bigtriangledown,\triangle,\bigtriangledown,\triangle)$                                                                                                                                                                                                                                                  | 13.00                                                  | 8 1                                                    |
| 12.99                        | $(\triangle, \bigtriangledown, \triangle, \triangle, \triangle)$                                                                                                                                                                                                                                                     | 12.00                                                  | 6                                                      |
| 12.99                        | $(\triangle, \triangle, \bigtriangledown, \bigtriangledown, \bigtriangledown, \triangle)$                                                                                                                                                                                                                            | 12.98                                                  | d = 1                                                  |
| 12.91                        | $(\triangle, \triangle, \bigtriangledown, \triangle, \triangle)$                                                                                                                                                                                                                                                     | 12.00                                                  | 4 +                                                    |
| 12.79                        | $(\triangle, \triangle, \triangle, \bigtriangledown, \triangle)$                                                                                                                                                                                                                                                     | 12.00                                                  |                                                        |
| 12.00                        | $(\triangle, \triangle, \triangle, \Delta, \Delta)$                                                                                                                                                                                                                                                                  | 11.41                                                  | 2 +                                                    |
|                              |                                                                                                                                                                                                                                                                                                                      |                                                        |                                                        |
|                              | ε <sub>арр</sub> (p)           12.00           13.00           13.00           14.99           13.00           17.12           15.71           13.00           13.00           13.00           13.00           13.00           13.00           12.99           12.99           12.91           12.79           12.00 | $\begin{array}{c c c c c c c c c c c c c c c c c c c $ | $\begin{array}{ c c c c c c c c c c c c c c c c c c c$ |

p represented by  $(p_0, p_1, p_2, p_3, p_4)$ 

## Improved Coefficient Selection Methods



Other selection methods:

- linear programming methods, e.g. meplib software https://lipforge.ens-lyon.fr/projects/meplib/
- euclidean lattices reduction (LLL), e.g. sollya software http://sollya.gforge.inria.fr/

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

93/104

Example:  $\sqrt{x}$  over [1, 2] and  $\mu \leq 8$  sb



| 1 |   |
|---|---|
|   | + |
| - | p |

| solution  | area | period | #cycles | latency | power |
|-----------|------|--------|---------|---------|-------|
| wo. tools | 1.00 | 1.00   | 2       | 1.00    | 1.00  |
| w. tools  | 0.59 | 0.97   | 1       | 0.48    | 0.45  |

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### Conclusion

When designing arithmetic operators for low-power applications:

- use adequate number system(s)
- use adequate algorithm(s)
- use specific operator(s) when possible
- use optimization (open-source) tool(s):
  - floating-point data-paths: FloPoCo flopoco.gforge.inria.fr
  - divider generator: divgen http://lipforge.ens-lyon.fr/www/divgen/
  - > polynomial approx.: sollya http://sollya.gforge.inria.fr/
  - rounding errors: gappa http://gappa.gforge.inria.fr/

Part V

#### Conclusion & references

Conclusion

References

Good Books

#### Take Care to Not So Good Ideas...



Gaston 14. La sage des gaffes. Page 5. Par Franquin. Dupuis, 1982 (réédition 1993). ISBN 2-8001-0955-6

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

97/104

## References I

#### N. Boullis and A. Tisserand.

Some optimizations of hardware multiplication by constant matrices. In J.-C. Bajard and M. Schulte, editors, *Proc. 16th Symposium on Computer Arithmetic* (*ARITH*), pages 20–27, Santiago de Compostela, Spain, June 2003. IEEE Computer Society.

#### N. Boullis and A. Tisserand.

Some optimizations of hardware multiplication by constant matrices. *IEEE Transactions on Computers*, 54(10):1271–1282, October 2005.

#### H. T. Bui, Y. Wang, and Y. Jiang.

Design and analysis of low-power 10-transistor full adders using novel XOR-XNOR gates. *IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing*, 49(1):25–30, January 2002.

#### L. Dadda.

Some schemes for parallel multipliers. *Alta Frequenza*, 34:349–356, 1965.

#### R. Michard, A. Tisserand, and N. Veyrat-Charvillon. Divgen: a divider unit generator.

In F. T. Luk, editor, *Proc. Advanced Signal Processing Algorithms, Architectures and Implementations XV*, volume 5910, pages 1–12, San Diego, California, U.S.A., August 2005. SPIE.

#### Arithmetic Operators for Circuits in 20 Years ???



Fluide Glacial Numéro HS 46 (20/04/2009) Où serons nous en 2040 ? Couverture de Goossens

98/104

#### References I

#### A. Tisserand.

Low-power arithmetic operators. In C. Piguet, editor, *Low Power Electronics Design*, chapter 9. CRC Press, November 2004.

#### A. Tisserand.

Introduction aux représentations des nombres et opérateurs arithmétiques à basse consommation d'énergie. *Technique et Science Informatiques*, 26(5):639–646, May 2007.

#### A. Tisserand.

Unités de calcul flottant. Cours École Thématique ARCHI07, March 2007.

#### A. Tisserand.

Power analysis and cryptosystem security: Attacks and countermeasures. Cours École Thématique ECOFAC 2012, May 2012.

#### C.S. Wallace.

#### A suggestion for a fast multiplier.

IEEE Transactions on Electronic Computers, EC-13:14–17, 1964.

#### R. Zimmermann.

*Binary Adder Architectures for Cell-Based VLSI and their Synthesis.* Phd thesis, Swiss Federal Institute of Technology Zurich, 1998.

#### Good Books

#### **CMOS VLSI** Design

A Circuits and Systems Perspective Neil Weste and David Harris 3rd edition, 2004 Addison Wesley ISBN: 0-321-14901-7





Micro et nano-électronique Bases, Composants, Circuits Hervé Fanet 2006 Dunod ISBN: 2–10–049141–5

#### A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

## Good Books

#### L'Énergie à découvert Sous la direction de R. Mosseri et C. Jeandel 2013 CNRS Éditions ISBN: 978-2-271-07678-6



## Good Books

#### Digital Arithmetic Milos Ercegovac and Tomas Lang 2003 Morgan Kaufmann ISBN: 1–55860–798–6



#### Arithmétique des ordinateurs Jean-Michel Muller 1989 Masson ISBN: 2–225–81689–1 (web version)

**Digital** Arithmetic

A. Tisserand, CNRS-IRISA-CAIRN. Low-Power Arithmetic Operators

#### The end, questions ?

#### Contact:

- mailto:arnaud.tisserand@irisa.fr
- http://people.irisa.fr/Arnaud.Tisserand/
- CAIRN Group http://www.irisa.fr/cairn/
- IRISA Laboratory, CNRS–INRIA–Univ. Rennes 1
   6 rue Kerampont, CS 80518, F-22305 Lannion cedex, France

Thank you