OpenTitan Big Number Accelerator (OTBN) Technical Specification

Note on the status of this document

This specification is work in progress and will see significant changes before it can be considered final. We invite input of all kind through the standard means of the OpenTitan project; a good starting point is filing an issue in our GitHub issue tracker.

Overview

This document specifies functionality of the OpenTitan Big Number Accelerator, or OTBN. OTBN is a coprocessor for asymmetric cryptographic operations like RSA or Elliptic Curve Cryptography (ECC).

This module conforms to the Comportable guideline for peripheral functionality. See that document for integration overview within the broader top level system.

Features

  • Processor optimized for wide integer arithmetic
  • 32b wide control path with 32 32b wide registers
  • 256b wide data path with 32 256b wide registers
  • Full control-flow support with conditional branch and unconditional jump instructions, hardware loops, and hardware-managed call/return stacks.
  • Reduced, security-focused instruction set architecture for easier verification and the prevention of data leaks.
  • Built-in access to random numbers. Note: The (quality) properties of the provided random numbers it not specified currently; this gap in the specification will be addressed in a future revision.

Description

OTBN is a processor, specialized for the execution of security-sensitive asymmetric (public-key) cryptography code, such as RSA or ECC. Such algorithms are dominated by wide integer arithmetic, which are supported by OTBN’s 256b wide data path, registers, and instructions which operate these wide data words. On the other hand, the control flow is clearly separated from the data, and reduced to a minimum to avoid data leakage.

The data OTBN processes is security-sensitive, and the processor design centers around that. The design is kept as simple as possible to reduce the attack surface and aid verification and testing. For example, no interrupts or exceptions are included in the design, and all instructions are designed to be executable within a single cycle.

OTBN is designed as a self-contained co-processor with its own instruction and data memory, which is accessible as a bus device.

Compatibility

OTBN is not designed to be compatible with other cryptographic accelerators. It received some inspiration from assembly code available from the Chromium EC project, which has been formally verified within the Fiat Crypto project.

Instruction Set

OTBN is a processor with a custom instruction set, which is described in this section. The instruction set is split into two groups:

  • The base instruction subset operates on the 32b General Purpose Registers (GPRs). Its instructions are used for the control flow of a OTBN application. The base instructions are inspired by RISC-V’s RV32I instruction set, but not compatible with it.
  • The big number instruction subset operates on 256b Wide Data Registers (WDRs). Its instructions are used for data processing. Also included are compare and select instructions to perform data-dependent control flow in a safe and constant time fashion.

The subsequent sections describe first the processor state, followed by a description of the base and big number instruction subsets.

Processor State

General Purpose Registers (GPRs)

OTBN has 32 General Purpose Registers (GPRs). Each GPR is 32b wide. General Purpose Registers in OTBN are mainly used for control flow. The GPRs are defined in line with RV32I.

Note: GPRs and Wide Data Registers (WDRs) are separate register files. They are only accessible through their respective instruction subset: GPRs are accessible from the base instruction subset, and WDRs are accessible from the big number instruction subset (BN instructions).

x0 Zero. Always reads 0. Writes are ignored.
x1 Return address. Access to the call stack. Reading x1 pops an address from the call stack. Writing x1 pushes a return address to the call stack. Reading from an empty call stack results in an alert.
x2 General Purpose Register 2.
...
x31 General Purpose Register 31.

Note: Currently, OTBN has no “standard calling convention,” and GPRs except for x0 and x1 can be used for any purpose. If, at one point, a calling convention is needed, it is expected to be aligned with the RISC-V standard calling conventions, and the roles assigned to registers in that convention. Even without a agreed-on calling convention, software authors are encouraged to follow the RISC-V calling convention where it makes sense. For example, good choices for temporary registers are x6, x7, x28, x29, x30, and x31.

Control and Status Registers (CSRs)

Control and Status Registers (CSRs) are 32b wide registers used for “special” purposes, as detailed in their description; they are not related to the GPRs. CSRs can be accessed through dedicated instructions, CSRRS and CSRRW.

Number Privilege Description
0x7C0 RW FLAGS. Wide arithmetic flags. This CSR provides access to the flags used in wide integer arithmetic.
BitDescription
0Carry of Flag Group 0
1LSb of Flag Group 0
2MSb of Flag Group 0
3Zero of Flag Group 0
4Carry of Flag Group 1
5LSb of Flag Group 1
6MSb of Flag Group 1
7Zero of Flag Group 1
0x7D0 RW MOD0. Bits [31:0] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D1 RW MOD1. Bits [63:32] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D2 RW MOD2. Bits [95:64] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D3 RW MOD3. Bits [127:96] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D4 RW MOD4. Bits [159:128] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D5 RW MOD5. Bits [191:160] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D6 RW MOD6. Bits [223:192] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0x7D7 RW MOD7. Bits [255:224] of the modulus operand, used in the BN.ADDM/BN.SUBM instructions. This CSR is mapped to the MOD WSR.
0xFC0 R RND. A random number.

Wide Data Registers (WDRs)

In addition to the 32b wide GPRs, OTBN has a second “wide” register file, which is used by the big number instruction subset. This register file consists of NWDR = 32 Wide Data Registers (WDRs). Each WDR is WLEN = 256b wide.

Wide Data Registers (WDRs) and the 32b General Purpose Registers (GPRs) are separate register files. They are only accessible through their respective instruction subset: GPRs are accessible from the base instruction subset, and WDRs are accessible from the big number instruction subset (BN instructions).

Register
w0
w1
w31

Wide Special Purpose Registers (WSRs)

In addition to the Wide Data Registers, BN instructions can also access WLEN-sized special purpose registers, short WSRs.

Number Privilege Description
0x0 RW MOD Modulus. To be used in the BN.ADDM and BN.SUBM instructions. This WSR is mapped to the MOD0 to MOD7 CSRs.
0x1 R RND A random number.
0x2 RW ACC MAC Accumulator. This gives direct access to the accumulator register used by the BN.MULQACC instruction.

Flags

In addition to the wide register file, OTBN maintains global state in two groups of flags for the use by wide integer operations. Flag groups are named Flag Group 0 (FG0), and Flag Group 1 (FG1). Each group consists of four flags. Each flag is a single bit.

  • C (Carry flag). Set to 1 an overflow occurred in the last arithmetic instruction.

  • L (LSb flag). The least significant bit of the result of the last arithmetic or shift instruction.

  • M (MSb flag) The most significant bit of the result of the last arithmetic or shift instruction.

  • Z (Zero Flag) Set to 1 if the result of the last operation was zero; otherwise 0.

The L, M, and Z flags are determined based on the result of the operation as it is written back into the result register, without considering the overflow bit.

Loop Stack

The LOOP instruction allows for nested loops; the active loops are stored on the loop stack. Each loop stack entry is a tuple of loop count, start address, and end address. The number of entries in the loop stack is implementation-dependent.

Call Stack

A stack (LIFO) of function call return addresses (also known as “return address stack”). The number of entries in this stack is implementation-dependent.

The call stack is accessed through the x1 GPR (return address). Writing to x1 pushes to the call stack, reading from it pops an item.

Accumulator

A WLEN bit wide accumulator used by the BN.MULQACC instruction.

Base Instruction Subset

The base instruction set of OTBN is a limited 32b instruction set. It is used together with the 32b wide General Purpose Register file. The primary use of the base instruction set is the control flow in applications.

The base instruction set is an extended subset of RISC-V’s RV32I_Zcsr. Refer to the RISC-V Unprivileged Specification for a detailed instruction specification. Not all RV32 instructions are implemented. The implemented subset is shown below.

ADD

Add.

ADD <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
ADD0000000grs2grs1000grd0110011

ADDI

Add Immediate.

ADDI <grd>, <grs1>, <imm>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
ADDIimmgrs1000grd0010011

LUI

Load Upper Immediate.

LUI <grd>, <imm>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
LUIimmgrd0110111

SUB

Subtract.

SUB <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SUB0100000grs2grs1000grd0110011

SLL

Logical left shift.

SLL <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SLL0000000grs2grs1001grd0110011

SLLI

Logical left shift with Immediate.

SLLI <grd>, <grs1>, <shamt>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SLLI0000000shamtgrs1001grd0010011

SRL

Logical right shift.

SRL <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SRL0000000grs2grs1101grd0110011

SRLI

Logical right shift with Immediate.

SRLI <grd>, <grs1>, <shamt>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SRLI0000000shamtgrs1101grd0010011

SRA

Arithmetic right shift.

SRA <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SRA0100000grs2grs1101grd0110011

SRAI

Arithmetic right shift with Immediate.

SRAI <grd>, <grs1>, <shamt>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SRAI0100000shamtgrs1101grd0010011

AND

Bitwise AND.

AND <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
AND0000000grs2grs1111grd0110011

ANDI

Bitwise AND with Immediate.

ANDI <grd>, <grs1>, <imm>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
ANDIimmgrs1111grd0010011

OR

Bitwise OR.

OR <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
OR0000000grs2grs1110grd0110011

ORI

Bitwise OR with Immediate.

ORI <grd>, <grs1>, <imm>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
ORIimmgrs1110grd0010011

XOR

Bitwise XOR.

XOR <grd>, <grs1>, <grs2>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
XOR0000000grs2grs1100grd0110011

XORI

Bitwise XOR with Immediate.

XORI <grd>, <grs1>, <imm>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
XORIimmgrs1100grd0010011

LW

Load Word. Loads a 32b word from address offset + grs1 in data memory, writing the result to grd. Unaligned loads are not supported. Any address that is unaligned or is above the top of memory will result in an error (with error code ErrCodeBadDataAddr).

LW <grd>, <offset>(<grs1>)

This instruction is defined in the RV32I instruction set.

This instruction takes 2 cycles.

313029282726252423222120191817161514131211109876543210
LWoffsetgrs1010grd0000011

SW

Store Word. Stores a 32b word in grs2 to address offset + grs1 in data memory. Unaligned stores are not supported. Any address that is unaligned or is above the top of memory will result in an error (with error code ErrCodeBadDataAddr).

SW <grs2>, <offset>(<grs1>)

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
SWoffset[11:5]grs2grs1010offset[4:0]0100011

BEQ

Branch Equal.

BEQ <grs1>, <grs2>, <offset>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
BEQoffset[12]offset[10:5]grs2grs1000offset[4:1]offset[11]1100011

BNE

Branch Not Equal.

BNE <grs1>, <grs2>, <offset>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
BNEoffset[12]offset[10:5]grs2grs1001offset[4:1]offset[11]1100011

JAL

Jump And Link.

JAL <grd>, <offset>

This instruction is defined in the RV32I instruction set.

The JAL instruction has the same behavior as in RV32I, jumping by the given offset and writing PC+4 as a link address to the destination register. OTBN has a hardware managed call stack, accessed through x1, which should be used when calling subroutines. Do so by using x1 as the link register: jal x1, <offset>.

313029282726252423222120191817161514131211109876543210
JALoffset[20]offset[10:1]offset[11]offset[19:12]grd1101111

JALR

Jump And Link Register.

JALR <grd>, <grs1>, <offset>

This instruction is defined in the RV32I instruction set.

The JALR instruction has the same behavior as in RV32I, jumping by <grs1> + <offset> and writing PC+4 as a link address to the destination register. OTBN has a hardware managed call stack, accessed through x1, which should be used when calling and returning from subroutines. To return from a subroutine, use jalr x0, x1, 0. This pops a link address from the call stack and branches to it. To call a subroutine through a function pointer, use jalr x1, <grs1>, 0. This jumps to the address in <grs1> and pushes the link address onto the call stack.

313029282726252423222120191817161514131211109876543210
JALRoffsetgrs1000grd1100111

CSRRS

Atomic Read and Set bits in CSR. Reads the value of the CSR csr, and writes it to the destination GPR grd. The initial value in grs1 is treated as a bit mask that specifies bits to be set in the CSR. Any bit that is high in grs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable. Other bits in the CSR are unaffected (though CSRs might have side effects when written).

CSRRS <grd>, <csr>, <grs1>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
CSRRScsrgrs1010grd1110011

Decode

csr_num = UInt(csr)
d = UInt(grd)
s = UInt(grs1)

Operation

gpr_s_val = GPR[s]
GPR[d] = CSR[csr_num]
CSR[csr_num] |= gpr_s_val

CSRRW

Atomic Read/Write CSR. Atomically swaps values in the CSR csr with the value in the GPR grs1. Reads the old value of the CSR, and writes it to the GPR grd. Writes the initial value in grs1 to the CSR csr. If grd == x0 the instruction does not read the CSR or cause any read-related side-effects.

CSRRW <grd>, <csr>, <grs1>

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
CSRRWcsrgrs1001grd1110011

Decode

csr_num = UInt(csr)
d = UInt(grd)
s = UInt(grs1)

Operation

gpr_s_val = GPR[s]

if d != 0:
  GPR[d] = CSR[csr_num]

CSR[csr_num] = gpr_s_val

ECALL

Environment Call. Triggers the done interrupt to indicate the completion of the operation.

ECALL 

This instruction is defined in the RV32I instruction set.

313029282726252423222120191817161514131211109876543210
ECALL00000000000000000000000001110011

LOOP

Note

The LOOP and LOOPI instructions are under-specified, and improvements to them are being discussed. See https://github.com/lowRISC/opentitan/issues/2496 for up-to-date information.

Loop (indirect). Repeats a sequence of code multiple times. The number of iterations is read from grs, treated as an unsigned value. The number of instructions in the loop is given in the bodysize immediate.

LOOP <grs>, <bodysize>
Assembly symbolDescription

<grs>

Name of the GPR containing the number of iterations

<bodysize>

Number of instructions in the loop body

Valid range: 0..4095

313029282726252423222120191817161514131211109876543210
LOOPbodysizegrs0001111011

LOOPI

Note

The LOOP and LOOPI instructions are under-specified, and improvements to them are being discussed. See https://github.com/lowRISC/opentitan/issues/2496 for up-to-date information.

Loop Immediate. Repeats a sequence of code multiple times. The iterations unsigned immediate operand gives the number of iterations and the bodysize unsigned immediate operand gives the number of instructions in the body.

LOOPI <iterations>, <bodysize>
Assembly symbolDescription

<iterations>

Number of iterations

Valid range: 0..1023

<bodysize>

Number of instructions in the loop body

Valid range: 0..4095

313029282726252423222120191817161514131211109876543210
LOOPIbodysizeiterations[9:5]001iterations[4:0]1111011

NOP

No Operation. A pseudo-operation that has no effect.

NOP 

This instruction is defined in the RV32I instruction set.

This instruction is a pseudo-operation and expands to the following instruction sequence:

ADDI x0, x0, 0

LI

Load Immediate. Loads a 32b signed immediate value into a GPR. This uses ADDI and LUI, expanding to one or two instructions, depending on the immediate (small non-negative immediates or immediates with all lower bits zero can be loaded with just ADDI or LUI, respectively; general immediates need a LUI followed by an ADDI).

LI <grd>, <imm>

This instruction is defined in the RV32I instruction set.

RET

Return from subroutine.

RET 

This instruction is defined in the RV32I instruction set.

This instruction is a pseudo-operation and expands to the following instruction sequence:

JALR x0, x1, 0

Big Number Instruction Subset

All Big Number (BN) instructions operate on the Wide Data Registers (WDRs).

BN.ADD

Add. Adds two WDR values, writes the result to the destination WDR and updates flags. The content of the second source WDR can be shifted by an unsigned immediate before it is consumed by the operation.

BN.ADD <wrd>, <wrs1>, <wrs2>[<shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.ADDflag_groupshift_typeshift_byteswrs2wrs1000wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

b_shifted = ShiftReg(b, st, sb)
(result, flags_out) = AddWithCarry(a, b_shifted, "0")

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.ADDC

Add with Carry. Adds two WDR values and the Carry flag value, writes the result to the destination WDR, and updates the flags. The content of the second source WDR can be shifted by an unsigned immediate before it is consumed by the operation.

BN.ADDC <wrd>, <wrs1>, <wrs2>[<shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.ADDCflag_groupshift_typeshift_byteswrs2wrs1010wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

b_shifted = ShiftReg(b, st, sb)
(result, flags_out) = AddWithCarry(a, b_shifted, FLAGS[flag_group].C)

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.ADDI

Add Immediate. Adds a zero-extended unsigned immediate to the value of a WDR, writes the result to the destination WDR, and updates the flags.

BN.ADDI <wrd>, <wrs>, <imm>[, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs>

Name of the source WDR

<imm>

Immediate value

Valid range: 0..1023

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.ADDIflag_group0immwrs100wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)

fg = DecodeFlagGroup(flag_group)
i = ZeroExtend(imm, WLEN)

Operation

(result, flags_out) = AddWithCarry(a, i, "0")

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.ADDM

Pseudo-Modulo Add. Adds two WDR values, subtracts the value of the MOD WSR once if the result is equal or larger than MOD, and writes the result to the destination WDR. This operation is a modulo addition if the sum of the two input registers is smaller than twice the value of the MOD WSR. Flags are not used or saved.

BN.ADDM <wrd>, <wrs1>, <wrs2>
313029282726252423222120191817161514131211109876543210
BN.ADDM0wrs2wrs1101wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

Operation

(result, ) = AddWithCarry(a, b, "0")

if result >= MOD:
  result = result - MOD

WDR[d] = result

BN.MULQACC

Quarter-word Multiply and Accumulate. Multiplies two WLEN/4 WDR values, shifts the product by acc_shift_imm bit, and adds the result to the accumulator.

For versions of the instruction with writeback, see BN.MULQACC.WO and BN.MULQACC.SO.

BN.MULQACC[<zero_acc>] <wrs1>.<wrs1_qwsel>, <wrs2>.<wrs2_qwsel>, <acc_shift_imm>
Assembly symbolDescription

<zero_acc>

Zero the accumulator before accumulating the multiply result.

To specify, use the literal syntax .z

<wrs1>

First source WDR

<wrs1_qwsel>

Quarter-word select for <wrs1>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<wrs2>

Second source WDR

<wrs2_qwsel>

Quarter-word select for <wrs2>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<acc_shift_imm>

The number of bits to shift the WLEN/2-bit multiply result before accumulating.

Valid range: 0..192 in steps of 64

313029282726252423222120191817161514131211109876543210
BN.MULQACC00wrs2_qwselwrs1_qwselwrs2wrs1acc_shift_imm[7:6]zero_acc0111011

Decode

writeback_variant = None
zero_accumulator = DecodeMulqaccZeroacc(zero_acc)

d = None
a = UInt(wrs1)
b = UInt(wrs2)

d_hwsel = None
a_qwsel = DecodeQuarterWordSelect(wrs1_qwsel)
b_qwsel = DecodeQuarterWordSelect(wrs2_qwsel)

Operation

a_qw = GetQuarterWord(a, a_qwsel)
b_qw = GetQuarterWord(b, b_qwsel)

mul_res = a_qw * b_qw

if zero_accumulator:
  ACC = 0

ACC = ACC + (mul_res << acc_shift_imm)

if writeback_variant == 'shiftout':
  if d_hwsel == 'L':
    WDR[d][WLEN/2-1:0] = ACC[WLEN/2-1:0]
  elif d_hwsel == 'U':
    WDR[d][WLEN-1:WLEN/2] = ACC[WLEN/2-1:0]
  ACC = ACC >> (WLEN/2)

elif writeback_variant == 'writeout':
  WDR[d] = ACC

BN.MULQACC.WO

Quarter-word Multiply and Accumulate with half-word writeback. Multiplies two WLEN/4 WDR values, shifts the product by acc_shift_imm bit, and adds the result to the accumulator. Writes the resulting accumulator to wrd.

BN.MULQACC.WO[<zero_acc>] <wrd>, <wrs1>.<wrs1_qwsel>, <wrs2>.<wrs2_qwsel>, <acc_shift_imm>
Assembly symbolDescription

<zero_acc>

Zero the accumulator before accumulating the multiply result.

To specify, use the literal syntax .z

<wrd>

Destination WDR.

<wrs1>

First source WDR

<wrs1_qwsel>

Quarter-word select for <wrs1>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<wrs2>

Second source WDR

<wrs2_qwsel>

Quarter-word select for <wrs2>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<acc_shift_imm>

The number of bits to shift the WLEN/2-bit multiply result before accumulating.

Valid range: 0..192 in steps of 64

313029282726252423222120191817161514131211109876543210
BN.MULQACC.WO01wrs2_qwselwrs1_qwselwrs2wrs1acc_shift_imm[7:6]zero_accwrd0111011

Decode

writeback_variant = 'writeout'
zero_accumulator = DecodeMulqaccZeroacc(zero_acc)

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

d_hwsel = None
a_qwsel = DecodeQuarterWordSelect(wrs1_qwsel)
b_qwsel = DecodeQuarterWordSelect(wrs2_qwsel)

Operation

a_qw = GetQuarterWord(a, a_qwsel)
b_qw = GetQuarterWord(b, b_qwsel)

mul_res = a_qw * b_qw

if zero_accumulator:
  ACC = 0

ACC = ACC + (mul_res << acc_shift_imm)

if writeback_variant == 'shiftout':
  if d_hwsel == 'L':
    WDR[d][WLEN/2-1:0] = ACC[WLEN/2-1:0]
  elif d_hwsel == 'U':
    WDR[d][WLEN-1:WLEN/2] = ACC[WLEN/2-1:0]
  ACC = ACC >> (WLEN/2)

elif writeback_variant == 'writeout':
  WDR[d] = ACC

BN.MULQACC.SO

Quarter-word Multiply and Accumulate with half-word writeback. Multiplies two WLEN/4 WDR values, shifts the product by <acc_shift_imm> and adds the result to the accumulator. Next, shifts the resulting accumulator right by half a word. The bits that are shifted out are written to a half-word of <wrd>, selected with <wrd_hwsel>.

BN.MULQACC.SO[<zero_acc>] <wrd>.<wrd_hwsel>, <wrs1>.<wrs1_qwsel>, <wrs2>.<wrs2_qwsel>, <acc_shift_imm>
Assembly symbolDescription

<zero_acc>

Zero the accumulator before accumulating the multiply result.

To specify, use the literal syntax .z

<wrd>

Destination WDR.

<wrd_hwsel>

Half-word select for <wrd>. A value of L means the less significant half-word; U means the more significant half-word.

Syntax table:

Syntax Value of immediate
l 0
u 1

<wrs1>

First source WDR

<wrs1_qwsel>

Quarter-word select for <wrs1>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<wrs2>

Second source WDR

<wrs2_qwsel>

Quarter-word select for <wrs2>.

Valid values:

  • 0: Select wrs1[WLEN/4-1:0] (least significant quarter-word)
  • 1: Select wrs1[WLEN/2:WLEN/4]
  • 2: Select wrs1[WLEN/4*3-1:WLEN/2]
  • 3: Select wrs1[WLEN-1:WLEN/4*3] (most significant quarter-word)

Valid range: 0..3

<acc_shift_imm>

The number of bits to shift the WLEN/2-bit multiply result before accumulating.

Valid range: 0..192 in steps of 64

313029282726252423222120191817161514131211109876543210
BN.MULQACC.SO1wrd_hwselwrs2_qwselwrs1_qwselwrs2wrs1acc_shift_imm[7:6]zero_accwrd0111011

Decode

writeback_variant = 'shiftout'
zero_accumulator = DecodeMulqaccZeroacc(zero_acc)

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

d_hwsel = DecodeHalfWordSelect(wrd_hwsel)
a_qwsel = DecodeQuarterWordSelect(wrs1_qwsel)
b_qwsel = DecodeQuarterWordSelect(wrs2_qwsel)

Operation

a_qw = GetQuarterWord(a, a_qwsel)
b_qw = GetQuarterWord(b, b_qwsel)

mul_res = a_qw * b_qw

if zero_accumulator:
  ACC = 0

ACC = ACC + (mul_res << acc_shift_imm)

if writeback_variant == 'shiftout':
  if d_hwsel == 'L':
    WDR[d][WLEN/2-1:0] = ACC[WLEN/2-1:0]
  elif d_hwsel == 'U':
    WDR[d][WLEN-1:WLEN/2] = ACC[WLEN/2-1:0]
  ACC = ACC >> (WLEN/2)

elif writeback_variant == 'writeout':
  WDR[d] = ACC

BN.SUB

Subtraction. Subtracts the second WDR value from the first one, writes the result to the destination WDR and updates flags. The content of the second source WDR can be shifted by an unsigned immediate before it is consumed by the operation.

BN.SUB <wrd>, <wrs1>, <wrs2>[<shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.SUBflag_groupshift_typeshift_byteswrs2wrs1001wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

b_shifted = ShiftReg(b, st, sb)
(result, flags_out) = SubtractWithBorrow(a, b_shifted, 0)

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.SUBB

Subtract with borrow. Subtracts the second WDR value and the Carry from the first one, writes the result to the destination WDR, and updates the flags. The content of the second source WDR can be shifted by an unsigned immediate before it is consumed by the operation.

BN.SUBB <wrd>, <wrs1>, <wrs2>[<shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.SUBBflag_groupshift_typeshift_byteswrs2wrs1011wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

b_shifted = ShiftReg(b, st, sb)
(result, flags_out) = SubtractWithBorrow(a, b_shifted, FLAGS[flag_group].C)

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.SUBI

Subtract Immediate. Subtracts a zero-extended unsigned immediate from the value of a WDR, writes the result to the destination WDR, and updates the flags.

BN.SUBI <wrd>, <wrs>, <imm>[, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs>

Name of the source WDR

<imm>

Immediate value

Valid range: 0..1023

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.SUBIflag_group1immwrs100wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)

fg = DecodeFlagGroup(flag_group)
i = ZeroExtend(imm, WLEN)

Operation

(result, flags_out) = SubtractWithBorrow(a, i, 0)

WDR[d] = result
FLAGS[flag_group] = flags_out

BN.SUBM

Pseudo-modulo subtraction. Subtracts the second WDR value from the first WDR value, performs a modulo operation with the MOD WSR, and writes the result to the destination WDR. This operation is equivalent to a modulo subtraction as long as wrs1 - wrs2 >= -MOD holds. This constraint is not checked in hardware. Flags are not used or saved.

BN.SUBM <wrd>, <wrs1>, <wrs2>
313029282726252423222120191817161514131211109876543210
BN.SUBM1wrs2wrs1101wrd0101011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

Operation

(result, ) = SubtractWithBorrow(a, b, 0)

if result < 0:
  result = MOD + result

WDR[d] = result

BN.AND

Bitwise AND. Performs a bitwise and operation. Takes the values stored in registers referenced by wrs1 and wrs2 and stores the result in the register referenced by wrd. The content of the second source register can be shifted by an immediate before it is consumed by the operation. The M, L and Z flags in flag group 0 are updated with the result of the operation.

BN.AND <wrd>, <wrs1>, <wrs2>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.ANDflag_groupshift_typeshift_byteswrs2wrs1010wrd1111011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)
fg = DecodeFlagGroup(flag_group)

Operation

b_shifted = ShiftReg(b, st, sb)
result = a & b_shifted

WDR[d] = result
flags_out = FlagsForResult(result)
FLAGS[flag_group] = {M: flags_out.M, L: flags_out.L, Z: flags_out.Z, C: FLAGS[flag_group].C}

BN.OR

Bitwise OR. Performs a bitwise or operation. Takes the values stored in WDRs referenced by wrs1 and wrs2 and stores the result in the WDR referenced by wrd. The content of the second source WDR can be shifted by an immediate before it is consumed by the operation. The M, L and Z flags in flag group 0 are updated with the result of the operation.

BN.OR <wrd>, <wrs1>, <wrs2>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.ORflag_groupshift_typeshift_byteswrs2wrs1100wrd1111011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)
fg = DecodeFlagGroup(flag_group)

Operation

b_shifted = ShiftReg(b, st, sb)
result = a | b_shifted

WDR[d] = result
flags_out = FlagsForResult(result)
FLAGS[flag_group] = {M: flags_out.M, L: flags_out.L, Z: flags_out.Z, C: FLAGS[flag_group].C}

BN.NOT

Bitwise NOT. Negates the value in wrs and stores the result in the register referenced by wrd. The source value can be shifted by an immediate before it is consumed by the operation. The M, L and Z flags in flag group 0 are updated with the result of the operation.

BN.NOT <wrd>, <wrs>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs>

Name of the source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.NOTflag_groupshift_typeshift_byteswrs101wrd1111011

Decode

d = UInt(wrd)
a = UInt(wrs1)

sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)
fg = DecodeFlagGroup(flag_group)

Operation

a_shifted = ShiftReg(a, st, sb)
result = ~a_shifted

WDR[d] = result
flags_out = FlagsForResult(result)
FLAGS[flag_group] = {M: flags_out.M, L: flags_out.L, Z: flags_out.Z, C: FLAGS[flag_group].C}

BN.XOR

Bitwise XOR. Performs a bitwise xor operation. Takes the values stored in WDRs referenced by wrs1 and wrs2 and stores the result in the WDR referenced by wrd. The content of the second source WDR can be shifted by an immediate before it is consumed by the operation. The M, L and Z flags in flag group 0 are updated with the result of the operation.

BN.XOR <wrd>, <wrs1>, <wrs2>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.XORflag_groupshift_typeshift_byteswrs2wrs1110wrd1111011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)

sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)
fg = DecodeFlagGroup(flag_group)

Operation

b_shifted = ShiftReg(b, st, sb)
result = a ^ b_shifted

WDR[d] = result
flags_out = FlagsForResult(result)
FLAGS[flag_group] = {M: flags_out.M, L: flags_out.L, Z: flags_out.Z, C: FLAGS[flag_group].C}

BN.RSHI

Concatenate and right shift immediate. Concatenates the content of WDRs referenced by wrs1 and wrs2 (wrs1 forms the upper part), shifts it right by an immediate value and truncates to WLEN bit. The result is stored in the WDR referenced by wrd.

BN.RSHI <wrd>, <wrs1>, <wrs2> >> <imm>
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<imm>

Number of bits to shift the second source register by. Valid range: 0..(WLEN-1).

Valid range: 0..255

313029282726252423222120191817161514131211109876543210
BN.RSHIimm[7:1]wrs2wrs1imm[0]11wrd1111011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)
shift_bit = Uint(imm)

Operation

WDR[d] = (((a << WLEN) | b) >> shift_bit)[WLEN-1:0]

BN.SEL

Flag Select. Returns in the destination WDR the value of the first source WDR if the flag in the chosen flag group is set, otherwise returns the value of the second source WDR.

BN.SEL <wrd>, <wrs1>, <wrs2>, [FG<flag_group>.]<flag>
Assembly symbolDescription

<wrd>

Name of the destination WDR

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

<flag>

Flag to check. Valid values:

  • C: Carry flag
  • M: MSB flag
  • L: LSB flag
  • Z: Zero flag

Syntax table:

Syntax Value of immediate
c 0
m 1
l 2
z 3
313029282726252423222120191817161514131211109876543210
BN.SELflag_groupflagwrs2wrs1000wrd0001011

Decode

d = UInt(wrd)
a = UInt(wrs1)
b = UInt(wrs2)
fg = DecodeFlagGroup(flag_group)
flag = DecodeFlag(flag)

Operation

flag_is_set = FLAGS[fg].get(flag)

WDR[d] = wrs1 if flag_is_set else wrs2

BN.CMP

Compare. Subtracts the second WDR value from the first one and updates flags. This instruction is identical to BN.SUB, except that no result register is written.

BN.CMP <wrs1>, <wrs2>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.CMPflag_groupshift_typeshift_byteswrs2wrs10010001011

Decode

a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

b_shifted = ShiftReg(b, st, sb)
(, flags_out) = SubtractWithBorrow(a, b_shifted, 0)

FLAGS[flag_group] = flags_out

BN.CMPB

Compare with Borrow. Subtracts the second WDR value from the first one and updates flags. This instruction is identical to BN.SUBB, except that no result register is written.

BN.CMPB <wrs1>, <wrs2>[, <shift_type> <shift_bytes>B][, FG<flag_group>]
Assembly symbolDescription

<wrs1>

Name of the first source WDR

<wrs2>

Name of the second source WDR

<shift_type>

The direction of an optional shift applied to <wrs2>.

Syntax table:

Syntax Value of immediate
<< 0
>> 1

<shift_bytes>

Number of bytes by which to shift <wrs2>. Defaults to 0.

Valid range: 0..31

<flag_group>

Flag group to use. Defaults to 0.

Valid range: 0..1

313029282726252423222120191817161514131211109876543210
BN.CMPBflag_groupshift_typeshift_byteswrs2wrs10110001011

Decode

a = UInt(wrs1)
b = UInt(wrs2)

fg = DecodeFlagGroup(flag_group)
sb = UInt(shift_bytes)
st = DecodeShiftType(shift_type)

Operation

(, flags_out) = SubtractWithBorrow(a, b, FLAGS[flag_group].C)

FLAGS[flag_group] = flags_out

BN.LID

Load Word (indirect source, indirect destination). Calculates a byte memory address by adding the offset to the value in the GPR grs1. The value from this memory address is then copied into the WDR pointed to by the value in GPR grd.

After the operation, either the value in the GPR grs1, or the value in grd can be optionally incremented.

  • If grs1_inc is set, the value in grs1 is incremented by the value WLEN/8 (one word).
  • If grd_inc is set, the value in grd is incremented by the value 1.

The memory address must be aligned to WLEN bytes. Any address that is unaligned or is above the top of memory will result in an error (with error code ErrCodeBadDataAddr).

BN.LID <grd>[<grd_inc>], <offset>(<grs1>[<grs1_inc>])

This instruction takes 2 cycles.

Assembly symbolDescription

<grd>

Name of the GPR referencing the destination WDR

<grs1>

Name of the GPR containing the memory byte address. The value contained in the referenced GPR must be WLEN-aligned.

<offset>

Offset value. Must be WLEN-aligned.

Valid range: -16384..16352 in steps of 32

<grs1_inc>

Increment the value in <grs1> by WLEN/8 (one word). Cannot be specified together with grd_inc.

To specify, use the literal syntax ++

<grd_inc>

Increment the value in <grd> by one. Cannot be specified together with grs1_inc.

To specify, use the literal syntax ++

313029282726252423222120191817161514131211109876543210
BN.LIDoffset[11:5]offset[14:12]grs1_incgrd_incgrs1100grd0001011

Decode

rd = UInt(grd)
rs1 = UInt(grs1)
offset = UInt(offset)

Operation

mem_addr = GPR[rs1] + offset
wdr_dest = GPR[rd]

assert not (grs1_inc and grd_inc)  # prevented in encoding
if mem_addr % (WLEN / 8) or mem_addr + WLEN > DMEM_SIZE:
    raise BadDataAddr()

mem_index = mem_addr // (WLEN / 8)

WDR[wdr_dest] = LoadWlenWordFromMemory(mem_index)

if grs1_inc:
    GPR[rs1] = GPR[rs1] + (WLEN / 8)
if grd_inc:
    GPR[rd] = GPR[rd] + 1

BN.SID

Store Word (indirect source, indirect destination). Calculates a byte memory address by adding the offset to the value in the GPR grs1. The value from the WDR pointed to by grs2 is then copied into the memory.

After the operation, either the value in the GPR grs1, or the value in grs2 can be optionally incremented.

  • If grs1_inc is set, the value in grs1 is incremented by the value WLEN/8 (one word).
  • If grs2_inc is set, the value in grs2 is incremented by the value 1.

The memory address must be aligned to WLEN bytes. Any address that is unaligned or is above the top of memory will result in an error (with error code ErrCodeBadDataAddr).

BN.SID <grs2>[<grs2_inc>], <offset>(<grs1>[<grs1_inc>])
Assembly symbolDescription

<grs1>

Name of the GPR containing the memory byte address. The value contained in the referenced GPR must be WLEN-aligned.

<grs2>

Name of the GPR referencing the source WDR.

<offset>

Offset value. Must be WLEN-aligned.

Valid range: -16384..16352 in steps of 32

<grs1_inc>

Increment the value in <grs1> by WLEN/8 (one word). Cannot be specified together with grs2_inc.

To specify, use the literal syntax ++

<grs2_inc>

Increment the value in <grs2> by one. Cannot be specified together with grs1_inc.

To specify, use the literal syntax ++

313029282726252423222120191817161514131211109876543210
BN.SIDoffset[11:5]offset[14:12]grs1_incgrs2_incgrs1101grs20001011

Decode

rs1 = UInt(grs1)
rs2 = UInt(grs2)
offset = UInt(offset)

Operation

mem_addr = GPR[rs1] + offset
wdr_src = GPR[rs2]

assert not (grs1_inc and grd_inc)  # prevented in encoding
if mem_addr % (WLEN / 8) or mem_addr + WLEN > DMEM_SIZE:
    raise BadDataAddr()

mem_index = mem_addr // (WLEN / 8)

StoreWlenWordToMemory(mem_index, WDR[wdr_src])

if grs1_inc:
    GPR[rs1] = GPR[rs1] + (WLEN / 8)
if grs2_inc:
    GPR[rs2] = GPR[rs2] + 1

BN.MOV

Copy content between WDRs (direct addressing).

BN.MOV <wrd>, <wrs>
313029282726252423222120191817161514131211109876543210
BN.MOV0wrs110wrd0001011

Decode

s = UInt(wrs)
d = UInt(wrd)

Operation

WDR[d] = WDR[s]

BN.MOVR

Copy content between WDRs (register-indirect addressing). Copies WDR contents between registers with indirect addressing. Optionally, either the source or the destination register address can be incremented by 1.

BN.MOVR <grd>[<grd_inc>], <grs>[<grs_inc>]
Assembly symbolDescription

<grd>

Name of the GPR containing the destination WDR.

<grs>

Name of the GPR referencing the source WDR.

<grd_inc>

Increment the value in <grd> by one. Cannot be specified together with grs_inc.

To specify, use the literal syntax ++

<grs_inc>

Increment the value in <grs> by one. Cannot be specified together with grd_inc.

To specify, use the literal syntax ++

313029282726252423222120191817161514131211109876543210
BN.MOVR1grs_incgrd_incgrs110grd0001011

Decode

s = UInt(grs)
d = UInt(grd)

Operation

WDR[GPR[d]] = WDR[GPR[s]]

if grs_inc:
  GPR[s] = GPR[s] + 1
if grd_inc:
  GPR[d] = GPR[d] + 1

BN.WSRRS

Atomic Read and Set Bits in WSR.

BN.WSRRS <wrd>, <wsr>, <wrs>
313029282726252423222120191817161514131211109876543210
BN.WSRRS0wsrwrs111wrd0001011

BN.WSRRW

Atomic Read/Write WSR.

BN.WSRRW <wrd>, <wsr>, <wrs>
313029282726252423222120191817161514131211109876543210
BN.WSRRW1wsrwrs111wrd0001011

Pseudo-Code Functions for BN Instructions

The instruction description uses Python-based pseudocode. Commonly used functions are defined once below.

Note

This “pseudo-code” is intended to be Python 3, and contains known inconsistencies at the moment. It will be further refined as we make progress in the implementation of a simulator using this syntax.

class Flag(Enum):
  C: Bits[1]
  M: Bits[1]
  L: Bits[1]
  Z: Bits[1]

class FlagGroup:
  C: Bits[1]
  M: Bits[1]
  L: Bits[1]
  Z: Bits[1]

  def set(self, flag: Flag, value: Bits[1]):
    assert flag in Flag

    if flag == Flag.C:
      self.C = value
    elif flag == Flag.M:
      self.M = value
    elif flag == Flag.L:
      self.L = value
    elif flag == Flag.Z:
      self.Z = value

  def get(self, flag: Flag):
    assert flag in Flag

    if flag == Flag.C:
      return self.C
    elif flag == Flag.M:
      return self.M
    elif flag == Flag.L:
      return self.L
    elif flag == Flag.Z:
      return self.Z


class ShiftType(Enum):
  LSL = 0 # logical shift left
  LSR = 1 # logical shift right

class HalfWord(Enum):
  LOWER = 0 # lower or less significant half-word
  UPPER = 1 # upper or more significant half-word

def DecodeShiftType(st: Bits(1)) -> ShiftType:
  if st == 0:
    return ShiftType.LSL
  elif st == 1:
    return ShiftType.LSR
  else:
    raise UndefinedException()

def DecodeFlagGroup(flag_group: Bits(1)) -> UInt:
  if flag_group > 1:
    raise UndefinedException()
  return UInt(flag_group)

def DecodeFlag(flag: Bits(1)) -> Flag:
  if flag == 0:
    return ShiftType.C
  elif flag == 1:
    return ShiftType.M
  elif flag == 2:
    return ShiftType.L
  elif flag == 3:
    return ShiftType.Z
  else:
    raise UndefinedException()


def ShiftReg(reg, shift_type, shift_bytes) -> Bits(N):
  if ShiftType == ShiftType.LSL:
    return GPR[reg] << shift_bytes << 3
  elif ShiftType == ShiftType.LSR:
    return GPR[reg] >> shift_bytes >> 3

def AddWithCarry(a: Bits(WLEN), b: Bits(WLEN), carry_in: Bits(1)) -> (Bits(WLEN), FlagGroup):
  result: Bits[WLEN+1] = a + b + carry_in

  flags_out = FlagGroup()
  flags_out.C = result[WLEN]
  flags_out.L = result[0]
  flags_out.M = result[WLEN-1]
  flags_out.Z = (result[WLEN-1:0] == 0)

  return (result[WLEN-1:0], flags_out)

def SubtractWithBorrow(a: Bits(WLEN), b: Bits(WLEN), borrow_in: Bits(1)) -> (Bits(WLEN), FlagGroup):
  result: Bits[WLEN+1] = a - b - borrow_in

  flags_out = FlagGroup()
  flags_out.C = result[WLEN]
  flags_out.L = result[0]
  flags_out.M = result[WLEN-1]
  flags_out.Z = (result[WLEN-1:0] == 0)

  return (result[WLEN-1:0], flags_out)

def DecodeHalfWordSelect(hwsel: Bits(1)) -> HalfWord:
  if hwsel == 0:
    return HalfWord.LOWER
  elif hwsel == 1:
    return HalfWord.UPPER
  else:
    raise UndefinedException()

def GetHalfWord(reg: integer, hwsel: HalfWord) -> Bits(WLEN/2):
  if hwsel == HalfWord.LOWER:
    return GPR[reg][WLEN/2-1:0]
  elif hwsel == HalfWord.UPPER:
    return GPR[reg][WLEN-1:WLEN/2]

def LoadWlenWordFromMemory(byteaddr: integer) -> Bits(WLEN):
  wordaddr = byteaddr >> 5
  return DMEM[wordaddr]

def StoreWlenWordToMemory(byteaddr: integer, storedata: Bits(WLEN)):
  wordaddr = byteaddr >> 5
  DMEM[wordaddr] = storedata

Theory of Operations

Block Diagram

OTBN architecture block diagram

Hardware Interfaces

Referring to the Comportable guideline for peripheral device functionality, the module otbn has the following hardware interfaces defined.

Primary Clock: clk_i

Other Clocks: none

Bus Device Interface: tlul

Bus Host Interface: none

Peripheral Pins for Chip IO: none

Interrupts:

Interrupt NameDescription
doneOTBN has completed the operation
errAn error occurred. Read the ERR_CODE register for error details.

Security Alerts:

Alert NameDescription
imem_uncorrectableUncorrectable error in the instruction memory detected.
dmem_uncorrectableUncorrectable error in the data memory detected.
reg_uncorrectableUncorrectable error in one of the register files detected.

Design Details

Note

To be filled in as we create the implementation.

By design, OTBN is a simple processor and has essentially no error handling support. When anything goes wrong (an out-of-bounds memory operation, an invalid instruction encoding, etc.), OTBN will stop fetching instructions, and set the ERR_CODE register and the err bit of the INTR_STATE register.

Programmers Guide

Note

This section will be written as we move on in the design and implementation process.

Memories

The OTBN processor core has access to two dedicated memories: an instruction memory (IMEM), and a data memory (DMEM). Each memory is 4 kiB in size.

The memory layout follows the Harvard architecture. Both memories are byte-addressed, with addresses starting at 0.

The instruction memory (IMEM) is 32b wide and provides the instruction stream to the OTBN processor; it cannot be read or written from user code through load or store instructions.

The data memory (DMEM) is 256b wide and read-write accessible from the base and big number instruction subsets of the OTBN processor core. When accessed from the base instruction subset through the LW or SW instructions, accesses must read or write 32b-aligned 32b words. When accessed from the big number instruction subset through the BN.LID or BN.SID instructions, accesses must read or write 256b-aligned 256b words.

Operation

Note

The exact sequence of operations is not yet finalized.

Rough expected process:

Error conditions

Note

To be filled in as we create the implementation.

Register Table

otbn.INTR_STATE @ + 0x0
Interrupt State Register
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  err done
BitsTypeResetNameDescription
0rw1c0x0doneOTBN has completed the operation
1rw1c0x0errAn error occurred. Read the ERR_CODE register for error details.


otbn.INTR_ENABLE @ + 0x4
Interrupt Enable Register
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  err done
BitsTypeResetNameDescription
0rw0x0doneEnable interrupt when INTR_STATE.done is set
1rw0x0errEnable interrupt when INTR_STATE.err is set


otbn.INTR_TEST @ + 0x8
Interrupt Test Register
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  err done
BitsTypeResetNameDescription
0wo0x0doneWrite 1 to force INTR_STATE.done to 1
1wo0x0errWrite 1 to force INTR_STATE.err to 1


otbn.CMD @ + 0xc
command register
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  dummy start
BitsTypeResetNameDescription
0r0w1c0x0startStart the operation The completion is signalled by the done interrupt.
1r0w1c0x0dummyReggen doesn't generate sub-fields with only a single field specified; instead, the whole register is taken as a field, leading to signals like `hw2reg.status.d` instead of `hw2reg.status.start.d`. Since we expect to add more commands later, we force the generation of fields with this dummy field for now.


otbn.STATUS @ + 0x10
Status
Reset default = 0x0, mask 0x3
31302928272625242322212019181716
 
1514131211109876543210
  dummy busy
BitsTypeResetNameDescription
0ro0x0busyOTBN is performing an operation.
1ro0x0dummySee CMD.dummy for details.


otbn.ERR_CODE @ + 0x14
Error Code
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
err_code...
1514131211109876543210
...err_code
BitsTypeResetNameDescription
31:0ro0x0err_codeThe error cause if an error occurred. Software should read this register before clearing the err interrupt to avoid race conditions. Possible values: - 0x0 (ErrCodeNoError): No error occurred. - 0x1 (ErrCodeBadDataAddr): Load or store to invalid address


otbn.START_ADDR @ + 0x18
Start byte address in the instruction memory
Reset default = 0x0, mask 0xffffffff
31302928272625242322212019181716
start_addr...
1514131211109876543210
...start_addr
BitsTypeResetNameDescription
31:0wo0x0start_addrByte address in the instruction memory OTBN starts to execute from when instructed to do so with the CMD.start .


otbn.IMEM @ + 0x100000
1024 item rw window
Byte writes are not supported
310
+0x100000 
+0x100004 
 ...
+0x100ff8 
+0x100ffc 
Instruction Memory. Not accessible during the operation of the engine. TODO: Discuss and document behavior in that case. Alert? Ignore?


otbn.DMEM @ + 0x200000
1024 item rw window
Byte writes are not supported
310
+0x200000 
+0x200004 
 ...
+0x200ff8 
+0x200ffc 
Data Memory. Not accessible during the operation of the engine. TODO: Discuss and document behavior in that case. Alert? Ignore?


Algorithic Example: Replacing BN.MULH with BN.MULQACC

This specification gives the implementers the option to provide either a quarter-word multiply-accumulate instruction, BN.MULQADD, or a half-word multiply instruction, BN.MULH. Four BN.MULQACC can be used to replace one BN.MULH instruction, which is able to operate on twice the data size.

BN.MULH w1, w0.l, w0.u becomes

BN.MULQACC.Z      w0.0, w0.2, 0
BN.MULQACC        w0.0, w0.3, 64
BN.MULQACC        w0.1, w0.2, 64
BN.MULQACC.WO r1, w0.1, w0.3, 128

Algorithmic Example: Multiplying two WLEN numbers with BN.MULQACC

The big number instruction subset of OTBN generally operates on WLEN bit numbers. However, the multiplication instructions only operate on half or quarter-words of WLEN bit. This section outlines a technique to multiply two WLEN-bit numbers with the use of the quarter-word multiply-accumulate instruction BN.MULQACC.

The shift out functionality can be used to perform larger multiplications without extra adds. The table below shows how two registers w0 and w1 can be multiplied together to give a result in w2 and w3. The cells on the right show how the result is built up a0:a3 = w0.0:w0.3 and b0:b3 = w1.0:w1.3. The sum of a column represents WLEN/4 bits of a destination register, where c0:c3 = w2.0:w2.3 and d0:d3 = w3.0:w3.3. Each cell with a multiply in takes up two WLEN/4-bit columns to represent the WLEN/2-bit multiply result. The current accumulator in each instruction is represented by highlighted cells where the accumulator value will be the sum of the highlighted cell and all cells above it.

The outlined technique can be extended to arbitrary bit widths but requires unrolled code with all operands in registers.

d3 d2 d1 d0 c3 c2 c1 c0
BN.MULQACC.Z w0.0, w1.0, 0 a0 * b0
BN.MULQACC w0.1, w1.0, 64 a1 * b0
BN.MULQACC.SO w2.l, w0.0, w1.1, 64 a0 * b1
BN.MULQACC w0.2, w1.0, 0 a2 * b0
BN.MULQACC w0.1, w1.1, 0 a1 * b1
BN.MULQACC w0.0, w1.2, 0 a0 * b2
BN.MULQACC w0.3, w1.0, 64 a3 * b0
BN.MULQACC w0.2, w1.1, 64 a2 * b1
BN.MULQACC w0.1, w1.2, 64 a1 * b2
BN.MULQACC.SO w2.u, w0.0, w1.3, 64 a0 * b3
BN.MULQACC w0.3, w1.1, 0 a3 * b1
BN.MULQACC w0.2, w1.2, 0 a2 * b2
BN.MULQACC w0.1, w1.3, 0 a1 * b3
BN.MULQACC w0.3, w1.2, 64 a3 * b2
BN.MULQACC.SO w3.l, w0.2, w1.3, 64 a2 * b3
BN.MULQACC.SO w3.u, w0.3, w1.3, 0 a3 * b3

Code snippets giving examples of 256x256 and 384x384 multiplies can be found in sw/otbn/code-snippets/mul256.S and sw/otbn/code-snippets/mul384.S.