Skip to main content

Instruction Reference

This document combines the instruction encoding overview with a detailed reference for every opcode supported by the VPU. It focuses on the hardware semantics so that programs built for the emulator behave the same on real devices.

Instruction Encoding

VPU instructions are 64 bits wide. Most share a standard layout that selects the operation, the destination register and the two source operands. Branch instructions use a slightly different layout.

Standard Layout

BitsPurpose
63-56Opcode selecting the instruction
55Result goes to spawned group when set
54-50Destination register index
49-42Write mask and flag controls
41-28Source B configuration (register, swizzle)
27-12Source A configuration (register, swizzle)
11-0Reserved for future expansion

Branch Layout

BitsPurpose
63-56Opcode 0x80
55-52Flag component controlling the branch
51-42Reserved (must be zero)
41-32Target program counter (10 bits)
31-0Reserved

Writeback Modes

Each component of the result can update condition flags and destination registers independently. The write mask encodes four two-bit fields with these meanings:

  • 00 – write if the corresponding flag bit is true
  • 01 – write if the flag bit is false
  • 10 – never write
  • 11 – always write

The flag control field selects whether result sign, zero status or a combination updates the flags. This mechanism avoids explicit branches when only some pixels require modification.

Opcode Reference

The following sections describe every instruction. Opcodes use the standard format unless stated otherwise. Bit positions refer to the tables above.

Control Flow

Branch (opcode 0x80)

Performs a conditional jump to another instruction address. Threads that do not satisfy the flag condition are masked until a following Sync re-enables them. Because there is no branch prediction the pipeline pauses briefly when the jump is taken.

BitsMeaning
63-560x80
55-52Flag component controlling the branch
41-32Target program counter (10 bits)
31-0Reserved

Typical uses include implementing loops or skipping over optional work.

Sync (opcode 0x01)

Clears the thread mask so all lanes of the group become active again. Often inserted after a conditional branch.

BitsMeaning
63-560x01
55-0Standard layout fields

Exit (opcode 0x00)

Terminates the current thread. When all threads in a group exit, the hardware recycles the group.

BitsMeaning
63-560x00
55-0Standard layout fields

Work Scheduling

ThreadGroupAllocate (opcode 0x04)

Reserves a global register bank for a new group. The instruction stalls if no banks are free. Programs typically follow this with register initialisation.

BitsMeaning
63-560x04
55-0Standard layout fields

ThreadGroupLaunch (opcode 0x05)

Adds a prepared group to the scheduler queue. The B register encodes the entry address and starting position. Write masks select which lanes of the new group start alive.

BitsMeaning
63-560x05
55-0Standard layout fields

LoadPosition (opcode 0x07)

Loads the current screen position and tile offset into the destination register and updates flags for active lanes. Useful when spawning child work to know where it should operate.

BitsMeaning
63-560x07
55-0Standard layout fields

Flag Operations

AndFlags (opcode 0x08)

Computes the logical AND of all flag components and writes the result back to every component. It has no register operands and is typically used before branching.

BitsMeaning
63-560x08
55-0Standard layout fields

OrFlags (opcode 0x09)

Computes the logical OR of all flag components and writes the result back to every component.

BitsMeaning
63-560x09
55-0Standard layout fields

PairAndFlags (opcode 0x0A)

Replaces X and Y with their AND and Z and W with their AND. Useful when two comparisons must both be true.

BitsMeaning
63-560x0A
55-0Standard layout fields

PairOrFlags (opcode 0x0B)

Performs OR on (X,Y) and (Z,W) pairs. It allows combining related conditions without extra branches.

BitsMeaning
63-560x0B
55-0Standard layout fields

BroadcastFlags (opcode 0x0C)

Copies the X component flag from each thread to all components so threads can quickly test group-wide activity.

BitsMeaning
63-560x0C
55-0Standard layout fields

Floating Point Arithmetic

AddFloat16 (opcode 0x10)

Component-wise addition of two FP16 vectors.

BitsMeaning
63-560x10
55-0Standard layout fields

MultiplyFloat16 (opcode 0x11)

Component-wise multiplication of two FP16 vectors.

BitsMeaning
63-560x11
55-0Standard layout fields

DivideFloat16 (opcode 0x15)

Divides the X component of A by B while multiplying the remaining components. Useful for perspective correction when interpolating.

BitsMeaning
63-560x15
55-0Standard layout fields

ConvertToFloat16 (opcode 0x16)

Converts signed 16‑bit integers in A to FP16, clamping out-of-range values. Handy for unpacking colour data.

BitsMeaning
63-560x16
55-0Standard layout fields

Integer Arithmetic

ConvertToInt16 (opcode 0x20)

Converts FP16 components to signed integers by truncation.

BitsMeaning
63-560x20
55-0Standard layout fields

AddInt16 (opcode 0x21)

Adds two signed 16‑bit integer vectors.

BitsMeaning
63-560x21
55-0Standard layout fields

AddInt32 (opcode 0x22)

Adds two 32‑bit integers formed from the XY and ZW pairs. Used for counters or addresses that exceed 16 bits.

BitsMeaning
63-560x22
55-0Standard layout fields

MultiplyInt16 (opcode 0x2C)

Multiplies pairs of 16‑bit integers and keeps the low halves. Commonly used when squaring numbers such as texture coordinates.

BitsMeaning
63-560x2C
55-0Standard layout fields

MultiplyHiInt16 (opcode 0x2E)

Produces full 32‑bit products from X and Z components. Allows higher precision results when needed.

BitsMeaning
63-560x2E
55-0Standard layout fields

MultiplyInt16u (opcode 0x28)

Unsigned variant of MultiplyInt16.

BitsMeaning
63-560x28
55-0Standard layout fields

MultiplyHiInt16u (opcode 0x2A)

Unsigned version of MultiplyHiInt16.

BitsMeaning
63-560x2A
55-0Standard layout fields

And (opcode 0x30)

Bitwise AND of two integer vectors. Often masks texture coordinates.

BitsMeaning
63-560x30
55-0Standard layout fields

Or (opcode 0x31)

Bitwise OR of two vectors. Can combine flags or build masks.

BitsMeaning
63-560x31
55-0Standard layout fields

ExclusiveOr (opcode 0x32)

Bitwise XOR, sometimes used for toggling masks or computing parity.

BitsMeaning
63-560x32
55-0Standard layout fields

TextureWrapInt16 (opcode 0x33)

Converts FP16 texture coordinates to integers and masks them with B. Helps implement tiled textures.

BitsMeaning
63-560x33
55-0Standard layout fields

Tile Memory Operations

ReadLocalTileN / WriteLocalTileN (opcodes 0x400x53)

Read or write tile buffer N at each thread's local position. Locks can optionally be acquired or released. Typically used for per-pixel processing without external memory traffic.

BitsMeaning
63-560x400x53
55-0Standard layout fields

ReadTileN / WriteTileN (opcodes 0x440x57)

Access tile buffer N at an arbitrary address supplied in A.x. The lower two address bits are replaced with the thread ID so each lane touches a separate element.

BitsMeaning
63-560x440x57
55-0Standard layout fields

ReadLockTileN / WriteUnlockTileN (opcodes 0x4C0x5F)

Acquire or release locks when accessing tile buffers. Locks are tracked per position and help coordinate groups updating neighbouring pixels.

BitsMeaning
63-560x4C0x5F
55-0Standard layout fields

External Memory Operations

IssueExternalLoad (opcode 0x70)

Begins an asynchronous read from external memory. The address comes from B.xy and the data becomes available when FinishExternalLoad is executed. Useful for fetching textures or geometry.

BitsMeaning
63-560x70
55-0Standard layout fields

IssueExternalLoadRgb (opcode 0x71)

Loads an RGB1555 value and converts it to FP16 on the fly.

BitsMeaning
63-560x71
55-0Standard layout fields

IssueExternalLoadLut0 / IssueExternalLoadLut1 (opcodes 0x720x73)

Perform palette lookups while reading from memory. Good for indexed colour formats.

BitsMeaning
63-560x720x73
55-0Standard layout fields

FinishExternalLoad (opcode 0x74)

Waits for a pending external load or fetch to complete. If the data is not ready the thread stalls.

BitsMeaning
63-560x74
55-0Standard layout fields

IssueExternalFetchSwap / Add32 / Or / Xor / And (opcodes 0x780x7C)

Start atomic read–modify–write operations in external memory. They behave like loads followed by an update. Used for counters or synchronisation across groups.

BitsMeaning
63-560x780x7C
55-0Standard layout fields

ExternalWrite (opcode 0x7F)

Writes a register to external memory at B.xy. All four components are stored unconditionally.

BitsMeaning
63-560x7F
55-0Standard layout fields

Miscellaneous Notes

  • Branch instructions cause a single cycle bubble when taken because there is no prediction hardware.
  • Tile buffer locks serialize access so overuse may reduce parallelism.
  • Atomic external operations bypass the cache, so repeated use can saturate memory bandwidth.