Instruction Reference
This document combines the instruction encoding overview with a detailed reference for every opcode supported by the VPU. It focuses on the hardware semantics so that programs built for the emulator behave the same on real devices.
Instruction Encoding
VPU instructions are 64 bits wide. Most share a standard layout that selects the operation, the destination register and the two source operands. Branch instructions use a slightly different layout.
Standard Layout
Bits | Purpose |
---|---|
63-56 | Opcode selecting the instruction |
55 | Result goes to spawned group when set |
54-50 | Destination register index |
49-42 | Write mask and flag controls |
41-28 | Source B configuration (register, swizzle) |
27-12 | Source A configuration (register, swizzle) |
11-0 | Reserved for future expansion |
Branch Layout
Bits | Purpose |
---|---|
63-56 | Opcode 0x80 |
55-52 | Flag component controlling the branch |
51-42 | Reserved (must be zero) |
41-32 | Target program counter (10 bits) |
31-0 | Reserved |
Writeback Modes
Each component of the result can update condition flags and destination registers independently. The write mask encodes four two-bit fields with these meanings:
00
– write if the corresponding flag bit is true01
– write if the flag bit is false10
– never write11
– always write
The flag control field selects whether result sign, zero status or a combination updates the flags. This mechanism avoids explicit branches when only some pixels require modification.
Opcode Reference
The following sections describe every instruction. Opcodes use the standard format unless stated otherwise. Bit positions refer to the tables above.
Control Flow
Branch (opcode 0x80
)
Performs a conditional jump to another instruction address. Threads that do not satisfy the flag condition are masked until a following Sync
re-enables them. Because there is no branch prediction the pipeline pauses briefly when the jump is taken.
Bits | Meaning |
---|---|
63-56 | 0x80 |
55-52 | Flag component controlling the branch |
41-32 | Target program counter (10 bits) |
31-0 | Reserved |
Typical uses include implementing loops or skipping over optional work.
Sync (opcode 0x01
)
Clears the thread mask so all lanes of the group become active again. Often inserted after a conditional branch.
Bits | Meaning |
---|---|
63-56 | 0x01 |
55-0 | Standard layout fields |
Exit (opcode 0x00
)
Terminates the current thread. When all threads in a group exit, the hardware recycles the group.
Bits | Meaning |
---|---|
63-56 | 0x00 |
55-0 | Standard layout fields |
Work Scheduling
ThreadGroupAllocate (opcode 0x04
)
Reserves a global register bank for a new group. The instruction stalls if no banks are free. Programs typically follow this with register initialisation.
Bits | Meaning |
---|---|
63-56 | 0x04 |
55-0 | Standard layout fields |
ThreadGroupLaunch (opcode 0x05
)
Adds a prepared group to the scheduler queue. The B register encodes the entry address and starting position. Write masks select which lanes of the new group start alive.
Bits | Meaning |
---|---|
63-56 | 0x05 |
55-0 | Standard layout fields |
LoadPosition (opcode 0x07
)
Loads the current screen position and tile offset into the destination register and updates flags for active lanes. Useful when spawning child work to know where it should operate.
Bits | Meaning |
---|---|
63-56 | 0x07 |
55-0 | Standard layout fields |
Flag Operations
AndFlags (opcode 0x08
)
Computes the logical AND of all flag components and writes the result back to every component. It has no register operands and is typically used before branching.
Bits | Meaning |
---|---|
63-56 | 0x08 |
55-0 | Standard layout fields |
OrFlags (opcode 0x09
)
Computes the logical OR of all flag components and writes the result back to every component.
Bits | Meaning |
---|---|
63-56 | 0x09 |
55-0 | Standard layout fields |
PairAndFlags (opcode 0x0A
)
Replaces X and Y with their AND and Z and W with their AND. Useful when two comparisons must both be true.
Bits | Meaning |
---|---|
63-56 | 0x0A |
55-0 | Standard layout fields |
PairOrFlags (opcode 0x0B
)
Performs OR on (X,Y) and (Z,W) pairs. It allows combining related conditions without extra branches.
Bits | Meaning |
---|---|
63-56 | 0x0B |
55-0 | Standard layout fields |
BroadcastFlags (opcode 0x0C
)
Copies the X component flag from each thread to all components so threads can quickly test group-wide activity.
Bits | Meaning |
---|---|
63-56 | 0x0C |
55-0 | Standard layout fields |
Floating Point Arithmetic
AddFloat16 (opcode 0x10
)
Component-wise addition of two FP16 vectors.
Bits | Meaning |
---|---|
63-56 | 0x10 |
55-0 | Standard layout fields |
MultiplyFloat16 (opcode 0x11
)
Component-wise multiplication of two FP16 vectors.
Bits | Meaning |
---|---|
63-56 | 0x11 |
55-0 | Standard layout fields |
DivideFloat16 (opcode 0x15
)
Divides the X component of A by B while multiplying the remaining components. Useful for perspective correction when interpolating.
Bits | Meaning |
---|---|
63-56 | 0x15 |
55-0 | Standard layout fields |
ConvertToFloat16 (opcode 0x16
)
Converts signed 16‑bit integers in A to FP16, clamping out-of-range values. Handy for unpacking colour data.
Bits | Meaning |
---|---|
63-56 | 0x16 |
55-0 | Standard layout fields |
Integer Arithmetic
ConvertToInt16 (opcode 0x20
)
Converts FP16 components to signed integers by truncation.
Bits | Meaning |
---|---|
63-56 | 0x20 |
55-0 | Standard layout fields |
AddInt16 (opcode 0x21
)
Adds two signed 16‑bit integer vectors.
Bits | Meaning |
---|---|
63-56 | 0x21 |
55-0 | Standard layout fields |
AddInt32 (opcode 0x22
)
Adds two 32‑bit integers formed from the XY and ZW pairs. Used for counters or addresses that exceed 16 bits.
Bits | Meaning |
---|---|
63-56 | 0x22 |
55-0 | Standard layout fields |
MultiplyInt16 (opcode 0x2C
)
Multiplies pairs of 16‑bit integers and keeps the low halves. Commonly used when squaring numbers such as texture coordinates.
Bits | Meaning |
---|---|
63-56 | 0x2C |
55-0 | Standard layout fields |
MultiplyHiInt16 (opcode 0x2E
)
Produces full 32‑bit products from X and Z components. Allows higher precision results when needed.
Bits | Meaning |
---|---|
63-56 | 0x2E |
55-0 | Standard layout fields |
MultiplyInt16u (opcode 0x28
)
Unsigned variant of MultiplyInt16
.
Bits | Meaning |
---|---|
63-56 | 0x28 |
55-0 | Standard layout fields |
MultiplyHiInt16u (opcode 0x2A
)
Unsigned version of MultiplyHiInt16
.
Bits | Meaning |
---|---|
63-56 | 0x2A |
55-0 | Standard layout fields |
And (opcode 0x30
)
Bitwise AND of two integer vectors. Often masks texture coordinates.
Bits | Meaning |
---|---|
63-56 | 0x30 |
55-0 | Standard layout fields |
Or (opcode 0x31
)
Bitwise OR of two vectors. Can combine flags or build masks.
Bits | Meaning |
---|---|
63-56 | 0x31 |
55-0 | Standard layout fields |
ExclusiveOr (opcode 0x32
)
Bitwise XOR, sometimes used for toggling masks or computing parity.
Bits | Meaning |
---|---|
63-56 | 0x32 |
55-0 | Standard layout fields |
TextureWrapInt16 (opcode 0x33
)
Converts FP16 texture coordinates to integers and masks them with B. Helps implement tiled textures.
Bits | Meaning |
---|---|
63-56 | 0x33 |
55-0 | Standard layout fields |
Tile Memory Operations
ReadLocalTileN / WriteLocalTileN (opcodes 0x40
–0x53
)
Read or write tile buffer N at each thread's local position. Locks can optionally be acquired or released. Typically used for per-pixel processing without external memory traffic.
Bits | Meaning |
---|---|
63-56 | 0x40 –0x53 |
55-0 | Standard layout fields |
ReadTileN / WriteTileN (opcodes 0x44
–0x57
)
Access tile buffer N at an arbitrary address supplied in A.x. The lower two address bits are replaced with the thread ID so each lane touches a separate element.
Bits | Meaning |
---|---|
63-56 | 0x44 –0x57 |
55-0 | Standard layout fields |
ReadLockTileN / WriteUnlockTileN (opcodes 0x4C
–0x5F
)
Acquire or release locks when accessing tile buffers. Locks are tracked per position and help coordinate groups updating neighbouring pixels.
Bits | Meaning |
---|---|
63-56 | 0x4C –0x5F |
55-0 | Standard layout fields |
External Memory Operations
IssueExternalLoad (opcode 0x70
)
Begins an asynchronous read from external memory. The address comes from B.xy and the data becomes available when FinishExternalLoad
is executed. Useful for fetching textures or geometry.
Bits | Meaning |
---|---|
63-56 | 0x70 |
55-0 | Standard layout fields |
IssueExternalLoadRgb (opcode 0x71
)
Loads an RGB1555 value and converts it to FP16 on the fly.
Bits | Meaning |
---|---|
63-56 | 0x71 |
55-0 | Standard layout fields |
IssueExternalLoadLut0 / IssueExternalLoadLut1 (opcodes 0x72
–0x73
)
Perform palette lookups while reading from memory. Good for indexed colour formats.
Bits | Meaning |
---|---|
63-56 | 0x72 –0x73 |
55-0 | Standard layout fields |
FinishExternalLoad (opcode 0x74
)
Waits for a pending external load or fetch to complete. If the data is not ready the thread stalls.
Bits | Meaning |
---|---|
63-56 | 0x74 |
55-0 | Standard layout fields |
IssueExternalFetchSwap / Add32 / Or / Xor / And (opcodes 0x78
–0x7C
)
Start atomic read–modify–write operations in external memory. They behave like loads followed by an update. Used for counters or synchronisation across groups.
Bits | Meaning |
---|---|
63-56 | 0x78 –0x7C |
55-0 | Standard layout fields |
ExternalWrite (opcode 0x7F
)
Writes a register to external memory at B.xy. All four components are stored unconditionally.
Bits | Meaning |
---|---|
63-56 | 0x7F |
55-0 | Standard layout fields |
Miscellaneous Notes
- Branch instructions cause a single cycle bubble when taken because there is no prediction hardware.
- Tile buffer locks serialize access so overuse may reduce parallelism.
- Atomic external operations bypass the cache, so repeated use can saturate memory bandwidth.