VPU Architecture

The Virtual Processing Unit (VPU) is a programmable vector processor designed for graphics and image manipulation. This document describes the hardware at a conceptual level so that software authors can understand how programs are executed. All numeric values refer to the actual hardware configuration used by the emulator.

Overview

A single VPU core uses a wide SMT design to hide latency. Up to 96 hardware threads share one execution pipeline. Threads are grouped into thread groups of four. All four threads in a group execute in lockstep and share a program counter. The VPU can schedule up to 20 active groups at a time, while maintaining state for another 28 queued groups for a total of 48 groups.

Each VPU system contains one or more cores connected to a common memory bus through an L2 cache and DMA engine. The global memory interface supports asynchronous reads, atomic updates and block transfers via DMA. Each core also includes several kinds of local memory and numerous register files. The following sections describe each component in detail.

Execution Pipeline

The pipeline of a core is organised as fetch, decode, operand collection, execute and writeback stages. Instructions are 64 bits wide and hold the opcode plus configuration for input swizzles, masking and writeback conditions. On each cycle the scheduler selects a thread group to issue. Because at most 20 groups can be active, other groups wait in a FIFO until a slot becomes free.

Threads have individual condition flags used for branching and masking. Divergent branches mask inactive threads until they reconverge with a Sync instruction. If all threads in a group exit, the group is returned to the free pool.

Register Files

Three register spaces are visible to shader programs:

Local registers – each thread owns 16 vector registers (64 bits each) used for temporary values.
Shared registers – each group shares 16 registers that all threads can access.
Global registers – four banks exist, with 32 registers per bank. When work is spawned, the launching program selects which bank is visible to the spawned group.

Registers hold 64‑bit vectors which may be interpreted as four FP16 values, four signed or unsigned 16‑bit integers, or two 32‑bit integers. The instruction set operates on these formats and can mix them within reasonable limits.

Program Memory

Each core provides 1024 instruction slots of program memory. Program counters wrap within this range. Instruction words are fetched sequentially unless modified by a branch operation. Because program memory is per core, the same program executes identically on each thread group launched on that core.

Tile Buffers

Four dedicated tile buffers exist for intermediate pixel data. Each buffer holds 1024 entries of 64 bits. Addresses within a tile buffer use 10 bits. Threads may access the buffer at their own screen position or at an arbitrary offset. Loads and stores can optionally lock a position to ensure only one group updates a tile at a time. Locks are maintained per position and must be released explicitly when finished.

The combination of local registers and tile memory allows complex image operations such as blending, filtering and tiling without constant trips to external memory.

Spawner and Scheduling

Programs launch new work through the spawner. To begin, a group must allocate one of the 20 active slots. The program specifies the starting screen position, program entry point and global register bank. Once scheduled, the group executes until it either completes or spawns further work. The spawner maintains start and end coordinates for a rectangular region and automatically advances through positions as groups complete.

Groups waiting for an active slot are queued in a FIFO. The scheduler uses a round‑robin policy across the available groups so that long‑running threads do not starve new work. On each cycle a different group can issue instructions, giving the appearance of 96 fully parallel threads even though only a single pipeline is present.

External Memory Interface

External memory access uses a request queue of 16 entries per core. Loads and atomic fetch operations issue requests that later complete via a dedicated finish instruction. The memory path passes through a shared L2 cache inside the VPU complex. Cache lines are 32 bytes in size and organised as two‑way set associative. Misses are serviced using a simple round‑robin policy across banks, with a typical fill latency of thirty‑two cycles.

The complex also contains a DMA engine for block transfers between external memory and tile buffers. DMA operations specify a burst size, count, stride and optional data conversion such as RGB1555 to FP16. DMA requests arbitrate for the memory bus with normal cache traffic but operate independently from the cores.

Synchronisation and Locks

Tile buffer locks are primarily used by programs that update neighbouring pixels from multiple groups. A lock is associated with each tile position. Threads may acquire a lock when reading so that no other group writes the same position until released. Likewise, writes can release a lock once the update is complete. Locks only apply to tile memory and are not used for external memory accesses.

Branch divergence is handled through thread masks. When a Branch instruction is taken by some threads and not others, the threads that skip the branch are masked. Instructions still issue every cycle but masked threads do not update registers or flags. A Sync instruction clears the mask so all threads resume execution together. Properly managing divergence is key to achieving good utilisation of the pipeline.

Condition Flags and Writeback

Each thread maintains four condition flag bits labelled X, Y, Z and W. Most ALU operations update these flags according to the result component sign, zero or a combination of both. Writeback from an instruction is separately configurable for each component. This allows, for example, conditional writes based on a previous comparison.

Core to Complex Connection

Multiple cores may be instantiated within one VPU complex although the default configuration contains just one. The complex arbitrates access to external memory between cores and the DMA engine. Because the shared cache sits between cores and memory, data loaded by one core may be visible to another provided caching policies permit it. Atomic operations on external memory are guaranteed coherent only within a complex, not across multiple complexes.

Example Program Flow

Host software writes the desired shader program into a core's program memory.
Global registers are initialised with constants such as texture pointers or transformation parameters.
The host uses the spawner to launch work over a region of screen coordinates. Each launched group starts at the program entry point with a selected global register bank.
The core schedules the new groups among any already running. As groups execute they may load from or store to tile buffers, issue external memory reads or writes, and spawn additional groups.
Finished groups release their global register bank for reuse. When all groups have completed and the spawner queue is empty, the host can read back results from tile buffers or external memory.

Summary

The VPU combines a simple vector instruction set with a highly multi‑threaded execution model. By understanding the arrangement of thread groups, registers, tile memory and the external memory system, developers can craft efficient programs that take advantage of the 96‑way pipeline. The following documents cover instruction encoding, individual opcodes and usage of the emulator that models this hardware.

Pipeline Stages in Detail

Fetch – The core reads the next instruction for the currently scheduled thread group from program memory. The fetch stage runs speculatively for masked threads so the program counter always advances in lockstep.
Decode – Opcode and operand selectors are decoded. The control logic determines register indexes, swizzle selectors and immediate modifiers.
Operand Collection – Source registers are read, shuffle operations apply and zeros or sign inversions are inserted as requested. For tile buffer reads or external memory operations, this stage initiates the request but the instruction continues to execute while the memory operation completes later.
Execute – The arithmetic logic unit performs the requested operation. Integer and floating point operations share the same datapath so only one functional unit exists per stage. Flag results are computed simultaneously.
Writeback – Results are written to the destination register if the per component conditions evaluate true. Flags are also updated here. Because each component has an independent condition, some parts of a vector may be updated while others remain unchanged.

The entire pipeline takes five cycles from fetch to writeback. As long as at least five groups are active, one instruction completes every cycle.

Memory Map

The VPU addresses external memory using 32‑bit byte addresses. Cache lines are aligned to 32‑byte boundaries. The top bit of an address selects whether data is interpreted directly or via one of two 256‑entry decompression lookup tables. Writes must target uncached regions only. When DMA is used, tile memory addresses are formed by concatenating the core ID, tile buffer ID and tile offset as follows:

+-----------+------------+--------------+-----+
| Core Bits | Buffer (2) | Offset (10)  | 000 |
+-----------+------------+--------------+-----+

The lower three bits are always zero because vectors are aligned to eight bytes.

Performance Considerations

Thread groups advance in a round‑robin fashion. Each group issues one instruction approximately every twenty cycles when all slots are full. Programs that stall on memory or locks should therefore keep enough groups active to cover latency. Using divergent branches excessively reduces utilisation because masked threads still occupy a slot in the schedule. Optimally, shader code organises branches so that neighbouring pixels follow the same path as often as possible.

Example Usage Scenario

Consider a 2D tile‑based renderer where each tile represents a screen region. The host allocates an array of global register banks filled with texture pointers and constant colours. Work is spawned over the visible tiles. Each thread group loads tile data from external memory, blends it with existing pixels in tile buffers while respecting locks to avoid race conditions, and finally writes the completed tile back to memory using the DMA engine. With 96 hardware threads the renderer can process dozens of tiles concurrently, hiding latency of memory transfers.

Register Usage and Addressing

Local and shared registers are addressed using 5‑bit indexes. Instructions specify a source and destination index along with per component masking. For global registers, the high two bits select the bank while the lower five bits select the register within that bank. When a thread group is spawned it inherits the bank selection from the launching group, allowing programs to stage constants for many groups ahead of time.

Because each thread has its own local register space, total register file usage can grow to hundreds of entries across all threads. Programs that rely heavily on local registers should ensure that the working set fits in the hardware to avoid spilling through external memory.

Thread Group State

The hardware tracks several pieces of metadata for each group:

Program counter – the next instruction to execute.
Screen position – the X/Y coordinate within the current tile or render target.
Liveness mask – indicates which threads are still active after branches or exits.
Condition flags – four bits per thread used for conditional execution.
Spawn pointer – location within the spawner grid when launching work.

Maintaining this state allows the scheduler to pause and resume groups transparently. When a group completes or becomes idle waiting for memory, another queued group can take its place in the active set without software intervention.

Spawner Operation in Detail

The spawner is configured with a rectangular region defined by start and end coordinates. Each cycle that the spawner is not busy, it allocates a thread group and fills in its starting position. The program counter and global register bank come from parameters provided by the host. As groups exit, new positions are pulled from the region until all coordinates have been covered.

A typical usage pattern is to spawn a group per 8×8 tile of screen space. Because groups execute independently, large frame buffers can be processed in parallel. The host may choose to spawn additional layers or effects by launching secondary passes once the primary work completes.

DMA Operation Example

To perform a block transfer from external memory into a tile buffer, software programs the DMA engine with the desired direction, burst size, stride and conversion. For example, copying a 64×64 pixel texture with RGB1555 conversion would use 64 bursts of 64 bytes each. The engine reads eight bytes at a time from memory, converts them to FP16 vectors and writes them into the target buffer. DMA can operate while cores continue executing other instructions. Once the transfer completes the cores can read the populated tile buffer without additional latency.

Additional Notes

The VPU hardware used by the emulator is intentionally simplified compared to modern GPUs. There is no hardware texture cache, branch prediction or out‑of‑order execution. All synchronisation is explicit, giving the programmer fine grained control over memory ordering and pipeline stalls. Despite the simplicity, careful use of thread groups and tile memory enables complex graphical effects such as particle systems, procedural textures and deferred shading.

Cache Design

Each core relies on a shared L2 cache within the complex to reduce latency of external memory. The cache is split into multiple banks so that concurrent requests from different cores or from the DMA engine can proceed in parallel. Each bank holds two sets with two ways per set. Replacement uses a simple round‑robin policy. Dirty lines are written back lazily when evicted. Because the VPU is primarily intended for streaming workloads, the cache is relatively small compared to modern CPU caches but sufficient to cover sequential texture or buffer accesses.

Atomic Operations

The instruction set includes several atomic fetch and update operations. These allow shader programs to perform counters, mutexes or reduction operations in external memory without race conditions. Atomic requests are issued in the same queue as normal loads but are forwarded directly to memory, bypassing the cache. When the result returns it is written into a register using the same mechanism as a normal load completion.

Instruction Prefetch and Branching

Instruction fetches are pipelined so that the next instruction word is requested while the current instruction executes. Branch targets are calculated during the decode stage. Because there is no branch prediction, any taken branch incurs a pipeline bubble while the new instruction is fetched. Keeping branches well structured minimises this cost. Using the condition flag writeback and mask features, many small decisions can be encoded without using explicit branches at all.

Typical Use Cases

Sprite Rendering: Each thread group processes a quad on screen, sampling textures, applying colour modulation and writing to a tile buffer before DMA to the final frame.
Image Filtering: Groups read pixels from a source tile, apply convolution kernels stored in global registers and write the result back. Locks ensure that overlapping regions are updated in order.
Physics or Particle Systems: Vector arithmetic instructions update positions and velocities. External writes store results back to system memory where the main CPU can consume them.
Compute Workloads: Because the VPU supports integer math and atomic operations, it can also handle tasks such as histogram generation or prefix sums within a graphics pipeline.

Limitations

While powerful for parallel pixel processing, the VPU lacks features such as hardware-based texture sampling, floating point rounding modes other than truncation and advanced memory coherence across multiple complexes. Programs must manually manage caching behaviour, memory fences and data formats. Nevertheless the architecture maps well to predictable workloads where control flow and memory access patterns are known in advance.

Pseudocode Example

Below is an illustrative sequence that shows how a simple program might read from external memory, process a value and write it back. Real programs are more complex but the structure is similar.

for each thread group in spawner region {
    load value from external address
    convert to float
    add constant from global register
    clamp result
    store back to external memory
}

The load and store steps may take many cycles due to memory latency. By keeping several thread groups active the VPU hides this delay.

Pipeline Timing Diagram

Cycle | Stage0  | Stage1  | Stage2       | Stage3  | Stage4
------+---------+---------+--------------+---------+---------
 | Fetch   |         |              |         |
 | Decode  | Fetch   |              |         |
 | Operands| Decode  | Fetch        |         |
 | Execute | Operands| Decode       | Fetch   |
 | Write   | Execute | Operands     | Decode  | Fetch
 |        ... pipeline repeats ...

Each column represents one thread group issuing instructions as the pipeline fills. After the first few cycles the pipeline produces one result per cycle.

Glossary

Term	Meaning
Thread	One of the 96 execution contexts within a core.
Thread Group	Set of four threads that share a program counter.
Tile Buffer	On-chip memory array for temporary pixel data.
Global Register Bank	Set of 32 registers shared between groups.
Condition Flag	Per-thread bits used for conditional execution.
DMA	Direct Memory Access engine for block transfers.
Spawner	Hardware unit that issues new thread groups.
L2 Cache	Shared cache that buffers external memory accesses.

Component Summary

The table below summarises key numeric parameters of the VPU hardware.

Feature	Value
Active thread groups per core	20
Total thread groups per core	48
Threads per group	4
Local registers per thread	16
Shared registers per group	16
Global register banks	4
Global registers per bank	32
Tile buffers per core	4
Entries per tile buffer	1024
Program memory per core	1024 instructions
Memory queue depth	16 requests
Cache line size	32 bytes

Historical Perspective

The VPU draws inspiration from early programmable shading hardware that predated unified shader cores. By using a very small instruction set and a large number of hardware threads, it achieves good throughput on workloads with predictable memory access patterns. Unlike modern GPUs it does not provide texturing units or fixed function interpolation. All data fetches and arithmetic must therefore be expressed using the vector instruction set.

Much of the architecture was motivated by simplicity for FPGA implementation. Features such as branch divergence masking and explicit cache management keep hardware costs low while still enabling sophisticated effects through software techniques. The emulator models these behaviours closely so that programs written against it behave the same on a real VPU implementation.

Comparison to Contemporary Designs

Compared to modern desktop GPUs the VPU lacks many convenience features, but it also exposes a very deterministic execution model. There is no out‑of‑order execution, no speculative threading and no hidden caches beyond the explicit L2. As a result, performance is highly predictable from the instruction sequence and memory access pattern alone. This makes the architecture suitable for experiments in graphics or compute where deterministic results are important.

Testing and Emulation

The provided high level emulator mirrors the behaviour of the hardware to a high degree. It models the thread scheduler, memory latency and DMA semantics. Programs developed against the emulator will therefore translate directly to real hardware. For verification it can dump internal state such as register banks and tile contents after each instruction, helping developers understand how their code interacts with the architecture.

When running large test suites it is common to seed the spawner with a known pattern of work, execute until completion and then compare the tile buffer contents against expected images. Because the architecture is deterministic, mismatches point to either a logic error in the program or a divergence from specification in the emulator itself.

Future Extensions

The reference design intentionally omits features that might complicate a small FPGA implementation. However nothing in the architecture prevents adding more cores, larger caches or additional instruction formats in future revisions. A vector texture fetch unit or floating point rounding control could be introduced while preserving the basic thread group and tile memory model.

Operational Tips

Initialise all global registers before spawning work to avoid reading undefined values.
When mixing integer and floating point operations remember that condition flags are set based on the selected arithmetic mode.
Use tile buffer locks sparingly; they introduce serialisation points that can reduce parallelism.
Monitor cache utilisation when issuing many external loads. Adjust the stride of DMA operations to align with cache line boundaries where possible.

Instruction Latency

For most ALU instructions the result is available five cycles after issue due to the length of the pipeline. Memory operations such as loads insert additional delay depending on cache hits and DMA activity. Branch instructions always stall the pipeline for one cycle as the new instruction word is fetched. By estimating these latencies developers can schedule independent operations from different thread groups to keep the pipeline busy.

Instruction Type	Typical Delay
ALU (integer or float)	5 cycles
Tile buffer read	5 cycles to issue, plus lock wait
External load (cache hit)	35 cycles
External load (cache miss)	35 + fill latency
DMA transfer	Overlaps with execution, length varies

Acknowledgements

This documentation was compiled from analysis of the emulator and hardware design goals. It is intended to serve as a starting point for anyone experimenting with the VPU architecture. Contributions and corrections are welcome as the implementation evolves.

Conclusion

Understanding the VPU requires a mental model of the thread scheduler, register layout and memory hierarchy. While the design is intentionally minimal, mastering its nuances enables sophisticated graphics pipelines on relatively small hardware. This document will continue to expand as the architecture grows and new features are introduced.

Overview​

Execution Pipeline​

Register Files​

Program Memory​

Tile Buffers​

Spawner and Scheduling​

External Memory Interface​

Synchronisation and Locks​

Condition Flags and Writeback​

Core to Complex Connection​

Example Program Flow​

Summary​

Pipeline Stages in Detail​

Memory Map​

Performance Considerations​

Example Usage Scenario​

Register Usage and Addressing​

Thread Group State​

Spawner Operation in Detail​

DMA Operation Example​

Additional Notes​

Cache Design​

Atomic Operations​

Instruction Prefetch and Branching​

Typical Use Cases​

Limitations​

Pseudocode Example​

Pipeline Timing Diagram​

Further Reading​

Glossary​

Component Summary​

Historical Perspective​

Comparison to Contemporary Designs​

Testing and Emulation​

Future Extensions​

Operational Tips​

Instruction Latency​

Acknowledgements​

Conclusion​