VPU Emulator Overview
The VPU emulator provides a software model of the hardware. It simulates all core features including the thread scheduler, tile memory, register files and the external memory interface. The emulator is useful for developing and debugging shader programs without access to physical hardware.
Building
The emulator is compiled as part of the main project build. Standard development environments can compile it for native execution on a workstation. No special hardware is required. The build produces a command line tool that loads programs, runs them and optionally dumps execution traces.
Basic Usage
- Create an instance of the emulator and reset it to a known state.
- Write the shader program into program memory.
- Initialise global registers with constants, pointers and configuration data.
- Use the spawner interface to launch work over the desired range of screen coordinates.
- Step the emulator until all thread groups have completed.
- Read back tile memory or external memory to examine results.
The stepping process advances the simulation one cycle at a time. Each cycle updates the internal cores, processes memory requests and performs DMA transfers. The host application may choose to run until no active thread groups remain or to single-step for debugging purposes.
Program Loading
Programs are supplied as arrays of 64‑bit instruction words. These words are written into program memory starting at address zero. The emulator does not enforce any particular format for loading programs, so host applications are free to assemble instructions in whatever manner is convenient. After loading, the entry point for spawned thread groups is specified when calling the spawner.
Spawner Interface
The emulator exposes a spawner that mirrors the behaviour of the hardware. When work is launched the spawner records the rectangular region of screen coordinates along with the program entry point and target global register bank. Each step the spawner allocates new thread groups until the region is fully covered. This makes it possible to emulate large workloads without manually creating every thread group from the host.
External Memory Simulation
An emulated SDRAM model backs the external memory interface. Memory requests from cores are queued, passed through an L2 cache and eventually satisfied from SDRAM. The emulator tracks cache hits and misses, simulates bus contention and honours the same latency rules as the real hardware. DMA operations are likewise modelled cycle by cycle, allowing programs to overlap transfers with computation.
Debugging Features
For development the emulator can dump a variety of internal state:
- Current program counter and liveness mask for each thread group.
- Contents of local, shared and global registers.
- Tile buffer values and lock status.
- Pending memory requests and cache contents.
These facilities make it easier to diagnose issues in shader programs. Because the real hardware would not expose this information directly, the emulator is an invaluable tool for early stage development.
Example Workflow
- Assemble a shader program using the instruction set reference.
- Load the program into the emulator and configure initial registers.
- Spawn work over a region corresponding to the output image.
- Step the emulator in a loop until no threads remain active.
- Dump tile buffer contents to an image file or analyse results in memory.
- Adjust the program as needed and repeat.
This approach can be automated with scripting to run unit tests or performance experiments. Because the emulator is deterministic, identical inputs will always produce identical results, aiding reproducibility.
Performance Considerations
The emulator runs much more slowly than hardware because every cycle is simulated in software. However it is fast enough to develop small programs and verify correctness. For large workloads it may be desirable to reduce the output resolution or the number of spawned thread groups during iterative testing.
Extensibility
The emulator shares the same modular design as the hardware. Additional cores can be instantiated to experiment with multi-core scheduling, or new instruction implementations can be added for testing future extensions. Since all behaviour is deterministic and well-defined, contributors can modify the emulator without affecting existing programs as long as they adhere to the architectural rules.
Limitations
While the emulator strives to match hardware behaviour, it does not run inside a cycle-accurate hardware simulator. Timing of individual instructions is approximate and may differ slightly from a real implementation. The emulator also assumes perfect memory coherency within a complex and does not emulate errors or bus contention outside of normal latency.
Conclusion
Using the emulator, developers can create and debug VPU programs entirely in software. By following the same sequence of operations that real hardware would perform—loading programs, spawning work, stepping execution and reading results—one can iterate quickly on shader logic before deploying to physical devices.
Integration with the Host Environment
The emulator can be embedded into larger applications or run as a standalone executable. When embedded, host code can directly access the internal memory structures to implement custom I/O or visualisation. The standalone tool reads program binaries and data files from disk, executes the workload and writes results back out for analysis.
Configuration options allow developers to adjust the number of cores, enable or disable traces and choose whether memory contents are initialised to zero or random values. These options make it easier to reproduce hardware behaviour during automated testing.
Tracing and Logging
A built-in tracing mechanism can record the instruction stream with cycle counts. When enabled, every instruction issued by the emulator is logged along with the state of relevant registers. This produces large output but is invaluable when diagnosing subtle issues in complex programs. Developers can enable tracing for specific thread groups or for all groups depending on their needs.
Unit Testing
Because the emulator runs deterministically, it can be integrated into unit test suites. A typical test loads a small program, feeds it known input data and verifies the expected output after a fixed number of cycles. This approach helps catch regressions when modifying the emulator or when optimising shader programs. Tests can also verify that corner cases such as branch divergence or cache eviction behave correctly.
Performance Profiling
While the emulator is not optimised for raw speed, it can collect statistics about program behaviour. Built-in counters track cache hits, misses, number of DMA transfers and cycles spent waiting on locks. These metrics help developers tune their programs and understand where time is spent. Profiling data can be dumped to a file after execution or printed to the console for quick inspection.
Common Pitfalls
- Forgetting to call the finish instruction after issuing external loads results in stale data.
- Spawning too few thread groups leads to underutilisation and exposes memory latency.
- Neglecting to release tile buffer locks can deadlock the emulator just like real hardware.
- Overwriting the same global registers from multiple groups can cause race conditions if not coordinated properly.
Compatibility Notes
The emulator targets the current revision of the VPU architecture. If future hardware revisions introduce new instructions or behaviour, the emulator will be updated accordingly. Older program binaries should continue to function as long as they follow the documented encoding rules.
Further Reading
See the architecture and instruction set documentation for detailed information about how instructions operate and how thread groups interact with the memory system. Understanding those concepts will make it much easier to interpret the emulator's behaviour and debug complex programs.
Emulating Multiple Cores
The emulator can instantiate more than one core to mirror systems that contain a full complex of processing units. Each core operates independently but shares the L2 cache and external memory through the complex object. When running in multi-core mode, the spawner can target specific cores or distribute work evenly across them. This allows experimentation with parallel workloads and helps evaluate how memory bandwidth is shared.
Program Binary Format
Although programs can be loaded directly from arrays, many users store them as binary files for convenience. A simple format places the instruction count as a 32-bit little-endian integer followed by the instruction words in execution order. The emulator's command line wrapper can load such files automatically. Tooling scripts may generate the binary from assembly-like text or from a higher level compiler.
Data Input and Output
Input textures or buffers are usually stored in the external SDRAM model. Before starting a run, host code writes these values into the memory image. Upon completion, the emulator can dump regions of memory or tile buffers back to disk. Standard image formats such as PPM are easy to generate from the raw data and help visualise rendering results.
Debugging Tips
- Step the emulator one cycle at a time when first bringing up a new program to observe state changes.
- Enable tracing selectively for a small number of thread groups to reduce log size.
- Use the liveness mask display to verify that branches are taken as expected.
- Compare outputs against a reference software implementation when possible.
Conclusion
With these capabilities the emulator forms a complete development environment for VPU programs. It offers repeatable results, extensive debugging support and enough configurability to mimic a range of hardware setups. Mastery of the emulator paves the way for efficient deployment on actual VPU hardware once available.
Command Line Options
When used via the standalone tool, the following options are available:
--program <file>
– Load a binary program file.--image <file>
– Preload external memory from an image or data dump.--trace
– Enable instruction trace logging.--steps <count>
– Execute a fixed number of cycles and then stop.--dump <file>
– Write tile buffer or memory contents to a file after execution.
These options can be combined to create automated scripts that build, run and verify shader programs as part of a continuous integration workflow.
Example Session
$ vpu_emulator --program blur.vpu --image input.ppm --trace --dump output.ppm
loading program blur.vpu (256 instructions)
loading input.ppm into external memory
spawning work across 320x200 pixels
executing ...
dumping results to output.ppm
Running the emulator in this manner lets developers quickly iterate on programs and view the resulting images. Trace output can be enabled to inspect the exact instruction sequence for any problematic pixels.
Advanced Features
Some applications integrate the emulator into larger simulation frameworks. Because the emulator exposes the same APIs as real hardware, it can be driven by a cycle-accurate system simulator or even connected to custom peripherals. Advanced users may implement callbacks that trigger when specific memory locations are written, or patch instructions on the fly to emulate self-modifying code.
An embedding example in C++ might look like:
VpuEmulator emu;
emu.load_program(my_program);
emu.write_global(0, 0, initial_data);
emu.spawn_work(0, 0, width - 1, height - 1, entry_pc, 0);
while (emu.active_threads()) {
emu.step();
}
auto result = emu.read_tile_buffer(0);
This code snippet sets up the emulator, runs until completion and retrieves the final tile buffer for inspection.