Summary
Dragon is a hobby project to design and run a "fantasy console" that never existed. The actual project is a combination of design documents, SystemVerilog HDL source code, system software, and driver and utility programs.
This document gives a high-level overview of the system architecture, design decisions, and major components. There will be accompanying articles providing more detail on key components.
High-Level Goals
Dragon refers to a console with "power approximately equal to the Nintendo 64 or PlayStation 1", while also leaning into concepts that would not become mainstream until some years later. Our goals are:
- Enjoy making the thing.
- Our goal is the GPU and 'game console' aspect. So, we're not looking to design our own CPU as well.
- It must run both in Verilator simulation and on real FPGA hardware hooked up to a TV or monitor.
- Pull off something that looks like it could have existed in 1997.
- Be able to execute user-defined shaders instead of just a fixed-function graphics pipeline.
- Powerful enough that as a kid in the late nineties we would have said "Wow, that looks awesome", assuming we can make or pay for decent art assets.
- After the system works and has at least one functional game showing off its capabilities, do the work to transform this to a mostly-there ASIC design, purely to understand roughly what the footprint and layout of this system on a real IC would be like. Also, produce a sweet poster showing all the gates/metal/etc. from the GDS files 😄
FPGA Target Hardware
The hardware design is targeted to synthesize and run on real FPGAs. Our preliminary target FPGA is Xilinx 7-Series Arty-200T part, housed on the Digilent Nexys Video board. We may eventually target other FPGAs, but this will likely influence overall performance as we are heavily dependent on decent-size block ram-based caches; any smaller board may technically support the logic requirements of our design but have too few or too slow resources to run effectively.
Initially, we were targetting execution on the ULX3S, which is a fantastic device that we both like working with due to its completely open-source toolchain and low price tag. After more analysis of the necessary bandwidth and logic required for our goals, we found that the ECP5 85F is simply too limiting for this use case, which is why we needed to move to the Xilinx device. We'd be very happy if Lattice (or some new player) put out a similarly accessible FPGA with embedded memories and logic capacities on par with higher-end Xilinx devices, and would probably switch over.
System Overview
- RISC-V CPU - The main CPU for running software on the Dragon system. It is a single-core 32-bit RISC-V core, of which there are many good implementations already out there in the wild.
- HDMI Display Controller - Responsible for reading framebuffer from memory and displaying over HDMI to a screen.
- Unified Memory - This is our system memory which is utilized by most components.
- IO - General IO used to communicate with the Developer PC.
- CXB Switching Fabric - A custom switching fabric that serves all read/write requests between system components.
- Dragon GPU - The heart of our project, the Dragon GPU is composed of several functional pieces that enable tile memory operations, texture fetch, rasterization, and both graphics and compute shaders.
Notably missing is any specialized sound device (we intend to develop a basic mixer, but it's unclear at this point if that will be in software or some assisting hardware).
RISC-V CPU
We've gone with this since there exist multiple high-quality open-source implementations, the toolchains are available for multiple host platforms meaning we don't need to write our own toolchains just to produce software. Also, the simple single-core RISC-V setup allows us to quickly add support for CPU instruction cache and the like without worrying about cache invalidation/inconsistencies across cores.
Currently, we use the picorv32 core in our design and have not had any issues with it so far.
Unified Memory
This is our system memory, global memory, whatever you'd like to call it. On the Digilent board, this is multiple gigabytes of DDR3, which is way larger and faster than anything would have had support for back in the day. We do not have the benefit of multiple memories though in Dragon, with each dedicated to a component, so the CPU, GPU, IO, etc. all share this common resource. DDR3 is also extremely complicated to interact with, so it's likely we will use someone else's actual DDR3 controller for interfacing with the memory chips themselves.
Since our project began its life on ULX3S until we realized we needed more and faster resources, we also have an existing SDRAM controller which works just fine.
HDMI Display Controller
Currently, we only support a single output resolution and framerate, 640x480 at 60 FPS, which we call 480p60
. This mode is selected because the HDMI standard guarantees that all conformant displays will support this resolution. The HDMI is easy to interact with; There are a couple of MMIO registers that determine where the current framebuffer for drawing is located in system memory ("Unified Memory").
Because we actually are initially targetting 320x240 at 30 fps (240p30
), the operation is like this: We have one cache of data for the line we're currently putting out to HDMI. That line is displayed for two lines of actual output since 240p is half of 480p. Whenever we begin displaying content for one line, we start filling the cache for the next visible line of the framebuffer. These two ping-pong back and forth. That's it!
IO Block
This is a connection between the system and our developer PCs. It's how we can send data to the system for programming RAM, read the state of MMIO registers from Python scripts from our PCs, etc. We only support UART right now since it's incredibly simple (and we still made mistakes integrating that), but in the future, we'd like to support Digilent's proprietary high-speed IO protocol, or even better, actual basic ethernet support, which has plenty of bandwidth and is very convenient to interact with from the PC side.
CXB
Nope, we aren't using AXI! We have developed a variant of wormhole switching called "Cheshire Bus" (CXB). CXB is how all reads and writes on the system reach their destination. Initially, we had a graph of simple switches for interfacing system components together, but due to the complexity of choosing the right topology to avoid deadlocks, we are moving to a simpler torus configuration for switches.
The design of CXB is well-documented in this PDF, though it mentions the older 2-input/2-output switching node, which we are moving beyond.
Dragon GPU
An incomplete diagram of the GPU which serves the purpose of this article:
Command Buffer​
The GPU is centered around the execution of a single queue (FIFO) of "commands". Each command is either
- A 32-bit value written to an internal register, or
- A "trigger" to begin work in a functional unit of the GPU.
The GPU represents a large portion of the addressable memory on the system. Almost all of the address bits representing the GPU are used to encode internal register indexing. This was a design decision to simplify pushing data to the GPU command buffer. Instead of needing to write e.g. 64 bits of data containing a 32-bit value and additionally, an index for which register to write to, the user only needs to issue a 32-bit write where part of the address encodes the index of the GPU register to write to.
All of these writes push data into the GPU command queue for eventual processing. The GPU will execute commands as quickly as possible, though there are exceptions (see Synchronization Primitives below).
Execution Engine (EE)​
The Execution Engine is responsible for reading the front of the command buffer FIFO and doing one of 3 things:
- Perform the internal register write
- Trigger a functional unit to start doing work, such as drawing triangles, storing a tile buffer, etc.
- Waiting on previous operations to signal completion.
There are additionally a few MMIO registers that allow software to control the state of the EE. They can pause the command buffer execution, reset its state, etc.
GPU Synchronization Primitives (SIG
Bits)​
There are multiple functional units of the GPU and any number of them could be executing concurrently. During the execution of the command buffer, before starting a "Draw Triangles" type command, we would want to ensure that e.g. a previous "Load Shader Program" command has been completed. For this purpose, we've implemented a fencing mechanism called SIG
bits.
When triggering a functional unit to do work by writing to an MMIO register named EXEC_*
, the value written to this register will be a set of "signal bits" which are remembered by that functional unit. The command will begin execution and then the command buffer proceeds onwards without waiting. If the write in the command queue is to the special SIG_WAIT
register, then the command queue will halt processing until the bits specified for the wait are written by previous operations. Once the waited-upon bits have been written by all the functional units completing their work, the bits are also cleared. There were originally several registers for managing SIG_*
concepts, but after discussion, we realized the use cases are limited enough for the above to work. The concept of signal bits enables, for instance, the DMA of one tile buffer to Unified Memory while we continue rendering into other tile buffers.
Tile-Based Rendering​
One of the key considerations for the Dragon GPU system architecture is the single shared RAM ("Unified Memory") resource and limited cache resources in the embedded memory of the FPGA. Random reads and writes from the GPU are unlikely to be efficient, both in terms of frequent accesses and turnaround time to change banks in RAM, etc.
For these reasons, we've designed the GPU to operate as a tile-based renderer. We render sections of the screen into embedded memory called "tiles". In a basic render, we clear a tile, render many polygons into the tile color buffer, and then do one write of that tile color data to unified memory. Each conceptual tile is composed of 1-4 Tile Buffers, depending on what the programmer wants. The tile buffers are called TB0
through TB3
. As an example, we might use TB0
as a color buffer which the user eventually will see on the screen. TB1
may be used to represent a depth buffer to do proper hidden surface removal. In the case of depth buffer in TB1
, we do not necessarily need to load or store this tile's data to unified memory at all. It can simply be cleared and ignored after rendering. In Vulkan (and other modern APIs which make tile-based rendering first-class) we have this same concept, called "Transient attachments".
Tiles are explicitly loaded and unloaded through a dedicated functional unit and triggered by an EXEC_
-type command in the command buffer.
VPU​
The star of the GPU is the Vector Processing Unit (VPU
) which performs compute and shading computations. A single VPU has a single program but has 8 VPT
threads, which execute concurrently (but not in parallel) the same program but with different instructions. There are 4 VPUs in a GPU in our current design. While there is a long pipeline to begin execution of a work item and significantly longer to retire each instruction compared to an earlier SIMD model, our use cases are not latency sensitive and the extra bandwidth is more important. VPT execution is pipelined so that each VPT executes some logic then the next VPT executes its logic, and so on.
The VPU is fed from a VPU Work Queue with "Work Items". These could be unshaded fragments coming from the rasterizer, or a more direct "DMA from Unified Memory -> VPU" path which is how we implement generic compute on the device.
All VPUs in the GPU must be operating on the same program and work uniformly. They serve as a scaling mechanism for work being performed. This is also why there is only one command buffer FIFO. It is not possible for instance for a DMA of TB0
in VPU "A" to be going on while normal rendering is happening into TB0
of VPU "B". Do note however that VPU work items can potentially be configured to start the VPU program at different entrypoints and so different work items could be performing different work internally.
Note: Originally the VPU had 4 VPC cores which each ran the same instruction in parallel. The switch to the VPT SMT+SIMD model introduced more latency but meant that we could achieve greater resource utilization and pipelining, so we could reach high clock speeds with less overall DSP resource utilization.
Here is a diagram of the logical hierarchy of Screen Framebuffer/VPU/Tile looks like:
Some observations to hopefully make the hierarchy here clear:
- There are 4 VPUs * 8 VPTs of concurrent work at any moment.
- On a single VPU, only one VPT is ever running on any given cycle.
VPU Tile Buffers​
Each VPU has 4 Tile Buffers ("TB"s) which can be read and written to by a VPT executing a program. Each tile buffer is 16x16, and each entry holds an f16x4
element.
The Tile Buffers can also be completely written to Unified Memory or loaded from Unified Memory, known as Store and Load operations. A VPU program has no instruction to read general System memory, and an executing VPT has no way to communicate with other VPTs, even on the same VPU. Related, the VPT is scheduled for a particular "pixel" in the tile buffer and cannot load/store to any other part of the tile buffers but the one it is assigned. So the only information a VPT can read is from whatever is in the Tile Buffers at its assigned location, data from look-up tables embedded in the current VPU program, and via texture fetch. VPTs may only store data within "their pixel" of the tile buffers.
Tile Buffer load/store happens through a DMA engine which can optionally transform the native f16x4
data format (see below) of the VPT registers and tile buffers into an ARGB1555
format, which mimics that of the PS1.
VPU Texture Fetch Support​
VPU programs can trigger a texture fetch for one of a set of configured textures for a given (u,v)
coordinate in that texture. All texture fetches go through a small texture cache. The fetch has no filtering today. This is a somewhat deliberate choice currently since I like this look from the PS1, but with a large enough texture cache, we could likely scale up to the needs of bilinear filtering (naively ~4x memory bandwidth) if we do the interpolation in the VPU program itself.
VPU Data Format (f16x4
)​
All VPT registers and tile buffer data are natively stored and operated on in a format called f16x4
which means a four-wide vector of 16-bit floats. The 16-bit floats are very similar to IEEE-754, but we do not worry over denormals, infinity, etc. to simplify the logic.
Each float16
is represented as 1 sign bit, 5 bits to express exponent, and 10 bits of mantissa SeeeeeMMMMMMMMMM
as
- If , then this indicates a value of .
- The smallest-magnitude non-zero representable number is with (, )
- The largest-magnitude representable number would be with (, ).
VPU Programming Capabilities​
VPU programs do not yet have a high-level language, and the list of operations is very small, but have a lot of optional flags for manipulating and masking inputs/outputs, making it very expressive.
Each of the 8 VPTs within a VPU, from a programmer's perspective, has 16 registers they may reference for reads/writes (r0-r15
).
There are also 16 "global registers" (g0-g15
) whose values may be written only by the GPU execution engine. These are useful so that the same program may run the same logic many times with e.g. different transformation matrices or other information updated between batches of compute.
Instruction inputs may come from one of three locations:
- From 16 "local" registers,
r0-r15
- From 16 "global" read-only registers
g0-g15
- Input B can additionally choose from 32 constants
c0-c31
built into the VPU itself (for things like , , , etc.)
VPTs may only perform writes to local registers or to the tile buffer for the particular location they have been assigned.
Some highlights of the ISA until a more in-depth document is produced:
- Every instruction is uniformly encoded to 64-bits
- The (x,y,z,w) components of each A and B may be independently and arbitrarily "shuffled" before the instruction executes any logic, this is similar to how shuffling (aka "swizzling") works in GLSL, HLSL, etc.
- Each component of the instruction result may be conditionally written to the result register.
- It is also possible to invert the sign of all components of A and/or B inputs.
- There is currently no support for integer operations.
These parameters make for a very expressive set of instructions. For instance, though there is only an add
and no sub
instruction, you can subtract by setting the invert-sign bit for input B and then adding. Similarly, there is a mul
instruction, but no dot product. With 3 instructions a dot product can be achieved. With some careful work, a full 4x4 matrix-vector multiply can be done in 12 instructions.
Some critical functions like sqrt
, inverse
, and sin
are based on approximations from lookup tables.
A sketch of normalizing a 4-vector a
and storing into b
:
b = mul a,a
b = add(a.xxyy, a.zzww)
b = add(b.xxyy, b.zzww)
b = sqrt(b)
b = inv(b)
b = a * b
Naively performing a 4x4 matrix-vector multiplication would involve a load for each row of the matrix and performing dot products for the input vector and each row, totaling at least 16 instructions. By pre-loading the global registers with matrix data before invoking the shader on many inputs, we can avoid a load. Moreover, the additions would normally take 8 instructions, however, because a naive implementation would perform redundant additions, it's possible to interleave operations for multiple rows together reducing the instruction count to 10 down from :
# 4x4 matrix * 4x1 vector multiplication
# Input is r0, Output is r0, matrix is g[0-3]
r4 = r0 * g0
r5 = r0 * g1
r4.xy__ = r4.xy__ + r4.zw__
r4.__zw = r5.__xy + r5.__zw
r5 = r0 * g2
r6 = r0 * g3
r5.xy__ = r5.xy__ + r5.zw__
r5.__zw = r6.__xy + r6.__zw
r0.xy__ = r4.xz__ + r4.yw__
r0.__zw = r5.__xz + r5.__yw
Finally, note that programs have a variety of arguments, but these are all specified by register writes. The DMA engine or rasterizer pushes work to the VPU Work Queue by actually pushing local register writes. So, if for instance the rasterizer is configured to place screen position, color, and triangle UV into r0
, r1
, and r2
(VPU program parameters are always written to the first registers), then the rasterizer will push "writes" for these registers with the appropriate value, the last one having a flag set which signifies this work item is done, then loop around to the next work item. As the VPU executes writes to the next unscheduled VPT and sees this flag, the work item will begin executing the program.
VPU by the Numbers​
Some statistics on the VPU design and throughput estimates
- We expect to achieve 200 Mhz clock rates on VPU execution.
- Each instruction in a VPT takes 24 cycles, however, there are 8 concurrently executing VPTs, so the total instruction throughput per VPU is 3 cycles per instruction. At 200 Mhz, this comes out to ~66.6 million instructions per second per VPU.
- With 4 VPUs operating in parallel, total instruction throughput is ~266 million instructions per second per GPU.
- Each math-related instruction generally operates on 4 floats, giving on the order of ~1 GFLOP/s of theoretical arithmetic throughput. Unified memory bandwidth is nowhere near this number which does highlight that maxing out the GPU and RAM will entail operating within tile memory as much as possible.
Rasterization Approach, Handling Vertex Attributes​
To draw triangles ("rasterization") we make use of the edge-equation approach. This is a very elegant approach that avoids sorting and other complex logic and is also easy to parallelize.
It's very tempting to give a sketch of how this algorithm works and its benefits, especially for hardware, but honestly you should just read this fantastic write-up, or another great write-up by ryg.
We take each 16x16 tile and subdivide it in a way that each VPU numbered 0-3
gets a nearby part of that tile:
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
Dividing the tile's 2D space in this way so that 0,1,2, and 3 all receive a spatially similar part of the tile helps ensure that nearby pixels make similar texture fetches, thus improving the texture cache hit rate.
The output of the rasterization process is a set of pixels and un-normalized barycentric weights. After normalization, each of the triangle attributes which are configured for interpolation are scaled by the barycentric (aka uv) coordinates and entered into the VPU work queue.
GPU Performance Counters​
We are professional software developers, not professional hardware designers, so we suspect there will be many times that we need to peer underneath and see which components of the system are taking too long to execute and then correct either hardware decisions or just improve software. If ever there is another software developer making something for the system, they will also want to see this information 😄
There is a performance counters system within the GPU which helps to understand how many times various things happen and the amount of time taken by operations. The GPU has a small dedicated memory called the Performance Counter Area (PCA) which is effectively just an array of 32-bit values. Within the command buffer, a "write" to a performance counter MMIO register will do two things: 1) the lower bits of the value written determine at what index the current value of the selected performance counter should be written in the PCA, and 2) the most-significant bit determines if the counter should be reset to zero.
There are a variety of useful performance counters. To illustrate, here are some of the available performance counters that are updated by hardware continuously and able to be captured:
PERF_GPU_CYCLES
- The number of cycles elapsed on the GPU. This serves as a general 'timestamp' mechanism which can be used to measure time when combined with the fixed GPU clock speed.PERF_VPU_CYCLES_IDLE
- The number of cycles that the VPU was idle performing no work.PERF_VPU_CYCLES_STALL
- The number of cycles that the VPU/VPCs were stalled waiting on a memory request to completePERF_VPU_CYCLES_TOTAL
- The number of cycles that the VPU was executing in total. This is not necessarily the same asPERF_GPU_CYCLES
.PERF_VPU_FRAGMENTS_SHADED
- The total number of entries that completed processing through the VPU (reached end of the program).PERF_RASTERIZER_FRAGMENTS_ENQUEUED
- The total number of fragments enqueued from the rasterizer to the VPUPERF_RASTERIZER_CYCLES_ENQUEUED
- The total number of cycles where at least one fragment was enqueued to the VPUPERF_RASTERIZER_CYCLES_DISCARD
- The total number of cycles where the rasterizer was operating, but did not produce at least one fragment into the fragment queue.PERF_RASTERIZER_CYCLES_TOTAL
- The total number of cycles where the rasterizer was executing in total. This is not necessarily the same asGPU_CYCLES
.PERF_GPU_CMDBUF_COMMANDS_TOTAL
- Number of completed commandsPERF_GPU_CMDBUF_CYCLES_WAITING
- The total number of cycles where the command buffer was stalled waiting for an operation to complete before proceeding.
After performance counters are written to the PCA, the software can read this area back and interpret the results, maybe show them on-screen, or stream them over to the developer's PC.
JTAG, Debugging, and ddb
We do not currently utilize any type of JTAG for testing or communication. The IO block is directly connected to CXB in such a way that we can issue reads and writes from the developer's PC, meaning we can write framebuffers, CPU program code, etc. all from a Python driver program. This is also how we do some of our debugging today, reading MMIO registers. It's not a complete solution like chipscope or similar embedded logic analyzers, but it does work quite well for our use cases.
We have plans to develop ddb
, a kind of debugger application running on the developer's computer which can...
- Read and read CXB from the host computer (this part is already done)
- Register and service general streams. For instance, ddb can register
stream0
as the stdout of the system so that allprintf(..)
style code can be read from the developer's PC.stdin
/stderr
would similarly be possible. Currently, this is planned to be a simple ring buffer per stream.- With the handling of streams, we can also bridge two Dragon consoles over the internet for multiplayer possibilities. Think
DragonA -> HostA -> (internet) -> HostB -> DragonB
- With the handling of streams, we can also bridge two Dragon consoles over the internet for multiplayer possibilities. Think
- "Resources" where dragon software can request a host resource from the PC is registered and can then be read. This enables dragon software to request file data on a convenient path like
'/game/assets/song.wav'
and then read that resource as if it had access to a filesystem.
ddb
use cases are plentiful and exciting. The main limits here are over the IO latency and bandwidth characteristics. At the moment we only operate over UART using a simple command request/reply scheme. This is both high latency and low bandwidth. For general ethernet, the story is much, much better, though the work needed to get that up and running is a little vague right now. Likely we can try to find an ethernet implementation to get this bootstrapped.