DirectX11 Interop CUDA Mandelbrot Fractal
I updated my DirectCompute Mandelbrot fractal demo to be able to render using both CUDA and DirectCompute. The program is a simple Mandelbrot renderer, but you can dynamically switch between a CUDA and DirectCompute render.
The NVIDIA CUDA 3.2 SDK includes a DirectX11 interop sample “Simple D3D11 Texture” which I used as reference. For the 3.2 SDK, I think this is the only CUDA sample using D3D11. The SDK sample uses a ID3D11Texture3D as a CUDA resource while my Mandelbrot fractal program uses a ID3D11Buffer as a CUDA resource.
I mainly used the CUDA 3.2 SDK programming manual to learn CUDA, but I also read the NVIDIA “Cuda by Example” book on Safari Informit. The book is very easy to follow and I read it in one sitting. This book teaches the basics of CUDA in a simple language and I thought it was a good first book.
My old DirectCompute Mandelbrot fractal viewer. The program has a bug (oh no!) when the screen size dimensions are not divisible by the thread group size dimensions, which I fixed in my new CUDA interop demo.
Took some timings to compare the CUDA DirectX11 interop and DirectCompute performances. Timings were taken on my Geforce GTX 460, Driver 266.58, Vista 64-bit SP2, CUDA SDK 3.2. Neither the CUDA nor DirectCompute implementations have been optimized. With better documentation, better tools and finer control over the program, I think the CUDA program has a better chance of good optimization.
|8||0.588ms/frame (1700fps)||0.749ms/frame (1335fps)|
|256||1.439ms/frame (695fps)||1.639ms/frame (610fps)|
|624||3.185ms/frame (314fps)||2.907ms/frame (344fps)|
|1024||4.878ms/frame (205fps)||4.274ms/frame (234fps)|
The higher the iteration count, the more work that the compute shader has to do. When the iteration count is low, DirectCompute is faster. When the iteration count becomes higher, CUDA becomes faster than DirectCompute. I’m assuming this means that with my current code, the CUDA kernel execution is faster, but that there is a fixed cost for DirectX11 interop that makes CUDA initially slower.
On Friday, I watched the NVIDIA GDC2011 “GPU Radiosity: Porting the Enlighten runtime to CUDA” presentation, and around 28:23, the speaker mentions that “Switching between D3D and CUDA is expensive (it’s a power cycle!)”. I’m guessing this power mode switch cost is what makes CUDA initially slower in my Mandelbrot fractal program. I watched the entire Enlighten presentation and it was very interesting. The stream had some technical info, but if you are curious about Enlighten tech, there is a DICE&Geomerics Siggraph 2010 presentation that contains even more detailed technical information about Enlighten http://advances.realtimerendering.com/s2010/index.html.
Source Code & Binary