Yakiimo3D

Mostly DirectX 11 Programming

DirectX11 Interop CUDA Mandelbrot Fractal

Introduction

I updated my DirectCompute Mandelbrot fractal demo to be able to render using both CUDA and DirectCompute. The program is a simple Mandelbrot renderer, but you can dynamically switch between a CUDA and DirectCompute render.

Relevant Links

http://developer.download.nvidia.com/compute/cuda/sdk/website/Graphics_Interop.html#simpleD3D11Texture
The NVIDIA CUDA 3.2 SDK includes a DirectX11 interop sample “Simple D3D11 Texture” which I used as reference. For the 3.2 SDK, I think this is the only CUDA sample using D3D11. The SDK sample uses a ID3D11Texture3D as a CUDA resource while my Mandelbrot fractal program uses a ID3D11Buffer as a CUDA resource.

http://developer.nvidia.com/object/cuda-by-example.html
I mainly used the CUDA 3.2 SDK programming manual to learn CUDA, but I also read the NVIDIA “Cuda by Example” book on Safari Informit. The book is very easy to follow and I read it in one sitting. This book teaches the basics of CUDA in a simple language and I thought it was a good first book.

http://www.yakiimo3d.com/2010/02/02/directcompute-mandelbrot-fractal-viewer/
My old DirectCompute Mandelbrot fractal viewer. The program has a bug (oh no!) when the screen size dimensions are not divisible by the thread group size dimensions, which I fixed in my new CUDA interop demo.

Demo Notes

Took some timings to compare the CUDA DirectX11 interop and DirectCompute performances. Timings were taken on my Geforce GTX 460, Driver 266.58, Vista 64-bit SP2, CUDA SDK 3.2. Neither the CUDA nor DirectCompute implementations have been optimized. With better documentation, better tools and finer control over the program, I think the CUDA program has a better chance of good optimization.

Num Iterations DirectCompute CUDA
8 0.588ms/frame (1700fps) 0.749ms/frame (1335fps)
256 1.439ms/frame (695fps) 1.639ms/frame (610fps)
624 3.185ms/frame (314fps) 2.907ms/frame (344fps)
1024 4.878ms/frame (205fps) 4.274ms/frame (234fps)

The higher the iteration count, the more work that the compute shader has to do. When the iteration count is low, DirectCompute is faster. When the iteration count becomes higher, CUDA becomes faster than DirectCompute. I’m assuming this means that with my current code, the CUDA kernel execution is faster, but that there is a fixed cost for DirectX11 interop that makes CUDA initially slower.

http://nvidia.fullviewmedia.com/gdc2011/agenda.html
On Friday, I watched the NVIDIA GDC2011 “GPU Radiosity: Porting the Enlighten runtime to CUDA” presentation, and around 28:23, the speaker mentions that “Switching between D3D and CUDA is expensive (it’s a power cycle!)”. I’m guessing this power mode switch cost is what makes CUDA initially slower in my Mandelbrot fractal program. I watched the entire Enlighten presentation and it was very interesting. The stream had some technical info, but if you are curious about Enlighten tech, there is a DICE&Geomerics Siggraph 2010 presentation that contains even more detailed technical information about Enlighten http://advances.realtimerendering.com/s2010/index.html.

Demo

Source Code & Binary
http://yakiimo3d.codeplex.com/releases/view/62087

CEDEC Digital Library

http://cedil.cesa.or.jp/
Looks like the CEDEC Digital Library is now online. CEDEC is Japan’s biggest game developer conference and the CEDEC Digital Library houses presentation slide material from CEDEC 2006 through last year’s CEDEC 2010. Not sure if it’s all the presentations, but there’s a whole lot of material available for download. In order to download stuff, a free quick user registration is necessary. Most presentations at CEDEC are given in Japanese, so most of the material is in Japanese.

https://members.cesa.or.jp/cedil/session/detail/316
The CEDEC Digital Library “CEDEC 2010: LostPlanet2 DirectX11 Features” page. I wrote about this session before (http://www.yakiimo3d.com/2010/07/11/cedec-2010-lostplanet2-directx11-features/), and happily, I ended up being able to go see it at last year’s CEDEC. Capcom discussed some of their DX11 tessellation and compute shader usage in Lost Planet 2 and now the Japanese slides are available to download for everyone. The tessellation portion of the presentation included a comparison of tessellation schemes as well as a discussion of dealing with tessellation and displacement mapping artifacts such as cracking. The compute shader part of the presentation included discussions of a wave particles compute shader implementation as well as discussions of a soft body computer shader implementation.

Parallel Nsight 1.51

http://parallelnsight.nvidia.com/
NVIDIA Parallel Nsight website.http://www.nvidia.com/object/parallel-nsight-requirements.html lists requirements for different Parallel Nsight functionality.

http://images.anandtech.com/doci/3924/ParallelNsight.png
From AnandTech’s article on Parallel Nsight 1.5 & CUDA Toolkit 3.2. Nice simple table of the different functionality available depending on hardware configuration. For a single GPU system like my own, the D3D Graphics Inspector and Analyzer functionality are available.

Since NVIDIA PerfHUD is not available for DirectX11, I ended up downloading & installing Parallel Nsight on my Vista 64-bit system. I briefly tried it out over the weekend on sample apps and like the AnandTech table shows, I’m able to use the D3D Graphics Inspector and Analyzer functionality from within Visual Studio 2008. The Graphics Inspector shows a HUD overlay of performance and debug information over running DX11 applications reminiscent of PerfHUD (no info over the OpenGL Cuda app I tried). You can perform a resumable capture from the HUD menu to further inspect the application. The Parallel Nsight Analyzer can do trace analysis for System, Tools Extension, CUDA, OpenCL, DirectX, OpenGL and Cg (I tried out DirectX and CUDA). Separate from Parallel Nsight, I also noticed that the CUDA toolkit includes a Compute Visual Profiler for CUDA and OpenCL programs. Did only a quick check, but it looks like Parallel Nsight could be a pretty useful tool for single GPU systems and am glad I downloaded it.

NVIDIA GTX 460


A picture of my new graphics card.

http://www.geeks3d.com/20100712/nvidia-geforce-gtx-460-specifications-and-reviews-available/
Geeks3D’s article on the GTX 460.

http://developer.nvidia.com/object/nsight.html
I recently learned via Twitter (http://twitter.com/#!/NIV_Anteru/status/30620618055483392) that the Pro edition of NVIDIA’s Parallel Nsight is now free. This news is one reason why I bought a new GTX 460. In order to fully use Parallel Nsight on a single machine though, I need separate graphics card for display and for running CUDA. I was hoping I could use my HD5750 somehow, but using multiple display drivers is not even supported in Vista (http://www.microsoft.com/whdc/device/display/multimonVista.mspx), so I don’t think I’m going to get anywhere trying on my Vista system. I’m probably going to buy another CUDA capable graphics card.

http://3dmark.com/3dm11/595018
Ran 3DMark 11 on my new card. As expected, since it came out later, I get a better score than my old HD5750 (http://www.yakiimo3d.com/2010/12/12/3dmark-11/).

Felt a little sad taking out my Radeon HD5750 (http://www.yakiimo3d.com/2009/12/13/my-radeon-hd5750/), which has served me well. The HD5750 cost me around 16,000 yen (w/ current strong yen around $191) and the new GTX 460 cost me around 13,000 yen (around $155). Since I don’t play much PC games, I usually buy inexpensive mid-range graphics cards, so financially, I don’t feel wasteful upgrading after a year. With my new graphics card, I get increased performance, ability to run CUDA samples, ability to run double-precision float programs and try out Parallel Nsight once I get a 2nd graphics card.

(2011/02/13 update)
http://www.yakiimo3d.com/2011/02/13/parallel-nsight-1-51-2/
I ended up trying Parallel Nsight. Seems useful on single GPU systems.

3DMark 11

http://www.3dmark.com/
3DMark’s official webpage.

http://www.3dmark.com/wp-content/uploads/2010/12/3DMark11_Whitepaper.pdf
3DMark 11′s white paper, which I learned about from Twitter (http://twitter.com/#!/repi/status/12653354148700161). The whitepaper contains details about multithreading, tessellation (phong tessellation is used), lighting and posteffect techniques used in 3DMark 11. It has interesting information such as how they do bloom and lens reflection computations in the frequency domain using the computer shader to calculate the FFT and iFFT. The pdf is linked from 3DMark 11′s support page http://www.3dmark.com/3dmark11/support/.

http://www.geeks3d.com/20101207/review-3dmark11-gaming-benchmark-directx11-d3d11-dx11-tessellation/
http://www.geeks3d.com/20101207/tested-3dmark11-dx11-battle-gtx-580-vs-gtx-480-vs-hd-6870-vs-hd-5870-vs-gtx-460-vs-hd-5770/
Geeks3D’s 3DMark 11 article and benchmarks.

http://www.4gamer.net/games/110/G011050/20101207055/
4gamer is Japan’s official mirror. Good announcement article on 3DMark 11 (in Japanese).

Over the weekend, I finally got around to running the free, basic edition of 3DMark 11 on my HD5750 w/ Core 2 Quad Q6600. My non-top-of-the-line machine struggled and the framerate was choppy, but 3DMark 11 looked great. The following is my 3DMark 11 score: http://3dmark.com/3dm11/130419