Have wanted to move Yakiimo3D source code and binaries to an open-source hosting community for a while now. After initially leaning towards creating a Google Code project, I ended up choosing CodePlex. Since Yakiimo3D is going to be mostly DirectX 11 programming samples, a Microsoft-run CodePlex open-source hosting with a MS-user community makes the most sense, so I think it was a good decision. For source code control on CodePlex, I went with Mercurial. Have never used Mercurial or Git, so it should be a good learning experience for me.
Following in the footsteps of other people on the Internet, I’ve written a GPGPU language Mandelbrot fractal viewer. Along with a DirectCompute implementation, I’ve also written a reference unoptimized cpu version and am really excited that the DirectCompute implementation is more than 50x times faster! (With the caveat that my cpu implementation can probably be optimized much more.)
Other People’s GPGPU language Mandelbrot Fractal Implementations
During the past year, a bunch of people have released GPGPU language Mandelbrot fractal implementations. It seems to be a popular first program for GPGPU language learning.
1) DX11 DirectCompute Mandelbrot and Julia viewer
http://users.skynet.be/fquake/
I learned about this implementation from a Beyond3D forum post last year (http://forum.beyond3d.com/showthread.php?t=55330). The author has continued releasing DirectCompute demos after the release of the above Mandelbrot viewer demo. His latest DirectCompute fluid simulation demo blows my mind and looks amazing. Source code appears included with the demos.
2) A realtime Mandelbrot zoomer in SSE assembly and CUDA
http://warp.povusers.org/Mandelbrot/
The above page provides a really good introduction to what the Mandelbrot fractal is. I had previously never written a Mandelbrot fractal program and the above page helped me a lot in my understanding and implementations.
Video of My Mandelbrot Fractal Viewer
The demo video starts off with the DirectCompute implementation running and toggles to my non-optimized cpu implementation midway through. As you can see by the VSync-disabled FPS at the top left-hand side of the screen, the DirectCompute implementation is much much faster. The ghosting in the video is probably due to the fact that I don’t know how to use my free video editing software properly, but I’m too tired to try to fix it. Yay, fixed the ghosting problem in my video. Turns out I just needed to increase the CamStudio video options compressor quality setting to 100%.
My Mandelbrot Fractal Viewer Demo
As I wrote in the introduction, the DirectCompute version is over 50x faster! I have been reading about how fast GPGPU languages can be for the right algorithms, but experiencing huge performance gains first-hand really hits home. The caveat is, of course, that both cpu and gpu implementations are unoptimized, and the cpu implementation has more potential to be sped up (but not +50x, I don’t think). Currently, my cpu implementation is single-threaded with no SIMD-use. The cpu in my machine is an Intel Core2Quad Q6600 2.4ghz (not overclocked) and my gpu is a ATI Radeon HD5750. In the demo, I don’t use doubles, but use floats in my calculations because my gpu doesn’t support doubles. Unfortunately, the insufficient precisions limits how closely you can zoom into the fractal.
To start off my DirectX 11 studies, I decided to write a cpu implementation of bitonic sort. Bitonic sort is an algorithm that is supposedly well-suited for parallel processing, and sample implementations exist both for DirectX 11 Direct Compute and Nvidia Cuda.
The binary and source code to my bitonic sort demo is provided near the end of this article.
Video of My Bitonic Sort Demo
For people without DirectX11 hardware, the above is a video of my cpu bitonic sort demo. Nothing too exciting. A StructuredBuffer of random unsigned longs are first sorted in increasing order and then the same data is sorted in decreasing order using a loop iterative cpu implementation of bitonic sort.
Good Bitonic Sort Websites
There are already a bunch of great resources on the internet for learning about bitonic sort. The below sites were especially helpful for me.
1. http://www.t-pot.com/program/90_BitonicSort/index.html
t-pot is a famous Japanese graphics programming website. The above article discusses an interesting pixel shader implementation of bitonic sort. Has good pictures for the explanation of the bitonic sort algorithm. No English, it’s all Japanese.
3. http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
My demo is based on the explanation provided in the above website. Both recursive and loop iterative implementations of bitonic sort are carefully explained. The bit manipulations for sorting in the DirectX 11 compute shader sample made a lot more sense to me after reading the above page’s explanations.
Bitonic Sort Reference Code
// loop from Merge( 2 ) to Merge( nCount ) for(int nMergeSize=2; nMergeSize<=nCount; nMergeSize=nMergeSize*2) { // Merge( nCount ) requires log2( nCount ) merges. Merge( nCount/2 ) -> Merge( 2 ) for(int nMergeSubSize=nMergeSize>>1; nMergeSubSize>0; nMergeSubSize=nMergeSubSize>>1) { // compare and swap elements for(int nElem=0; nElem<nCount;++nElem ) { int nSwapElem = nElem^nMergeSubSize; // check to make sure to only swap once if( nSwapElem > nElem ) { // sort in increasing order if(( nElem & nMergeSize )== 0 && m_data[ nElem ]>m_data[ nSwapElem ]){
Swap( nElem, nSwapElem ); } // sort in descending order if(( nElem & nMergeSize )!= 0 && m_data[ nElem ]<m_data[ nSwapElem ]){
Swap( nElem, nSwapElem ); } } } } }
The above is source code for the iterative loop version of bitonic sort taken directly from my demo. The 3 nested loops represent the entire bitonic sort algorithm, and the entire dataset is sorted in increasing order after the loop finishes.
Bitonic Sort Complexity
The Big O Complexity
The below is a hand-wavy explanation of bitonic sort’s algorithmic complexity which is:
The Recurrence Relationship
The recurrence relationship for bitonic sort is:
The recurrence relationship expanded is:
Appying fundamental properties of log e.g. the relationship is further simplified:
Applying the formula for arithmetic series, the equation becomes:
Arriving at the Big O Complexity
The above recurrence relationship’s complexity corresponds to the outer 2 loops in the reference code. In addition, there is an inner loop comparing all data elements so the final algorithmic complexity is represented by:
Bitonic Sort Algorithm
For good explanations of the bitonic sort algorithm with helpful pictures, please refer to the “Good Bitonic Sort Websites” links provided above.
The Parallelism of Bitonic Sort
The compared and swapped elements in the inner loop of the reference bitonic sort code do not overlap. The independent data nature of the inner loop allows it to be run in parallel across thousands of threads. If you look at the Nvidia Cuda and DirectX SDK bitonic sort samples, this inner loop is parallelized using the many threads of the gpu. An actual GPGPU bitonic sort implementations is complicated by the fact that the elements to be compared and swapped might not reside in the same limited shared memory space of the gpu thread group. In DirectX10-level compute shader 4_x, the maximum number of threads in a single thread group (only threads within the same thread group can access the same shared memory) is 768. For DirectX11-level compute shader 5_0, the maximum number of threads per thread group is 1024. The DirectX SDK bitonic sort sample handles this shared memory problem by transposing the data array to force each compared and swapped element to reside in the same shared memory.
Bitonic Sort Demo
Here’s my cpu bitonic sort demo using DirectX 11. Both a recursive and a loop iterative version of bitonic sort have been implemented. My code is based on the excellent explanations provided in http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm. In the demo, bitonic sort is performed on a DirectX 11 StructuredBuffer of random numbers. The data is first sorted in increasing order. Then without re-randomizing, the data is sorted in decreasing order. I thought it was nice that the pixel shader could read and draw the StructuredBuffer directly without transferring to an intermediate texture. The video shown at the top of the page is a video of my demo.
Well, that’s it for my first Yakiimo3D DirectX 11 article. I’ve experimented with a compute shader implementation of bitonic sort, and will probably post a demo here eventually. Thanks for reading!
The above is my very first youtube video post! It’s a video of a CPU implementation of bitonic sort. In the demo, a StructuredBuffer of random data is first sorted in increasing order. Then without re-randomizing, the same data is sorted in decreasing order. DirectX 11 was used for rendering. Something I found interesting is that the pixel shader could load and draw the StructuredBuffer without any intermediate transfers to a texture. Not an exciting demo of DirectX 11, but I have to start somewhere.
The card quickly got unboxed and inserted into my machine. Since buying it, I’ve used my free time not spent playing games (Demon’s Souls, Ratchet & Clank: A Crack in Time, and Modern Warefare 2 are all so good…), looking over and studying the DirectX SDK samples for DirectX11.
I plan on posting a DirectX11 CPU bitonic sort sample followed by a compute shader bitonic sort sample soon. I know the DirectX SDK contains a bitonic sort compute shader sample, but it was a good topic for me to begin my studies. I’ll try to cover the bitonic sort algorithm more in depth than the DirectX SDK.