Yakiimo3D

Mostly DirectX 11 Programming

Yakiimo3D project on CodePlex

http://yakiimo3d.codeplex.com/

Have wanted to move Yakiimo3D source code and binaries to an open-source hosting community for a while now. After initially leaning towards creating a Google Code project, I ended up choosing CodePlex. Since Yakiimo3D is going to be mostly DirectX 11 programming samples, a Microsoft-run CodePlex open-source hosting with a MS-user community makes the most sense, so I think it was a good decision. For source code control on CodePlex, I went with Mercurial. Have never used Mercurial or Git, so it should be a good learning experience for me.

DirectCompute Mandelbrot Fractal Viewer

Introduction

Following in the footsteps of other people on the Internet, I’ve written a GPGPU language Mandelbrot fractal viewer. Along with a DirectCompute implementation, I’ve also written a reference unoptimized cpu version and am really excited that the DirectCompute implementation is more than 50x times faster! (With the caveat that my cpu implementation can probably be optimized much more.)

Other People’s GPGPU language Mandelbrot Fractal Implementations

During the past year, a bunch of people have released GPGPU language Mandelbrot fractal implementations. It seems to be a popular first program for GPGPU language learning.

1) DX11 DirectCompute Mandelbrot and Julia viewer

http://users.skynet.be/fquake/
I learned about this implementation from a Beyond3D forum post last year (http://forum.beyond3d.com/showthread.php?t=55330). The author has continued releasing DirectCompute demos after the release of the above Mandelbrot viewer demo. His latest DirectCompute fluid simulation demo blows my mind and looks amazing. Source code appears included with the demos.

2) A realtime Mandelbrot zoomer in SSE assembly and CUDA

http://users.softlab.ece.ntua.gr/~ttsiod/mandelSSE.html
I learned about this implementation from a VizWorld article (http://www.vizworld.com/2009/12/realtime-mandelbrot-zoomer-sse-assembly-cuda/). The author provides a multi-threaded, hand-optimized SIMD cpu implementation along with a Nvidia CUDA implementation. GPL-ed source code is provided.

3) MandelCPU vs MandelGPU

http://davibu.interfree.it/opencl/mandelgpu/mandelGPU.html
The author appears to be one of the main programmer behind the exciting developments in LuxRender’s OpenCL support (http://www.luxrender.net/wiki/index.php?title=Luxrender_and_OpenCL). This demos compares cpu and OpenCL implementations of a Mandelbrot fractal viewer. The author’s page http://davibu.interfree.it/index.html has other more advanced OpenCL demos with source code.

Explanation of the Mandelbrot Fractal

http://warp.povusers.org/Mandelbrot/
The above page provides a really good introduction to what the Mandelbrot fractal is. I had previously never written a Mandelbrot fractal program and the above page helped me a lot in my understanding and implementations.

Video of My Mandelbrot Fractal Viewer

The demo video starts off with the DirectCompute implementation running and toggles to my non-optimized cpu implementation midway through. As you can see by the VSync-disabled FPS at the top left-hand side of the screen, the DirectCompute implementation is much much faster. The ghosting in the video is probably due to the fact that I don’t know how to use my free video editing software properly, but I’m too tired to try to fix it. Yay, fixed the ghosting problem in my video. Turns out I just needed to increase the CamStudio video options compressor quality setting to 100%.

My Mandelbrot Fractal Viewer Demo

As I wrote in the introduction, the DirectCompute version is over 50x faster! I have been reading about how fast GPGPU languages can be for the right algorithms, but experiencing huge performance gains first-hand really hits home. The caveat is, of course, that both cpu and gpu implementations are unoptimized, and the cpu implementation has more potential to be sped up (but not +50x, I don’t think). Currently, my cpu implementation is single-threaded with no SIMD-use. The cpu in my machine is an Intel Core2Quad Q6600 2.4ghz (not overclocked) and my gpu is a ATI Radeon HD5750. In the demo, I don’t use doubles, but use floats in my calculations because my gpu doesn’t support doubles. Unfortunately, the insufficient precisions limits how closely you can zoom into the fractal.

Source Code & Binary
http://yakiimo3d.codeplex.com/releases/view/41053

CPU Implementation of Bitonic Sort

Introduction

To start off my DirectX 11 studies, I decided to write a cpu implementation of bitonic sort. Bitonic sort is an algorithm that is supposedly well-suited for parallel processing, and sample implementations exist both for DirectX 11 Direct Compute and Nvidia Cuda.

1) Nvidia Cuda SDK sample
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html#bitonic

2) DirectX SDK DX11 Direct Compute sample
http://msdn.microsoft.com/en-us/library/ee416561%28VS.85%29.aspx

The binary and source code to my bitonic sort demo is provided near the end of this article.

Video of My Bitonic Sort Demo

For people without DirectX11 hardware, the above is a video of my cpu bitonic sort demo. Nothing too exciting. A StructuredBuffer of random unsigned longs are first sorted in increasing order and then the same data is sorted in decreasing order using a loop iterative cpu implementation of bitonic sort.

Good Bitonic Sort Websites

There are already a bunch of great resources on the internet for learning about bitonic sort. The below sites were especially helpful for me.

1. http://www.t-pot.com/program/90_BitonicSort/index.html
t-pot is a famous Japanese graphics programming website. The above article discusses an interesting pixel shader implementation of bitonic sort. Has good pictures for the explanation of the bitonic sort algorithm. No English, it’s all Japanese.

2. http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm
A good website that explains the bitonic sort algorithm very clearly.

3. http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
My demo is based on the explanation provided in the above website. Both recursive and loop iterative implementations of bitonic sort are carefully explained. The bit manipulations for sorting in the DirectX 11 compute shader sample made a lot more sense to me after reading the above page’s explanations.

Bitonic Sort Reference Code

// loop from Merge( 2 ) to Merge( nCount )
for( int nMergeSize=2; nMergeSize<=nCount; nMergeSize=nMergeSize*2 )
{
    // Merge( nCount ) requires log2( nCount ) merges. Merge( nCount/2 ) -> Merge( 2 )
    for( int nMergeSubSize=nMergeSize>>1; nMergeSubSize>0; nMergeSubSize=nMergeSubSize>>1 )
    {
        // compare and swap elements
        for( int nElem=0; nElem<nCount; ++nElem )
        {
            int nSwapElem = nElem^nMergeSubSize;
            // check to make sure to only swap once
            if( nSwapElem > nElem )
            {
                // sort in increasing order
                if ( ( nElem & nMergeSize ) == 0 && m_data[ nElem ]>m_data[ nSwapElem ] ) {
                    Swap( nElem, nSwapElem );
                }
                // sort in descending order
                if ( ( nElem & nMergeSize ) != 0 && m_data[ nElem ]<m_data[ nSwapElem ] ) {
                    Swap( nElem, nSwapElem );
                }
            }
        }
    }
}

The above is source code for the iterative loop version of bitonic sort taken directly from my demo. The 3 nested loops represent the entire bitonic sort algorithm, and the entire dataset is sorted in increasing order after the loop finishes.

Bitonic Sort Complexity

The Big O Complexity

The below is a hand-wavy explanation of bitonic sort’s algorithmic complexity which is:
O(n \cdot log_{2}(n)^{2})

The Recurrence Relationship
The recurrence relationship for bitonic sort is:
T(n) = log_{2}(n)+T(n/2)

The recurrence relationship expanded is:
T(n) = log_{2}(n)+log_{2}(n/2)+log_{2}(n/4)+log_{2}(n/8)+\ldots+1

Appying fundamental properties of log e.g. log_{2}(n/2)=log_{2}(n)-log_{2}(2)=log_{2}(n)-1 the relationship is further simplified:
T(n) = log_{2}(n)+log_{2}(n)-1+log_{2}(n)-2+log_{2}(n)-3+\ldots+1

Applying the formula for arithmetic series, the equation becomes:
T(n) = \frac{log_{2}(n)\cdot(log_{2}(n)+1)}{2}

Arriving at the Big O Complexity
The above recurrence relationship’s complexity corresponds to the outer 2 loops in the reference code. In addition, there is an inner loop comparing all data elements so the final algorithmic complexity is represented by:
O(n \cdot log_{2}(n)^{2})

Bitonic Sort Algorithm

For good explanations of the bitonic sort algorithm with helpful pictures, please refer to the “Good Bitonic Sort Websites” links provided above.

The Parallelism of Bitonic Sort

The compared and swapped elements in the inner loop of the reference bitonic sort code do not overlap. The independent data nature of the inner loop allows it to be run in parallel across thousands of threads. If you look at the Nvidia Cuda and DirectX SDK bitonic sort samples, this inner loop is parallelized using the many threads of the gpu. An actual GPGPU bitonic sort implementations is complicated by the fact that the elements to be compared and swapped might not reside in the same limited shared memory space of the gpu thread group. In DirectX10-level compute shader 4_x, the maximum number of threads in a single thread group (only threads within the same thread group can access the same shared memory) is 768. For DirectX11-level compute shader 5_0, the maximum number of threads per thread group is 1024. The DirectX SDK bitonic sort sample handles this shared memory problem by transposing the data array to force each compared and swapped element to reside in the same shared memory.

Bitonic Sort Demo

Here’s my cpu bitonic sort demo using DirectX 11. Both a recursive and a loop iterative version of bitonic sort have been implemented. My code is based on the excellent explanations provided in http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm. In the demo, bitonic sort is performed on a DirectX 11 StructuredBuffer of random numbers. The data is first sorted in increasing order. Then without re-randomizing, the data is sorted in decreasing order. I thought it was nice that the pixel shader could read and draw the StructuredBuffer directly without transferring to an intermediate texture. The video shown at the top of the page is a video of my demo.

Source Code & Binary
http://yakiimo3d.codeplex.com/releases/view/41051

Conclusions

Well, that’s it for my first Yakiimo3D DirectX 11 article. I’ve experimented with a compute shader implementation of bitonic sort, and will probably post a demo here eventually. Thanks for reading!

Posted my very first youtube video!



The above is my very first youtube video post! It’s a video of a CPU implementation of bitonic sort. In the demo, a StructuredBuffer of random data is first sorted in increasing order. Then without re-randomizing, the same data is sorted in decreasing order. DirectX 11 was used for rendering. Something I found interesting is that the pixel shader could load and draw the StructuredBuffer without any intermediate transfers to a texture. Not an exciting demo of DirectX 11, but I have to start somewhere.

The demo was inspired by t-pot.com’s pixel shader bitonic sort demo (http://www.t-pot.com/program/90_BitonicSort/index.html). My next post will be an article about the above CPU implementation with full source code.

My Radeon HD5750

I posted pics of my Radeon HD5750 on my hatena diary page when I got it last month.

2009/11/21SAPPHIRE HD 5750 1GB GDDR5 PCIE HDMI/DP video card pics (Hatena Diary)

The card quickly got unboxed and inserted into my machine. Since buying it, I’ve used my free time not spent playing games (Demon’s Souls, Ratchet & Clank: A Crack in Time, and Modern Warefare 2 are all so good…), looking over and studying the DirectX SDK samples for DirectX11.

I plan on posting a DirectX11 CPU bitonic sort sample followed by a compute shader bitonic sort sample soon. I know the DirectX SDK contains a bitonic sort compute shader sample, but it was a good topic for me to begin my studies. I’ll try to cover the bitonic sort algorithm more in depth than the DirectX SDK.