Yakiimo3D

Mostly DirectX 11 Programming

CPU Implementation of Bitonic Sort

Introduction

To start off my DirectX 11 studies, I decided to write a cpu implementation of bitonic sort. Bitonic sort is an algorithm that is supposedly well-suited for parallel processing, and sample implementations exist both for DirectX 11 Direct Compute and Nvidia Cuda.

1) Nvidia Cuda SDK sample
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html#bitonic

2) DirectX SDK DX11 Direct Compute sample
http://msdn.microsoft.com/en-us/library/ee416561%28VS.85%29.aspx

The binary and source code to my bitonic sort demo is provided near the end of this article.

Video of My Bitonic Sort Demo

For people without DirectX11 hardware, the above is a video of my cpu bitonic sort demo. Nothing too exciting. A StructuredBuffer of random unsigned longs are first sorted in increasing order and then the same data is sorted in decreasing order using a loop iterative cpu implementation of bitonic sort.

Good Bitonic Sort Websites

There are already a bunch of great resources on the internet for learning about bitonic sort. The below sites were especially helpful for me.

1. http://www.t-pot.com/program/90_BitonicSort/index.html
t-pot is a famous Japanese graphics programming website. The above article discusses an interesting pixel shader implementation of bitonic sort. Has good pictures for the explanation of the bitonic sort algorithm. No English, it’s all Japanese.

2. http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm
A good website that explains the bitonic sort algorithm very clearly.

3. http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
My demo is based on the explanation provided in the above website. Both recursive and loop iterative implementations of bitonic sort are carefully explained. The bit manipulations for sorting in the DirectX 11 compute shader sample made a lot more sense to me after reading the above page’s explanations.

Bitonic Sort Reference Code

// loop from Merge( 2 ) to Merge( nCount )
for( int nMergeSize=2; nMergeSize<=nCount; nMergeSize=nMergeSize*2 )
{
    // Merge( nCount ) requires log2( nCount ) merges. Merge( nCount/2 ) -> Merge( 2 )
    for( int nMergeSubSize=nMergeSize>>1; nMergeSubSize>0; nMergeSubSize=nMergeSubSize>>1 )
    {
        // compare and swap elements
        for( int nElem=0; nElem<nCount; ++nElem )
        {
            int nSwapElem = nElem^nMergeSubSize;
            // check to make sure to only swap once
            if( nSwapElem > nElem )
            {
                // sort in increasing order
                if ( ( nElem & nMergeSize ) == 0 && m_data[ nElem ]>m_data[ nSwapElem ] ) {
                    Swap( nElem, nSwapElem );
                }
                // sort in descending order
                if ( ( nElem & nMergeSize ) != 0 && m_data[ nElem ]<m_data[ nSwapElem ] ) {
                    Swap( nElem, nSwapElem );
                }
            }
        }
    }
}

The above is source code for the iterative loop version of bitonic sort taken directly from my demo. The 3 nested loops represent the entire bitonic sort algorithm, and the entire dataset is sorted in increasing order after the loop finishes.

Bitonic Sort Complexity

The Big O Complexity

The below is a hand-wavy explanation of bitonic sort’s algorithmic complexity which is:
O(n \cdot log_{2}(n)^{2})

The Recurrence Relationship
The recurrence relationship for bitonic sort is:
T(n) = log_{2}(n)+T(n/2)

The recurrence relationship expanded is:
T(n) = log_{2}(n)+log_{2}(n/2)+log_{2}(n/4)+log_{2}(n/8)+\ldots+1

Appying fundamental properties of log e.g. log_{2}(n/2)=log_{2}(n)-log_{2}(2)=log_{2}(n)-1 the relationship is further simplified:
T(n) = log_{2}(n)+log_{2}(n)-1+log_{2}(n)-2+log_{2}(n)-3+\ldots+1

Applying the formula for arithmetic series, the equation becomes:
T(n) = \frac{log_{2}(n)\cdot(log_{2}(n)+1)}{2}

Arriving at the Big O Complexity
The above recurrence relationship’s complexity corresponds to the outer 2 loops in the reference code. In addition, there is an inner loop comparing all data elements so the final algorithmic complexity is represented by:
O(n \cdot log_{2}(n)^{2})

Bitonic Sort Algorithm

For good explanations of the bitonic sort algorithm with helpful pictures, please refer to the “Good Bitonic Sort Websites” links provided above.

The Parallelism of Bitonic Sort

The compared and swapped elements in the inner loop of the reference bitonic sort code do not overlap. The independent data nature of the inner loop allows it to be run in parallel across thousands of threads. If you look at the Nvidia Cuda and DirectX SDK bitonic sort samples, this inner loop is parallelized using the many threads of the gpu. An actual GPGPU bitonic sort implementations is complicated by the fact that the elements to be compared and swapped might not reside in the same limited shared memory space of the gpu thread group. In DirectX10-level compute shader 4_x, the maximum number of threads in a single thread group (only threads within the same thread group can access the same shared memory) is 768. For DirectX11-level compute shader 5_0, the maximum number of threads per thread group is 1024. The DirectX SDK bitonic sort sample handles this shared memory problem by transposing the data array to force each compared and swapped element to reside in the same shared memory.

Bitonic Sort Demo

Here’s my cpu bitonic sort demo using DirectX 11. Both a recursive and a loop iterative version of bitonic sort have been implemented. My code is based on the excellent explanations provided in http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm. In the demo, bitonic sort is performed on a DirectX 11 StructuredBuffer of random numbers. The data is first sorted in increasing order. Then without re-randomizing, the data is sorted in decreasing order. I thought it was nice that the pixel shader could read and draw the StructuredBuffer directly without transferring to an intermediate texture. The video shown at the top of the page is a video of my demo.

Source Code & Binary
http://yakiimo3d.codeplex.com/releases/view/41051

Conclusions

Well, that’s it for my first Yakiimo3D DirectX 11 article. I’ve experimented with a compute shader implementation of bitonic sort, and will probably post a demo here eventually. Thanks for reading!

. This entry was posted on Thursday, December 24th, 2009 at 12:53 AM and is filed under Demo, DirectX11. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply



XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

It may take some time for your comment to appear, it is not necessary to submit it again.