Yakiimo3D

Mostly DirectX 11 Programming

DX11 DirectCompute Buddhabrot & Nebulabrot Renderer

Introduction

I wrote a DX11 DirectCompute implementation of the famous Buddhabrot fractal. The implementation is an extension of my earlier Mandelbrot fractal DX11 DirectCompute program (http://www.yakiimo3d.com/2010/02/02/directcompute-mandelbrot-fractal-viewer/). I also use my Rheinhard tonemapping code (http://www.yakiimo3d.com/2010/03/13/dx11-directcompute-global-operator-photographic-tonemapping/) to bring the HDR Buddhabrot color values into the LDR framebuffer’s [0,1] range. It’s good that I used code from my old demos because I found and fixed bugs in my tonemapping code and also realized I had forgotten to upload the source and binary for my DX11 Mandelbrot demo to CodePlex (doh!).

Regular Buddhabrot renderings result in monotone achromatic images because the same single value is written to each RGB channel. The Nebulabrot is a simple extension to the Buddhabrot, where you plot the Buddhabrot 3 times with a different iteration exit max value, and assign each of the 3 iteration plots to a different RGB channel (In an actual implementation, you can render the 3 iteration exit max value plots in one draw by just using if branches.) Since the implementation is easy and the resulting renders are more interesting, my Buddhabrot implementation supports Nebulabrot renderings as well.

As usual, a CodePlex link for my Buddhabrot program’s source code and binary are provided near the end of the article.

Relevant Links

1) http://www.superliminal.com/fractals/bbrot/bbrot.htm
Web page by the Buddhabrot’s discoverer Melinda Green. Contains good explanations of the Buddhabrot and the Nebulabrot. There is a Buddhabrot implementation links list at the bottom of her page. While searching information about the Buddhabrot, I came across Melinda Green’s stackoverflow.com profile (http://stackoverflow.com/users/181535/melinda-green). I thought it was cool that the discoverer is a programmer and uses sites like stackoverflow.

2) http://local.wasp.uwa.edu.au/~pbourke/fractals/buddhabrot/
Paul Bourke’s web page has great explanations for a wide variety of mathematical and geometric algorithms. I’ve used his webpage for reference on numerous other occasions. His page on the Buddhabrot contains an implementation with source code.

3) http://iquilezles.org/www/articles/budhabrot/budhabrot.htm
Very nice Nebulabrot renderings. This page made me realize that I could use a square or cubic function in order to increase the contrast of my Buddhabrot and make the renderings more pretty.

4) http://brnz.org/hbr/?p=297
A very cool PS3 SPU Buddhabrot implementation with source code. I follow the author on twitter (http://twitter.com/twoscomplement) and his PS3 implementation was one of my first encounters with the Buddhabrot.

5) http://www.steckles.com/buddha/
The base routine of the Buddhabrot algorithm is similar to pathtracing and involves random sampling of the image each frame. This page suggests applying the Metropolis-Hastings algorithm used with pathtracing in order to statistically speed up convergence of the Buddhabrot image. The results seem impressive, and if I feel up to it, I would like to try this technique in the future.

6) http://en.wikipedia.org/wiki/Buddhabrot
The Wikipedia page for the Buddhabrot has good information and links.

Movie of My Buddhabrot Demo

I posted a movie of my Buddhabrot demo to YouTube. The Buddhabrot in the demo is a Nebulabrot with max iteration values of red: 10,000 green: 1,000 blue: 100 (the default values in my program.)

A Screen Capture of My Buddhabrot Demo

1592x1028 Buddhabrot

I let my demo run for around 20 min and then took a screenshot. The frame rate was around 42 fps during the render. My implementation collects 10,000 samples per frame, so the above Nebulabrot was rendered at a sampling rate of 420,000 samples per second. Click on the image to see the image at its full 1592×1028 resolution. The Nebulabrot parameters are the same as the default ones used in the movie. The tonemap middle-grey value is probably different.

Implementation Details

Compute Shader Thread Setup

I call Dispatch with a thread group dimension of <10,10,1> (10×10=100 thread groups in total.) I declare numthread with a (10,10,1) size; this means each thread group will contain 100 threads. 100 thread groups each with 100 threads means that a total of 10,000 threads are executed in parallel during a single compute shader execution. I only do 1 random number sampling per thread, so 10,000 Buddhabrot samplings are performed in one frame.

Random Number Generation

I do random number generation on the CPU-side and just lock & copy the data to a GPU buffer every frame (prob should use multiple buffers to avoid lock, but I don’t do that in this demo.) I tried assigning a multiply-with-carry random number generator to each thread and generating a RNG sequence inside the compute shader, but the results showed repetition. Finding and implementing a GPU parallel RNG algorithm seemed like a lot of work, so I didn’t try investigating further.

Atomic Operations

I used InterlockedAdd to write the Buddhabrot plot to a global memory iteration count buffer. Writing to global memory repeatedly (very rare worst case is iterations max-1 times, so 9999+999+99 for the demo Nebulabrot) during each thread execution (10,000 threads ttl in the demo!) probably resulted in a big performance hit. I’ve read that atomic operations should be minimized if possible, but I actually got slower performance if I didn’t use InterlockedAdd (writes are not synchronized and wrong, but just for testing.) Not exactly sure why this is.

Tone Mapping

I used my earlier implementation of the Rheinhard global tonemapping operator to map the Buddhabrot output buffer to a LDR range. Some Buddhabrot renders on the Internet have no details in the dark areas because the authors didn’t have time to implement a good HDR->LDR mapping algorithm. The nature of Rheinhard tonecurve (less compression in the darker areas) and the fact that I can set the middle-grey value helps my Buddhabrot renders retain detail in both the bright and dark areas of the image.

Contrast Brightening

In order to increase the contrast of my Buddhabrot renders, and make the bright areas really bright, I cube the luminance value of the Buddhabrot output buffer in Yxy color space. Combined with Rheinhard tonemapping, I get images with strong bright areas while still retaining detail in the darker areas.

Buddhabrot Demo

In terms of hardware, I own the same hardware I was using when I wrote my Mandelbrot DirectCompute implementation (more recent ATI Catalyst drivers.) The CPU in my machine is an Intel Core2Quad Q6600 2.4ghz (not overclocked) and my GPU is an ATI Radeon HD5750.

My Buddhabrot DirectCompute implementation turned out 4-6 times faster than my reference CPU implementation (single-thread, non-SIMD.) This is a smaller performance gain compared to my Mandelbrot DirectCompute implementation, which was +50x faster than the CPU implementation (also single-thread, non-SIMD.) I’m still impressed with the DirectCompute performance though, since the Buddhabrot fractal is more computationally complex than the Mandelbrot fractal, and a 4-6x time speedup is a noticeable one when doing renders.

Source Code & Binary
http://yakiimo3d.codeplex.com/releases/view/42716

Conclusions

I really enjoyed writing this Buddhabrot DirectCompute program. Being able to create a beautiful, interesting image like the Buddhabrot is really cool and made me glad I began learning programming 10+ years ago.

. This entry was posted on Monday, March 29th, 2010 at 10:17 AM and is filed under Demo, DirectCompute, DirectX11. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

10 Responses to “DX11 DirectCompute Buddhabrot & Nebulabrot Renderer”

  1. Jonathan Says:

    Processing contrast like you do makes a big difference to the result – nicely done!

    Something I’ve been meaning to test is how much effect there would be in making writes to the iteration count buffer non-atomic – there would be some loss in precision, but how much? Would it be noticeable?

    I look forward to reading through your code and seeing what I can learn, thanks :)

  2. RCL Says:

    Wow, nice blog about DirectCompute… I’m also in GPGPU, but I’m using CUDA and OpenCL (trying to stay away from Microsoft (re)inventions). Would you mind to comment about DirectCompute performance as compared to OpenCL (and probably CUDA, if you happen to own NVidia, too)? I’m curious to try DirectCompute as well, but right now I’m devoting all my time to polishing my current (OpenCL) project. Thanks in advance :)

  3. yakiimo02 Says:

    Hi, thanks for the comments!

    @Jonathan
    When I tried making the writes to the iteration buffer non-atomic, contrary to what I would have expected, I noticed performance actually decreased (pretty big too. maybe 20%). I’m assuming that more so than the cost of atomic operations, for GPGPU, it’s the global memory writes itself that slows down performance. Don’t really see a way to move writes to shared local memory, so didn’t try optimizing. I also didn’t keep rendering long enough to see how non-atomic writes affect the render results of the actual image itself.

    @RCL
    I started GPGPU programming only last November and my first language was DirectCompute, so I can’t really comment on OpenCL and CUDA. In terms of DirectCompute performance, I earlier wrote a DirectCompute Mandelbrot implementation (as you know, Mandelbrot is pure computation and very little data access so very GPGPU friendly) and that was 50x+ faster. I’ve read about other ppl’s OpenCL and Cuda Mandelbrot implementations with similiar numbers, but different hardware and programmer, so can’t really compare. Good luck with your OpenCL project!

  4. DX11 DirectCompute Buddhabrot Demo - 3D Tech News, Pixel Hacking, Data Visualization and 3D Programming - Geeks3D.com Says:

    [...] [source] | [via] Related posts:DX11 DirectCompute Demos: Waves and Fractals [...]

  5. List of Direct3D demos « tse: Says:

    [...] Buddhabrot & Nebulabrot Renderer (yakiimoo2) – Compute Shader 5.0 [...]

  6. Keldor Says:

    Your threadgroup size looks problematic – you should make your thread group size divisible by 32, or better yet, 64, so that it maps to the physical SIMD size of the GPU’s processors. I’d recommend setting numthread to [128,1,1] or [256,1,1] as well as possibly increasing the number of thread groups since in general kernels can’t overlap.

    As for random number generation, here’s the simple algorithm I use in Flam4 – it’s in OpenCL, so you’ll have to make some minor changes, but it should be easy enough.

    constant int shift1[4] = {6, 2, 13, 3};
    constant int shift2[4] = {13, 27, 21, 12};
    constant int shift3[4] = {18, 2, 7, 13};
    constant unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};

    unsigned int TausStep(unsigned int z, int s1, int s2, int s3, unsigned int M)
    {
        unsigned int b = (((z << s1) ^ z) >> s2);
        return (((z & M) << s3) ^b);
    }

    unsigned int RAND_INT(local volatile unsigned int* randStates)
    {
        unsigned int index = get_local_id(0);
        unsigned int i2 = get_local_id(0)&~31;
        randStates[index] = TausStep(randStates[index], shift1[index&3], shift2[index&3], shift3[index&3], offset[index&3]);
        return (randStates[index&31+i2]^randStates[(index+1)&31+i2]^randStates[(index+2)&31+i2]^randStates[(index+3)&31+i2]);
    }

    The only thing to remember is that shared memory is not preserved across kernel invocations, so you’ll have to load/store the state buffer each thread. This will also let you have each thread run a certain number of iterations, rather than one sample, which should be a big performance win since the number of iterations a given point requires varies wildly, meaning that with one point per thread, you end up with a significant number of threads finished, but waiting on a few stragglers in their group.

  7. yakiimo02 Says:

    Hi Keldor,

    Thanks for taking the time to write the above advice. Seems like good stuff.

    I’ve read about ideal thread group and thread sizes taking into consideration hardware, but I wasn’t careful when writing my implementation. I’ll follow your advice and try out different numbers.

    Thanks for the sample RNG code. Hmm yes, if I generate RNG in the compute shader, I could keep sampling and plotting until a goal iteration count is met in each thread, potentially eliminating the current large # of finished idle threads. This could result in a huge performance gain, and to be honest I hadn’t realized this.

    I’m interested in the performance impact both your optimization suggestions make, so I’ll definitely try them out soon (hopefully this weekend.)

    GPU implementation of Flam4 seems like cool stuff! Will take a look at it tomorrow. Gotta go to bed now, so I can get up at 8 and go to work tomorrow.

  8. Mike Says:

    Hi and thanks a lot for this detailed blog.
    I m just beginning with directcompute as I would need to display a scrolling text banner with transparency over a video. For the moment I ve succeded in displaying a red circle with transparency in the corners so I can see through the background movie. However I m not really sure about how I could use directcompute to display the scrolling text through the gpu using directcompute .could you give me your opinion? As I need a perfectly smooth banner and also low CPU usage.
    Thanks
    mike

  9. yakiimo02 Says:

    Hi, thanks for reading my blog! I don’t think there is much more than regular alphablended or alphatested rasterization that you can do in order to display your scrolling text bar. Display a scrolling text bar seems to be just a rasterization problem instead of computation problem, so I don’t think DirectCompute is going to help. I don’t have much experience with programming movie playback so maybe I’m misunderstanding the question, but that’s my answer.

  10. Butaman Renderer - smallpt Says:

    [...] I looked at the source code of smallpt a couple of months ago after the OpenCL GPGPU implementation of it, SmallptGPU (http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html), came out. I was learning DirectCompute and thought it would be good for me to write a DirectCompute version of smallpt, but instead of doing that I ended up writing a DirectCompute Buddhabrot renderer (http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/). [...]

Leave a Reply



XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

It may take some time for your comment to appear, it is not necessary to submit it again.