вторник, 11 июня 2013 г.

Small optimization trick for heavy shaders

As you may know depth buffers allows to draw 3d objects in any order and they will appear correctly - closer one will obscure farther one. But draw order is important anyway. For example, consider a scene where you're drawing from back to front. You draw an object, it passes the depth test for particular pixel and that means that pixel shader will be executed for this pixel. Next you draw another object. But since our order from back to front, this object will pass depth test and pixel shader will be executed again. Ans so on...

There's obvious solution - sort object front to back. Then, in modern hardware early depth test will be executed and obscured pixels will be discarded early. But this sorting should be done on CPU and this is expensive.

I found recently an interesting trick to simulate such sorting:
  • Disable color write. Enable depth write. Set Context3DCompareMode.LESS depth test. Render to back buffer.
  • Enable color write. Disable depth write. Set Context3DCompareMode.EQUAL depth test. Render to back buffer.
In plane english it looks like this: you're writing to depth buffer only in first pass. And it's a very fast operation. You make usual depth test - update depth buffer only when new fragment closer to camera. In the  second pass you don't update depth buffer but check current depth with depth in that buffer (from the first pass it contains only "closest" pixels). So if current pixel is obscured it will be rejected by hardware.

In code:
_context3D.setColorMask(false, false, false, false); // disable color write
_context3D.setDepthTest(true, Context3DCompareMode.LESS); // enable depth write, < compare mode


_context3D.setColorMask(true, true, true, true); // enable color write
_context3D.setDepthTest(false, Context3DCompareMode.EQUAL); // disable depth write, == compare mode


Here's a demos that illustrates a concept.
This demo uses brute forse - make a draw call for every object without any depth buffer state change:


This demo uses a trick described above.


P.S.: some interesting notes. In my desktop first I've used teapots with phong lightning. I've got fps drop with 300 teapots (draw calls) approximatelly. And state changes didn't give any result at all. This is because GPU works so fast that I've got my fps drop due to number of draw calls only. Next, I've added dummy instructions to my vertex and pixel shaders. Many instructions - up to the limit. And again the same thing - fps drop at 300 teapots. So I can measure difference only with texture calls, which is heaviest operation on GPU. In the demos above I've used something about 60 tex instructions. But this is all about good desktop video card. On laptops and mobile things are diferent...

2 комментария:

  1. Is it depends on hardware?
    Seems the pre-sorting + no depth test is faster...

  2. I think it depends. For example, I have notice that switching programs on old hardware (old laptop) is overkill. And since this trick uses 2 passes it can be slower than CPU sorting + one pass. Also I think that geometry complexity is a huge factor - for example, if you have 1 million triangles than 2 passes will be overkill.