Outerra: OpenGL

Showing posts with label OpenGL. Show all posts

Tuesday, June 13, 2017

fp64 approximations for sin/cos for OpenGL

Minimax approximations for missing fp64 functions in GLSL, using remez exchange toolbox. For additional details see Double precision approximations for map projections in OpenGL.

//sin approximation, error < 5e-9
double sina_9(double x)
{
    //minimax coefs for sin for 0..pi/2 range
    const double a3 = -1.666665709650470145824129400050267289858e-1LF;
    const double a5 = 8.333017291562218127986291618761571373087e-3LF;
    const double a7 = -1.980661520135080504411629636078917643846e-4LF;
    const double a9 = 2.600054767890361277123254766503271638682e-6LF;

    const double m_2_pi = 0.636619772367581343076LF;
    const double m_pi_2 = 1.57079632679489661923LF;

    double y = abs(x * m_2_pi);
    double q = floor(y);
    int quadrant = int(q);

    double t = (quadrant & 1) != 0 ? 1 - y + q : y - q;
    t *= m_pi_2;

    double t2 = t * t;
    double r = fma(fma(fma(fma(a9, t2, a7), t2, a5), t2, a3), t2*t, t);

    r = x < 0 ? -r : r;

    return (quadrant & 2) != 0 ? -r : r;
}

//sin approximation, error < 2e-11
double sina_11(double x)
{
    //minimax coefs for sin for 0..pi/2 range
    const double a3 = -1.666666660646699151540776973346659104119e-1LF;
    const double a5 = 8.333330495671426021718370503012583606364e-3LF;
    const double a7 = -1.984080403919620610590106573736892971297e-4LF;
    const double a9 = 2.752261885409148183683678902130857814965e-6LF;
    const double ab = -2.384669400943475552559273983214582409441e-8LF;

    const double m_2_pi = 0.636619772367581343076LF;
    const double m_pi_2 = 1.57079632679489661923LF;

    double y = abs(x * m_2_pi);
    double q = floor(y);
    int quadrant = int(q);

    double t = (quadrant & 1) != 0 ? 1 - y + q : y - q;
    t *= m_pi_2;

    double t2 = t * t;
    double r = fma(fma(fma(fma(fma(ab, t2, a9), t2, a7), t2, a5), t2, a3),
t2*t, t);

    r = x < 0 ? -r : r;

    return (quadrant & 2) != 0 ? -r : r;
}

Cos can be just offset by π/2

//cos approximation, error < 5e-9
double cosa_9(double x)
{
    //sin(x + PI/2) = cos(x)
    return sina_9(x + DBL_LIT(1.57079632679489661923LF));
}

//cos approximation, error < 2e-11
double cosa_11(double x)
{
    //sin(x + PI/2) = cos(x)
    return sina_11(x + DBL_LIT(1.57079632679489661923LF));
}

Saturday, February 6, 2016

OpenGL rendering performance test #2 - blocks

Previous test - Procedural grass - focused on procedurally generated geometry that didn’t use any vertex data and generated meshes procedurally using only gl_VertexID value and a few lookups into shape textures, therefore with only a neglectable amount of non-procedural data fetched per instance.

This test uses geometry generated from some per-instance data (i.e each produced vertex uses some unique data from buffers), but still partially procedurally generated. Source stream originally consisted of data extracted from OpenStreetMap maps - building footprints broken into quads, meaning there’s a constant size block of data pertaining to one rendered building block (exactly 128 bytes of data per block). Later the data were replaced with a generated content to achieve higher density for the purpose of the testing, but there’s still the same amount of data per block..

Another significant difference is that backface culling is enabled here. With grass the blades are essentially 2D and visible from both sides, and that means that the number of triangles sent to the GPU equals the number of potentially visible triangles, minus the blades that are off the frustum.

In case of the building block test there’s approximately half of the triangles culled away, being back-facing triangles of the boxes. The test measures number of triangles sent in draw calls, but for direct comparison with the grass test we’ll also use the visible triangle metrics.

Here are the results with backface culling enabled:

(Note this graph is missing AMD Fury cards, we’d be grateful if anyone having one of these could run the test (link at the end of the blog) and submit the results to us.)

What immediately catches attention is the large boost happening on newer AMD architectures upon reaching 5k triangles per instanced draw call. A boost in that region was visible also in the grass test, though it was more gradual.

No idea what’s happening there, or rather why the performance is so low on smaller sizes. Hopefully AMD folks will be able to provide some hints.

As it was visible on the grass test as well, AMD 380/390 chips perform rather poorly on the smallest meshes (less than 40 triangles), which also corresponds to the relatively poor performance of geometry shaders.

To be able to compare peak performance with the grass test we need to turn off backface culling, getting the following results:

With culling off, the GPU has to render all triangles, twice as much as in previous test. It shows on Nvidia cards, taking roughly a 30% hit.

On AMD the dip is much smaller though, indicating that maybe there’s another bottleneck.

Grass test results for comparison (note slightly different instance mesh sizes). Grass test uses much less per-instance data, but has a more complex vertex shader. Apparently, it suits Nvidia cards better:

Conclusion: both vendors show the best performance when the geometry in instanced calls is grouped into 5 - 20k triangles. On AMD GCN 1.1+ chips the 5k threshold seems to be critical, with performance almost doubling.

Test sources and binaries

All tests can be found at https://github.com/hrabcak/draw_call_perf

If anyone wants to contribute their benchmark results, the binaries can be downloaded from here: Download Outerra perf test

There are 3 tests: grass, buildings and cubes. The tests can be launched by running their respective bat files. Each test run lasts only 4-5 seconds, but there are many tested combinations so the total time can be up to 15 minutes.

Once each test completes, you will be asked to submit the test results to us. The test results are stored in a CSV file, and include the GPU type, driver version and performance measurements.

We will be updating the graphs to fill the missing pieces.

Additional test results

More cards added based on test result submissions:

@cameni

Sunday, January 31, 2016

OpenGL rendering performance test #1 - Procedural grass

This is the first of several OpenGL rendering performance tests done to determine the best strategy for rendering of various type of procedural and non-procedural content.

This particular test concerns with rendering of small procedurally generated meshes and the optimal way to render them to achieve the highest triangle throughput on different hardware.

Test setup:

rendering a large number of individual grass blades using various rendering methods

a single grass blade in the test is made of 7 vertices and 5 triangles

grass is organized in tiles with 256x256 blades, making 327680 effective triangles, rendered using either:

a single draw call rendering 327680 grass triangles at once, or

an instanced draw call with different instance mesh sizes and corresponding number of instances, rendering the same scene

rendering methods:

arrays via glDrawArrays / glDrawArraysInstanced, using a triangle strip with 2 extra vertices to transition from one blade to another (except for instanced draw call with a single blade per instance)

indexed rendering via glDrawElements / glDrawElementsInstanced, separately as a triangle strip or a triangle list, with index buffer containing indices for 1,2,4 etc up to all 65536 blades at once

a geometry shader (GS) producing blades from point input, optionally also generating several blades in one invocation
GS performance showed up to be dependent on the number of interpolants used by the fragment shader, so we also tested a varying number of these

position of grass blades is computed from the blade coordinate within the grass tile, derived purely from gl_InstanceID and gl_VertexID values; rendering does not use any vertex attributes, it uses only small textures storing the ground and grass height, looked up using the blade coordinate

also tested randomizing the order in which the blades are rendered in tile, it seems to boost performance on older GPUs a bit

This test generates lots of geometry from little input data, which might be considered a distinctive property of procedural rendering.

Goals of the test are to determine the fastest rendering method for procedural geometry across different graphics cards and architectures, the optimal mesh size per (internal) draw call, the achievable triangle throughput and factors that affect it.

Results

Performance results for the same scene rendered divided into varying size instances. The best results overall were obtained with indexed triangle list rendering, shown in the following graph, measured as triangle throughput at different instance mesh sizes:

(updated graph with some recent cards)

On Nvidia GPUs and on older AMD chips (1st gen GCN and earlier) it’s good to keep mesh size above 80 triangles in order not to lose performance significantly. On newer AMD chips the threshold seems to be much higher - above 1k, and after 20k the performance goes down again.

Unfortunately I haven’t got a Fury card here to test if this holds for the latest parts, but anyone can run the tests and submit the results to us (links at the end).

In any case, mesh size around 5k triangles is a good one that works well across different GPUs. Interestingly both vendors start to have issues at different ends - on Nvidia, performance drops with small meshes/lots of instances (increasing CPU side overhead), whereas AMD cards start having problem with larger meshes (but not with array rendering).

Conclusion: with small meshes, always group several instances into one draw call so that resulting effective mesh size is around 1 - 20k.

Geometry shader performance roughly corresponds to the performance of instanced vertex shader rendering with small instance mesh sizes, which in all cases lie below the peak performance. This also shows as a problem on newer AMD cards with disproportionally low performance with geometry shaders.

Note that there’s one factor that can still shift the disadvantage in some cases - the ability to implement culling as a fast discard in GS, especially in this test where lots of off-screen geometry can be discarded.

GS performance is also affected by the number of interpolants (0 or 4 floats in the graph), but mainly on Nvidia cards.

The following graph shows the performance as a percentage of given card’s theoretical peak Mtris/s performance (core clock * triangles per clock of given architecture). However, the resulting numbers for AMD cards seem to be too low.

Perhaps a better comparison is the performance per dollar graph that follows after this one.

Performance per dollar, using the prices as of Dec 2015 from http://www.videocardbenchmark.net/gpu_value.html

Results for individual GPUs

Arrays are generally slower here because of 2 extra vertices needed to cross between grass blades in triangle strips, and only match the indexed rendering when using per-blade instances, where the extra vertices aren’t needed.

“Randomized” versions just reverse the bits in computed blade index to ensure that blade positions are spread. This seems to help a bit on older architectures.

Unexpected performance drop on smaller meshes on newer AMD cards (380, 390X)

Slower rendering on Nvidia with small meshes is due to sudden CPU-side overhead.

With more powerful Nvidia GPUs it’s also best to aim for meshes larger than 1k, as elsewhere minor performance bump becomes slightly more prominent:

Older Nvidia GPUs show comparatively worse geometry shader performance than newer ones:

Test sources and binaries

All tests can be found at https://github.com/hrabcak/draw_call_perf

If anyone wants to contribute their benchmark results, the binaries can be downloaded from here: Download Outerra perf test

There are 3 tests: grass, buildings and cubes. The tests can be launched by running their respective bat files. Each test run lasts around 4-5 seconds, but there are many tested combinations so the total time is up to 15 minutes.

Once each test completes, you will be asked to submit the test results to us. The test results are stored in a CSV file, and include the GPU type, driver version and performance measurements.

We will be updating the graphs to fill the missing pieces.

The other two tests will be described in subsequent blog posts.

@cameni

Sunday, May 11, 2014

Double precision approximations for map projections in OpenGL

Written mainly for my own reference, but it's also possible that someone else finds it useful as well.

The problem: OpenGL specification/extension ARB_gpu_shader_fp64 that brings support for double precision operations on the GPU specifically states that double-precision versions of angle, trigonometry, and exponential functions are not supported. All we get are the basic add/mul operations and sqrt.

In rendering planetary-scaled geometry you'll often hit numerical precision issues, and it's a good idea to avoid using 64-bit floats altogether because they come at a price. Performance hit is usually even larger than it could be, as some vendors intentionally cripple the double precision operations, to make a greater gap between their consumer and professional GPU lines.

In some cases it's not worth going into the trouble of trying to solve both the performance and precision problems at once, for example in tools that process real world data or project maps. These usually need trigonometric functions when converting between different map projection types, and a higher precision than 32 bit floats can give.
Unless one can/want to use OpenCL interoperation, here are some tricks how to reduce some of the used equations to just the basic supported double-precision operations.

fp64 atan2 approximation

In our case we needed to implement the geographic and Mercator projections. For both we need atan (or better atan2) function, which we'll get through a polynomial approximation.

Article on Lol Engine blog colorfully talks about why it's not a good idea to use Taylor series for approximating functions over an interval, and compares it with Ramez/minimax method. What's nice is that the good folks also released a tool for computing the minimax approximations, called remez exchange toolbox. Using this tool we can get a good atan approximation with custom adjustable polynomial degree / maximum error.

Here's the source code for the atan2 approximation with error less than 5 ⋅ 10^-9, computed using the lolremez tool. Angular error of 5 ⋅ 10^-9 translates to ~ 3cm error on Earth surface, which is very good for our purposes.

// atan2 approximation for doubles for GLSL
// using http://lolengine.net/wiki/doc/maths/remez

double atan2(double y, double x)
{
    const double atan_tbl[] = {
    -3.333333333333333333333333333303396520128e-1LF,
     1.999999117496509842004185053319506031014e-1LF,
    -1.428514132711481940637283859690014415584e-1LF,
     1.110012236849539584126568416131750076191e-1LF,
    -8.993611617787817334566922323958104463948e-2LF,
     7.212338962134411520637759523226823838487e-2LF,
    -5.205055255952184339031830383744136009889e-2LF,
     2.938542391751121307313459297120064977888e-2LF,
    -1.079891788348568421355096111489189625479e-2LF,
     1.858552116405489677124095112269935093498e-3LF
    };

    /* argument reduction: 
       arctan (-x) = -arctan(x); 
       arctan (1/x) = 1/2 * pi - arctan (x), when x > 0
    */

    double ax = abs(x);
    double ay = abs(y);
    double t0 = max(ax, ay);
    double t1 = min(ax, ay);
    
    double a = 1 / t0;
    a *= t1;

    double s = a * a;
    double p = atan_tbl[9];

    p = fma( fma( fma( fma( fma( fma( fma( fma( fma( fma(p, s,
        atan_tbl[8]), s,
        atan_tbl[7]), s, 
        atan_tbl[6]), s,
        atan_tbl[5]), s,
        atan_tbl[4]), s,
        atan_tbl[3]), s,
        atan_tbl[2]), s,
        atan_tbl[1]), s,
        atan_tbl[0]), s*a, a);

    double r = ay > ax ? (1.57079632679489661923LF - p) : p;

    r = x < 0 ?  3.14159265358979323846LF - r : r;
    r = y < 0 ? -r : r;

    return r;
}

Mercator projection

While the geographic projection can do with the above atan2 function to compute the longitude and latitude, Mercator uses a different mapping for the y-axis, to preserve the shapes:

Where φ is the latitude angle. Bad luck, neither ln nor tan functions are available with double arguments. Logarithm approximation doesn't give a good precision with reasonable polynomial degree, but it turns out we actually do not need to approximate neither ln nor tan function.

It follows that:

Given that, going from 3D coordinates, we can get the value of tan φ simply from the 3D coordinates (assuming ECEF coordinate system):

Meaning that the inner logarithm expression now consists solely of operations that are supported by the ARB_gpu_shader_fp64 extension.

Finally, we can get rid of the logarithm itself by utilizing the identity:

and using the single-precision logarithm function to compute a delta value from some reference angle computed on the CPU. Typically that means computing sqrt(k^2+1) + |k| for the screen coordinate center, and using the computed logarithm difference directly as the projected screen y coordinate.

Since the resulting value of a/b is very close to 1 in cases when we need extra double precision (zoomed in), we can even use the crudest ln approximation around 1: ln(x) ≅2*x/(2+x) with double arithmetic.

@cameni

Thursday, November 15, 2012

OpenGL Notes #2: Texture bind performance

Some time ago we encountered a strange performance slowdown in our object renderer on NVIDIA hardware. I didn't have mesh sorting by material id implemented yet; I knew that this could be issue but I thought it could take 1ms top. But the actual slowdown was way bigger. Later I found out that it was indeed caused by glBindMultiTextureEXT. So I ran a couple of tests, first using one big texture (2048x2048) for all meshes, then adding the sorting by material id. You can see the times for the final object pass in the Table 1.

TABLE 1: Objects in the test scene had about 400k tris, ~250 meshes and 42 textures
all textures had the same format, diffuse DXT1, normal 3DC/ATI2 etc.
NVIDIA DRV: 306.97, 306.94 (Nsight) and 310.33, AMD DRV: 12.10

	NVIDIA 460 1GB (ms)	AMD 6850 (ms)
without sorting	15.0	5.0
one texture	3.3	4.0
sort by mesh/material	6.8	4.3

Those numbers are pretty bad for NVIDIA. But the weird thing is that we have no such issue in the terrain renderer, which uses unique textures for every terrain tile, making about ~200 meshes/textures per frame, but the performance there is great. There is no difference in the render states, object renderer follows immediately after the terrain renderer without any changes to the render states. I have tried a lot of tests - a simple shader, non-DSA functions for texture binding, render without terrain etc., without luck - every time I started to change the textures per mesh I hit the wall. (I used two textures for this test and the time was almost 15 ms).

I moved to NVIDIA Nsight to find out where exactly is this slowdown, and I found out that every time glBindMultiTextureEXT is called, the following draw call (in this case glDrawRangeElements) is much longer both on the CPU side and the GPU side too. With debug rendering context there were no performance warnings. Full of frustration I was browsing the glCapsViewer database and comparing with texture caps between NVIDIA and AMD. There was one interesting number that had drawn my attention: MAX_COMBINED_TEXTURE_IMAGE_UNITS on NVIDIA SM4 hardware is from 96 to 192 texture image units. On AMD only 32 (the spec tells it should be at least 48 for GL3.3, at least 16 per stage and there are three stages in GL3.3). This number means how many textures you can bind at once by glActiveTexture(GL_TEXTURE0+192,...). In one shader stage you can use 16-32 textures only from this budget and these textures are specified by glUniform1i calls.

The idea that came to me was to bind as many textures as I can at once (in my case there were 160 units available on NV460, more than enough for my test scene) and render all meshes without texture binding. Additionally per mesh I call one glUniform1iv, which tells the shader which texture unit is being used. This uniform call is a fast one (usually i'm using glVertexAttrib to pass a mesh specific parameters to the vertex shader, whould be nice to know which way is faster but I think the difference is negligible) The result was more than interesting, see Table 2.

TABLE 2: The same scene, NVIDIA DRV: 306.97, 306.94 (Nsight) and 310.33, AMD DRV: 12.10

	NVIDIA 460 1GB (ms)	AMD 6850 (ms)
one texture	3.3	4.0
sort by mesh/material	6.8	4.3
texture bind group	3.36	4.1

I was curious how this will work on AMD, but the limit is 32 image units there. I created so called "texture bind groups" with the max size set to MAX_COMBINED_TEXTURE_IMAGE_UNITS minus the number of channels needed for other stuff (in my case I have 26 free bindable units on AMD) and split the meshes to these groups. The result was better than the version with sorting, so I left this implementation for both vendors.

I would be still interested to learn what is causing such a big slowdown on NVIDIA architecture.

Monday, June 18, 2012

A couple of OpenGL notes

A couple of notes mostly for my own reference, and for the unlucky souls who google through here after encountering similar problems.

A floating-point depth buffer

Some time ago I wrote about the possible use of a reversed-range floating point depth buffer as an alternative to the logarithmic depth buffer (see also +Thatcher Ulrich's post about logarithmic buffers). A RFP depth buffer has got some advantages over the logarithmic one we are still using today, namely no need to write depth values in the fragment shader for objects close to the camera in order to suppress depth bugs caused by insufficient geometry tesselation, because the depth values interpolated across polygon diverge from the required logarithmic value significantly in that case.

Writing depth values in a fragment shader causes a slight performance drop because of the additional bandwidth the depth values take, and also because it disables various hardware optimizations related to depth buffers. However, we have found the performance drop to be quite small in our case, so we didn't even test the floating point buffer back then.
A disadvantage of floating-point depth buffer is that it takes 32 bits, and when you also need a stencil buffer it no longer fits into a 32 bit alignment, and thus consumes twice as much memory as a 24b integer buffer + 8b stencil (if there was a 24-bit floating point format, its resolution would be insufficient and inferior in comparison to the logarithmic one, in our case).

Recently we have been doing some tests of our new object pipeline, and decided to test the reversed floating-point buffer to see if the advantages can outdo the disadvantages.

However, a new problem arose that's specific to OpenGL: since in OpenGL the depth values use normalized device range of -1 to 1, techniques that rely on a higher precision of floating point values close to zero cannot be normally used because the far plane is mapped to -1 and not to 0.

Alright, I thought, I will sacrifice half of the range and simply output 1..0 values from the vertex shader, the precision will be still sufficient.
However, there is a problem that OpenGL does implicit conversion from NDC to 0..1 range to compute the actual values to write to the depth buffer. That conversion is defined as 0.5*(f-n)*z + 0.5(f+n) where n,f are values given by glDepthRange. With default values of 0,1 that means that the output value of z from vertex shader gets remapped by 0.5*z + 0.5.
Since the reversed FP depth buffer relies on better precision near 0, the addition of 0.5 to values like 1e-12 pretty much discards most of the significant bits, rendering the technique unusable.

Calling glDepthRange with -1, 1 parameters would solve the problem (making the conversion 1*z + 0), but the spec explicitly says that "The parameters n and f are clamped to the range [0; 1], as are all arguments of type clampd or clampf".
On Nvidia hardware the NV_depth_buffer_float extension comes to the rescue, as it allows to set non-clamped values, but alas it's not supported on AMD hardware, and that's the end of the journey for now.

Update: a more complete and updated info about the use of reverse floating point buffer can be found in post Maximizing Depth Buffer Range and Precision.

Masquerading integer values as floats in outputs of fragment shaders

Occasionally (but in our case quite often) there's a need to combine floating point values with integer values in a single texture. For example, our shader that generates road mask (to be applied over terrain geometry in another pass) needs to output road height as float, together with road type and flags as integers. Since using shader instructions uintBitsToFloat or floatBitsToUint is cheap (they are just type casts), it can be easily done.

However, there's a problem: if you use a floating-point render target for that purpose, the exact bit representation of the masqueraded integer values may not be preserved on AMD hardware, as the blending unit, even though inactive, can meddle with the floating point values, altering them in a way that doesn't do much to the interpretation of floating point values, but changes the bits in some cases.

In this case you need to use an integer render target and cast the floating point values to int (or uint) instead.