A Detailed Study of the Numerical Accuracy of GPUImplemented Math Functions Dan Fay[1], Ali Sazegari[1], and Dan Connors[2] [1] Apple Computer, Inc.
Introduction Motivation: Modern programmable GPUs have demonstrated their ability to significantly accelerate important classes of non-graphics applications; however, GPUs' substandard support for floating-point arithmetic can potentially limit their usefulness in some algorithms. Previous studies of GPUs' numerical accuracy[2][3][4] quantified only the "overall" accuracy of different arithmetic and math functions on the GPU by providing an average error and/or an error bounds for each operation. Many algorithms also require correct behavior for edge cases on the floating-point number line: • Subnormal (denormal) numbers – Allow for gradual precision loss when working with very small magnitude numbers. • Infinities – Provide a way to describe numbers with magnitudes too large to represent as normal numbers. • Not a Number (NaN) – Used to close the number line for invalid operations, such as 0/0. NaNs can carry a “payload” useful for debugging by having a specific bit pattern in their lower bits. • +/-0 – A negative zero describes an extremely tiny negative number that is smaller in magnitude than the smallest possible subnormal number.
Experimental Methodology ATi Platform:
nVIDIA Platform:
•Machine: Core Duo iMac •Operating System: OS X 10.4.7 •GPU: ATi Radeon x1600
•Machine: Core 2 Duo iMac •Operating System: OS X 10.4.7 •GPU: nVIDIA GeForce 7300GT
Input Data: Test vectors were drawn from Jerome Coonen’s Ph.D. thesis, entitled “Contributions to a Proposed Standard for Binary Floating-Point Arithmetic”[6], as well as being supplemented with other tests designed to exercise algorithm-specific edge cases.
Test Functions: The basic operations tests examine the accuracy of the fundamental arithmetic operators, which form the building blocks of any numerical code. Finally, the ported vForce functions are code directly translated from Apple Computer’s vForce math library.
Comparing the Results: The first set of columns compare the percentage of test cases that pass: higher numbers are better. The other set of vectors describe the percentage of failures. In this case, lower numbers are better. With the exception of the comparison operators, failed test cases are classified by the type of number in one or more of the results, such as subnormal numbers, infinities, NaNs, +/-0s, and normals. To compare the ATi and nVIDIA GPUs, each cell is color-coded one of three colors: red if the particular GPU fails more test cases than the other GPU, blue if it passes more test cases than the other GPU, and yellow if there is a tie.
Goals of this Study:
Accuracy of Basic Operators
• Study the numerical accuracy of basic arithmetic functions on the GPU by testing important edge cases. • Test the built-in mathematical functionality provided by the OpenGL Shading Language, GLSL. • Investigate the accuracy of an existing high-performance math library, vForce[9], when it is ported to the GPU. • Investigate the consistency of results between GPU vendors by comparing the results provided by current ATi and nVIDIA GPUs.
Op # Tests + * / < <= > >= == !=
References [1] "IEEE 754: Standard for Binary Floating-Point Arithmetic." Available at http://grouper.ieee.org/groups/754/ . [2] Karl E. Hillesland and Anselmo Lastra, "GPU Floating-Point Paranoia." In Proc. GP2, August 2004. [3] "GPUBench Test: Precision." Available at http://graphics.stanford.edu/projects/gpubench/test_precision.html . [4] Guillaume Da Graca and David Defour, "Implementation of float-float operators on graphics hardware." In Proc. 7th conference on Real Numbers and Computers, July 2006. [5] David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic." Available at http://docs.sun.com/source/8063568/ncg_goldberg.html . [6] Coonen, Jerome T.: Contributions to a Proposed Standard for Binary Floating-Point Arithmetic. PhD dissertation, Univ. of California, Berkeley, 1984. [7] John Kessenich, "The OpenGL Shading Language. Available at http://www.opengl.org/registry/specs/ARB/GLSLangSpec.Full.1.20.6.pdf . [8] Steven Moshier, "Cephes Mathematical Library." Available at http://www.moshier.net/#Cephes . [9] "Vector Libraries." Available at http://developer.apple.com/hardwaredrivers/ve/vector_libraries.html .
[2] University of Colorado
267 267 311 345 400 400 400 400 400 400
Pass % Subnorm % Infinity % NaN % +/-0 % Normal % ATi NV ATi NV ATi NV ATi NV ATi NV ATi NV 79.8 79.8 16.5 16.5 0.749 0.749 0.00 0.00 0.375 0.375 2.62 2.62 79.8 79.8 16.5 16.5 0.749 0.749 0.00 0.00 0.375 0.375 2.62 2.62 60.8 66.2 22.2 19.6 0.00 0.00 5.14 3.86 8.36 8.36 3.54 1.93 56.2 62.3 11.0 7.25 5.51 5.22 5.22 6.67 11.0 9.86 11.0 8.70 77.0 100 96.0 100 77.0 100 96.0 100 92.5 100 92.5 100
Discussion:
• Neither GPU can correctly handle subnormal numbers using their basic arithmetic operators. Both GPUs flush subnormal numbers to zero. • Unlike the ATi GPU, the nVIDIA GPU can correctly compare subnormals. Neither GPU properly treats -0 as a negative number: both GPUs treat it as a positive number. • Per the IEEE-754 standard[1], if a NaN is encountered as an input, the function should return the exact same NaN payload. Neither GPU does this.
Conclusion Conclusion
• The nVIDIA GPU produces somewhat better quality results than the ATi GPU. • Implementing a custom math library for the GPU can produce better results. • • •
Future Work Test all of the vForce functions (pow, sinh, cosh, tanh, asinh, acosh, and atanh were not tested). Thoroughly test the normal number line, in addition to the edge cases tested here. Study the performance of the built-in GLSL functions versus the ported vForce functions.
Accuracy of Built-in GLSL Functions Fn
# Tests
sqrt sin cos tan asin acos atan log exp log2 exp2
63 821 791 1152 96 96 96 37 28 44 28
Pass % Subnorm % ATi NV ATi NV 22.2 17.5 1.59 0.00 26.9 33.7 5.60 4.87 28.4 35.5 1.14 0.00 2.69 3.47 1.91 11.3 0.00 12.5 10.4 10.4 0.00 25.0 0.00 0.00 2.08 12.5 10.4 10.4 13.5 59.5 0.00 0.00 100 100 0.00 0.00 18.2 20.5 0.00 0.00 100 100 0.00 0.00
Infinity % ATi NV 0.00 0.00 1.95 1.95 0.00 0.00 1.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.4 0.00 0.00 0.00 18.2 18.2 0.00 0.00
NaN % +/-0 % ATi NV ATi NV 50.8 50.8 1.59 1.59 0.00 0.00 0.244 0.244 0.00 0.00 0.00 0.00 0.00 0.00 0.694 0.694 59.4 49.0 0.00 2.08 59.4 49.0 0.00 0.00 10.4 0.00 2.08 2.08 56.8 18.9 0.00 0.00 0.00 0.00 0.00 0.00 40.1 40.1 0.00 0.00 0.00 0.00 0.00 0.00
Normal % ATi NV 23.8 30.2 65.3 59.2 70.4 64.5 93.4 84.5 30.2 26.0 40.6 26.0 75.0 75.0 24.3 21.6 0.00 0.00 22.7 20.5 0.00 0.00
Discussion:
• log2 and exp2 are also important basic functions because they are used to extract and alter respectively the exponent of a floating-point number. • Overall, the quality of the results for the nVIDIA GPU system are better than the ones produced by the ATi GPU system. • ATi and nVIDIA are fairly consistent in their results for many of the floating-point specials.
Accuracy of Ported vForce Functions Fn
# Tests
div sqrt sin cos tan asin acos atan log exp log2 expm1 log1p
Discussion
345 63 821 791 1152 96 96 96 37 28 44 52 86
Pass % Subnorm % ATi NV ATi NV 53.0 55.1 8.70 9.27 76.2 93.7 0.00 0.00 30.5 37.8 3.65 5.60 34.9 40.2 0.00 0.00 2.69 3.47 1.91 11.3 76.0 86.5 0.00 0.00 88.5 88.5 0.00 0.00 55.2 53.1 10.4 10.4 67.6 97.3 0.00 0.00 92.8 100 0.00 0.00 68.1 81.8 0.00 0.00 57.7 100 40.4 0.00 51.2 98.8 45.3 0.00
Infinity % ATi NV 8.70 5.22 0.00 0.00 1.95 0.00 0.00 0.00 1.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.2 0.00 0.00 0.00 0.00
NaN % +/-0 % ATi NV ATi NV 14.2 15.9 9.86 9.86 22.2 0.00 1.59 1.59 0.00 0.00 0.243 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.694 0.694 0.00 0.00 2.08 0.00 0.00 0.00 0.00 0.00 10.4 0.00 2.08 2.08 21.6 0.00 0.00 0.00 7.14 0.00 0.00 0.00 31.8 0.00 0.00 0.00 0.00 0.00 1.92 0.00 1.16 0.00 1.16 0.00
Normal % ATi NV 5.51 4.64 0.00 4.76 63.7 56.6 65.1 59.8 93.4 84.5 24.0 13.5 11.5 11.5 21.9 34.4 10.8 2.70 0.00 0.00 0.00 0.00 0.00 0.00 1.16 1.16
• vForce actually names log2 logb, but it is listed as log2 for consistency. • div and sqrt, which both use Newton-Raphson refinement to converge at an answer, employ the built-in division and square root functionality in GLSL to provide the initial estimate. • expm1 and log1p are the only two functions listed which use a table lookup. • sin, cos, and tan have custom test vectors designed to test around nπ/4. This was done to stress the argument reduction used by these functions’ algorithms. • Limited language support in GLSL for floating-point special constants required using a third texture to store the important constants. Accessing these constants involves doing an additional texture lookup for every data element. • Overall, the quality of the results for the nVIDIA GPU system are better than the ones produced by the ATi GPU system. • Overall, the ported vForce functions provide better results than do the built-in GLSL functions.