Nvidia Geforce 6 Series Manual
Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.
30.3 GPU Features This section covers both fixed-function features and Shader Model 3.0 support (de- scribed in detail later) in GeForce 6 Series GPUs. As we describe the various pieces, we focus on the many new features that are meant to make applications shine (in terms of both visual quality and performance) on GeForce 6 Series GPUs. 30.3.1 Fixed-Function Features Geometry Instancing With Shader Model 3.0, the capability for sending multiple batches of geometry with one Direct3D call has been added, greatly reducing driver overhead in these cases. The hardware feature that enables instancing is vertex stream frequency—the ability to read vertex attributes at a frequency less than once every output vertex, or to loop over a subset of vertices multiple times. Instancing is most useful when the same object is drawn multiple times with different positions, for example, when rendering an army of soldiers or a field of grass. Early Culling/Clipping GeForce 6 Series GPUs are able to cull nonvisible primitives before shading at a high rate and clip partially visible primitives at full speed. Previous NVIDIA products would cull nonvisible primitives at primitive-setup rates, and clip all partially visible primitives at full speed. Rasterization Like previous NVIDIA products, GeForce 6 Series GPUs are capable of rendering the following objects: ●Point sprites ●Aliased and antialiased lines ●Aliased and antialiased triangles Multisample antialiasing is also supported, allowing accurate antialiased polygon ren- dering. Multisample antialiasing supports all rasterization primitives. Multisampling is supported in previous NVIDIA products, though the 4 ×multisample pattern was improved for GeForce 6 Series GPUs. 30.3 GPU Features 481 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 481 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
482 Z-Cull NVIDIA GPUs since GeForce3 have technology, called z-cull, that allows hidden sur- face removal at speeds much faster than conventional rendering. The GeForce 6 Series z-cull unit is the third generation of this technology, which has increased efficiency for a wider range of cases. Also, in cases where stencil is not being updated, early stencil reject can be employed to remove rendering early when stencil test (based on equals comparison) fails. Occlusion Query Occlusion query is the ability to collect statistics on how many fragments passed or failed the depth test and to report the result back to the host CPU. Occlusion query can be used either while rendering objects or with color and z-write masks turned off, returning depth test status for the objects that would have been rendered, without modifying the contents of the frame buffer. This feature has been available since the GeForce3 was introduced. Texturing Like previous GPUs, GeForce 6 Series GPUs support bilinear, trilinear, and anisotropic filtering on 2D and cube-map textures of various formats. Three-dimensional textures support bilinear, trilinear, and quad-linear filtering, with and without mipmapping. Here are the new texturing features on GeForce 6 Series GPUs: ●Support for all texture types (2D, cube map, 3D) with fp16×2, fp16×4, fp32×1, fp32 ×2, and fp32×4 formats ●Support for all filtering modes on fp16×2 and fp16×4 texture formats ●Extended support for non-power-of-two textures to match support for power-of-two textures, specifically: – Mipmapping – Wrapping and clamping – Cube map and 3D textures Shadow Buffer Support NVIDIA GPUs support shadow buffering directly. The application first renders the scene from the light source into a separate z-buffer. Then during the lighting phase, it fetches the shadow buffer as a projective texture and performs z-compares of the shadow buffer data against a value corresponding to the distance from the light. If the Chapter 30 The GeForce 6 Series GPU Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 482 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
distance passes the test, it’s in light; if not, it’s in shadow. NVIDIA GPUs have dedi- cated transistors to perform four z-compares per pixel (on four neighboring z-values) per clock, and to perform bilinear filtering of the pass/fail data. This more advanced variation of percentage-closer filtering saves many shader instructions compared to GPUs that don’t have direct shadow buffer support. High-Dynamic-Range Blending Using fp16 Surfaces, Texture Filtering, and Blending GeForce 6 Series GPUs allow for fp16×4 (four components, each represented by a 16-bit float) filtered textures in the pixel shaders; they also allow performing all alpha- blending operations on fp16 ×4 filtered surfaces. This permits intermediate rendered buffers at a much higher precision and range, enabling high-dynamic-range rendering, motion blur, and many other effects. In addition, it is possible to specify a separate blending function for color and alpha values. (The lowest-end member of the GeForce 6 Series family, the GeForce 6200 TC, does not support floating-point blending or floating-point texture filtering because of its lower memory bandwidth, as well as to save area on the chip.) 30.3.2 Shader Model 3.0 Programming Model Along with the fixed-function features listed previously, the capabilities of the vertex and the fragment processors have been enhanced in GeForce 6 Series GPUs. With Shader Model 3.0, the programming models for vertex and fragment processors are converging: both support fp32 precision, texture lookups, and the same instruction set. Specifically, here are the new features that have been added. Vertex Processor ●Increased instruction count. The total instruction count is now 512 static instructions and 65,536 dynamic instructions. The static instruction count represents the number of instructions in a program as it is compiled. The dynamic instruction count repre- sents the number of instructions actually executed. In practice, the dynamic count can be much higher than the static count due to looping and subroutine calls. ●More temporary registers. Up to 32 four-wide temporary registers can be used in a vertex program. ●Support for instancing. This enhancement was described earlier. 30.3 GPU Features 483 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 483 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
484 ●Dynamic flow control.Branching and looping are now part of the shader model. On the GeForce 6 Series vertex engine, branching and looping have minimal overhead of just two cycles. Also, each vertex can take its own branches without being grouped in the way pixel shader branches are. So as branches diverge, the GeForce 6 Series vertex processor still operates efficiently. ●Vertex texturing. Textures can now be fetched in a vertex program, although only nearest-neighbor filtering is supported in hardware. More advanced filters can of course be implemented in the vertex program. Up to four unique textures can be accessed in a vertex program, although each texture can be accessed multiple times. Vertex textures generate latency for fetching data, unlike true constant reads. There- fore, the best way to use vertex textures is to do a texture fetch and follow it with arithmetic operations to hide the latency before using the result of the texture fetch. Each vertex engine is capable of simultaneously performing a four-wide SIMD MAD (multiply-add) instruction and a scalar special function per clock cycle. Special function instructions include: ●Exponential functions: EXP, EXPP, LIT, LOG, LOGP ●Reciprocal instructions: RCP, RSQ ●Trigonometric functions: SIN, COS Fragment Processor ●Increased instruction count. The total instruction count is now 65,535 static in- structions and 65,535 dynamic instructions. There are limitations on how long the operating system will wait while the shader finishes working, so a long shader pro- gram working on a full screen of pixels may time-out. This makes it important to carefully consider the shader length and number of fragments rendered in one draw call. In practice, the number of instructions exposed by the driver tends to be smaller, because the number of instructions can expand as code is translated from Direct3D pixel shaders or OpenGL fragment programs to native hardware instructions. ●Multiple render targets. The fragment processor can output to up to four separate color buffers, along with a depth value. All four separate color buffers must be the same format and size. MRTs can be particularly useful when operating on scalar data, because up to 16 scalar values can be written out in a single pass by the fragment processor. Sample uses of MRTs include particle physics, where positions and veloci- ties are computed simultaneously, and similar GPGPU algorithms. Deferred shading is another technique that computes and stores multiple four-component floating- point values simultaneously: it computes all material properties and stores them in Chapter 30 The GeForce 6 Series GPU Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 484 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
separate textures. So, for example, the surface normal and the diffuse and specular material properties could be written to textures, and the textures could all be used in subsequent passes when lighting the scene with multiple lights. This is illustrated in Figure 30-8. ●Dynamic flow control (branching).Shader Model 3.0 supports conditional branch- ing and looping, allowing for more flexible shader programs. ●Indexing of attributes. With Shader Model 3.0, an index register can be used to select which attributes to process, allowing for loops to perform the same operation on many different inputs. ●Up to ten full-function attributes. Shader Model 3.0 supports ten full-function attributes/texture coordinates, instead of Shader Model 2.0’s eight full-function at- tributes plus specular color and diffuse color. All ten Shader Model 3.0 attributes are interpolated at full fp32 precision, whereas Shader Model 2.0’s diffuse and specular color were interpolated at only 8-bit integer precision. ●Centroid sampling. Shader Model 3.0 allows a per-attribute selection of center sam- pling, or centroid sampling . Centroid sampling returns a value inside the covered por- tion of the fragment, instead of at the center, and when used with multisampling, it can remove some artifacts associated with sampling outside the polygon (for example, when calculating diffuse or specular color using texture coordinates, or when using texture atlases). ●Support for fp32 and fp16 internal precision. Fragment programs can support full fp32-precision computations and intermediate storage or partial-precision fp16 com- putations and intermediate storage. 30.3 GPU Features 485 Figure 30-8. How MRTs Work MRTs make it possible for a fra gment program to return four four-wide color values plus a depth value. 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 485 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
486 ●3:1 and 2:2 coissue.Each four-component-wide vector unit is capable of executing two independent instructions in parallel, as shown in Figure 30-9: either one three- wide operation on RGB and a separate operation on alpha, or one two-wide opera- tion on red-green and a separate two-wide operation on blue-alpha. This gives the compiler more opportunity to pack scalar computations into vectors, thereby doing more work in a shorter time. ●Dual issue. Dual issue is similar to coissue, except that the two independent instruc- tions can be executed on different parts of the shader pipeline. This makes the pipeline easier to schedule and, therefore, more efficient. See Figure 30-10. Chapter 30 The GeForce 6 Series GPU Architecture Figure 30-9.How Coissue Works Two separate operations can concurrently exe cute on different parts of a four-wide register. Figure 30-10.How Dual Issue Works Independent instructions can be executed on independent units in the computational pipeline. 430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 486 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
Fragment Processor Performance The GeForce 6 Series fragment processor architecture has the following performance characteristics: ●Each pipeline is capable of performing a four-wide, coissue-able multiply-add (MAD) or four-term dot product ( DP4), plus a four-wide, coissue-able and dual-issuable multiply instruction per clock in series, as shown in Figure 30-11. In addition, a multifunction unit that performs complex operations can replace the alpha channel MADoperation. Operations are performed at full speed on both fp32 and fp16 data, although storage and bandwidth limitations can favor fp16 performance sometimes. In practice, it is sometimes possible to execute eight math operations and a texture lookup in a single cycle. ●Dedicated fp16 normalization hardware exists, making it possible to normalize a vector at fp16 precision in parallel with the multiplies and MADs just described. ●An independent reciprocal operation can be performed in parallel with the multiply, MAD, and fp16 normalization described previously. ●Because the GeForce 6800 has 16 fragment-processing pipelines, the overall available performance of the system is given by these values multiplied by 16 and then by the clock rate. ●There is some overhead to flow-control operations, as defined in Table 30-2. 30.3 GPU Features 487 Figure 30-11. Shader Units and Capabilities in the Fragment Processor 430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 487 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
488 Table 30-2.Overhead Incurred When Executing Flo w-Control Operations in Fragment Programs Instruction Cost (Cycles) If/ endif4 If/else/ endif6 Call2 Ret2 Loop/ endloop4 Furthermore, branching in the fragment processor is affected by the level of divergence of the branches. Because the fragment processor operates on hundreds of pixels per instruction, if a branch is taken by some fragments and not others, all fragments exe- cute both branches, but only writing to the registers on the branches each fragment is supposed to take. For low-frequency and mid-frequency branch changes, this effect is hidden, although it can become a limiter as the branch frequency increases. 30.3.3 Supported Data Storage Formats Table 30-3 summarizes the data formats supported by the graphics pipeline. 30.4 Performance The GeForce 6800 Ultra is the flagship product of the GeForce 6 Series family at the time of writing. Its performance is summarized as follows: ●425 MHz internal graphics clock ●550 MHz memory clock ●600 million vertices/second ●6.4 billion texels/second ●12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and shadow buffers) ●6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multi- function operation (a complex math operation, such as a sine or reciprocal square root) ●16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16 four-wide fp32 multiplies per clock cycle ●64 pixels per clock cycle early z-cull (reject rate) As you can see, there’s plenty of programmable floating-point horsepower in the vertex and fragment processors that can be exploited for computationally demanding problems. Chapter 30 The GeForce 6 Series GPU Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 488 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
30.4 Performance489 Table 30-3. Data Storage Formats Supported by GeForce 6 Series GPUs FormatDescription of Data in Memory Ver tex Texture SupportFragment Texture SupportRender Target Support B8One 8-bit fixed-point number✗✓✓ A1R5G5B5A 1-bit value and three 5-bit unsigned fixed-point numbers✗✓✓ A4R4G4B4Four 4-bit unsigned fixed-point numbers✗✓✗ R5G6B55-bit, 6-bit, and 5-bit fixed-point numbers✗✓✓ A8R8G8B8Four 8-bit fixed-point numbers✗✓✓ DXT1Compressed 4×4 pixels into 8 bytes ✗✓✗ DXT2,3,4,5Compressed 4×4 pixels into 16 bytes✗✓✗ G8B8Two 8-bit fixed-point numbers ✗✓✓ B8R8_G8R8Compressed as YVYU; two pixels in 32 bits ✗✓✗ R8B8_R8G8 Compressed as VYUY; two pixels in 32 bits✗✓✗ R6G5B56-bit, 5-bit, and 5-bit unsigned fixed-point numbers✗✓✗ DEPTH24_D8A 24-bit unsigned fixed-point number and 8 bits of garbage✗✓✓ DEPTH24_D8_FLOATA 24-bit unsigned float and 8 bits of garbage✗✓✓ DEPTH16A 16-bit unsigned fixed-point number✗✓✓ DEPTH16_FLOATA 16-bit unsigned float✗✓✓ X16A 16-bit fixed-point number✗✓✗ Y16_X16Two 16-bit fixed-point numbers✗✓✗ R5G5B5A1Three unsigned 5-bit fixed-point numbers and a 1-bit value✗✓✓ HILO8Two unsigned 16-bit values compressed into two 8-bit values✗✓✗ HILO_S8Two signed 16-bit values compressed into two 8-bit values✗✓✗ W16_Z16_Y16_X16 FLOATFour fp16 values✗✓✓ W32_Z32_Y32_X32 FLOATFour fp32 values✓ (unfiltered)✓ (unfiltered)✓ X32_FLOATOne 32-bit floating-point number✓ (unfiltered)✓ (unfiltered)✓ D1R5G5B51 bit of garbage and three unsigned 5-bit fixed-point numbers✗✓✓ D8R8G8B88 bits of garbage and three unsigned 8-bit fixed-point numbers✗✓✓ Y16_X16 FLOATTwo 16-bit floating-point numbers✗✓✗ ✓ = Yes ✗= No 430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 489 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
490 30.5 Achieving Optimal Performance While graphics hardware is becoming more and more programmable, there are still some tricks to ensuring that you exploit the hardware fully to get the most perform- ance. This section lists some common techniques that you may find helpful. A more detailed discussion of performance advice is available in the NVIDIA GPU Program- ming Guide , which is freely available in several languages from the NVIDIA Developer Web site (http://developer.nvidia.com/object/gpu_programming_guide.html). 30.5.1 Use Z-Culling Aggressively Z-cull avoids work that won’t contribute to the final result. It’s better to determine early on that a computation doesn’t matter and save doing the work. In graphics, this can be done by rendering the z-values for all objects first, before shading. For general-purpose computation, the z-cull unit can be used to select which parts of the computation are still active, culling computational threads that have already resolved. See Section 34.2.3 of Chapter 34, “GPU Flow-Control Idioms,” for more details on this idea. 30.5.2 Exploit Texture Math When Loading Data The texture unit filters data before returning it to the fragment processor, thus reducing the total data needed by the shader. The texture unit’s bilinear filtering can frequently be used to reduce the total work done by the shader if it’s performing more sophisticated shading. Often, large filter kernels can be dissected into groups of bilinear footprints, which are scaled and accumulated to build the large kernel. A few caveats apply here, most no- tably that all filter coefficients must be positive for bilinear footprint assembly to work properly. (See Chapter 20, “Fast Third-Order Texture Filtering,” for more information about this technique.) Similarly, the filtering support given by shadow buffering can be used to offload the work from the processor when performing compares, then filtering the results. 30.5.3 Use Branching in Fragment Programs Judiciously Because the fragment processor is a SIMD machine operating on many fragments at a time, if some fragments in a given group take one branch and other fragments in that group take another branch, the fragment processor needs to take both branches. Also, there is a six-cycle overhead for if-else-endif control structures. These two effects can reduce the performance of branching programs if not considered carefully. Branching can be very beneficial, as long as the work avoided outweighs the cost of branching. Chapter 30 The GeForce 6 Series GPU Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 490 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation