Home > Nvidia > Processor > Nvidia Geforce 6 Series Manual

Nvidia Geforce 6 Series Manual

Download as PDF Print this page Share this page

Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.

View all the pages

Add page 11 to Favourites

image text

Enable zoom

30.3 GPU Features
This section covers both fixed-function features and Shader Model 3.0 support (de-
scribed in detail later) in GeForce 6 Series GPUs. As we describe the various pieces, we
focus on the many new features that are meant to make applications shine (in terms of
both visual quality and performance) on GeForce 6 Series GPUs.
30.3.1 Fixed-Function Features
Geometry Instancing
With Shader Model 3.0, the capability for sending multiple batches of geometry with
one Direct3D call has been added, greatly reducing driver overhead in these cases. The
hardware feature that enables instancing is vertex stream frequency—the ability to read
vertex attributes at a frequency less than once every output vertex, or to loop over a
subset of vertices multiple times. Instancing is most useful when the same object is
drawn multiple times with different positions, for example, when rendering an army of
soldiers or a field of grass.
Early Culling/Clipping
GeForce 6 Series GPUs are able to cull nonvisible primitives before shading at a high
rate and clip partially visible primitives at full speed. Previous NVIDIA products would
cull nonvisible primitives at primitive-setup rates, and clip all partially visible primitives
at full speed.
Rasterization
Like previous NVIDIA products, GeForce 6 Series GPUs are capable of rendering the
following objects:
●Point sprites
●Aliased and antialiased lines
●Aliased and antialiased triangles
Multisample antialiasing is also supported, allowing accurate antialiased polygon ren-
dering. Multisample antialiasing supports all rasterization primitives. Multisampling is
supported in previous NVIDIA products, though the 4
×multisample pattern was
improved for GeForce 6 Series GPUs.
30.3 GPU Features 481

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 481
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 12 to Favourites

image text

Enable zoom

482
Z-Cull
NVIDIA GPUs since GeForce3 have technology, called z-cull, that allows hidden sur-
face removal at speeds much faster than conventional rendering. The GeForce 6 Series
z-cull unit is the third generation of this technology, which has increased efficiency for
a wider range of cases. Also, in cases where stencil is not being updated, early stencil
reject can be employed to remove rendering early when stencil test (based on equals
comparison) fails.
Occlusion Query
Occlusion query is the ability to collect statistics on how many fragments passed or
failed the depth test and to report the result back to the host CPU. Occlusion query
can be used either while rendering objects or with color and z-write masks turned off,
returning depth test status for the objects that would have been rendered, without
modifying the contents of the frame buffer. This feature has been available since the
GeForce3 was introduced.
Texturing
Like previous GPUs, GeForce 6 Series GPUs support bilinear, trilinear, and anisotropic
filtering on 2D and cube-map textures of various formats. Three-dimensional textures
support bilinear, trilinear, and quad-linear filtering, with and without mipmapping.
Here are the new texturing features on GeForce 6 Series GPUs:
●Support for all texture types (2D, cube map, 3D) with fp16×2, fp16×4, fp32×1,
fp32
×2, and fp32×4 formats
●Support for all filtering modes on fp16×2 and fp16×4 texture formats
●Extended support for non-power-of-two textures to match support for power-of-two
textures, specifically:
– Mipmapping
– Wrapping and clamping
– Cube map and 3D textures
Shadow Buffer Support
NVIDIA GPUs support shadow buffering directly. The application first renders the
scene from the light source into a separate z-buffer. Then during the lighting phase, it
fetches the shadow buffer as a projective texture and performs z-compares of the
shadow buffer data against a value corresponding to the distance from the light. If the
Chapter 30 The GeForce 6 Series GPU Architecture

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 482
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 13 to Favourites

image text

Enable zoom

							
distance passes the test, it’s in light; if not, it’s in shadow. NVIDIA GPUs have dedi-
cated transistors to perform four z-compares per pixel (on four neighboring z-values)
per clock, and to perform bilinear filtering of the pass/fail data. This more advanced
variation of percentage-closer filtering saves many shader instructions compared to
GPUs that don’t have direct shadow buffer support.
High-Dynamic-Range Blending Using fp16 Surfaces, Texture Filtering,
and Blending
GeForce 6 Series GPUs allow for fp16×4 (four components, each represented by a 
16-bit float) filtered textures in the pixel shaders; they also allow performing all alpha-
blending operations on fp16
×4 filtered surfaces. This permits intermediate rendered
buffers at a much higher precision and range, enabling high-dynamic-range rendering,
motion blur, and many other effects. In addition, it is possible to specify a separate
blending function for color and alpha values. (The lowest-end member of the GeForce
6 Series family, the GeForce 6200 TC, does not support floating-point blending or
floating-point texture filtering because of its lower memory bandwidth, as well as to
save area on the chip.)
30.3.2 Shader Model 3.0 Programming Model
Along with the fixed-function features listed previously, the capabilities of the vertex
and the fragment processors have been enhanced in GeForce 6 Series GPUs. With
Shader Model 3.0, the programming models for vertex and fragment processors are
converging: both support fp32 precision, texture lookups, and the same instruction set.
Specifically, here are the new features that have been added.
Vertex Processor
●Increased instruction count.  The total instruction count is now 512 static instructions
and 65,536 dynamic instructions. The static instruction count represents the number
of instructions in a program as it is compiled. The dynamic instruction count repre-
sents the number of instructions actually executed. In practice, the dynamic count can
be much higher than the static count due to looping and subroutine calls.
●More temporary registers. Up to 32 four-wide temporary registers can be used in a
vertex program.
●Support for instancing.  This enhancement was described earlier.
30.3 GPU Features 483

430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 483
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 14 to Favourites

image text

Enable zoom

484
●Dynamic flow control.Branching and looping are now part of the shader model. On
the GeForce 6 Series vertex engine, branching and looping have minimal overhead of
just two cycles. Also, each vertex can take its own branches without being grouped in
the way pixel shader branches are. So as branches diverge, the GeForce 6 Series vertex
processor still operates efficiently.
●Vertex texturing. Textures can now be fetched in a vertex program, although only
nearest-neighbor filtering is supported in hardware. More advanced filters can of
course be implemented in the vertex program. Up to four unique textures can be
accessed in a vertex program, although each texture can be accessed multiple times.
Vertex textures generate latency for fetching data, unlike true constant reads. There-
fore, the best way to use vertex textures is to do a texture fetch and follow it with
arithmetic operations to hide the latency before using the result of the texture fetch.
Each vertex engine is capable of simultaneously performing a four-wide SIMD
MAD
(multiply-add) instruction and a scalar special function per clock cycle. Special function
instructions include:
●Exponential functions: EXP, EXPP, LIT, LOG, LOGP
●Reciprocal instructions: RCP, RSQ
●Trigonometric functions: SIN, COS
Fragment Processor
●Increased instruction count. The total instruction count is now 65,535 static in-
structions and 65,535 dynamic instructions. There are limitations on how long the
operating system will wait while the shader finishes working, so a long shader pro-
gram working on a full screen of pixels may time-out. This makes it important to
carefully consider the shader length and number of fragments rendered in one draw
call. In practice, the number of instructions exposed by the driver tends to be smaller,
because the number of instructions can expand as code is translated from Direct3D
pixel shaders or OpenGL fragment programs to native hardware instructions.
●Multiple render targets. The fragment processor can output to up to four separate
color buffers, along with a depth value. All four separate color buffers must be the
same format and size. MRTs can be particularly useful when operating on scalar data,
because up to 16 scalar values can be written out in a single pass by the fragment
processor. Sample uses of MRTs include particle physics, where positions and veloci-
ties are computed simultaneously, and similar GPGPU algorithms. Deferred shading
is another technique that computes and stores multiple four-component floating-
point values simultaneously: it computes all material properties and stores them in
Chapter 30 The GeForce 6 Series GPU Architecture

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 484
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 15 to Favourites

image text

Enable zoom

							
separate textures. So, for example, the surface normal and the diffuse and specular
material properties could be written to textures, and the textures could all be used in
subsequent passes when lighting the scene with multiple lights. This is illustrated in
Figure 30-8.
●Dynamic flow control (branching).Shader Model 3.0 supports conditional branch-
ing and looping, allowing for more flexible shader programs.
●Indexing of attributes. With Shader Model 3.0, an index register can be used to
select which attributes to process, allowing for loops to perform the same operation
on many different inputs.
●Up to ten full-function attributes. Shader Model 3.0 supports ten full-function 
attributes/texture coordinates, instead of Shader Model 2.0’s eight full-function at-
tributes plus specular color and diffuse color. All ten Shader Model 3.0 attributes are
interpolated at full fp32 precision, whereas Shader Model 2.0’s diffuse and specular
color were interpolated at only 8-bit integer precision.
●Centroid sampling. Shader Model 3.0 allows a per-attribute selection of center sam-
pling, or  centroid sampling . Centroid sampling returns a value inside the covered por-
tion of the fragment, instead of at the center, and when used with multisampling, it
can remove some artifacts associated with sampling outside the polygon (for example,
when calculating diffuse or specular color using texture coordinates, or when using
texture atlases).
●Support for fp32 and fp16 internal precision. Fragment programs can support full
fp32-precision computations and intermediate storage or partial-precision fp16 com-
putations and intermediate storage.
30.3 GPU Features 485
Figure 30-8.
How MRTs Work
MRTs make it possible for a fra gment program to return four four-wide  color values plus a depth value.

430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 485
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 16 to Favourites

image text

Enable zoom

							
486
●3:1 and 2:2 coissue.Each four-component-wide vector unit is capable of executing
two independent instructions in parallel, as shown in Figure 30-9: either one three-
wide operation on RGB and a separate operation on alpha, or one two-wide opera-
tion on red-green and a separate two-wide operation on blue-alpha. This gives the
compiler more opportunity to pack scalar computations into vectors, thereby doing
more work in a shorter time.
●Dual issue. Dual issue is similar to coissue, except that the two independent instruc-
tions can be executed on different parts of the shader pipeline. This makes the
pipeline easier to schedule and, therefore, more efficient. See Figure 30-10.
Chapter 30 The GeForce 6  Series GPU Architecture

Figure 30-9.How Coissue Works
Two separate operations can concurrently exe cute on different parts of a four-wide register.
Figure 30-10.How Dual Issue Works
Independent instructions can  be executed on independent units in the computational pipeline.
430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 486
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 17 to Favourites

image text

Enable zoom

							
Fragment Processor Performance
The GeForce 6 Series fragment processor architecture has the following performance
characteristics:
●Each pipeline is capable of performing a four-wide, coissue-able multiply-add (MAD)
or four-term dot product (
DP4), plus a four-wide, coissue-able and dual-issuable
multiply instruction per clock in series, as shown in Figure 30-11. In addition, a
multifunction unit that performs complex operations can replace the alpha channel
MADoperation. Operations are performed at full speed on both fp32 and fp16 data,
although storage and bandwidth limitations can favor fp16 performance sometimes.
In practice, it is sometimes possible to execute eight math operations and a texture
lookup in a single cycle.
●Dedicated fp16 normalization hardware exists, making it possible to normalize a
vector at fp16 precision in parallel with the multiplies and 
MADs just described.
●An independent reciprocal operation can be performed in parallel with the multiply,
MAD, and fp16 normalization described previously.
●Because the GeForce 6800 has 16 fragment-processing pipelines, the overall available
performance of the system is given by these values multiplied by 16 and then by the
clock rate.
●There is some overhead to flow-control operations, as defined in Table 30-2.
30.3 GPU Features 487
Figure 30-11.
Shader Units and Capabilities in the Fragment Processor

430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 487
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 18 to Favourites

image text

Enable zoom

							
488
Table 30-2.Overhead Incurred When Executing Flo w-Control Operations in Fragment Programs
Instruction Cost (Cycles)
If/ endif4
If/else/ endif6
Call2
Ret2
Loop/ endloop4
Furthermore, branching in the fragment processor is affected by the level of divergence
of the branches. Because the fragment processor operates on hundreds of pixels per
instruction, if a branch is taken by some fragments and not others, all fragments exe-
cute both branches, but only writing to the registers on the branches each fragment is
supposed to take. For low-frequency and mid-frequency branch changes, this effect is
hidden, although it can become a limiter as the branch frequency increases.
30.3.3 Supported Data Storage Formats
Table 30-3 summarizes the data formats supported by the graphics pipeline.
30.4 Performance
The GeForce 6800 Ultra is the flagship product of the GeForce 6 Series family at the
time of writing. Its performance is summarized as follows:
●425 MHz internal graphics clock
●550 MHz memory clock
●600 million vertices/second
●6.4 billion texels/second
●12.8 billion pixels/second, rendering z/stencil-only (useful for shadow volumes and
shadow buffers)
●6 four-wide fp32 vector MADs per clock cycle in the vertex shader, plus one scalar multi-
function operation (a complex math operation, such as a sine or reciprocal square root)
●16 four-wide fp32 vector MADs per clock cycle in the fragment processor, plus 16
four-wide fp32 multiplies per clock cycle
●64 pixels per clock cycle early z-cull (reject rate)
As you can see, there’s plenty of programmable floating-point horsepower in the vertex
and fragment processors that can be exploited for computationally demanding problems.
Chapter 30 The GeForce 6  Series GPU Architecture

430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 488
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 19 to Favourites

image text

Enable zoom

							
30.4 Performance489
Table 30-3.
Data Storage Formats Supported by GeForce 6 Series GPUs
FormatDescription of Data in Memory
Ver tex
Texture
SupportFragment Texture
SupportRender Target
Support
B8One 8-bit fixed-point number✗✓✓
A1R5G5B5A 1-bit value and three 5-bit unsigned fixed-point
numbers✗✓✓
A4R4G4B4Four 4-bit unsigned fixed-point numbers✗✓✗
R5G6B55-bit, 6-bit, and 5-bit fixed-point numbers✗✓✓
A8R8G8B8Four 8-bit fixed-point numbers✗✓✓
DXT1Compressed 4×4 pixels into 8 bytes ✗✓✗
DXT2,3,4,5Compressed 4×4 pixels into 16 bytes✗✓✗
G8B8Two 8-bit fixed-point numbers ✗✓✓
B8R8_G8R8Compressed as YVYU; two pixels in 32 bits ✗✓✗
R8B8_R8G8 Compressed as VYUY; two pixels in 32 bits✗✓✗
R6G5B56-bit, 5-bit, and 5-bit unsigned fixed-point numbers✗✓✗
DEPTH24_D8A 24-bit unsigned fixed-point number and 8 bits of
garbage✗✓✓
DEPTH24_D8_FLOATA 24-bit unsigned float and 8 bits of garbage✗✓✓
DEPTH16A 16-bit unsigned fixed-point number✗✓✓
DEPTH16_FLOATA 16-bit unsigned float✗✓✓
X16A 16-bit fixed-point number✗✓✗
Y16_X16Two 16-bit fixed-point numbers✗✓✗
R5G5B5A1Three unsigned 5-bit fixed-point numbers and a 1-bit
value✗✓✓
HILO8Two unsigned 16-bit values compressed into two 8-bit
values✗✓✗
HILO_S8Two signed 16-bit values compressed into two 8-bit
values✗✓✗
W16_Z16_Y16_X16 FLOATFour fp16 values✗✓✓
W32_Z32_Y32_X32 FLOATFour fp32 values✓
(unfiltered)✓
(unfiltered)✓
X32_FLOATOne 32-bit floating-point number✓
(unfiltered)✓
(unfiltered)✓
D1R5G5B51 bit of garbage and three unsigned 5-bit fixed-point
numbers✗✓✓
D8R8G8B88 bits of garbage and three unsigned 8-bit fixed-point
numbers✗✓✓
Y16_X16 FLOATTwo 16-bit floating-point numbers✗✓✗
✓ = Yes ✗= No

430_gems2_ch30_new.qxp  1/31/2005  6:58 PM  Page 489
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 20 to Favourites

image text

Enable zoom

490
30.5 Achieving Optimal Performance
While graphics hardware is becoming more and more programmable, there are still
some tricks to ensuring that you exploit the hardware fully to get the most perform-
ance. This section lists some common techniques that you may find helpful. A more
detailed discussion of performance advice is available in the NVIDIA GPU Program-
ming Guide , which is freely available in several languages from the NVIDIA Developer
Web site (http://developer.nvidia.com/object/gpu_programming_guide.html).
30.5.1 Use Z-Culling Aggressively
Z-cull avoids work that won’t contribute to the final result. It’s better to determine early
on that a computation doesn’t matter and save doing the work. In graphics, this can be
done by rendering the z-values for all objects first, before shading. For general-purpose
computation, the z-cull unit can be used to select which parts of the computation are
still active, culling computational threads that have already resolved. See Section 34.2.3
of Chapter 34, “GPU Flow-Control Idioms,” for more details on this idea.
30.5.2 Exploit Texture Math When Loading Data
The texture unit filters data before returning it to the fragment processor, thus reducing the
total data needed by the shader. The texture unit’s bilinear filtering can frequently be used
to reduce the total work done by the shader if it’s performing more sophisticated shading.
Often, large filter kernels can be dissected into groups of bilinear footprints, which are
scaled and accumulated to build the large kernel. A few caveats apply here, most no-
tably that all filter coefficients must be positive for bilinear footprint assembly to work
properly. (See Chapter 20, “Fast Third-Order Texture Filtering,” for more information
about this technique.)
Similarly, the filtering support given by shadow buffering can be used to offload the
work from the processor when performing compares, then filtering the results.
30.5.3 Use Branching in Fragment Programs Judiciously
Because the fragment processor is a SIMD machine operating on many fragments at a
time, if some fragments in a given group take one branch and other fragments in that
group take another branch, the fragment processor needs to take both branches. Also,
there is a six-cycle overhead for if-else-endif control structures. These two effects can
reduce the performance of branching programs if not considered carefully. Branching
can be very beneficial, as long as the work avoided outweighs the cost of branching.
Chapter 30 The GeForce 6 Series GPU Architecture

430_gems2_ch30_new.qxp 1/31/2005 6:58 PM Page 490
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

All Nvidia manuals Comments (0)