Home > Nvidia > Processor > Nvidia Geforce 6 Series Manual

Nvidia Geforce 6 Series Manual

Download as PDF Print this page Share this page

Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.

View all the pages

Add page 1 to Favourites

image text

Enable zoom

							
471
The GeForce 6 Series GPU
Architecture
Emmett Kilgariff
NVIDIA Corporation
Randima Fernando
NVIDIA Corporation
Chapter 30
The previous chapter described how GPU architecture has changed as a result of compu-
tational and communications trends in microprocessing. This chapter describes the archi-
tecture of the GeForce 6 Series GPUs from NVIDIA, which owe their formidable
computational power to their ability to take advantage of these trends. Most notably, we
focus on the GeForce 6800 (NVIDIA’s flagship GPU at the time of writing, shown in
Figure 30-1), which delivers hundreds of gigaflops of single-precision floating-point com-
putation, as compared to approximately 12 gigaflops for current high-end CPUs. In this
chapter—and throughout the book—references to GeForce 6 Series GPUs should be read
to include the latest Quadro FX GPUs supporting Shader Model 3.0, which provide a
superset of the functionality offered by the GeForce 6 Series. We start with a general
overview of where the GPU fits into the overall computer system, and then we describe
the architecture along with details of specific features and performance characteristics.
30.1 How the GPU Fits into the Overall Computer System
The CPU in a modern computer system communicates with the GPU through a graph-
ics connector such as a PCI Express or AGP slot on the motherboard. Because the
graphics connector is responsible for transferring all command, texture, and vertex data
from the CPU to the GPU, the bus technology has evolved alongside GPUs over the
past few years. The original AGP slot ran at 66 MHz and was 32 bits wide, giving a
transfer rate of 264 MB/sec. AGP 2
×, 4×, and 8 ×followed, each doubling the available
30.1 How the GPU Fits into the  Overall Computer System

430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 471
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 2 to Favourites

image text

Enable zoom

							
472Chapter 30 The GeForce 6 Series GPU Architecture
bandwidth, until finally the PCI Express standard was introduced in 2004, with a maxi-
mum theoretical bandwidth of 4 GB/sec simultaneously available to and from the GPU.
(Your mileage may vary; currently available motherboard chipsets fall somewhat below
this limit—around 3.2 GB/sec or less.) 
It is important to note the vast differences between the GPU’s memory interface band-
width and bandwidth in other parts of the system, as shown in Table 30-1.
Table 30-1.Available Memory Bandwidth in Differ ent Parts of the Computer System 
Component Bandwidth
GPU Memory Interface 35 GB/sec
PCI Express Bus (
×16) 8 GB/sec
CPU Memory Interface  6.4 GB/sec
(800 MHz Front-Side Bus)
Figure 30-1. The GeForce 6800 Microprocessor

430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 472
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 3 to Favourites

image text

Enable zoom

							
473
Table 30-1 reiterates some of the points made in the preceding chapter: there is a vast
amount of bandwidth available internally on the GPU. Algorithms that run on the
GPU can therefore take advantage of this bandwidth to achieve dramatic performance
improvements. 
30.2 Overall System Architecture
The next two subsections go into detail about the architecture of the GeForce 6 Series
GPUs. Section 30.2.1 describes the architecture in terms of its graphics capabilities.
Section 30.2.2 describes the architecture with respect to the general computational capa-
bilities that it provides. See Figure 30-2 for an illustration of the system architecture.
30.2.1 Functional Block Diagram for Graphics Operations
Figure 30-3 illustrates the major blocks in the GeForce 6 Series architecture. In this
section, we take a trip through the graphics pipeline, starting with input arriving from
the CPU and finishing with pixels being drawn to the frame buffer. 
30.2 Overall System Architecture
Figure 30-2.The Overall System Architecture of a PC

430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 473
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 4 to Favourites

image text

Enable zoom

							
474
First, commands, textures, and vertex data are received from the host CPU through
shared buffers in system memory or local frame-buffer memory. A command stream is
written by the CPU, which initializes and modifies state, sends rendering commands, and
references the texture and vertex data. Commands are parsed, and a vertex fetch unit is
used to read the vertices referenced by the rendering commands. The commands, vertices,
and state changes flow downstream, where they are used by subsequent pipeline stages.
The vertex processors (sometimes called “vertex shaders”), shown in Figure 30-4, allow
for a program to be applied to each vertex in the object, performing transformations,
skinning, and any other per-vertex operation the user specifies. For the first time, a
Chapter 30 The GeForce 6 Series GPU Architecture
Figure 30-3.A Block Diagram of the GeForce 6 Series Architecture

430_gems2_ch30_new.qxp  1/31/2005  6:56 PM  Page 474
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 5 to Favourites

image text

Enable zoom

							
GPU—the GeForce 6 Series—allows vertex programs to fetch texture data. All opera-
tions are done in 32-bit floating-point (fp32) precision per component. The GeForce 6
Series architecture supports scalable vertex-processing horsepower, allowing the same
architecture to service multiple price/performance points. In other words, high-end
models may have six vertex units, while low-end models may have two.
Because vertex processors can perform texture accesses, the vertex engines are connected
to the texture cache, which is shared with the fragment processors. In addition, there is
a vertex cache that stores vertex data both before and after the vertex processor, reduc-
ing fetch and computation requirements. This means that if a vertex index occurs twice
in a draw call (for example, in a triangle strip), the entire vertex program doesn’t have to
be rerun for the second instance of the vertex—the cached result is used instead. 
Vertices are then grouped into primitives, which are points, lines, or triangles. The
Cull/Clip/Setup blocks perform per-primitive operations, removing primitives that
aren’t visible at all, clipping primitives that intersect the view frustum, and performing
edge and plane equation setup on the data in preparation for rasterization.
30.2 Overall System Architecture475
Figure 30-4.
The GeForce 6 Series Vertex Processor

430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 475
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 6 to Favourites

image text

Enable zoom

							
476
The rasterization block calculates which pixels (or samples, if multisampling is enabled)
are covered by each primitive, and it uses the z-cull block to quickly discard pixels (or
samples) that are occluded by objects with a nearer depth value. Think of a fragment as
a “candidate pixel”: that is, it will pass through the fragment processor and several tests,
and if it gets through all of them, it will end up carrying depth and color information
to a pixel on the frame buffer (or render target).
Figure 30-5 illustrates the fragment processor (sometimes called a “pixel shader”) and
texel pipeline. The texture and fragment-processing units operate in concert to apply a
shader program to each fragment independently. The GeForce 6 Series architecture
supports a scalable amount of fragment-processing horsepower. Another popular way to
say this is that GPUs in the GeForce 6 Series can have a varying number of fragment
pipelines  (or “pixel pipelines”). Similar to the vertex processor, texture data is cached on-
chip to reduce bandwidth requirements and improve performance.
The texture and fragment-processing unit operates on squares of four pixels (called
quads ) at a time, allowing for direct computation of derivatives for calculating texture
level of detail. Furthermore, the fragment processor works on groups of hundreds of
pixels at a time in single-instruction, multiple-data (SIMD) fashion (with each fragment
processor engine working on one fragment concurrently), hiding the latency of texture
fetch from the computational performance of the fragment processor.
Chapter 30 The GeForce 6  Series GPU Architecture
Figure 30-5.The GeForce 6 Series Fragment Processor and Texel Pipeline

430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 476
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 7 to Favourites

image text

Enable zoom

The fragment processor uses the texture unit to fetch data from memory, optionally filter-
ing the data before returning it to the fragment processor. The texture unit supports many
source data formats (see Section 30.3.3, “Supported Data Storage Formats”). Data can be
filtered using bilinear, trilinear, or anisotropic filtering. All data is returned to the fragment
processor in fp32 or fp16 format. A texture can be viewed as a 2D or 3D array of data that
can be read by the texture unit at arbitrary locations and filtered to reconstruct a continu-
ous function. The GeForce 6 Series supports filtering of fp16 textures in hardware.
The fragment processor has two fp32 shader units per pipeline, and fragments are
routed through both shader units and the branch processor before recirculating through
the entire pipeline to execute the next series of instructions. This rerouting happens once
for each core clock cycle. Furthermore, the first fp32 shader can be used for perspective
correction of texture coordinates when needed (by dividing by w), or for general-purpose
multiply operations. In general, it is possible to perform eight or more math operations
in the pixel shader during each clock cycle, or four math operations if a texture fetch
occurs in the first shader unit.
On the final pass through the pixel shader pipeline, the fog unit can be used to blend
fog in fixed-point precision with no performance penalty. Fog blending happens often
in conventional graphics applications and uses the following function:
out = FogColor * fogFraction + SrcColor * (1 - fogFraction)
This function can be made fast and small using fixed-precision math, but in general
IEEE floating point, it requires two full multiply-adds to do effectively. Because fixed
point is efficient and sufficient for fog, it exists in a separate small\
unit at the end of the
shader. This is a good example of the trade-offs in providing flexible programmable
hardware while still offering maximum performance for legacy applications.
Fragments leave the fragment-processing unit in the order that they are rasterized and
are sent to the z-compare and blend units, which perform depth testing (z comparison
and update), stencil operations, alpha blending, and the final color wr\
ite to the target
surface (an off-screen render target or the frame buffer).
The memory system is partitioned into up to four independent memory partitions,
each with its own dynamic random-access memories (DRAMs). GPUs use standard
DRAM modules rather than custom RAM technologies to take advantage of market
economies and thereby reduce cost. Having smaller, independent memory partitions
allows the memory subsystem to operate efficiently regardless of whether large or small
blocks of data are transferred. All rendered surfaces are stored in the DRAMs, while
textures and input data can be stored in the DRAMs or in system memory. The four
30.2 Overall System Architecture 477

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 477
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 8 to Favourites

image text

Enable zoom

							
478
independent memory partitions give the GPU a wide (256 bits), flexible memory sub-
system, allowing for streaming of relatively small (32-byte) memory accesses at near the
35 GB/sec physical limit.
30.2.2 Functional Block Diagram for Non-Graphics Operations
As graphics hardware becomes more and more programmable, applications unrelated to
the standard polygon pipeline (as described in the preceding section) are starting to
present themselves as candidates for execution on GPUs.
Figure 30-6 shows a simplified view of the GeForce 6 Series architecture, when used as a
graphics pipeline. It contains a programmable vertex engine, a programmable fragment
engine, a texture load/filter engine, and a depth-compare/blending data write engine.
In this alternative view, a GPU can be seen as a large amount of programmable floating-
point horsepower and memory bandwidth that can be exploited for compute-intensive
applications completely unrelated to computer graphics.
Figure 30-7 shows another way to view the GeForce 6 Series architecture. When used for
non-graphics applications, it can be viewed as two programmable blocks that run serially:
the vertex processor and the fragment processor, both with support for fp32 operands and
intermediate values. Both use the texture unit as a random-access data fetch unit and access
data at a phenomenal 35 GB/sec (550 MHz DDR memory clock 
×256 bits per clock
cycle 
×2 transfers per clock cycle). In addition, both the vertex and the fragment processor
are highly computationally capable. (Performance details follow in Section 30.4.)
Chapter 30 The GeForce 6  Series GPU Architecture
Figure 30-6.The GeForce 6 Series Architect ure Viewed as a Graphics Pipeline

430_gems2_ch30_new.qxp  1/31/2005  6:57 PM  Page 478
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 9 to Favourites

image text

Enable zoom

The vertex processor operates on data, passing it directly to the fragment processor, or
by using the rasterizer to expand the data into interpolated values. At this point, each
triangle (or point) from the vertex processor has become one or more fragments.
Before a fragment reaches the fragment processor, the z-cull unit compares the pixel’s
depth with the values that already exist in the depth buffer. If the pixel’s depth is
greater, the pixel will not be visible, and there is no point shading that fragment, so the
fragment processor isn’t even executed. (This optimization happens only if it’s clear that
the fragment processor isn’t going to modify the fragment’s depth.) Thinking in a
general-purpose sense, this early cullingfeature makes it possible to quickly decide to
skip work on specific fragments based on a scalar test. Chapter 34 of this book,\
“GPU
Flow-Control Idioms,” explains how to take advantage of this feature to efficiently
predicate work for general-purpose computations.
After the fragment processor runs on a potential pixel (still a “fragment” because it has
not yet reached the frame buffer), the fragment must pass a number of tests in o\
rder to
move farther down the pipeline. (There may also be more than one fragment that
comes out of the fragment processor if multiple render targets [MRTs] are being used.
Up to four MRTs can be used to write out large amounts of data—up to 16 scalar
floating-point values at a time, for example—plus depth.)
First, the scissor test rejects the fragment if it lies outside a specified subrectangle of the
frame buffer. Although the popular graphics APIs define scissoring at this location in the
pipeline, it is more efficient to perform the scissor test in the rasterizer. Scissoring in xand
y actually happens in the rasterizer, before fragment processing, and zscissoring happens
30.2 Overall System Architecture 479
Figure 30-7.
The GeForce 6 Series Architect ure for Non-Graphics Applications

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 479
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

Add page 10 to Favourites

image text

Enable zoom

480
during z-cull. This avoids all fragment processor work on scissored (rejected) pixels. Scis-
soring is rarely useful for general-purpose computation because general-purpose program-
mers typically draw rectangles to perform computations in the first place.
Next, the fragment’s depth is compared with the depth in the frame buffer. If the depth
test passes, the fragment moves on in the pipeline. Optionally, the depth value in the
frame buffer can be replaced at this stage.
After this, the fragment can optionally test and modify what is known as the stencil
buffer, which stores an integer value per pixel. The stencil buffer was originally
intended to allow programmers to mask off certain pixels (for example, to restrict draw-
ing to a cockpit’s windshield), but it has found other uses as a way to count values by
incrementing or decrementing the existing value. This feature is used for stencil shadow
volumes, for example.
If the fragment passes the depth and stencil tests, it can then optionall\
y modify the
contents of the frame buffer using the blend function. A blend function \
can be
described as
out = src * srcOp + dst * dstOp
where sourceis the fragment color flowing down the pipeline; dstis the color value
in the frame buffer; and
srcOpand dstOpcan be specified to be constants, source
color components, or destination color components. Full blend functionality is sup-
ported for all pixel formats up to fp16
×4. However, fp32 frame buffers don’t support
blending—only updating the buffer is allowed.
Finally, a feature called occlusion querymakes it possible to quickly determine if any of
the fragments that would be rendered in a particular computation would cause results
to be written to the frame buffer. (Recall that fragments that do not pass the z-test don’t
have any effect on the values in the frame buffer.) Traditionally, the occlusion query test
is used to allow graphics applications to avoid making draw calls for occluded objects,
but it is useful for GPGPU applications as well. For instance, if the depth test is used to
determine which outputs need to be updated in a sparse array, updating depth can be
used to indicate when a given output has converged and no further work is needed. In
this case, occlusion query can be used to tell when all output calculations are done. See
Chapter 34 of this book, “GPU Flow-Control Idioms,” for further information about
this idea.
Chapter 30 The GeForce 6 Series GPU Architecture

430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 480
Excerpted from GPU Gems 2
Copyright 2005 by NVIDIA Corporation

All Nvidia manuals Comments (0)