Nvidia Geforce 6 Series Manual
Have a look at the manual Nvidia Geforce 6 Series Manual online for free. It’s possible to download the document as PDF or print. UserManuals.tech offer 9 Nvidia manuals and user’s guides for free. Share the user manual or guide on Facebook, Twitter or Google+.
471 The GeForce 6 Series GPU Architecture Emmett Kilgariff NVIDIA Corporation Randima Fernando NVIDIA Corporation Chapter 30 The previous chapter described how GPU architecture has changed as a result of compu- tational and communications trends in microprocessing. This chapter describes the archi- tecture of the GeForce 6 Series GPUs from NVIDIA, which owe their formidable computational power to their ability to take advantage of these trends. Most notably, we focus on the GeForce 6800 (NVIDIA’s flagship GPU at the time of writing, shown in Figure 30-1), which delivers hundreds of gigaflops of single-precision floating-point com- putation, as compared to approximately 12 gigaflops for current high-end CPUs. In this chapter—and throughout the book—references to GeForce 6 Series GPUs should be read to include the latest Quadro FX GPUs supporting Shader Model 3.0, which provide a superset of the functionality offered by the GeForce 6 Series. We start with a general overview of where the GPU fits into the overall computer system, and then we describe the architecture along with details of specific features and performance characteristics. 30.1 How the GPU Fits into the Overall Computer System The CPU in a modern computer system communicates with the GPU through a graph- ics connector such as a PCI Express or AGP slot on the motherboard. Because the graphics connector is responsible for transferring all command, texture, and vertex data from the CPU to the GPU, the bus technology has evolved alongside GPUs over the past few years. The original AGP slot ran at 66 MHz and was 32 bits wide, giving a transfer rate of 264 MB/sec. AGP 2 ×, 4×, and 8 ×followed, each doubling the available 30.1 How the GPU Fits into the Overall Computer System 430_gems2_ch30_new.qxp 1/31/2005 6:56 PM Page 471 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
472Chapter 30 The GeForce 6 Series GPU Architecture bandwidth, until finally the PCI Express standard was introduced in 2004, with a maxi- mum theoretical bandwidth of 4 GB/sec simultaneously available to and from the GPU. (Your mileage may vary; currently available motherboard chipsets fall somewhat below this limit—around 3.2 GB/sec or less.) It is important to note the vast differences between the GPU’s memory interface band- width and bandwidth in other parts of the system, as shown in Table 30-1. Table 30-1.Available Memory Bandwidth in Differ ent Parts of the Computer System Component Bandwidth GPU Memory Interface 35 GB/sec PCI Express Bus ( ×16) 8 GB/sec CPU Memory Interface 6.4 GB/sec (800 MHz Front-Side Bus) Figure 30-1. The GeForce 6800 Microprocessor 430_gems2_ch30_new.qxp 1/31/2005 6:56 PM Page 472 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
473 Table 30-1 reiterates some of the points made in the preceding chapter: there is a vast amount of bandwidth available internally on the GPU. Algorithms that run on the GPU can therefore take advantage of this bandwidth to achieve dramatic performance improvements. 30.2 Overall System Architecture The next two subsections go into detail about the architecture of the GeForce 6 Series GPUs. Section 30.2.1 describes the architecture in terms of its graphics capabilities. Section 30.2.2 describes the architecture with respect to the general computational capa- bilities that it provides. See Figure 30-2 for an illustration of the system architecture. 30.2.1 Functional Block Diagram for Graphics Operations Figure 30-3 illustrates the major blocks in the GeForce 6 Series architecture. In this section, we take a trip through the graphics pipeline, starting with input arriving from the CPU and finishing with pixels being drawn to the frame buffer. 30.2 Overall System Architecture Figure 30-2.The Overall System Architecture of a PC 430_gems2_ch30_new.qxp 1/31/2005 6:56 PM Page 473 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
474 First, commands, textures, and vertex data are received from the host CPU through shared buffers in system memory or local frame-buffer memory. A command stream is written by the CPU, which initializes and modifies state, sends rendering commands, and references the texture and vertex data. Commands are parsed, and a vertex fetch unit is used to read the vertices referenced by the rendering commands. The commands, vertices, and state changes flow downstream, where they are used by subsequent pipeline stages. The vertex processors (sometimes called “vertex shaders”), shown in Figure 30-4, allow for a program to be applied to each vertex in the object, performing transformations, skinning, and any other per-vertex operation the user specifies. For the first time, a Chapter 30 The GeForce 6 Series GPU Architecture Figure 30-3.A Block Diagram of the GeForce 6 Series Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:56 PM Page 474 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
GPU—the GeForce 6 Series—allows vertex programs to fetch texture data. All opera- tions are done in 32-bit floating-point (fp32) precision per component. The GeForce 6 Series architecture supports scalable vertex-processing horsepower, allowing the same architecture to service multiple price/performance points. In other words, high-end models may have six vertex units, while low-end models may have two. Because vertex processors can perform texture accesses, the vertex engines are connected to the texture cache, which is shared with the fragment processors. In addition, there is a vertex cache that stores vertex data both before and after the vertex processor, reduc- ing fetch and computation requirements. This means that if a vertex index occurs twice in a draw call (for example, in a triangle strip), the entire vertex program doesn’t have to be rerun for the second instance of the vertex—the cached result is used instead. Vertices are then grouped into primitives, which are points, lines, or triangles. The Cull/Clip/Setup blocks perform per-primitive operations, removing primitives that aren’t visible at all, clipping primitives that intersect the view frustum, and performing edge and plane equation setup on the data in preparation for rasterization. 30.2 Overall System Architecture475 Figure 30-4. The GeForce 6 Series Vertex Processor 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 475 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
476 The rasterization block calculates which pixels (or samples, if multisampling is enabled) are covered by each primitive, and it uses the z-cull block to quickly discard pixels (or samples) that are occluded by objects with a nearer depth value. Think of a fragment as a “candidate pixel”: that is, it will pass through the fragment processor and several tests, and if it gets through all of them, it will end up carrying depth and color information to a pixel on the frame buffer (or render target). Figure 30-5 illustrates the fragment processor (sometimes called a “pixel shader”) and texel pipeline. The texture and fragment-processing units operate in concert to apply a shader program to each fragment independently. The GeForce 6 Series architecture supports a scalable amount of fragment-processing horsepower. Another popular way to say this is that GPUs in the GeForce 6 Series can have a varying number of fragment pipelines (or “pixel pipelines”). Similar to the vertex processor, texture data is cached on- chip to reduce bandwidth requirements and improve performance. The texture and fragment-processing unit operates on squares of four pixels (called quads ) at a time, allowing for direct computation of derivatives for calculating texture level of detail. Furthermore, the fragment processor works on groups of hundreds of pixels at a time in single-instruction, multiple-data (SIMD) fashion (with each fragment processor engine working on one fragment concurrently), hiding the latency of texture fetch from the computational performance of the fragment processor. Chapter 30 The GeForce 6 Series GPU Architecture Figure 30-5.The GeForce 6 Series Fragment Processor and Texel Pipeline 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 476 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
The fragment processor uses the texture unit to fetch data from memory, optionally filter- ing the data before returning it to the fragment processor. The texture unit supports many source data formats (see Section 30.3.3, “Supported Data Storage Formats”). Data can be filtered using bilinear, trilinear, or anisotropic filtering. All data is returned to the fragment processor in fp32 or fp16 format. A texture can be viewed as a 2D or 3D array of data that can be read by the texture unit at arbitrary locations and filtered to reconstruct a continu- ous function. The GeForce 6 Series supports filtering of fp16 textures in hardware. The fragment processor has two fp32 shader units per pipeline, and fragments are routed through both shader units and the branch processor before recirculating through the entire pipeline to execute the next series of instructions. This rerouting happens once for each core clock cycle. Furthermore, the first fp32 shader can be used for perspective correction of texture coordinates when needed (by dividing by w), or for general-purpose multiply operations. In general, it is possible to perform eight or more math operations in the pixel shader during each clock cycle, or four math operations if a texture fetch occurs in the first shader unit. On the final pass through the pixel shader pipeline, the fog unit can be used to blend fog in fixed-point precision with no performance penalty. Fog blending happens often in conventional graphics applications and uses the following function: out = FogColor * fogFraction + SrcColor * (1 - fogFraction) This function can be made fast and small using fixed-precision math, but in general IEEE floating point, it requires two full multiply-adds to do effectively. Because fixed point is efficient and sufficient for fog, it exists in a separate small\ unit at the end of the shader. This is a good example of the trade-offs in providing flexible programmable hardware while still offering maximum performance for legacy applications. Fragments leave the fragment-processing unit in the order that they are rasterized and are sent to the z-compare and blend units, which perform depth testing (z comparison and update), stencil operations, alpha blending, and the final color wr\ ite to the target surface (an off-screen render target or the frame buffer). The memory system is partitioned into up to four independent memory partitions, each with its own dynamic random-access memories (DRAMs). GPUs use standard DRAM modules rather than custom RAM technologies to take advantage of market economies and thereby reduce cost. Having smaller, independent memory partitions allows the memory subsystem to operate efficiently regardless of whether large or small blocks of data are transferred. All rendered surfaces are stored in the DRAMs, while textures and input data can be stored in the DRAMs or in system memory. The four 30.2 Overall System Architecture 477 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 477 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
478 independent memory partitions give the GPU a wide (256 bits), flexible memory sub- system, allowing for streaming of relatively small (32-byte) memory accesses at near the 35 GB/sec physical limit. 30.2.2 Functional Block Diagram for Non-Graphics Operations As graphics hardware becomes more and more programmable, applications unrelated to the standard polygon pipeline (as described in the preceding section) are starting to present themselves as candidates for execution on GPUs. Figure 30-6 shows a simplified view of the GeForce 6 Series architecture, when used as a graphics pipeline. It contains a programmable vertex engine, a programmable fragment engine, a texture load/filter engine, and a depth-compare/blending data write engine. In this alternative view, a GPU can be seen as a large amount of programmable floating- point horsepower and memory bandwidth that can be exploited for compute-intensive applications completely unrelated to computer graphics. Figure 30-7 shows another way to view the GeForce 6 Series architecture. When used for non-graphics applications, it can be viewed as two programmable blocks that run serially: the vertex processor and the fragment processor, both with support for fp32 operands and intermediate values. Both use the texture unit as a random-access data fetch unit and access data at a phenomenal 35 GB/sec (550 MHz DDR memory clock ×256 bits per clock cycle ×2 transfers per clock cycle). In addition, both the vertex and the fragment processor are highly computationally capable. (Performance details follow in Section 30.4.) Chapter 30 The GeForce 6 Series GPU Architecture Figure 30-6.The GeForce 6 Series Architect ure Viewed as a Graphics Pipeline 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 478 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
The vertex processor operates on data, passing it directly to the fragment processor, or by using the rasterizer to expand the data into interpolated values. At this point, each triangle (or point) from the vertex processor has become one or more fragments. Before a fragment reaches the fragment processor, the z-cull unit compares the pixel’s depth with the values that already exist in the depth buffer. If the pixel’s depth is greater, the pixel will not be visible, and there is no point shading that fragment, so the fragment processor isn’t even executed. (This optimization happens only if it’s clear that the fragment processor isn’t going to modify the fragment’s depth.) Thinking in a general-purpose sense, this early cullingfeature makes it possible to quickly decide to skip work on specific fragments based on a scalar test. Chapter 34 of this book,\ “GPU Flow-Control Idioms,” explains how to take advantage of this feature to efficiently predicate work for general-purpose computations. After the fragment processor runs on a potential pixel (still a “fragment” because it has not yet reached the frame buffer), the fragment must pass a number of tests in o\ rder to move farther down the pipeline. (There may also be more than one fragment that comes out of the fragment processor if multiple render targets [MRTs] are being used. Up to four MRTs can be used to write out large amounts of data—up to 16 scalar floating-point values at a time, for example—plus depth.) First, the scissor test rejects the fragment if it lies outside a specified subrectangle of the frame buffer. Although the popular graphics APIs define scissoring at this location in the pipeline, it is more efficient to perform the scissor test in the rasterizer. Scissoring in xand y actually happens in the rasterizer, before fragment processing, and zscissoring happens 30.2 Overall System Architecture 479 Figure 30-7. The GeForce 6 Series Architect ure for Non-Graphics Applications 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 479 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation
480 during z-cull. This avoids all fragment processor work on scissored (rejected) pixels. Scis- soring is rarely useful for general-purpose computation because general-purpose program- mers typically draw rectangles to perform computations in the first place. Next, the fragment’s depth is compared with the depth in the frame buffer. If the depth test passes, the fragment moves on in the pipeline. Optionally, the depth value in the frame buffer can be replaced at this stage. After this, the fragment can optionally test and modify what is known as the stencil buffer, which stores an integer value per pixel. The stencil buffer was originally intended to allow programmers to mask off certain pixels (for example, to restrict draw- ing to a cockpit’s windshield), but it has found other uses as a way to count values by incrementing or decrementing the existing value. This feature is used for stencil shadow volumes, for example. If the fragment passes the depth and stencil tests, it can then optionall\ y modify the contents of the frame buffer using the blend function. A blend function \ can be described as out = src * srcOp + dst * dstOp where sourceis the fragment color flowing down the pipeline; dstis the color value in the frame buffer; and srcOpand dstOpcan be specified to be constants, source color components, or destination color components. Full blend functionality is sup- ported for all pixel formats up to fp16 ×4. However, fp32 frame buffers don’t support blending—only updating the buffer is allowed. Finally, a feature called occlusion querymakes it possible to quickly determine if any of the fragments that would be rendered in a particular computation would cause results to be written to the frame buffer. (Recall that fragments that do not pass the z-test don’t have any effect on the values in the frame buffer.) Traditionally, the occlusion query test is used to allow graphics applications to avoid making draw calls for occluded objects, but it is useful for GPGPU applications as well. For instance, if the depth test is used to determine which outputs need to be updated in a sparse array, updating depth can be used to indicate when a given output has converged and no further work is needed. In this case, occlusion query can be used to tell when all output calculations are done. See Chapter 34 of this book, “GPU Flow-Control Idioms,” for further information about this idea. Chapter 30 The GeForce 6 Series GPU Architecture 430_gems2_ch30_new.qxp 1/31/2005 6:57 PM Page 480 Excerpted from GPU Gems 2 Copyright 2005 by NVIDIA Corporation