The CUDA Toolkit targets a class of applications whose control part runs as a process on a general purpose computer (Linux, Windows), and which use one or more NVIDIA GPUs as coprocessors for accelerating SIMD parallel jobs. Such jobs are „self- contained‟, in the sense that they can be executed and completed by a batch of GPU threads entirely without intervention by the „host‟ process, thereby gaining optimal benefit from the parallel graphics hardware. Dispatching GPU jobs by the host process is supported by the CUDA Toolkit in the form of remote procedure calling. The GPU code is implemented as a collection of functions in a language that is essentially „C‟, but with some annotations for distinguishing them from the host code, plus annotations for distinguishing different types of data memory that exists on the GPU. Such functions may have parameters, and they can be „called‟ using a syntax that is very similar to regular C function calling, but slightly extended for being able to specify the matrix of GPU threads that must execute the „called‟ function. During its life time, the host process may dispatch many parallel GPU tasks. See Figure 1.