CUDA Runtime和L2 Cache简析

冬至子 2023-05-19 1578

电子说

1.3w人已加入

描述

CUDA Runtime

运行时在cudart库中实现，该库通过cudart静态地链接到应用程序。

所有入口都有cuda的前缀。

正如在异构编程中提到的，CUDA编程模型假设一个由主机和设备组成的系统，每个设备都有自己的独立内存。

Initialization

运行时没有显式的初始化函数。 它在第一次调用运行时函数 (更确切地说，是参考手册中错误处理和版本管理部分的函数以外的任何函数) 时初始化 。

运行时为系统中的每个设备创建一个CUDA Context 。该上下文是该设备的primary context， 在该设备上需要活动上下文的第一个运行时函数时初始化 。 它在应用程序的所有主机线程之间共享 。 作为创建上下文的一部分，如果需要的话，设备代码将被实时编译并加载到设备内存中 。这一切都是透明的。如果需要，例如，为了驱动API的互操作性，可以从驱动API访问设备的主上下文。

当主机线程调用cudaDeviceReset()时，这将销毁主机线程当前操作的设备的 primary context (即在device Selection中定义的当前设备)。当前拥有该设备的任何主机线程的下一个运行时函数调用将为该设备创建一个新的 primary context。

注意:CUDA接口使用全局状态，该状态在主机程序启动时初始化，在主机程序终止时销毁。CUDA运行时和驱动程序无法检测此状态是否无效，因此在程序启动或main后终止期间使用任何这些接口(隐式或显式)将导致未定义的行为。

Device Memory

正如在异构编程中提到的，CUDA编程模型假设一个由主机和设备组成的系统，每个设备都有自己的独立内存。内核在设备内存之外运行，因此运行时提供了分配、释放和复制设备内存的函数，以及在主机内存和设备内存之间传输数据。

设备内存可以分配作为linear memory 或 CUDA arrays。

CUDA arrays 是为 texture fetching 优化的不透明内存布局。
Linear memory 在单一的统一地址空间中分配，这意味着分别分配的实体可以通过指针相互引用，例如，在二叉树或链表中。地址空间的大小取决于主机系统(CPU)和使用的GPU的计算能力.

Graphics Interoperability 介绍了运行时提供的与两个主要图形API，OpenGL和 Direct3D互操作的各种功能。

Texture and Surface Memory 提供了纹理和表面存储器空间，提供了访问设备内存的另一种方式;它们还公开了GPU纹理硬件的一个子集。

Call Stack 提到了用于管理CUDA c++调用栈的运行时函数。

Error Checking 描述如何正确检查运行时生成的错误。

Multi-Device System 展示了编程模型如何扩展到具有多个设备连接到同一主机的系统。

Asynchronous Concurrent Execution 描述了用于在系统的各个级别上支持异步并发执行的概念和API。

Page-Locked Host Memory 引入了页锁定主机内存，它需要在内核执行与主机和设备内存之间的数据传输重叠。

Shared Memory演示了如何使用线程层次结构中引入的共享内存来最大化性能。

CUDA

Linear memory 通常使用 cudaMalloc() 分配，使用cudaFree()释放，主机内存和设备内存之间的数据传输通常使用cudaMemcpy()完成。在kernel的vector加法代码示例中，需要将vector从主机内存复制到设备内存:

// Device code

__global__

void

VecAdd

(

float

* A,

float

* B,

float

* C,

int

N)

{

int

i = blockDim.x * blockIdx.x + threadIdx.x;

if

(i < N)

C[i] = A[i] + B[i];

}

// Host code

int

main

()

{

int

N = ...;

size_t

size = N *

sizeof

(

float

);

// Allocate input vectors h_A and h_B in host memory

float

* h_A = (

float

*)

malloc

(size);

float

* h_B = (

float

*)

malloc

(size);

float

* h_C = (

float

*)

malloc

(size);

// Initialize input vectors

...

// Allocate vectors in device memory

float

* d_A;

cudaMalloc(&d_A, size);

float

* d_B;

cudaMalloc(&d_B, size);

float

* d_C;

cudaMalloc(&d_C, size);

// Copy vectors from host memory to device memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Invoke kernel

int

threadsPerBlock =

256

;

int

blocksPerGrid =

(N + threadsPerBlock -

1

) / threadsPerBlock;

VecAdd<<>>(d_A, d_B, d_C, N);

// Copy result from device memory to host memory

// h_C contains the result in host memory

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A);

cudaFree(d_B);

cudaFree(d_C);

// Free host memory

...

}

Linear memory 也可以通过cudaMallocPitch()和cudaMalloc3D()来分配。这些函数被推荐用于2D或3D数组的分配，因为它确保分配被适当填充，以满足设备内存访问中描述的对齐要求，因此在访问行地址或在2D数组和设备内存的其他区域之间执行复制时(使用cudaMemcpy2D()和cudaMemcpy3D()函数)确保最佳性能。返回的pitch(或stride)必须用于访问数组元素。

下面的代码示例分配了一个 width x height 的二维浮点值数组，并展示了如何在设备代码中循环遍历数组元素:

// Host code

int

width =

64

, height =

64

;

float

* devPtr;

size_t

pitch;

cudaMallocPitch(&devPtr, &pitch,

width *

sizeof

(

float

), height);

MyKernel<<<

100

,

512

>>>(devPtr, pitch, width, height);

// Device code

__global__

void

MyKernel

(

float

* devPtr,

size_t

pitch,

int

width,

int

height)

{

for

(

int

r =

0

; r < height; ++r) {

float

* row = (

float

*)((

char

*)devPtr + r * pitch);

for

(

int

c =

0

; c < width; ++c) {

float

element = row[c];

}

下面的代码示例分配了一个 width x height x depth 的浮点值3D数组，并展示了如何在设备代码中循环遍历数组元素:

// Host code

int

width =

64

, height =

64

, depth =

64

;

cudaExtent extent = make_cudaExtent(width *

sizeof

(

float

),

height, depth);

cudaPitchedPtr devPitchedPtr;

cudaMalloc3D(&devPitchedPtr, extent);

MyKernel<<<

100

,

512

>>>(devPitchedPtr, width, height, depth);

// Device code

__global__

void

MyKernel

(cudaPitchedPtr devPitchedPtr,

int

width,

int

height,

int

depth)

{

char

* devPtr = devPitchedPtr.ptr;

size_t

pitch = devPitchedPtr.pitch;

size_t

slicePitch = pitch * height;

for

(

int

z =

0

; z < depth; ++z) {

char

* slice = devPtr + z * slicePitch;

for

(

int

y =

0

; y < height; ++y) {

float

* row = (

float

*)(slice + y * pitch);

for

(

int

x =

0

; x < width; ++x) {

float

element = row[x];

}

下面的代码示例演示了通过运行时API访问全局变量的各种方法:

__constant__

float

constData[

256

];

float

data[

256

];

cudaMemcpyToSymbol(constData, data,

sizeof

(data));

cudaMemcpyFromSymbol(data, constData,

sizeof

(data));

__device__

float

devData;

float

value =

3.14f

;

cudaMemcpyToSymbol(devData, &value,

sizeof

(

float

));

__device__

float

* devPointer;

float

* ptr;

cudaMalloc(&ptr,

256

*

sizeof

(

float

));

cudaMemcpyToSymbol(devPointer, &ptr,

sizeof

(ptr));

cudaGetSymbolAddress()用于检索指向分配给在全局内存空间中声明的变量的内存的地址。所分配内存的大小通过cudaGetSymbolSize()获得。

Device Memory L2 Access Management

当CUDA内核重复访问全局内存中的数据区域时，可以认为这种数据访问是 persisting 。

另一方面，如果数据只被访问一次，则可以将这种数据访问视为 streaming 。

从CUDA 11.0开始，具有8.0及以上计算能力的设备能够影响L2缓存中的数据持久性，从而可能提供更高的带宽和更低的全局内存访问延迟。

L2 cache Set-Aside for Persisting Accesses

L2缓存的一部分可以被预留出来，用于持久化对全局内存的数据访问 。持久化访问优先使用L2缓存的预留部分，而正常的或流的全局内存访问只能在持久化访问未使用时使用L2的这部分。

用于持久化访问的L2缓存预留大小可以在限制范围内进行调整:

cudaGetDeviceProperties(&prop, device_id);

size_t

size = min(

int

(prop.l2CacheSize *

0.75

), prop.persistingL2CacheMaxSize);

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size);

/* set-aside 3/4 of L2 cache for persisting accesses or the max allowed*/

当GPU配置为MIG (Multi-Instance GPU)模式时，L2缓存预留功能不可用。

当使用多进程服务(MPS)时，L2缓存预留大小不能通过cudaDeviceSetLimit来改变。相反，只能在启动MPS服务器时通过环境变量CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT指定预留大小。

L2 Policy for Persisting Accesses

访问策略窗口指定全局内存的连续区域和L2缓存中的持久性属性，以便在该区域内进行访问。

下面的代码示例展示了如何使用CUDA流设置L2持久化访问窗口。

CUDA Stream Example

cudaStreamAttrValue stream_attribute;

// Stream level attributes data structure

stream_attribute.accessPolicyWindow.base_ptr =

reinterpret_cast

<

void

*>(ptr);

// Global Memory data pointer

stream_attribute.accessPolicyWindow.num_bytes = num_bytes;

// Number of bytes for persistence access.

// (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)

stream_attribute.accessPolicyWindow.hitRatio =

0.6

;

// Hint for cache hit ratio

stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;

// Type of access property on cache hit

stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;

// Type of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

当内核随后在CUDA stream 中执行时，全局内存范围 [ptr..ptr+num_bytes] 内的内存访问比访问其他全局内存位置更有可能持久存在L2缓存中。

CUDA GraphKernelNode Example

cudaKernelNodeAttrValue node_attribute;

// Kernel level attributes data structure

node_attribute.accessPolicyWindow.base_ptr =

reinterpret_cast

<

void

*>(ptr);

// Global Memory data pointer

node_attribute.accessPolicyWindow.num_bytes = num_bytes;

// Number of bytes for persistence access.

// (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)

node_attribute.accessPolicyWindow.hitRatio =

0.6

;

// Hint for cache hit ratio

node_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;

// Type of access property on cache hit

node_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;

// Type of access property on cache miss.

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t

cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow, &node_attribute);

可以使用hitRatio参数指定接收hitProp属性的访问的比例。在上面的两个示例中，全局内存区域中60%的内存访问[ptr..ptr+num_bytes]具有持久化属性，40%的内存访问具有流属性。哪些特定的内存访问被分类为持久化(hitProp)是随机的，概率近似于hitRatio;概率分布取决于硬件架构和内存大小。

例如，如果L2预留缓存大小为16KB，而accessPolicyWindow中的num_bytes为32KB:

当命中率为0.5时，硬件将随机选择32KB窗口中的16KB指定为持久化并缓存到预留的L2缓存区。
当hitRatio为1.0时，硬件将尝试将整个32KB窗口缓存到预留的L2缓存区。由于预留区域比窗口小，缓存行将被删除，以将最近使用的16KB数据保存在L2缓存的预留部分。

因此，可以使用hitRatio来避免缓存线的抖动，并从总体上减少移动到L2缓存和移出的数据量。

hitRatio值低于1.0可用于手动控制与并发CUDA流不同的accessPolicyWindows可以在L2中缓存的数据量。例如，设L2预留缓存大小为16KB;在两个不同的CUDA流中的两个并发内核，每个都具有16KB的accessPolicyWindow，并且都具有1.0的hitRatio值，在竞争共享的L2资源时，可能会驱逐彼此的缓存线。但是，如果两个accessPolicyWindows的hitRatio值都是0.5，它们就不太可能驱逐自己的或彼此的持久化缓存行。

L2 Access Properties

为不同的全局内存数据访问定义了三种类型的访问属性:

cudaAccessPropertyStreaming:带有streaming属性的内存访问不太可能持久存在L2缓存中，因为这些访问会优先被删除。
cudaAccessPropertyPersisting:具有persisting属性的内存访问更有可能保存在L2缓存中，因为这些访问优先保存在L2缓存的预留部分。
cudaAccessPropertyNormal: 这个访问属性强制重置之前应用的持久化访问属性到正常状态。来自以前CUDA内核的具有持久化属性的内存访问可能会在预期使用之后很长时间内保留在L2缓存中。这种使用后持久化减少了不使用持久化属性的后续内核可用的L2缓存量。使用cudaAccessPropertyNormal属性重置访问属性窗口将删除先前访问的持久(优先保留)状态，就像先前访问没有访问属性一样。

L2 Persistence Example

下面的例子展示了如何为持久访问预留L2缓存，通过CUDA流在CUDA内核中使用预留的L2缓存，然后重置L2缓存。

cudaStream_t stream;

cudaStreamCreate(&stream);

// Create CUDA stream

cudaDeviceProp prop;

// CUDA device properties variable

cudaGetDeviceProperties( &prop, device_id);

// Query GPU properties

size_t

size = min(

int

(prop.l2CacheSize *

0.75

) , prop.persistingL2CacheMaxSize );

cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);

// set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t

window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);

// Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;

// Stream level attributes data structure

stream_attribute.accessPolicyWindow.base_ptr =

reinterpret_cast

<

void

*>(data1);

// Global Memory data pointer

stream_attribute.accessPolicyWindow.num_bytes = window_size;

// Number of bytes for persistence access

stream_attribute.accessPolicyWindow.hitRatio =

0.6

;

// Hint for cache hit ratio

stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;

// Persistence Property

stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;

// Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

// Set the attributes to a CUDA Stream

for

(

int

i =

0

; i <

10

; i++) {

cuda_kernelA<<

0

,stream>>>(data1);

// This data1 is used by a kernel multiple times

}

// [data1 + num_bytes) benefits from L2 persistence

cuda_kernelB<<,block_size,<>

0

,stream>>>(data1);

// A different kernel in the same stream can also benefit

// from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes =

0

;

// Setting the window size to 0 disable it

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

// Overwrite the access policy attribute to a CUDA Stream

cudaCtxResetPersistingL2Cache();

// Remove any persistent lines in L2

cuda_kernelC<<,block_size,<>

0

,stream>>>(data2);

// data2 can now benefit from full L2 in normal mode

`Reset L2 Access to Normal`

来自上一个CUDA内核的持久化L2缓存线可能在它被使用后很长一段时间内持久化L2。因此，对于流或正常内存访问来说，L2缓存的正常优先级重置为正常是很重要的。有三种方法可以将持久化访问重置为正常状态。

使用访问属性cudaAccessPropertyNormal设置先前的持久化内存区域。

通过调用cudaCtxResetPersistingL2Cache()将所有持久化L2缓存线重置为正常。

最终未碰触的线路会自动重置为正常。由于自动复位发生所需的时间长度不确定，因此强烈不鼓励依赖自动复位。

`Manage Utilization of L2 set-aside cache`

在不同的CUDA流中并发执行的多个CUDA内核可能会为它们的流分配不同的访问策略窗口。然而， L2预留缓存部分在所有这些并发CUDA内核之间共享。因此，这个预留缓存部分的净利用率是所有并发内核单独使用的总和。当持久化访问的量超过预留的L2缓存容量时，将内存访问指定为持久化访问的好处就会减少。

为了管理预留的L2缓存部分的利用率，应用程序必须考虑以下因素:

L2预留缓存的大小。

可以并发执行的CUDA内核。

可并发执行的所有CUDA内核的访问策略窗口。

需要在何时以及如何重置L2，以允许normal或streaming访问以同等优先级利用之前设置的L2缓存。

`Query L2 cache Properties`

与L2缓存相关的属性是cudaDeviceProp结构的一部分，可以使用CUDA运行时API cudaGetDeviceProperties查询.

CUDA设备属性包括:

l2CacheSize: GPU上可用的L2缓存量。

persistingL2CacheMaxSize:可为持久内存访问预留的L2缓存的最大数量。

accessPolicyMaxWindowSize:访问策略窗口的最大大小。

`Control L2 Cache Set-Aside Size for Persisting Memory Access`

使用CUDA运行时API cudaDeviceGetLimit查询用于持久化内存访问的L2预留缓存大小，并使用CUDA运行时API cudaDeviceSetLimit作为cudaLimit进行设置。该限制的最大值为cudaDeviceProp::persistingL2CacheMaxSize。

enum

cudaLimit {

/* other fields not shown */

cudaLimitPersistingL2CacheSize

};

打开APP阅读更多精彩内容