Thursday, June 10, 2010

CUDA Quick Tips, Reference, and Cheat Sheets

Here are some quick tips and references I strung together while I'm learning CUDA

A. Size of a Grid:
  • gridDim.x (1Dimensional)
  • gridDim.x (2Dimensional, assuming a N x N Grid)

B. Size of a Block:
  • blockDim.x (1Dimensional)
  • blockDim.x (2Dimensional, assuming a N x N Block)

C. Thread Local Index within its block (assuming a 1Dimensional Block):
  • threadIdx.x

D. Block Local Index
  • blockIdx.x (1Dimensional)
  • blockIdx.x (2Dimensional) --> Current Column Index (Length) of a N x N Block
  • blockIdx.y (2Dimensional) --> Current Row Index (Height) of a N x N Block

E. Thread Global Index across the entire grid (assuming a 1 Dimensional Grid):
  • (blockDim.x * blockIdx.x) + threadIdx.x

F. Thread Local Index within its block (assuming a 2Dimensional Block):

F-1.Obtain current column index (assuming you have a N x N Block):
  • (blockIdx.x * blockDimx.x) + threadIdx.x
F-2. Obtain current row index (assuming you have a N x N Block):
  • (blockIdx.y * blockDimx.x) + threadIdx.y
Since you have a N x N Block, the Length and Height are the same.

Quick Example

N = 1024. You have to process N x N elements (1024 x 1024). You could decompose the grid as so: You could set the blockSize to 64. Then gridSize = numElements / blockSize --> gridSize = 1024 / 64 = 16. Maybe not the most efficient way, but since it's only an example it will do!

So your grid is composed of 4096 Blocks (64 x 64), and each Block is composed of 256 threads (16 x 16).

Total Blocks * Total Threasd per Block = 4096 * 256 = 1,048576 = N * N = 1024 * 1024.

To process each element serially, you would probably have a nested for loop:
for (each col)
for (each row)
process element

To access each element for processing in CUDA (assuming you are storing results in a 1D array):

  • (Global Row * Number of Elements) + Global Column
  • Global Row = (blockIdx.y * blockDimx.x + threadIdx.y)
  • Global Column = (blockIdx.x * blockDimx.x + threadIdx.x)
  • Number of Elements = N = Number of elements Length wise (1024 in my example)

More quick tips in the future ...

