What are the four GPU memory types?
Global (slow, accessible everywhere), Local (fast, work-group shared), Private (fastest, per-thread), Constant (fast, read-only)
Why do GPUs need multiple memory types?
High throughput design: optimize for different access patterns and performance needs
What is global memory?
Accessible by all work items, slower but largest capacity, not always cached
What is local memory?
Shared within work group, much faster than global, used for work group communication
What is private memory?
Per-work-item, fastest access, implemented as registers
What is constant memory?
Read-only, faster than global, for data that doesn’t change
What is memory coalescing?
GPU optimization where adjacent threads access adjacent memory locations for efficiency
What is the GPU memory hierarchy?
Private (registers) → Local → Global → Host memory
What is cache coherency in GPUs?
Ensuring consistent views when multiple work items access same location
What is false sharing in GPUs?
Work items accessing different data in same cache line, causing unnecessary invalidation
What is register overflow?
When kernel uses too many registers, spills to slower global memory
What is static local memory allocation?
Declared in kernel with fixed size: __local float temp[128]
What is dynamic local memory allocation?
Passed as kernel argument: __kernel void func(__local float *temp)
How to allocate dynamic local memory?
clSetKernelArg(kernel, arg_index, size, NULL)
What is the global memory access pattern?
Use __global qualifier, slowest but most flexible
What is the local memory access pattern?
Use __local qualifier, fastest for work-group communication
What is private memory automatic?
Variables declared in kernel are private by default
What is constant memory for?
Read-only data that benefits from caching: __constant qualifier
What is unified memory?
CPU/GPU memory appears as single address space (OpenCL 2.0+, CUDA 4.0+)
What is the advantage of unified memory?
Simplifies programming, automatic data movement
What is the disadvantage of unified memory?
Less control over performance optimization