What is the scalar product?
a · b = Σ(a_i × b_i) for vectors a and b
What is the serial scalar product algorithm?
Initialize sum = 0, loop adding a[i] * b[i]
What is parallel reduction?
Using binary tree to combine elements: pair, combine, repeat
What is the kernel for parallel reduction?
Each work item does partial sum, then binary tree reduction in local memory
What is barrier synchronization?
barrier(CLK_LOCAL_MEM_FENCE) - all work items in group wait
Why are barriers needed?
Ensure all work items complete writing before reading shared data
What is the problem with global barriers?
GPUs cannot synchronize between work groups - no global barrier
What is the solution for cross-work-group reduction?
Use multiple kernel launches, each completing before next starts
What are lockstep execution?
SIMD cores execute same instruction across all threads simultaneously
What is divergence?
When threads in same warp execute different code paths, causing serialization
What causes divergence?
if/else branches where different threads take different paths
What is the performance impact of divergence?
Code executes serially for different paths, losing parallelism
What is subgroup size?
Number of threads executing in lockstep (32 for Nvidia warps, 64 for AMD)
Why exploit subgroups in reduction?
Once reduced to subgroup size, no need for explicit synchronization
What is the subgroup reduction optimization?
Skip barriers for final reduction steps within subgroup
What is the reduction pattern with subgroups?
Normal reduction with barriers until subgroup size, then lockstep reduction
What is the advantage of subgroup-aware code?
Reduced synchronization overhead in final reduction steps
What is the trade-off with divergence?
Some algorithms accept divergence for cleaner code if benefit outweighs cost
What is the key insight about GPU synchronization?
Local synchronization works, global requires multiple kernels
What makes GPU synchronization different from CPU?
Work groups are independent, no global coordination primitives