17-GPU-Synchronisation Flashcards

(20 cards)

1
Q

What is the scalar product?

A

a · b = Σ(a_i × b_i) for vectors a and b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the serial scalar product algorithm?

A

Initialize sum = 0, loop adding a[i] * b[i]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is parallel reduction?

A

Using binary tree to combine elements: pair, combine, repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the kernel for parallel reduction?

A

Each work item does partial sum, then binary tree reduction in local memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is barrier synchronization?

A

barrier(CLK_LOCAL_MEM_FENCE) - all work items in group wait

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are barriers needed?

A

Ensure all work items complete writing before reading shared data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the problem with global barriers?

A

GPUs cannot synchronize between work groups - no global barrier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the solution for cross-work-group reduction?

A

Use multiple kernel launches, each completing before next starts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are lockstep execution?

A

SIMD cores execute same instruction across all threads simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is divergence?

A

When threads in same warp execute different code paths, causing serialization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What causes divergence?

A

if/else branches where different threads take different paths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the performance impact of divergence?

A

Code executes serially for different paths, losing parallelism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is subgroup size?

A

Number of threads executing in lockstep (32 for Nvidia warps, 64 for AMD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why exploit subgroups in reduction?

A

Once reduced to subgroup size, no need for explicit synchronization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the subgroup reduction optimization?

A

Skip barriers for final reduction steps within subgroup

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the reduction pattern with subgroups?

A

Normal reduction with barriers until subgroup size, then lockstep reduction

17
Q

What is the advantage of subgroup-aware code?

A

Reduced synchronization overhead in final reduction steps

18
Q

What is the trade-off with divergence?

A

Some algorithms accept divergence for cleaner code if benefit outweighs cost

19
Q

What is the key insight about GPU synchronization?

A

Local synchronization works, global requires multiple kernels

20
Q

What makes GPU synchronization different from CPU?

A

Work groups are independent, no global coordination primitives