cudaDeviceSynchronize()
is used from the host.
Wait until all current kernels finish
cudaStreamSynchronize()
waits until all kernels
in a stream finish.
__syncthreads()
is used inside a kernel.
Stop thread until all threads reach the location!
cudaEventCreate
initialize an event variable
cudaEventRecord
place a marker in the queue
cudaEventSynchronize
wait until all markers
have received values
cudaEventElapsedTime
get the time difference
between two events
Coalescing memory
Always access global memory ”in order”
If threads access global memory in order of thread
numbers, performance will be improved!
CUDA can be coupled closer to ________
OpenGL