Get rid of busy waiting during asynchronous cuda stream executions -
i looking way how rid of busy waiting in host thread in fallowing code (do not copy code, shows idea of problem, has many basic bugs):
cudastream_t steams[s_n]; (int = 0; < s_n; i++) { cudastreamcreate(streams[i]); } int sid = 0; (int d = 0; d < data_size; d+=data_step) { while (true) { if (cudastreamquery(streams[sid])) == cudasuccess) { //busy waiting !!!! cudamemcpyassync(d_data, h_data + d, data_step, cudamemcpyhosttodevice, streams[sid]); kernel<<<griddim, blockdim, smsize streams[sid]>>>(d_data, data_step); break; } sid = ++sid % s_n; }
}
is there way idle host thread , wait somehow stream finish, , prepare , run stream?
edit: added while(true) code, emphasize busy waiting. execute streams, , check of them finished run new one. cudastreamsynchronize
waits particular stream finish, want wait of streams first finished job.
edit2: got rid of busy-waiting in fallowing way:
cudastream_t steams[s_n]; (int = 0; < s_n; i++) { cudastreamcreate(streams[i]); } int sid = 0; (int d = 0; d < data_size; d+=data_step) { cudamemcpyassync(d_data, h_data + d, data_step, cudamemcpyhosttodevice, streams[sid]); kernel<<<griddim, blockdim, smsize streams[sid]>>>(d_data, data_step); sid = ++sid % s_n; } (int = 0; < s_n; i++) { cudastreamsynchronize(streams[i]); cudastreamdestroy(streams[i]); }
but appears little bit slower version busy-waiting on host thread. think because, statically distribute jobs on streams, when 1 stream finishes work idle till each of stream finishes work. previous version dynamically distributed work first idle stream, more efficient, there busy-waiting on host thread.
the real answer use cudathreadsynchronize wait all previous launches complete, cudastreamsynchronize wait launches in stream complete, , cudaeventsynchronize wait event on stream recorded.
however, need understand how streams , sychronization work before able use them in code.
what happens if not use streams @ all? consider following code:
kernel <<< griddim, blockdim >>> (d_data, data_step); host_func1(); cudathreadsynchronize(); host_func2();
the kernel launched , host moves on execute host_func1 , kernel concurrently. then, host , device synchronized, ie host waits kernel finish before moving on host_func2().
now, if have 2 different kernels?
kernel1 <<<griddim, blockdim >>> (d_data + d1, data_step); kernel2 <<<griddim, blockdim >>> (d_data + d2, data_step);
kernel1 launched asychronously! host moves on, , kernel2 launched before kernel1 finishes! however, kernel2 not execute until after kernel1 finishes, because have both been launched on stream 0 (the default stream). consider following alternative:
kernel1 <<<griddim, blockdim>>> (d_data + d1, data_step); cudathreadsynchronize(); kernel2 <<<griddim, blockdim>>> (d_data + d2, data_step);
there absolutely no need because device synchronizes kernels launched on same stream.
so, think functionality looking exists... because kernel always waits previous launches in same stream finish before starting (even though host passes by). is, if want wait any previous launch finish, don't use streams. code work fine:
for (int d = 0; d < data_size; d+=data_step) { cudamemcpyasync(d_data, h_data + d, data_step, cudamemcpyhosttodevice, 0); kernel<<<griddim, blockdim, smsize, 0>>>(d_data, data_step); }
now, on streams. can use streams manage concurrent device execution.
think of stream queue. can put different memcpy calls , kernel launches different queues. then, kernels in stream 1 , launches in stream 2 asynchronous! may executed @ same time, or in order. if want sure 1 memcpy/kernel being executed on device @ time, don't use streams. similarly, if want kernels executed in specific order, don't use streams.
that said, keep in mind put stream 1, executed in order, don't bother synchronizing. synchronization synchronizing host , device calls, not 2 different device calls. so, if want execute several of kernels @ same time because use different device memory , have no effect on each other, use streams. like...
cudastream_t steams[s_n]; (int = 0; < s_n; i++) { cudastreamcreate(streams[i]); } int sid = 0; (int d = 0; d < data_size; d+=data_step) { cudamemcpyasync(d_data, h_data + d, data_step, cudamemcpyhosttodevice, streams[sid]); kernel<<<griddim, blockdim, smsize streams[sid]>>>(d_data, data_step); sid = ++sid % s_n; }
no explicit device synchronization necessary.
Comments
Post a Comment