Great article. I have a question. Although the dataloader becomes asynchronous, it is to the CPU level. Usually the data need to be copied into GPU. If you look at that plot_timings function, essentially it is like this
for batch in is_slice(dataloader, 10):
PROCESS the batch
in the plot_timing function, dataloader is preparing N batches in parallel so whenever the main process asks for data, it is likely to be ready for yield already. And this main PROCESS is delay the system for a certain time to simulate the model time cost. That means the main PROCESS is still a sequential.
In real application, that main PROCESS usually involves something like
batch = batch.to("cuda")
prediction = model(batch)
My question is here: due to dataloader parallelism, now batch (right hand side) is most likely to be ready, but the steps such as to GPU and model(batch) seems sequential. GPU needs to copy a data, and model run on the data and so on so forth. Ideally, it would be truly efficient if batch to GPU becomes asynchronous. For example, batch upload to GPU is separate from the model calculation. Is there a way to do that?