.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python devices = d2l.try_all_gpus() def run(x): return [x.mm(x) for _ in range(50)] x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0]) x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1]) Now we apply the function to the data. To ensure that caching does not play a role in the results we warm up the devices by performing a single pass on either of them prior to measuring. ``torch.cuda.synchronize()`` waits for all kernels in all streams on a CUDA device to complete. It takes in a ``device`` argument, the device for which we need to synchronize. It uses the current device, given by ``current_device()``, if the device argument is ``None`` (default). .. raw:: latex \diilbookstyleinputcell .. code:: python run(x_gpu1) run(x_gpu2) # Warm-up all devices torch.cuda.synchronize(devices[0]) torch.cuda.synchronize(devices[1]) with d2l.Benchmark('GPU1 time'): run(x_gpu1) torch.cuda.synchronize(devices[0]) with d2l.Benchmark('GPU2 time'): run(x_gpu2) torch.cuda.synchronize(devices[1]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output GPU1 time: 0.4660 sec GPU2 time: 0.4510 sec If we remove the ``synchronize`` statement between both tasks the system is free to parallelize computation on both devices automatically. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('GPU1 & GPU2'): run(x_gpu1) run(x_gpu2) torch.cuda.synchronize() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output GPU1 & GPU2: 0.4659 sec .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python devices = d2l.try_all_gpus() def run(x): return [x.dot(x) for _ in range(50)] x_gpu1 = np.random.uniform(size=(4000, 4000), ctx=devices[0]) x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [22:23:34] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for GPU [22:23:34] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for GPU Now we apply the function to the data. To ensure that caching does not play a role in the results we warm up the devices by performing a single pass on either of them prior to measuring. .. raw:: latex \diilbookstyleinputcell .. code:: python run(x_gpu1) # Warm-up both devices run(x_gpu2) npx.waitall() with d2l.Benchmark('GPU1 time'): run(x_gpu1) npx.waitall() with d2l.Benchmark('GPU2 time'): run(x_gpu2) npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output GPU1 time: 0.4465 sec GPU2 time: 0.4519 sec If we remove the ``waitall`` statement between both tasks the system is free to parallelize computation on both devices automatically. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('GPU1 & GPU2'): run(x_gpu1) run(x_gpu2) npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output GPU1 & GPU2: 0.4535 sec .. raw:: html

.. raw:: html

pytorch mxnet

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def copy_to_cpu(x, non_blocking=False): return [y.to('cpu', non_blocking=non_blocking) for y in x] with d2l.Benchmark('Run on GPU1'): y = run(x_gpu1) torch.cuda.synchronize() with d2l.Benchmark('Copy to CPU'): y_cpu = copy_to_cpu(y) torch.cuda.synchronize() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Run on GPU1: 0.4656 sec Copy to CPU: 2.3125 sec This is somewhat inefficient. Note that we could already start copying parts of ``y`` to the CPU while the remainder of the list is still being computed. This situation occurs, e.g., when we compute the (backprop) gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our advantage to start using PCI-Express bus bandwidth while the GPU is still running. In PyTorch, several functions such as ``to()`` and ``copy_()`` admit an explicit ``non_blocking`` argument, which lets the caller bypass synchronization when it is unnecessary. Setting ``non_blocking=True`` allows us to simulate this scenario. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('Run on GPU1 and copy to CPU'): y = run(x_gpu1) y_cpu = copy_to_cpu(y, True) torch.cuda.synchronize() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Run on GPU1 and copy to CPU: 1.6907 sec .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def copy_to_cpu(x): return [y.copyto(npx.cpu()) for y in x] with d2l.Benchmark('Run on GPU1'): y = run(x_gpu1) npx.waitall() with d2l.Benchmark('Copy to CPU'): y_cpu = copy_to_cpu(y) npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Run on GPU1: 0.4788 sec [22:23:37] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU Copy to CPU: 2.4304 sec This is somewhat inefficient. Note that we could already start copying parts of ``y`` to the CPU while the remainder of the list is still being computed. This situation occurs, e.g., when we compute the gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our advantage to start using PCI-Express bus bandwidth while the GPU is still running. Removing ``waitall`` between both parts allows us to simulate this scenario. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('Run on GPU1 and copy to CPU'): y = run(x_gpu1) y_cpu = copy_to_cpu(y) npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Run on GPU1 and copy to CPU: 0.4530 sec .. raw:: html

.. raw:: html