19. Hyperparameter Optimizationnavigate_next 19.3. Asynchronous Random Search
Quick search
code
Show Source
Preview Version PyTorch MXNet Notebooks Courses GitHub 中文版
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability and Statistics
    • 2.7. Documentation
  • 3. Linear Neural Networks for Regression
    • 3.1. Linear Regression
    • 3.2. Object-Oriented Design for Implementation
    • 3.3. Synthetic Regression Data
    • 3.4. Linear Regression Implementation from Scratch
    • 3.5. Concise Implementation of Linear Regression
    • 3.6. Generalization
    • 3.7. Weight Decay
  • 4. Linear Neural Networks for Classification
    • 4.1. Softmax Regression
    • 4.2. The Image Classification Dataset
    • 4.3. The Base Classification Model
    • 4.4. Softmax Regression Implementation from Scratch
    • 4.5. Concise Implementation of Softmax Regression
    • 4.6. Generalization in Classification
    • 4.7. Environment and Distribution Shift
  • 5. Multilayer Perceptrons
    • 5.1. Multilayer Perceptrons
    • 5.2. Implementation of Multilayer Perceptrons
    • 5.3. Forward Propagation, Backward Propagation, and Computational Graphs
    • 5.4. Numerical Stability and Initialization
    • 5.5. Generalization in Deep Learning
    • 5.6. Dropout
    • 5.7. Predicting House Prices on Kaggle
  • 6. Builders’ Guide
    • 6.1. Layers and Modules
    • 6.2. Parameter Management
    • 6.3. Parameter Initialization
    • 6.4. Lazy Initialization
    • 6.5. Custom Layers
    • 6.6. File I/O
    • 6.7. GPUs
  • 7. Convolutional Neural Networks
    • 7.1. From Fully Connected Layers to Convolutions
    • 7.2. Convolutions for Images
    • 7.3. Padding and Stride
    • 7.4. Multiple Input and Multiple Output Channels
    • 7.5. Pooling
    • 7.6. Convolutional Neural Networks (LeNet)
  • 8. Modern Convolutional Neural Networks
    • 8.1. Deep Convolutional Neural Networks (AlexNet)
    • 8.2. Networks Using Blocks (VGG)
    • 8.3. Network in Network (NiN)
    • 8.4. Multi-Branch Networks (GoogLeNet)
    • 8.5. Batch Normalization
    • 8.6. Residual Networks (ResNet) and ResNeXt
    • 8.7. Densely Connected Networks (DenseNet)
    • 8.8. Designing Convolution Network Architectures
  • 9. Recurrent Neural Networks
    • 9.1. Working with Sequences
    • 9.2. Converting Raw Text into Sequence Data
    • 9.3. Language Models
    • 9.4. Recurrent Neural Networks
    • 9.5. Recurrent Neural Network Implementation from Scratch
    • 9.6. Concise Implementation of Recurrent Neural Networks
    • 9.7. Backpropagation Through Time
  • 10. Modern Recurrent Neural Networks
    • 10.1. Long Short-Term Memory (LSTM)
    • 10.2. Gated Recurrent Units (GRU)
    • 10.3. Deep Recurrent Neural Networks
    • 10.4. Bidirectional Recurrent Neural Networks
    • 10.5. Machine Translation and the Dataset
    • 10.6. The Encoder-Decoder Architecture
    • 10.7. Encoder-Decoder Seq2Seq for Machine Translation
    • 10.8. Beam Search
  • 11. Attention Mechanisms and Transformers
    • 11.1. Queries, Keys, and Values
    • 11.2. Attention Pooling by Similarity
    • 11.3. Attention Scoring Functions
    • 11.4. The Bahdanau Attention Mechanism
    • 11.5. Multi-Head Attention
    • 11.6. Self-Attention and Positional Encoding
    • 11.7. The Transformer Architecture
    • 11.8. Transformers for Vision
    • 11.9. Large-Scale Pretraining with Transformers
  • 12. Optimization Algorithms
    • 12.1. Optimization and Deep Learning
    • 12.2. Convexity
    • 12.3. Gradient Descent
    • 12.4. Stochastic Gradient Descent
    • 12.5. Minibatch Stochastic Gradient Descent
    • 12.6. Momentum
    • 12.7. Adagrad
    • 12.8. RMSProp
    • 12.9. Adadelta
    • 12.10. Adam
    • 12.11. Learning Rate Scheduling
  • 13. Computational Performance
    • 13.1. Compilers and Interpreters
    • 13.2. Asynchronous Computation
    • 13.3. Automatic Parallelism
    • 13.4. Hardware
    • 13.5. Training on Multiple GPUs
    • 13.6. Concise Implementation for Multiple GPUs
    • 13.7. Parameter Servers
  • 14. Computer Vision
    • 14.1. Image Augmentation
    • 14.2. Fine-Tuning
    • 14.3. Object Detection and Bounding Boxes
    • 14.4. Anchor Boxes
    • 14.5. Multiscale Object Detection
    • 14.6. The Object Detection Dataset
    • 14.7. Single Shot Multibox Detection
    • 14.8. Region-based CNNs (R-CNNs)
    • 14.9. Semantic Segmentation and the Dataset
    • 14.10. Transposed Convolution
    • 14.11. Fully Convolutional Networks
    • 14.12. Neural Style Transfer
    • 14.13. Image Classification (CIFAR-10) on Kaggle
    • 14.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 15. Natural Language Processing: Pretraining
    • 15.1. Word Embedding (word2vec)
    • 15.2. Approximate Training
    • 15.3. The Dataset for Pretraining Word Embeddings
    • 15.4. Pretraining word2vec
    • 15.5. Word Embedding with Global Vectors (GloVe)
    • 15.6. Subword Embedding
    • 15.7. Word Similarity and Analogy
    • 15.8. Bidirectional Encoder Representations from Transformers (BERT)
    • 15.9. The Dataset for Pretraining BERT
    • 15.10. Pretraining BERT
  • 16. Natural Language Processing: Applications
    • 16.1. Sentiment Analysis and the Dataset
    • 16.2. Sentiment Analysis: Using Recurrent Neural Networks
    • 16.3. Sentiment Analysis: Using Convolutional Neural Networks
    • 16.4. Natural Language Inference and the Dataset
    • 16.5. Natural Language Inference: Using Attention
    • 16.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications
    • 16.7. Natural Language Inference: Fine-Tuning BERT
  • 17. Reinforcement Learning
    • 17.1. Markov Decision Process (MDP)
    • 17.2. Value Iteration
    • 17.3. Q-Learning
  • 18. Gaussian Processes
    • 18.1. Introduction to Gaussian Processes
    • 18.2. Gaussian Process Priors
    • 18.3. Gaussian Process Inference
  • 19. Hyperparameter Optimization
    • 19.1. What Is Hyperparameter Optimization?
    • 19.2. Hyperparameter Optimization API
    • 19.3. Asynchronous Random Search
    • 19.4. Multi-Fidelity Hyperparameter Optimization
    • 19.5. Asynchronous Successive Halving
  • 20. Generative Adversarial Networks
    • 20.1. Generative Adversarial Networks
    • 20.2. Deep Convolutional Generative Adversarial Networks
  • 21. Recommender Systems
    • 21.1. Overview of Recommender Systems
    • 21.2. The MovieLens Dataset
    • 21.3. Matrix Factorization
    • 21.4. AutoRec: Rating Prediction with Autoencoders
    • 21.5. Personalized Ranking for Recommender Systems
    • 21.6. Neural Collaborative Filtering for Personalized Ranking
    • 21.7. Sequence-Aware Recommender Systems
    • 21.8. Feature-Rich Recommender Systems
    • 21.9. Factorization Machines
    • 21.10. Deep Factorization Machines
  • 22. Appendix: Mathematics for Deep Learning
    • 22.1. Geometry and Linear Algebraic Operations
    • 22.2. Eigendecompositions
    • 22.3. Single Variable Calculus
    • 22.4. Multivariable Calculus
    • 22.5. Integral Calculus
    • 22.6. Random Variables
    • 22.7. Maximum Likelihood
    • 22.8. Distributions
    • 22.9. Naive Bayes
    • 22.10. Statistics
    • 22.11. Information Theory
  • 23. Appendix: Tools for Deep Learning
    • 23.1. Using Jupyter Notebooks
    • 23.2. Using Amazon SageMaker
    • 23.3. Using AWS EC2 Instances
    • 23.4. Using Google Colab
    • 23.5. Selecting Servers and GPUs
    • 23.6. Contributing to This Book
    • 23.7. Utility Functions and Classes
    • 23.8. The d2l API Document
  • References
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability and Statistics
    • 2.7. Documentation
  • 3. Linear Neural Networks for Regression
    • 3.1. Linear Regression
    • 3.2. Object-Oriented Design for Implementation
    • 3.3. Synthetic Regression Data
    • 3.4. Linear Regression Implementation from Scratch
    • 3.5. Concise Implementation of Linear Regression
    • 3.6. Generalization
    • 3.7. Weight Decay
  • 4. Linear Neural Networks for Classification
    • 4.1. Softmax Regression
    • 4.2. The Image Classification Dataset
    • 4.3. The Base Classification Model
    • 4.4. Softmax Regression Implementation from Scratch
    • 4.5. Concise Implementation of Softmax Regression
    • 4.6. Generalization in Classification
    • 4.7. Environment and Distribution Shift
  • 5. Multilayer Perceptrons
    • 5.1. Multilayer Perceptrons
    • 5.2. Implementation of Multilayer Perceptrons
    • 5.3. Forward Propagation, Backward Propagation, and Computational Graphs
    • 5.4. Numerical Stability and Initialization
    • 5.5. Generalization in Deep Learning
    • 5.6. Dropout
    • 5.7. Predicting House Prices on Kaggle
  • 6. Builders’ Guide
    • 6.1. Layers and Modules
    • 6.2. Parameter Management
    • 6.3. Parameter Initialization
    • 6.4. Lazy Initialization
    • 6.5. Custom Layers
    • 6.6. File I/O
    • 6.7. GPUs
  • 7. Convolutional Neural Networks
    • 7.1. From Fully Connected Layers to Convolutions
    • 7.2. Convolutions for Images
    • 7.3. Padding and Stride
    • 7.4. Multiple Input and Multiple Output Channels
    • 7.5. Pooling
    • 7.6. Convolutional Neural Networks (LeNet)
  • 8. Modern Convolutional Neural Networks
    • 8.1. Deep Convolutional Neural Networks (AlexNet)
    • 8.2. Networks Using Blocks (VGG)
    • 8.3. Network in Network (NiN)
    • 8.4. Multi-Branch Networks (GoogLeNet)
    • 8.5. Batch Normalization
    • 8.6. Residual Networks (ResNet) and ResNeXt
    • 8.7. Densely Connected Networks (DenseNet)
    • 8.8. Designing Convolution Network Architectures
  • 9. Recurrent Neural Networks
    • 9.1. Working with Sequences
    • 9.2. Converting Raw Text into Sequence Data
    • 9.3. Language Models
    • 9.4. Recurrent Neural Networks
    • 9.5. Recurrent Neural Network Implementation from Scratch
    • 9.6. Concise Implementation of Recurrent Neural Networks
    • 9.7. Backpropagation Through Time
  • 10. Modern Recurrent Neural Networks
    • 10.1. Long Short-Term Memory (LSTM)
    • 10.2. Gated Recurrent Units (GRU)
    • 10.3. Deep Recurrent Neural Networks
    • 10.4. Bidirectional Recurrent Neural Networks
    • 10.5. Machine Translation and the Dataset
    • 10.6. The Encoder-Decoder Architecture
    • 10.7. Encoder-Decoder Seq2Seq for Machine Translation
    • 10.8. Beam Search
  • 11. Attention Mechanisms and Transformers
    • 11.1. Queries, Keys, and Values
    • 11.2. Attention Pooling by Similarity
    • 11.3. Attention Scoring Functions
    • 11.4. The Bahdanau Attention Mechanism
    • 11.5. Multi-Head Attention
    • 11.6. Self-Attention and Positional Encoding
    • 11.7. The Transformer Architecture
    • 11.8. Transformers for Vision
    • 11.9. Large-Scale Pretraining with Transformers
  • 12. Optimization Algorithms
    • 12.1. Optimization and Deep Learning
    • 12.2. Convexity
    • 12.3. Gradient Descent
    • 12.4. Stochastic Gradient Descent
    • 12.5. Minibatch Stochastic Gradient Descent
    • 12.6. Momentum
    • 12.7. Adagrad
    • 12.8. RMSProp
    • 12.9. Adadelta
    • 12.10. Adam
    • 12.11. Learning Rate Scheduling
  • 13. Computational Performance
    • 13.1. Compilers and Interpreters
    • 13.2. Asynchronous Computation
    • 13.3. Automatic Parallelism
    • 13.4. Hardware
    • 13.5. Training on Multiple GPUs
    • 13.6. Concise Implementation for Multiple GPUs
    • 13.7. Parameter Servers
  • 14. Computer Vision
    • 14.1. Image Augmentation
    • 14.2. Fine-Tuning
    • 14.3. Object Detection and Bounding Boxes
    • 14.4. Anchor Boxes
    • 14.5. Multiscale Object Detection
    • 14.6. The Object Detection Dataset
    • 14.7. Single Shot Multibox Detection
    • 14.8. Region-based CNNs (R-CNNs)
    • 14.9. Semantic Segmentation and the Dataset
    • 14.10. Transposed Convolution
    • 14.11. Fully Convolutional Networks
    • 14.12. Neural Style Transfer
    • 14.13. Image Classification (CIFAR-10) on Kaggle
    • 14.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 15. Natural Language Processing: Pretraining
    • 15.1. Word Embedding (word2vec)
    • 15.2. Approximate Training
    • 15.3. The Dataset for Pretraining Word Embeddings
    • 15.4. Pretraining word2vec
    • 15.5. Word Embedding with Global Vectors (GloVe)
    • 15.6. Subword Embedding
    • 15.7. Word Similarity and Analogy
    • 15.8. Bidirectional Encoder Representations from Transformers (BERT)
    • 15.9. The Dataset for Pretraining BERT
    • 15.10. Pretraining BERT
  • 16. Natural Language Processing: Applications
    • 16.1. Sentiment Analysis and the Dataset
    • 16.2. Sentiment Analysis: Using Recurrent Neural Networks
    • 16.3. Sentiment Analysis: Using Convolutional Neural Networks
    • 16.4. Natural Language Inference and the Dataset
    • 16.5. Natural Language Inference: Using Attention
    • 16.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications
    • 16.7. Natural Language Inference: Fine-Tuning BERT
  • 17. Reinforcement Learning
    • 17.1. Markov Decision Process (MDP)
    • 17.2. Value Iteration
    • 17.3. Q-Learning
  • 18. Gaussian Processes
    • 18.1. Introduction to Gaussian Processes
    • 18.2. Gaussian Process Priors
    • 18.3. Gaussian Process Inference
  • 19. Hyperparameter Optimization
    • 19.1. What Is Hyperparameter Optimization?
    • 19.2. Hyperparameter Optimization API
    • 19.3. Asynchronous Random Search
    • 19.4. Multi-Fidelity Hyperparameter Optimization
    • 19.5. Asynchronous Successive Halving
  • 20. Generative Adversarial Networks
    • 20.1. Generative Adversarial Networks
    • 20.2. Deep Convolutional Generative Adversarial Networks
  • 21. Recommender Systems
    • 21.1. Overview of Recommender Systems
    • 21.2. The MovieLens Dataset
    • 21.3. Matrix Factorization
    • 21.4. AutoRec: Rating Prediction with Autoencoders
    • 21.5. Personalized Ranking for Recommender Systems
    • 21.6. Neural Collaborative Filtering for Personalized Ranking
    • 21.7. Sequence-Aware Recommender Systems
    • 21.8. Feature-Rich Recommender Systems
    • 21.9. Factorization Machines
    • 21.10. Deep Factorization Machines
  • 22. Appendix: Mathematics for Deep Learning
    • 22.1. Geometry and Linear Algebraic Operations
    • 22.2. Eigendecompositions
    • 22.3. Single Variable Calculus
    • 22.4. Multivariable Calculus
    • 22.5. Integral Calculus
    • 22.6. Random Variables
    • 22.7. Maximum Likelihood
    • 22.8. Distributions
    • 22.9. Naive Bayes
    • 22.10. Statistics
    • 22.11. Information Theory
  • 23. Appendix: Tools for Deep Learning
    • 23.1. Using Jupyter Notebooks
    • 23.2. Using Amazon SageMaker
    • 23.3. Using AWS EC2 Instances
    • 23.4. Using Google Colab
    • 23.5. Selecting Servers and GPUs
    • 23.6. Contributing to This Book
    • 23.7. Utility Functions and Classes
    • 23.8. The d2l API Document
  • References

19.3. Asynchronous Random Search¶
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

As we have seen in the previous Section 19.2, we might have to wait hours or even days before random search returns a good hyperparameter configuration, because of the expensive evaluation of hyperparameter configurations. In practice, we have often access to a pool of resources such as multiple GPUs on the same machine or multiple machines with a single GPU. This begs the question: How do we efficiently distribute random search?

In general, we distinguish between synchronous and asynchronous parallel hyperparameter optimization (see Fig. 19.3.1). In the synchronous setting, we wait for all concurrently running trials to finish, before we start the next batch. Consider configuration spaces that contain hyperparameters such as the number of filters or number of layers of a deep neural network. Hyperparameter configurations that contain a larger number of layers of filters will naturally take more time to finish, and all other trials in the same batch will have to wait at synchronisation points (grey area in Fig. 19.3.1) before we can continue the optimization process.

In the asynchronous setting we immediately schedule a new trial as soon as resources become available. This will optimally exploit our resources, since we can avoid any synchronisation overhead. For random search, each new hyperparameter configuration is chosen independently of all others, and in particular without exploiting observations from any prior evaluation. This means we can trivially parallelize random search asynchronously. This is not straight-forward with more sophisticated methods that make decision based on previous observations (see Section 19.5). While we need access to more resources than in the sequential setting, asynchronous random search exhibits a linear speed-up, in that a certain performance is reached \(K\) times faster if \(K\) trials can be run in parallel.

../_images/distributed_scheduling.svg

Fig. 19.3.1 Distributing the hyperparameter optimization process either synchronously or asynchronously. Compared to the sequential setting, we can reduce the overall wall-clock time while keep the total compute constant. Synchronous scheduling might lead to idling workers in the case of stragglers.¶

In this notebook, we will look at asynchronous random search that, where trials are executed in multiple python processes on the same machine. Distributed job scheduling and execution is difficult to implement from scratch. We will use Syne Tune (Salinas et al., 2022), which provides us with a simple interface for asynchronous HPO. Syne Tune is designed to be run with different execution back-ends, and the interested reader is invited to study its simple APIs in order to learn more about distributed HPO.

import logging
from d2l import torch as d2l

logging.basicConfig(level=logging.INFO)
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import RandomSearch
INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[raytune]'
or (for everything)
   pip install 'syne-tune[extra]'

19.3.1. Objective Function¶

First, we have to define a new objective function such that it now returns the performance back to Syne Tune via the report callback.

def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
    from syne_tune import Reporter
    from d2l import torch as d2l

    model = d2l.LeNet(lr=learning_rate, num_classes=10)
    trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
    data = d2l.FashionMNIST(batch_size=batch_size)
    model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
    report = Reporter()
    for epoch in range(1, max_epochs + 1):
        if epoch == 1:
            # Initialize the state of Trainer
            trainer.fit(model=model, data=data)
        else:
            trainer.fit_epoch()
        validation_error = trainer.validation_error().cpu().detach().numpy()
        report(epoch=epoch, validation_error=float(validation_error))

Note that the PythonBackend of Syne Tune requires dependencies to be imported inside the function definition.

19.3.2. Asynchronous Scheduler¶

First, we define the number of workers that evaluate trials concurrently. We also need to specify how long we want to run random search, by defining an upper limit on the total wall-clock time.

n_workers = 2  # Needs to be <= the number of available GPUs

max_wallclock_time = 12 * 60  # 12 minutes

Next, we state which metric we want to optimize and whether we want to minimize or maximize this metric. Namely, metric needs to correspond to the argument name passed to the report callback.

mode = "min"
metric = "validation_error"

We use the configuration space from our previous example. In Syne Tune, this dictionary can also be used to pass constant attributes to the training script. We make use of this feature in order to pass max_epochs. Moreover, we specify the first configuration to be evaluated in initial_config.

config_space = {
    "learning_rate": loguniform(1e-2, 1),
    "batch_size": randint(32, 256),
    "max_epochs": 10,
}
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,
}

Next, we need to specify the back-end for job executions. Here we just consider the distribution on a local machine where parallel jobs are executed as sub-processes. However, for large scale HPO, we could run this also on a cluster or cloud environment, where each trial consumes a full instance.

trial_backend = PythonBackend(
    tune_function=hpo_objective_lenet_synetune,
    config_space=config_space,
)

We can now create the scheduler for asynchronous random search, which is similar in behaviour to our BasicScheduler from Section 19.2.

scheduler = RandomSearch(
    config_space,
    metric=metric,
    mode=mode,
    points_to_evaluate=[initial_config],
)
INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 4033665588

Syne Tune also features a Tuner, where the main experiment loop and bookkeeping is centralized, and interactions between scheduler and back-end are mediated.

stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)

tuner = Tuner(
    trial_backend=trial_backend,
    scheduler=scheduler,
    stop_criterion=stop_criterion,
    n_workers=n_workers,
    print_update_interval=int(max_wallclock_time * 0.6),
)

Let us run our distributed HPO experiment. According to our stopping criterion, it will run for about 12 minutes.

tuner.run()
INFO:syne_tune.tuner:results of trials will be saved on /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691
INFO:root:Detected 8 GPUs
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.31642002803324326 --batch_size 52 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.31642002803324326, 'batch_size': 52, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 0 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.045813161553582046 --batch_size 71 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.045813161553582046, 'batch_size': 71, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.11375402103945391 --batch_size 244 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.11375402103945391, 'batch_size': 244, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 2 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5211657199736571 --batch_size 47 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.5211657199736571, 'batch_size': 47, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 3 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.05259930532982774 --batch_size 181 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.05259930532982774, 'batch_size': 181, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 5 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.09086002421630578 --batch_size 48 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.09086002421630578, 'batch_size': 48, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0  Completed    10       0.100000         128          10     10          0.258109   108.366785
        1  Completed    10       0.316420          52          10     10          0.146223   179.660365
        2  Completed    10       0.045813          71          10     10          0.311251   143.567631
        3  Completed    10       0.113754         244          10     10          0.336094    90.168444
        4 InProgress     8       0.521166          47          10      8          0.150257   156.696658
        5  Completed    10       0.052599         181          10     10          0.399893    91.044401
        6 InProgress     2       0.090860          48          10      2          0.453050    36.693606
2 trials running, 5 finished (5 until the end), 436.55s wallclock-time

INFO:syne_tune.tuner:Trial trial_id 4 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.03542833641356924 --batch_size 94 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.03542833641356924, 'batch_size': 94, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 6 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5941192130206245 --batch_size 149 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.5941192130206245, 'batch_size': 149, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 7 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.013696247675312455 --batch_size 135 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.013696247675312455, 'batch_size': 135, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 8 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.11837221527625114 --batch_size 75 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.11837221527625114, 'batch_size': 75, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 9 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-0/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-0/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.18877290342981604 --batch_size 187 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/tune_function --tune_function_hash 53504c42ecb95363b73ac1f849a8a245 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.18877290342981604, 'batch_size': 187, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/d2l-worker/syne-tune/python-entrypoint-2023-02-10-04-56-21-691
--------------------
Resource summary (last result is reported):
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0  Completed    10       0.100000         128          10   10.0          0.258109   108.366785
        1  Completed    10       0.316420          52          10   10.0          0.146223   179.660365
        2  Completed    10       0.045813          71          10   10.0          0.311251   143.567631
        3  Completed    10       0.113754         244          10   10.0          0.336094    90.168444
        4  Completed    10       0.521166          47          10   10.0          0.146092   190.111242
        5  Completed    10       0.052599         181          10   10.0          0.399893    91.044401
        6  Completed    10       0.090860          48          10   10.0          0.197369   172.148435
        7  Completed    10       0.035428          94          10   10.0          0.414369   112.588123
        8  Completed    10       0.594119         149          10   10.0          0.177609    99.182505
        9  Completed    10       0.013696         135          10   10.0          0.901235   107.753385
       10 InProgress     2       0.118372          75          10    2.0          0.465970    32.484881
       11 InProgress     0       0.188773         187          10      -                 -            -
2 trials running, 10 finished (10 until the end), 722.92s wallclock-time

validation_error: best 0.1377706527709961 for trial-id 4
--------------------

The logs of all evaluated hyperparameter configurations are stored for further analysis. At any time during the tuning job, we can easily get the results obtained so far and plot the incumbent trajectory.

d2l.set_figsize()
tuning_experiment = load_experiment(tuner.name)
tuning_experiment.plot()
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/output_rs-async_d37aa6_19_1.svg

19.3.3. Visualize the Asynchronous Optimization Process¶

Below we visualize how the learning curves of every trial (each color in the plot represents a trial) evolve during the asynchronous optimization process. At any point in time, there are as many trials running concurrently as we have workers. Once a trial finishes, we immediately start the next trial, without waiting for the other trials to finish. Idle time of workers is reduced to a minimum with asynchronous scheduling.

d2l.set_figsize([6, 2.5])
results = tuning_experiment.results

for trial_id in results.trial_id.unique():
    df = results[results["trial_id"] == trial_id]
    d2l.plt.plot(
        df["st_tuner_time"],
        df["validation_error"],
        marker="o"
    )

d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")
Text(0, 0.5, 'objective function')
../_images/output_rs-async_d37aa6_21_1.svg

19.3.4. Summary¶

We can reduce the waiting time for random search substantially by distribution trials across parallel resources. In general, we distinguish between synchronous scheduling and asynchronous scheduling. Synchronous scheduling means that we sample a new batch of hyperparameter configurations once the previous batch finished. If we have a stragglers - trials that takes more time to finish than other trials - our workers need to wait at synchronization points. Asynchronous scheduling evaluates a new hyperparameter configurations as soon as resources become available, and, hence, ensures that all workers are busy at any point in time. While random search is easy to distribute asynchronously and does not require any change of the actual algorithm, other methods require some additional modifications.

19.3.5. Exercises¶

  1. Consider the DropoutMLP model implemented in Section 5.6, and used in Exercise 1 of Section 19.2.

    1. Implement an objective function hpo_objective_dropoutmlp_synetune to be used with Syne Tune. Make sure that your function reports the validation error after every epoch.

    2. Using the setup of Exercise 1 in Section 19.2, compare random search to Bayesian optimization. If you use SageMaker, feel free to use Syne Tune’s benchmarking facilities in order to run experiments in parallel. Hint: Bayesian optimization is provided as syne_tune.optimizer.baselines.BayesianOptimization.

    3. For this exercise, you need to run on an instance with at least 4 CPU cores. For one of the methods used above (random search, Bayesian optimization), run experiments with n_workers=1, n_workers=2, n_workers=4, and compare results (incumbent trajectories). At least for random search, you should observe linear scaling with respect to the number of workers. Hint: For robust results, you may have to average over several repetitions each.

  2. Advanced. The goal of this exercise is to implement a new scheduler in Syne Tune.

    1. Create a virtual environment containing both the d2lbook and syne-tune sources.

    2. Implement the LocalSearcher from Exercise 2 in Section 19.2 as a new searcher in Syne Tune. Hint: Read this tutorial. Alternatively, you may follow this example.

    3. Compare your new LocalSearcher with RandomSearch on the DropoutMLP benchmark.

Discussions

Table Of Contents

  • 19.3. Asynchronous Random Search
    • 19.3.1. Objective Function
    • 19.3.2. Asynchronous Scheduler
    • 19.3.3. Visualize the Asynchronous Optimization Process
    • 19.3.4. Summary
    • 19.3.5. Exercises
Previous
19.2. Hyperparameter Optimization API
Next
19.4. Multi-Fidelity Hyperparameter Optimization