DNN Troubleshooting
OpenAI Talk
Same talk @ IBM
Basic
  - Initial test set + a single metric to improve
- Target performance
    
      - Human-level performance, published results, previous baselines, etc.
 
Intuition
  - Results can be sensitive to small changes in hyperparameter and dataset makeup.
                          Tune hyperparameter
                                  |
Start simple -> Implement & Debug -> Evaluate -> ?
                                  |
                         Improve model & Data
  - Start simple: simplest model & data possible (LeNet on a subset of the data)
- Implement & Debug: Once model runs, overfit a single batch & reproduce a know result
- Evaluate: Apply the bias-variance decomposition
    
      - Test error = irreducible error + bias + variance + val overfit
 
- Tuning: Coarse-to-fine random search
- Improve model/data
    
      - Make model bigger if Underfit → reducing bias
- Add data or regularize if Overfit → reducing variance
 
Details
Start simple
Architecture
  
    
      | Arch | Start here | Consider this afterwards | 
  
  
    
      | Images | LetNet-like | ResNet | 
    
      | Sequences | LSTM with one hidden layer Temporal Convs
 | Attention model WaveNet-like
 | 
    
      | Others | MLP with one hidden layer | Problem-dependent | 
  
Defaults
  - Optimizer: Adam optimizer with learning rate 3e-4
- Activations: ReLU (FC and Conv models), tanh (LSTMs)
- Regularization: None
- Data normalization (e.g. Batch): None
Data
Normalize scale of input data
  - Subtract mean and divide by variance
Simply the problem itself
  - Small Training set (~10,000 examples)
- Fixed number of classes, objects, image size
- Simpler synthetic dataset
Implement
Most common DL bugs
  - didn’t try to overfit a single batch first.
- forgot to toggle train/eval mode for the net.
- forgot to .zero_grad() (in pytorch) before .backward().
- passed softmaxed outputs to a loss that expects raw logits.
- 
    Incorrect shapes for your tensors 
      - Can fail silently!
- E.g., accidental broadcasting: x.shape = (None,), y.shape = (None, 1), (x+y).shape = (None, None)
 
- 
    Pre-processing inputs incorrectly 
      - E.g., Forgetting to normalize, or too much pre-processing
 
- 
    Incorrect input to your loss function 
      - E.g., softmaxed outputs to a loss that expects logits
 
- 
    Forgot to set up train mode for the net correctly 
      - E.g., toggling train/eval, controlling batch norm dependencies
- Numerical instability - inf/NaN
- Often stems from using an exp, log, or div operation
 
Let the model start Running
  - Problem: Shape mismatch, Casting issue (float32)
    
      - Solution: Step through model creation and inference in a debugger
 
- Out of Memory
    
      - Solution: Scale back memory intensive operations one-by-one
 
- Others
    
      - Solution: Standard debugging toolkit
- Stack Overflow + interactive debugger
 
Error analysis
  - Error goes up
    
      - Flipped the sign of the loss function / gradient
- Learning rate too high
- Softmax taken over wrong dimension
 
- Error explodes
    
      - Numerical issue. Check all exp, log, and div operations
- Learning rate too high
 
- Error oscillates
    
      - Data or labels corrupted (e.g., zeroed, incorrectly shuffled, or preprocessed incorrectly)
- Learning rate too high
 
- Error plateaus
    
      - Learning rate too low
- Gradients not flowing through the whole model
- Too much regularization
- Incorrect input to loss function (e.g., softmax instead of logits)
- Data or labels corrupted
 
Evaluation
Apply the bias-variance decomposition
  - Test error = irreducible error + bias + variance + val overfit
  
    
      | Error source | Value | Analysis | 
  
  
    
      | Goal performance | 1% |  | 
    
      | Train error | 20% | Train - Goal = 19% Underfit
 | 
    
      | Validation error | 27% | Val - Train = 7% Overfit
 | 
    
      | Test error | 28% | Test - Val = 1% Val overfit
 | 
  
Choose Hyperparameter
  - Check initial loss
- Coarse grid sampling (random is better)
    
  
- Overfit a small sample and Train for ~1-5 epochs
- Find LR that makes loss go down
    
      - 
        acc = 0.412  lr : 1.405e-4  reg: 4.234e-4 (epoch 1 / 100) ✓
acc = 0.212  lr : 2.025e-3  reg: 2.793e-5 (epoch 2 / 100) 
acc = 0.612  lr : 3.045e-4  reg: 3.79e-4 (epoch 3 / 100)  ✓
acc = 0.112  lr : 6.435e-1  reg: 5.79e-3 (epoch 4 / 100) 
acc = 0.425  lr : 3.235e-4  reg: 7.79e-1 (epoch 5 / 100)  ✓
=>
lr  : ~ -4
reg : ~ e-4 - e-1
 
 
- Refine grid, train longer
    
      - Look at loss and accuracy curves
- Recursively
 
Improve model/data