My course.fast.ai Notes

Fri 01 November 2019, by Seppe "Macuyiko" vanden Broucke

Like many others, I’m a huge fan of Jeremy Howard’s fast.ai courses. Even when having worked in the machine learning field for years, I still find the material to be packed full with interesting tidbits, tips, and did-you-knows which you can’t easily find anywhere else.

Because of this, I’ve rewatched the courses several times. Last time around, I decided to take a somewhat more long-term structured approach and make notes of things which strike me as noteworthy.

Lots of other students and viewers have done so already and have made transcripts available, but I wanted to make my own notes as I find that the other notes

are either too long with way too much detail;
are way too short without enough detail;
describe coding and theory elements which I know already;
don’t describe some interesting things which I find very valuable to refer to later on and remember.

I especially wanted to include the points which I’d like to refer to or mention in my own teaching.

Deep Learning 2017 Course: Part 1

This course immediately adopts the concept of transfer learning
- Starting with VGG (ImageNet 2014)
- Tip: always look at the data is was trained on as this will impact the settings you can use it to transfer learn
- “Last of powerful simple architectures”
Baseball analogy to explain bottom-up learning: “many students lose motivation or drop out along the way. In contrast, areas like sports or music are often taught in a top-down way in which a child can enjoy playing baseball, even if they don’t know many of the formal rules. Children playing baseball have a general sense of the “whole game”, and learn the details later, over time”
Understanding LogLoss with Excel
- A small epsilon is applied to 0 and 1 probabilities
- But you can clip these as well manually for overconfident models
- Will rank higher when scored on LogLoss
Idea is introduced of changing learning rate after a couple of epochs: this will be expanded much more later on
Keep visualizing, also when evaluating
- Look at (most) correct labels, (most) incorrect labels at random, most uncertain labels
- My Take Generally in ML: data scientists tend to forget about this even with other models
Probabilities that come out of a deep learning network are not statistical probabilities (e.g. “how often right or wrong” interpretation)
Visualization of a CNN
- Find examples of pieces of images that strongly activate each filter
- Lower layers: very similar to Gabor filters
- Experiment with the layers to fine tune
- Don’t reinitialize weights
Initialization: start of with activations along the same scale and provide and output close to desired output
- Xavier (Glorot), Kaiming initialization: https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79
- Most good toolkits do this for you
Local minima
- Generally not an issue in DL: not really possible get to a point where there is not direction better because we have lots of parameters
- Sadle points are another issue
- My Take It’s striking how many times I’ve gotten this question as well
- What is you don’t know the derivate or derivate can’t be calculated (like with ReLU)
  - All modern network libraries do symbolic differentiation
  - ReLU doesn’t have a derivate at every point, mathematicians care but we dont
    - Doesn’t really matter if differentiable for the majority
  - So all non-linear activation functions work well
  - ReLU is very common but alternatives exist: ELU, LeakyReLU
  - Also see https://openreview.net/forum?id=H1gJ2RVFPH for a cool new result
Most simple way to transfer learn: use VGG’s 1000 outputs in a linear model
bcolz library to save Numpy arrays
Recommends rmsprop instead of SGD
Matrix multiplications one after one another is itself a linear model
- So that’s why we add in a non linear step on the activations
There’s a difference fine tuning and adding a layer
Why softmax: e^ matches nicely with logloss, also overcompensates one output which matches hot encoded
- Not good for multilabel
On convolutions:
- Increasing number of filters after each maxpool
- Convolution is position invariant
- Solved by further filters down the road
- 3x3 filters are better: use smaller filters, maxpooling, more filters and more layers
- Large images (at the time) still unsolved: attention might help
- CNN utilizes a lot of compute, dense uses a lot of memory
- Maxpool usage is controversial, increasing stride is better
  - Useful for 1D convolutions as well
- CNN can hence also be used for any type of ordered data (RNN alternative)
- SxSxC tensor per filter, then sum
- Capsule networks briefly mentioned
On overfitting:
- First of all build a model that overfits (since then we know we have enough model capacity and know that we can train it)
- Then gradually use a number of strategies to reduce the overfitting
- Add more data
- Dropout: now also on the CNN layers, but not too much in beginning
- Use data augmentation: always, but what kind and how much can vary
- Add regularization: L2 or L1
- Reduce architecture complexity
- Throwing away model information is different from throwing away input information
- BatchNorm: if inputs not scaled then all weights have very different gradients, same for weights in between
  - So also normalize activations, but doesn’t work by just normalizing
  - Normalizes but also multiplies and adds to change mean, and take these two params in the SGD
  - Allows to use 10x higher learning rate and less overfitting without removing information
  - Start with a batchnorm layer to get automatic normalization (trick)
Getting the learning rate right is critical: related to exploding gradients and vanishing gradients
- Vast majority of space in loss function are saddle points: loses time
- Chain rule: derivative of f(g()) = derivative products
On optimizers, from SGD:
- Momentum: keep track of average movement
- Dynamic LR: adagrad — different LR for different params: keep track of gradients
- rmsprop: similar, but better continuous update — doesn’t explode (exponentially weighted moving average)
  - Allows to keep the learning rate smaller and smaller
  - Start small, then up, then decreasing: learning rate annealing
- Adam: rmsprop and momentum combined
- Still very iterative changes
- Early stopping is pretty difficult to use correctly, especially since loss is jittery and can have plateaus
- Jumping too far at the start of training is easy
  - Often reasonable good answers that are easy to find: e.g. always predict 0: so look at the predictions!
  - Once accuracy is better than random, we can start to increase the learning rate
- Versus gradient descent (whole dataset)
- Versus online gradient descent (minibatch of 1)
On pseudo labeling:
- What do we do with instances for which we don’t have the labels?
- Can still help us to learn about the structure of the data
- Related to semi supervised learning
- Predict outputs for unlabeled set from a trained model: pseudo-labels (pretend they’re correct)
- And then train again (and repeat)
On knowledge distillation:
- Knowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as “teacher-student”, where the large model is the teacher and the small model is the student
- In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model
- The output of a softmax function on the teacher model’s logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn’t provide much information beyond the ground truth labels already provided in the dataset
- To tackle this issue, Hinton et al., 2015 introduced the concept of “softmax temperature”
Collaborative filtering: embeddings as latent factors
- Don’t forget to add bias
- Concatination and dense seems to work better
Embeddings and CNNs:
- t-sne can still be misleading, just like PCA
- Glove and word2vec constructed using linear models to reason in high dimensional space
- You should always (try to) use pre-trained word embeddings
- You can try different sizes of CNN filters and concat them together before connecting
Functional Keras API generally better
- Intermediate outputs using Model()
RNNs
- Keep memory about distant past
- Long term dependency handling
- Variable length sequences
- Stateful representation
- Attentional model: what should I look at next (very large images)
- RNN collapses unrolled with shared weights
- First make embeddings and then pass to RNN
- Predict n from n-1
- Initialize hidden state with identity, causes hidden state not to change at all
  - A simple way to initialize RNNs
- Predict 2 to n from 1 to n-1
  - Initial (1) input with zeroes
  - return_sequences = True: return output after every input step
- Long-term dependencies
  - Cannot initialize to zeroes after every sequence: we want to keep the hidden state over the sequences
    - stateful = true: don’t reset the hidden activation after every sequence
      - Training stateful models is hard
      - Because hidden state weights gets applied a lot
      - So scaling issues can lead exploding activations
    - The stateful RNN keeps the hidden state across batches, but doesn’t backprop between batches
      - Time steps still matter
  - Thought impossible to train until LSTM
    - Neural network inside the loop that decides how much state to keep
    - … and BatchNorm
    - inside of the loop: layer normalization
  - See also http://philipperemy.github.io/keras-stateful-lstm/
- RNN feeding into an RNN
  - Add dropout to RNN on U and W
- GRU as a simpler LSTM
  - Also makes sure activations don’t explode
  - Might work better than LSTM depending on setting
Combining RNNs and CNNs
- Attention
- Caption generation
ResNet as a new development
- Adds identity of input to output in each Resnet
- Works as a way to model residuals: comparable to boosting!
- GlobalAveragePooling introduced
- Inception: multiple filter sizes and concatinate, small improvement
Data leakage: fisheries competition example
- Redundant metadata
- Bounding boxes
  - Human help can help
  - Multi output can help, even if we don’t use it: more info
  - Better to give compatible tasks at once
- “Don’t just eff around with architectures and hyperparameters”
- Easy to set sizes with pretrained network, contrary to what some people say
  - FCN with GlobalAveragePooling
  - Can be used to make heatmaps
  - Again, draw pictures of activations, weights, gradients
- Ensembling by reinitializing the network a couple of times

Deep Learning 2017 Course: Part 2

From Theano to TensorFlow:
- API getting easier
- Productionization story
- ML Toolkit Tensor Forest
- Pytorch mentioned in context of RNN
- Define through run (dynamic computation)
Artistic style transfer:
- Loss function as MSE
  - Better to use intermediate convolution activations of a CNN
    - VGG, Resnet also possible but harder
    - loss(activ of orig, activ of modified) -> tune modified
- Average Pooling in Generative models
  - Better than Maxpooling
- No need to use SGD
  - Deterministic optimization based on line search (fmin_l_vfgs_b)
  - Still can give issues in sadle point
    - Conjugate direction: most downhill and most different
  - Optimize arbitrary Keras function
- Gram matrix for style
- Starting point of optimizer matters!
- Many more tuning possibilities, eg blur chaining, different activation layers weighted
  - likemo.net
  - pix2pix
- Alternative: train CNN to generate a new image rather from an original content than optimizing random input image
  - Play with padding and valid convolutions, otherwise weird “checkerboard” artifacts
Super resolution:
- Only content loss
- Deconvolutions
- Start of with large convolution many large filters to increase receptive field of later ones
- Removing activation of last layer per resnet block
- Loss in lambda and then fake targets
- Checkerboard artifacts
  - Because of stride 2 size 3, 4 is better
  - Or not use deconv’s but upsampling and normal convolution
Image preprocessing: resize by chopping instead of borders
Broadcasting: nice historical link to APL and J
Multi-modal models; that is, models which can combine multiple types of data in joint feature space
- Cotraining
- Resnet does not name the merge layers
  - Just before last bottleneck layer is a good place for transfer learning
  - Or not… requires experimentation
- Average Pooling
- Cosine distance
- “In high dimensional space, everything sits at an edge of one of the dimensions”
- Approximate nearest neighbors: locality sensitive hashing (LSH) LSHForest
  - Spilltrees is an alternative
GANs:
- Most people see GAN as an alternative to CNN based generation
  - But even better to stick it on top of it
- Back and forth training approach
  - Hard to interpret loss functions :(
  - Mode collapse: G and D reach a stalemate :(
- DCGAN tries to solve this problem
  - CNN, use use strided convs, no maxpooling
  - No fully connected layers
  - User batchnorm
  - Upsampling even better for G with conv’s
  - Still not fantastic
- Wasserstein GAN
  - Loss curves finally mean something
  - Train D… then train G… back and forth
  - Use MSE and constrain the weights in small area
  - Care about the difference between two loss functions: Jenson-Shannon distance
    - That loss function is hideous, non smooth, not differentiable
    - When using cross entropy
    - Other distances (KL) could be used as well
  - Still not easy when data set has many categories
- BEGAN (new result), CycleGAN, …
Some more tuning insights:
- ELU seems to work well
- Linear combination convolution of the channels 1x1
Noisy labels
Mentions cyclical learning rates by Leslie Smith
- And fairly automated LR picking approack (one cycle)
- Will be expanded later on but is a key tip
3d volume cancer detection
- “Anti-fan of clustering: unsupervised techniques are looking for a problem to solve”
- “K-means sucks (kernel k means is a bit better)”
- Mean shift clustering in 5 dimensions is nice though
Memory networks and chatbots:
- “For sentence embeddings, they just add the embeddings up”
- Kind of messy
- “Chatbots state of art is terrible”
- Recurrent entity network + attentional networks are better
On attention models:
- Dropout in RNNs is a bit different from normal dropout: best to dropout the same things in every time step
- Bidirectional RNN
  - Look forwards and backwards
  - Doubles the number of features by concatinating
- Matrix -> RNN (Encoder: sequence to vector) -> vector -> RNN (decoder) -> output sequences
  - Copy vector state and make copies (RepeatVector)
  - Still not perfect for longer words since we try to encode in a single vector
- BiLSTM + attention
  - Used a lot in NLP, “replaces everything before in linguistics” according to Chris Manning
- Attentional model
  - Weighted sum over decoded state where the weights represent how important each input element is to get the output
  - Weights are trained using SGD: single hidden layer
    - Jointly trained!
    - No reason to not make it more complex but allows for easy visualisations
  - Also applied on sound waves for speech recognition
  - Feature forcing: pass encoder hidden state and answer for previous time state
    - Sounds like cheating, but makes training easier early on
- Embedding -> Bidir -> rnn -> rnn -> attention (x + feature forced) -> timedistributed
  - Attention is a custom keras layer
Reinforcement learning: “almost no real practical examples”
- Evolutionary models turn out to be better, even…
Class imbalance: one way is to simply adjust your mini batches before doing under/over-sampling
Transfer learning: why does vgg still work so well vs inception and resnet
- Redundant intermediate representations
Find best output given list of predictions (“top1k problem”)
- Doesn’t work well with sequences
- Find best pathway through predictions
- Step by step adding up log of predictions (Viterbi)
  - NP-complete :(
- Beam search is better (pick the top few so far), or A*
  - Used by majority of practical approaches for decoding
Neural machine translation of rare words with subword units
- What do I do with a word I haven’t seen before? Or I don’t want a large vocab?
- BPE: encoder - tokens that are not necessarily the same of the words, e.g. [er], [am]
Densely Connected Convolutional Networks
- Replace addition with concatination in ResNet blocks
- Works well with limited amounts of data!
- Compression and bottlenecking
- “The One Hundred Layers Tiramisu”
  - Good for image segmentation
  - U-net model, but uses DenseNet block
    - Skip connections
    - Also nice
  - Data augmentations helps
  - Original did 1x1 conv and maxpooling
    - Stride 2 seems to work better
    - Also, deconv2d works better than upsampling + CNN here
  - works even better on video frames than models that incorporate time component
Time series
- Dynamic mortality risk predictions using RNN
- Rosmann sales: entity embeddings of categorical variables
  - My Take: I really like this case
  - Each categorical one hot embedding, concat, two dense layers and an output layer
  - Continuous variables get pushed directly in the network
  - Embeddings can be used in other models as well
  - Draw some projections
  - Google trends used as data source
  - vtreat package
  - Supervised embeddings vs unsupervised like autoencoders — is the latter more generalized?
    - “Bah! I don’t like unsupervised models”
    - “Always better to define a loss function — any loss function”
  - How long until previous/next holiday
    - Still good human feature engineering necessary!
  - Nice that Pandas has a lot of time series stuff
  - Dimensionality not so high, so embedding length set manually per categorical level
    - Or max(50, levels//2)
  - sklearn_pandas package
  - Removed all rows where store was closed
    - What happens before and after?
    - Sales for stores that are open on sunday are higher -> should be a feature
  - log and division with max log -> so we can use sigmoid
    - Doesn’t seem to matter much here
    - For sigmoid it’s really hard to get to the maximum!
    - So max log * 1.1 would have been better
  - Did not normalize continuous? and other weird changes
    - Turns out scaling and a good init and single dense works well
  - “xgboost is fantastic”
  - feature importance plots are important
    - “Credit scoring with 3000 vars, only 9 mattered”
- Actually already done artificial neural networks applied to taxi destination prediction
  - Yoshua Bengio
  - Simple embedding dimension
  - GPS points: just first and final points
  - Softmax, centroid with mean-shift clustering to come up with 3000 destinations

Deep Learning 2018 Course: Part 1

Starts off with ResNet model for cats vs dogs and introduction of fast.ai library
- Most top researcher are using PyTorch. Note: returns log of predictions
Visualizing again important: look at correct and incorrect ones
- “Why do you want to look at your data?”
  - Take advantage of things model is doing well and fix what is is not
  - Learn about the data set itself
Nice example of image classifiers on mouse movement to predict fraud
Neural networks don’t solve the UAT non-exponentially
- But multiple layers helps!
LR finder (cyclical learning rates)
- LR is the hyperparameter that matters
- Regardless of optimizer, even if they do adaptive learning rate, annealing still important
And so is data augmentation!
SGD with restart
- Cosine annealing with jump back up
  - Gradually increase length of cycle as well
- Helps to avoid spiky minima
  - Change every mini batch
  - Alternative for restarts with init/ensemble people did before
  - Snapshot ensemble: save weights at every bottom point before restart
Different learning rates when transfer learning per part of the model
- Don’t need to unfreeze only a part any more
Test time augmentation: average 1+4 preds
- But mayble sliding window might be enough
First train on smaller images and then on larger images
Unbalanced data not a big problem
- Best way is just to make copies of the rare class or adjust minibatches
Multilabel classification: don’t use softmax, obviously
Dropout still important
Year and dayofweek -> categorical
- Cardinality: number of levels
- Binning can be helpful
- Unknown category (for year): actually pretty meaningful
- Neural nets really like ~N(0,1)
- Cardinality versus length of vector
  - Rule of thumb here: min(50, card+1 // 2)
- Advantage versus onehot? Would work, but makes that the concept of e.g. Sunday can only be associated with one floating number
  - Linear behavior
  - Concepts in higher dimensional space allows to capture more richer semantics
import dill as pickle
NLP
- Lots of new stuff
- Language modeling: predict next word/char
  - Use it as a pretrained model
  - Fine tuning really powerful here as well
  - char-rnn a bit similar, but word level easier
- Spacy shows up: “super good tokenizer”
- Stemming or lemmatization not 100% required here
- Bag of words no longer useful
- Length X BatchSize step over BPTT (backprop through time)
  - Shuffle BPTT length
- First step is embedding matrix
- Adam defaults need to be changed for NLP
- AWD LSTM language model: uses dropout (all over the place)
  - Need to experiment tho
  - Clip the learning rate (gradient clipping)
- word2vec or glove would also be okay in this case, but
  - Pretrained doesn’t really add that much
  - Pretrained language models is nicer
Collaborative filtering
- dask
- PyTorch underscore trailing means do it in place
- Bias is important, still
- PyTorch now has broadcasting
- Put ratings through a sigmoid function -> 0..1 * 4 + 1 . not state of art but helps
- Not strictly a matrix factorization (harder to deal with sparse matrices -> predict 0)
- Here too cycle opt can help
- Then a neural net again (linear layer has bias)
  - Concat both embeddings, linear, relu and linear again and sigmoid
    - And dropout
    - Dimensionality for users and movies doesnt need to be the same any more
SGD
- Finite differentiation
  - Actually fine, computers are discrete anyway
  - But problem in high dimensional space: slow!
    - https://explained.ai/matrix-calculus/
- Analytic differentiation
  - Chain rule most important f(g(x)) -> relu(lin(x)) …
    - And that’s just backpropagation — really
- Take a hint that’s a good direction -> momentum (linear interpolation)
  - alpha(t) + (1-alpha)(t’-1)
  - can be tuned (esp with well behaved loss) but almost no one does
- Adam: much faster but final answers not quite as good as sgd + momentum, esp in nlp - Solved two weeks ago: weight decay had a nasty bug, AdamW fixes this
  - Incorporates adaptive (dynamic) learning rate
- Weight decay… more params than data points
  - Regularization and dropout
  - Weight decay = l2, add square of weights to loss
    - Penalize params when high unless gradient varies a lot
    - So actually needs to be handled separately from loss
    - “decoupling weight decay” from the momentum term (Sebastian Ruder)
- No rmsprop any more
PyTorch will get rid of tensor -> var concept
- Use Numpy except for derivatives or needs to run on gpu, that’s fine
Categorical embeddings
- Best way to turn unsupervised to supervised is to invent some labels
  - Unsupervised = “fake labels”
- Trains better with a quick matrix multiply instead of deep
- Does the “task” in the supervised matter
  - Hard question, e.g. predict augmented image from non-augmented, predict context word, predict next word
  - Though fake task often works well
    - “Ultimate fake task is autoencoder”
- “Look, don’t touch — no assumptions doesn’t mean not looking” (don’t delete closed store days)
RNN
- Merge add, or concat, both are possible
- Character model again now
  - Sometimes might want to combine both (word/char byte-pair encoding (BPE))
- Hyperbolic tan still used here
BatchNorm
- Normalize every layer, not just input
- Prevent activations vanishing or exploding
- Would be simple, but SGD is bloody-minded
  - Just trying to substract mean and devide by stds won’t do it
  - SGD would just try to undo it again next time
- Where do you put the batchnorm?
  - Original paper got it wrong: no ablation study, after relu is better
    - Though not easy to say
- Regularization effect but more about convergence
- No reason not to use it, there were cases where it was hard to use correctly (like RNN)
- Data normalization not required but still a good idea
Singular conv2d layer at the start
- Because we just want more filters except for the three channels, e.g. 10 5x5 filters
- Most modern archs use this
- E.g. 32 11x11
- Stride 1 padding 2: output will have the same dimension
ResNet: y = x + f(x) f(x) = y - x -> fit function for difference previous layer and output
- “Residuals net”: trying to fit the residual
- So far ignored in non-image domains
  - Transformer model does it as wel: skip connections, skipping over layer and identity add
    - Works well
Adaptiveaveragepooling -> heatmaps

Deep Learning 2018 Course: Part 2

Regularization recap: reducing architecture complexity should be last thing but often what people do first
Bounding boxes
- Atart out by classifying the largest object in each image
- Next, we try to find the bounding box… simply regression output
  - Easy but crappy if there are multiple objects
- And then combine both with a custom head
  - Again, seems to do a bit better, seems counterintuitive
    - Shared goals/computation are always better
- And then multilabel classification
  - Instead of 4+c we could have 16x(4+c)
    - Assuming we have a reasonably loss function, that would work -> YOLO network style
    - We could also make a conv2d with stride 2: 4x4x(4+c) -> SSD
    - Both are used, but yolo v3 used the ssd way
      - Anchor boxes
      - Ane more class for background
  - 4x4 grid cells: anchor boxes, prior boxes, default boxes
    - Get out class
    - Convert to bounding boxes
  - Possible to make anchor boxes of different sizes
defaultdict is handy
“Width by height, rows by columns” :)
matplotlib has an object oriented api, but almost no one uses it
- subplots is is super handy, returns fig, ax; use ax.
pdb.set_trace() to set a breakpoint, %debug to trace an error
- h(elp) s(tep into)/n(ext)/c(ontinue) l(ist context) p(rint) varname / u(p callstack context) or d(own)
l1loss generally better than mse: absolute values, less penalization of errors
Transfer learning for NLP
Slanted triangular learning rate: again, all about the learning rate
GANs getting closer to being proven useful
- But still cutting edge
- Many architectures slow as molasses and take a lot of memory
- Still using Wassertein GAN, take care of checkerboard artifacts
- “If you train a model with synthetic data, the neural net will become fantastically good at recognizing the specific problems of your synthetic data and that’ll end up what it’s learning from”
- Darknet
- Adaptiveavgpooling instead of averagepooling
  - Set size of output instead of window
- LeakyReLU
- SeLU not so great
- Cyclegan
Nice writeup on ethics: https://medium.com/@hiromi_suenaga/deep-learning-2-part-2-lesson-13-43454b21a5d0
L-BFGS also needs second derivative
Image segmentation with UNET

Deep Learning 2019 Course: Part 1

Cricket vs baseball with just 30 images
Normalizing is important
Still ResNet works very well
Image sizes still tricky: square and fixed size
Use onecycle
Even works on mouse movement pictures and sound with the same setup
Learning rate and epochs most important things to tune
UNET still for segmentation
Language model and classifier for nlp
Zeiler and Fergus paper
Sigmoids: A sigmoid actually asymptotes at whatever the maximum is so actually it’s slightly better to make your Y range go from a little bit less than the minimum to a little bit more than the maximum
Reflection is nearly always better for data augmentation
Spectral normalization to make GANs work nowadays
- GANs hate momentum when you’re training them
- Wasserstein GAN now older but still useful
- “Can we get rid of GANs entirely?” Obviously, the thing we really want to do is come up with a better loss function
Perceptual Losses for Real-Time Style Transfer and Super-Resolution: “Justin Johnson et al. created this thing they call perceptual losses. It’s a nice paper, but I hate this term because they’re nothing particularly perceptual about them. I would call them “feature losses”“
- VGG model on this generated image, but let’s take something in the middle. Let’s take the activations of some layer in the middle
- We then take the target (i.e. the actual y value) and we put it through the same pre-trained VGG network, and we pull out the activations of the same layer. Then we do a mean square error comparison

Deep Learning 2019 Course: Part 2

“Overfit, reduce overfit, no step 3”
- Overfit: doesn’t mean train loss < validation loss!
- Means validation error gets worse!
More data, augment, generalizable architecture, regularization, reduce complexity: in that order
fire as handy CLI library
If we multiply numbers initialized with normal distribution (mean=0;std=1) even 28 times they become NaN which mean that they are too big
- What if we multiply the weights with 0.01? Then the problem is that numbers become zeros
- The answer is to divide with the square root of the size of column of the matrix
Training loop looks like this: Calculate predictions, Calculate loss, Backward pass, Subtract learning rate with gradients, Zero gradients
- “Why we need to zero our gradients in PyTorch? If we don’t zero the gradients in every loop it is going to add the new gradients to the old ones. Then why can’t PyTorch just zero the grads automatically? This is because sometimes we want to use multiple different modules and if PyTorch would automatically zero the gradients we couldn’t do this”
Variance means how much data point varies which is same as how far they are from the mean on average
- Positive and negative so either square or abs
  - Square needs to be undone: sd
  - abs: mean absolute deviation
- The standard deviation is more sensitive to outliers and that is why we more often use absolute deviation
  - Why square: just because it makes some math easier :)
  - Replacing squares with abs values often works better!
- Variance: (X - e(X)) ^2 = e(X^2) - e(X)^2
- Covariance: this number tells how well points line up
  - Higher: how much these things vary in the same way
  - e(xy) - e(x)e(y)
- These numbers can be at any range so we calculate correlation which is always between -1 and 1
  - Good question: why scaled with standard deviation? Not variance or mean?
Softmax: e to the power divided by sum
- Different outputs can end up giving the same softmax!
- Softmax likes to pick one thing and make it high
- Terrible idea unless you to single-output multiclass
  - Otherwise use binomial
“We want to see how the mean and standard deviation changes of activitities”
- “These values increases exponentially and collapse suddenly many times at the start. Then later the values stay better at some range which means that model starts to train. If these values just goes up and down model is not learning anything”
- “When we look at the first ten loops we notice that the standard deviation drops further we go from the first layer. And to recall we wanted this to be close to one”
- Kaiming initialization solves this
“Another thing we want to assure is that the activations aren’t really small in later layers”
- “This can be seen using histogram of sd’s: the idea of the colorful dimension is to express with colors the mean and standard deviation of activations for each batch during training. Vertical axis represents a group (bin) of activation values. Each column in the horizontal axis is a batch. The colours represent how many activations for that batch have a value in that bin”
- “These plots show us right away a problem. There is a yellow line going bottom of each plot and that means there is a lot of values and that is something we don’t want”
- “The plots above show how many percentages of activations are nearly zero. This tells us more about the yellow line we saw on previous plots: how much acts are we wasting?”
- LeakyReLU
We have reached our limit and to go further we need to use normalization. BatchNorm is probably the most common normalization method
- If we use BatchNorm we don’t need bias because there is already a bias in BatchNorm.
- BatchNorm works well in most of the cases but it cannot be applied to online learning tasks
- Another problem is the RNNs. We can’t use BatchNorm with RNNs and small batches.
- LayerNorm is just like BatchNorm except instead of (0,2,3) we have (1,2,3) and this doesn’t use the running average. It is not even nearly as good as BatchNorm but for RNNs it is something we want to use because we can’t use BatchNorm.
- The problem with LayerNorm is that it combines all channels into one. InstanceNorm is a better version of LayerNorm where channels aren’t combined together.
Layerwise Sequential Unit Variance (LSUV)
- As we have seen getting a variance of one through a model is quite hard because even little things can spoil this. The high-level idea behind LSUV is to let the computer make all these decisions
- First, we start by creating a normal learner. Then we take one mini-batch out of this learner. We take all our layers and then create a hook that gives an ability to see the mean and standard deviation of the layers. Because we didn’t have any initialization the mean and standard deviation isn’t what we hope. Normally at this point we would test a bunch of initialization methods to see what works best. This time, although, we use the mini-batch we took previously and iteratively try to solve the problem
Mixup / Label smoothing:
- The idea is to combine two images by taking some amount of another image and some amount of another. We also do this for the labels. So, for example, we might take 30% of a plane image and 70% of a dog image and then label for that combination will be 30% of a plane and 70% of a dog
- In softmax, there is one number a lot higher than others. This is not good for the mixup. Label smoothing is something where we don’t use one-hot-encoding but something like 0.9-hot-encoding. It means that instead of trying to give one for some class it tries to give 0.9 for one class and 0.1 for other classes
  - This is a simple but very effective technique for noisy data. You actually want to use this almost always unless you are certain that there is only one right label
Mixed Precision Training: instead of using 32-bit floats we use 16-bit floats. This will speed up the training about 3x
“Big companies try to brag with how big batches they can train once. For us, normal people, increasing the learning rate is something we want. That way we can speed training and generalize better”
Discriminative LR and param groups: discriminative LR was where we have different learning rates for different layers.
ULMFiT is transfer learning applied to AWD-LSTM

Machine Learning Course

What’s the correct range of r^2? Mininf to 1
Ensembling: construct multiple models which are better than nothing and where the errors are, as much as possible, not correlated with each other
%prun
Calibrate your validation set, especially on Kaggle: check variance, also check whether a classifier could distinguish between train and validation
Check data leakage, collinearity
“The overall effect of the max_features is the same — it’s going to mean that each individual tree is probably going to be less accurate but the trees are going to be more varied. In particular, here this can be critical because imagine that you got one feature that is just super predictive. It’s so predictive that every random subsample you look at always starts out by splitting on that same feature then the trees are going to be very similar in the sense they all have the same initial split. But there may be some other interesting initial splits because they create different interactions of variables. So by half the time that feature won’t even be available at the top of the tree, at least half the tree are going to have a different initial split. It definitely can give us more variation and therefore it can help us to create more generalized trees that have less correlation with each other even though the individual trees probably won’t be as predictive”
“It’s pretty much always possible to create a simple logistic regression which is as good as pretty much any random forest if you know ahead of time exactly what variables you need, exactly how they interact, exactly how they need to be transformed. In this case, for example, we could have created a new field which was equal to sale year minus year made and we could have fed that to a model and got that interaction for us”
- “Our coefficients are telling you “in your totally wrong model, this is how important those things are” which is basically meaningless. Where else, the random forest feature importance is telling you in this extremely high parameter, highly flexible functional form, with few if any statistical assumptions, this is your feature importance. So I would be very cautious”
- “The problem is that when you look at a univariate relationship like this, there is a whole lot of collinearity going on — a whole lot of interactions that are being lost”
- “So again, as data scientists, one of the things you are going to keep seeing is that at the companies that you join, people will come to you with these kind of univariate charts where they’ll say “oh my gosh, our sales in Chicago have disappeared. They got really baed.” or “people aren’t clicking on this add anymore” and they will show you a chart that looks like this and ask what happened. Most of the time, you’ll find the answer to the question “what happened” is that there is something else going on”
- “So the PDP approach is where we actually say let’s try and remove all of these externalities. So if something is sold on the same day to the same person of the same kind of vehicle, then actually how does year made impact the price. This basically says, for example, if I am deciding what to buy at an auction, then this is saying to me that getting a more recent vehicle on average really does give you more money which is not what the naive univariate plot said”
What is the benefit of using cross-validation over a standard validation set: you can use all of the data. You don’t have to put anything aside. And you get a little benefit as well in that you’ve now got five models that you could ensemble together, each one used 80% of the data. So sometimes that ensemble can be helpful
Here is the problem with random forests when it comes to extrapolation
- So in this case, what I wanted to do was to first of all figure out what’s the difference between our validation set and our training set
- I’ve gone back and I’ve got my whole data frame with the training and validation all together and I’ve created a new column called is_valid which I’ve set to one and then for all of the stuff in the training set, I set it to zero. So I’ve got a new column which is just is this in the validation set or not and then I’m going to use that as my dependent variable and build a random forest. This is a random forest not to predict price but predict is this in the validation set or not. So if your variable were not time dependent, then it shouldn’t be possible to figure out if something is in the validation set or not
- I still want them in my random forest if they are important. But if they are not important, then taking them out if there are some other none-time dependent variables that work just as well — that would be better. Because now I am going to have a model that generalizes over time better
The variance of the predictions of the trees. Normally the prediction is just the average, this is variance of the trees, and is also interesting to show!
A matrix with one column is not the same thing as a vector
L2 tries to make things zero… That’s kind of true but if you’ve got two things that are highly correlated, then L2 regularization will move them both down together. It won’t make one of them zero and one of them nonzero. So L1 regularization actually has the property that it will try to make as many things zero as possible where else L2 regularization has a property that it tends to try to make everything smaller
Random forest similarity: computes similarity between instances with classification of out-of-bag instances. If two out-of-bag cases are classified in the same tree leaf the proximity between them is incremented