Deep Learning, lesson 1
Sigmoid
Usually used in the last layer, because it gives strong confidence for example. Usefull for the optimisation.
Loss function
Cross entropy and NLL (Negative Log Likelihood) for regression here
Quizz:
- MSE, if sigma is constant? Write the density of the gaussian distribution then simplify and remove the constant variance (should do this on paper looks pretty important for intuition)
Classification problems
Softlax used in the last layer. But if we don’t have a class, we should give him a random class? We don’t really want to force the output, but need to know where the network is not confident. Paper on how RELU fails to express the non-confidence look at the slides
Generalization
Link to double descent?? Grooking
Regularization
Weight decay, make our parameters smaller with L2 regularitaion?
Normalisation??? Why? Because the activation function has more mapping when you are between 0 znd 1 or something of the kind.
Optimization for deep learning
large scae optimisation paper to read
Identifiable model? ONe set of param can rule out everything? Which mean only one minima. But in NN we can change the order of the neurons and still have similar outputs, thus most of the models are non identifiable. Most of the model have various local minimum, but they are pretty similar.
Identifiying and attacking the saddle point problem
Back propagation
Follow threw example using the chain rule. Partial derivatives and all
Wheight Initalizer
Xavier
Unifrom with particular distribution?? Add this to @TODO ANKI. Same activation and gradient variance between all layers.
Batch normalisation
Normalisation the distribution of input features. Prevents exploding gradients and more resilieent to scaling, but why?
Data Augmentation
INterpolate two inputs together. Use it as an input to the network, modify cross entropy to work with it mixup cutup. Real data augmentation creating inputs out of the distribution (they can not exist!) artifical modification.
Dropout
Use bagging and enesmelbe methods on the various models. You can consider all the variations of the network and train them. Then select the best.