Decoding “ALEXNET” Architecture

In the ever-changing ecosystem of convolution neural network (CNN), Recently I read an interesting article on “Alexnet Architecture”. I decided to unravel the learning of amazing paper on Alexnet.AlexNet is the name of a Convolution Neural Network, designed by Alex Krizhevskly,and published with Ilya Sutskever and Krizhevsky’s PhD advisor Geoffrey Hinton.

AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge in 2012. The Network achieved a top-5 error of 15.3%.

Key Points of this Architecture ->

1. Non-Saturating Non-linearity: In this Architecture, we have used ReLu Activation function. Deep Convolution Neural Network with ReLUs(Non Saturating Non-linearity) train 6 times faster than tanh(Saturating Non-linearity).

2. Training on Multiple GPU’s:This Architecture has used 2 GTX 580 3 GB GPU’s with cross-GPU parallelization scheme. This scheme puts half of the kernels(neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers which results in fast training the model. This scheme reduces top-1 and top-5 error rates by 1.7% and 1.2% respectively.

3. Local Response Normalization:This model has used a neurobiological concept called “Lateral inhibition” for normalizing the input to prevent it from saturating. This scheme reduces top-1 and top-5 error rates by 1.4% and 1.2% respectively.

4. Overlapping Pooling:Pooling Layers in CNNs summarize the outputs of neighbouring groups of neurons in the same kernel map. We observe that training the models with overlap pooling find it slightly more difficult to overfit. This scheme reduces top-1 and top-5 error rates by 0.4% and 0.3% respectively compared with non-overlapping pooling.

5. Reducing Overfitting by Dropout:Here we have used a popular technique called “Dropout”. It consist of setting to zero the output of each neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to forward pass and do not participate in back propagation. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on a particular neuron. It is thus forced a neuron to learn more robust features that are useful in conjunction with many different random subsets of other neurons.

We use Dropout in the first two fully-connected layers. Without dropout, our network exhibits substantial overfitting.

Alexnet Architecture(from the original paper)

Overall ALEXNET Architecture:

The network consists of 8 layers with weights, the first 5 are Convolution layers and the remaining 3 are Fully-connected Layers. The Kernels of the 2nd,4thand 5thConvolutional layers are connected only to those kernels maps with the previous layers which reside in same GPU. The kernels of the 3rdConvolution layer are connected to all the kernel maps to the 2ndlayer. The neurons in the Fully-connected layers are all connected to all the neurons in the previous layers. Max-Pooling Layer and Response Normalization layers follow the 1st, 2ndand 5thConvolution Layers. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

The 1stConv Layer filters 224 x 224 input image with 96 kernels of size 11 x 11 x 3 with a stride of 4 pixels. The 2ndConv layer takes the output of 1stConv layer and filters 256 kernels of size 5 x 5 x 48. 3rd, 4thand 5thConvolution layers are connected to one another without intervening pooling or normalization layers. The 3rdConv Layer has 384 kernels of size 3 X 3 X 256. The 4thConv layer has 384 layers of size 3 x 3 x 192, and the 5thConv layer has 256 kernels of size 3 x 3x 192. The fully-connected layers have 4096 neurons each.This model is trained using Stochastic gradient descent with batch size of 128, momentum of 0.9 and weight decay of 0.0005.

The implementation of this architecture can be found here.

Note: I have implemented this Architecture using keras but instead of overlapping pooling I have used non-overlap pooling.