gradient vanishing problem in Back propagation of Artificial Neural Network:
Table of contents
No headings in the article.
- in a back propagation,we use Relu (Rectified linear unit) activation function more then Sigmoid function but why?
-before I explain,i hope you know the derivation of sigmoid function is ranging between 0 to 0.25.
so in a multi hidden layer neural network when we apply Sigmoid function to adjust the weights of a neuron (formula Wnew = Wold - learning rate × derivative of cost function) we use chain rule of derivative as one neurons weight updating depends on another. so when we use chain rule of derivative for let's say 3 neurons weight,and for each the derivative is 0.15 . Now multiplying those 3 values with learning rate (let's assume LR is 0.3) we get (0.3×0.15×0.15×0.15) = 0.0010 !!!
and this is just for 3 neurons. Imagine there are 1000 of them the the result will be almost 0 and subtracting 0 from old neuron weight doesn't change anything!! hence the gradient (derivative of cost function) is vanished.
This is called (Gradient Vanishing)
so,in a multi layer Artificial neural network use "Relu " cost function it doesn't have such problems! Though it has some other issues like "Dead neuron ", I'll talk about it in some future post.
Happy Learning
#deeplearning #artificialintelligence #machinelearning