Skip to main content

Activation Functions (Neural Networks)

Activation functions are really important for a Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response variable.They introduce non-linear properties to our Network.Their main purpose is to convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a input in the next layer in the stack.
Specifically in A-NN we do the sum of products of inputs(X) and their corresponding Weights(W) and apply a Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
In keras, we can use different activation function for each layer. That means that in our case we have to decide what activation function we should be utilized in the hidden layer and the output layer.
Activations can either be used through an Activation layer, or through the activation argument supported by all forward layers:
Neuron
“Input times weights , add Bias and Activate”
from keras.layers import Activation, Dense

model.add(Dense(64))
model.add(Activation('tanh'))
This is equivalent to:
model.add(Dense(64, activation='tanh'))
You can also pass an element-wise TensorFlow/Theano/CNTK function as an activation:
model.add(Dense(64, activation=K.tanh))

Step Function

step function
Activation function A = “activated” if Y > threshold else not
Pros
  • Simple to understand
Cons
  • Can't handle multiple classes.
  • Can't give output like 20% or 30%.
Conclusion
Give other activation function for the hidden layers and you can use step function in the final layer.

Linear Function

A straight line function where activation is proportional to input ( which is the weighted sum from neuron ).
Pros
  • It gives a range of activations, so it is not binary activation.
  • We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.
Cons
  • For this function, derivative is a constant. That means, the gradient has no relationship with X.
  • It is a constant gradient and the descent is going to be on constant gradient.
  • If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !


Sigmoid Function

Sigmoid function
It is a activation function of form f(x) = 1 / 1 + exp(-x) . Its Range is between 0 and 1. It is a S — shaped curve.
Pros
  • It is nonlinear in nature. Combinations of this function are also nonlinear!
  • It will give an analog activation unlike step function.
  • It has a smooth gradient too.
  • It’s good for a classifier.
  • The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.
Cons
  • Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
  • It gives rise to a problem of “vanishing gradients”.
  • Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
  • Sigmoids saturate and kill gradients.
  • The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

Tanh (Hyperbolic Tangent function)

tanh
A better version of Sigmoid for many cases due to its range.
tanh2
It’s mathamatical formula is f(x) = 1 — exp(-2x) / 1 + exp(-2x). Now it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 . Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function . But it still suffers from Vanishing gradient problem.
Deciding between the sigmoid or tanh will depend on your requirement of gradient strength.
Pros
  • The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
Cons
  • Tanh also has the vanishing gradient problem.

ReLu (Rectified Linear units)

It has become very popular in the past couple of years. It was recently proved that it had 6 times improvement in convergence from Tanh function. It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.
RELU

Pros

  • It avoids and rectifies vanishing gradient problem.
  • ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.
Cons
  • One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
  • Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
  • In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem.
  • The range of ReLu is [0, inf). This means it can blow up the activation.
There are variations in ReLu to mitigate the issue of Dying ReLU issue by simply making the horizontal line into non-horizontal component . for example y = 0.01x for x<0 will make it a slightly inclined line rather than horizontal line. This is Leaky ReLu. There are other variations too. The main idea is to let the gradient be non zero and recover during training eventually.

Comments

Popular posts from this blog

Ceph Single Node Setup Ubuntu

Single Node Ceph Install A quick guide for installing Ceph on a single node for demo purposes. It almost goes without saying that this is for tire-kickers who just want to test out the software. Ceph is a powerful distributed storage platform with a focus on spreading the failure domain across disks, servers, racks, pods, and datacenters. It doesn’t get a chance to shine if limited to a single node. With that said, let’s get on with it. Inspired from:  http://palmerville.github.io/2016/04/30/single-node-ceph-install.html Hardware This example uses a VMware Workstation 11 VM with 4 disks attached (1 for OS/App, 3 for Storage). Those installing on physical hardware for a more permanent home setup will obviously want to increase the OS disks for redundancy. To get started create a new VM with the following specs: ·         Name: ceph-single-node ·         Type: Linux ·     ...

Docker Basics

When Do You Need to Use Docker? For replicating the environment on your server, while running your code locally on your laptop Experimenting with new things on your laptop without breaking the repositories. Creating a production grade environment on you PC with just simple steps. For instant testing of your application. For Docker CI/CD during numerous development phases (dev/test/QA) For distributing your app’s OS with a team, and as a version control system. Simple ways to setup docker: - Route 1 (curl required): # curl https://get.docker.com | sh Route 2: #apt-get update #apt-get install \     apt-transport-https \     ca-certificates \     curl \     software-properties-common #curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add – #apt-key fingerprint 0EBFCD88 #add-apt-repository \    "deb [arch=amd64] https://download.docker.com/lin...

Docker Overview

OVERVIEW Docker is the company driving the container movement and the only container platform provider to address every application across the hybrid cloud. Today’s businesses are under pressure to digitally transform but are constrained by existing applications and infrastructure while rationalizing an increasingly diverse portfolio of clouds, datacenters and application architectures. Docker enables true independence between applications and infrastructure and developers and IT ops to unlock their potential and creates a model for better collaboration and innovation. A little intro to LXC: - LXC (LinuX Containers) is a OS-level virtualization technology that allows creation and running of multiple isolated Linux virtual environments (VE) on a single control host. These isolation levels or containers can be used to either sandbox specific applications, or to emulate an entirely new host. LXC uses Linux’s cgroups functionality, which was introduced in version 2.6.24 to...