Gradient Descent, all you need to know!!

Explained Gradient Descent with Linear Regression example.

Kishan Modasiya
7 min readNov 23, 2022
Image from Unsplash

What is it?

Gradient Descent is an optimization algorithm capable of finding optimal solutions to a wide range of problems.

The main function of Gradient Descent is to minimize a cost function.

This definition is complicated to understand, right?! Okay, let’s make it simple and understand it with an example.

Think that we have data of 2 variables, x and y, where x is independent variable and y is dependent variable on x. So our aim is to predict y for given x. Suppose data is looking something like shown below. Here, we are considering the Linear Regression problem. So we get that line, so that we can predict y for any x.

Linear Data
Linear data

If you don’t know what is Linear Regression, check it out here:

As you all know, that equation for line is y = mx + c. Here, m is slop of the line and c is intersection to the Y. So for Linear Regression, the hypothesis function is same as this equation, as shown below.

hypothesis function for linear regression
Hypothesis function for Linear Regression

Here θ₀ and θ₁ called weights. We have to find these weights to get the best fit line for data.

But the question is that, how to get this line so that it can fit on data? For that, first we have to find error, means difference between actual output y and predicted output h(x) which is calculated by our hypothesis function with some random weights. For that we need a cost function which calculates this error. For Linear Regression, the Cost Function is defined as following:

cost function for linear regression
Cost function for Linear Regression

If you see clearly, then you find that this cost function is nothing, but first it takes a difference of actual and predicted value and then takes a square of it and then takes an average of all data. So basically it is calculating how well is our line is fitted on data. If the value of the cost function is close to 0, then it is good, but if it is a large value, then we have to change the weights of the hypothesis function and then try again.

So do we have to try out different weights randomly to get the best fit line to the data?? 🤔🤔

No no, relax. That’s when Gradient Descent comes to help.

Basically, our aim is to reduce the cost function as low as possible.

Here, the cost function is a function of θ₀ and θ₁. If you know some differentiation, then you can interpret it easily. Let’s understand that part first.

For example, we have X and the function of it. Now we want that value of X at which f(X) has minimum value.

As we all know that taking differentiation of a function at some point, it means a value of a slop at that point. So we can say that at which point this differentiation becomes 0 means at that point a value of slop is 0, means we get a horizontal line of slop.

Look at the image shown below:

Minimize fucntion with differentiation

Here you can see that at value x', slop became horizontal. It means at that point the value of f(X) is minimum. You can see that clearly in the image. At that point, the differentiation of f(X) is 0. So here what’s we are doing is finding minima of the graph. (It can be local minima or global minima.)

Now let’s get back to Gradient Descent.

We have θ₀ and θ₁ as a parameter and cost function, which is a function of θ₀ and θ₁. So here we want to minimize the cost function so that we can get the best fit line for our model. As explained in the above example, we have to find a differentiation of cost function and at which point it became 0, we will get the value of θ₀ and θ₁ for best fit line. So we want to find minima.

So we have 2 parameters and a cost function. If we plot it in a graph, it will be 3D, as shown below. See image, you’ll get better understanding.

Gradient Descent graph with 2 parameters and cost function
Gradient Descent for 2 parameters

Here we have θ₀, θ₁ and J(θ₀, θ₁) in the graph. Now take a random value of θ₀, θ₁ and calculate cost function for that. Set this value in the graph, suppose that point is A shown in the graph. We want that point at minima, at point A'. You can see that at point A' slop became 0, and for that θ₀, θ₁ cost function has the lowest value means we have best fit line for our data.

Now the question is that how to reach at that point?

Procedure is that first calculate slop at a point. After that, step down in that direction. And repeat this step until we get a value of slop 0.

For Example, suppose you are lost in the mountains in a dense fog, you can only feel the slope of the ground below your feet. A good strategy to get to the bottom quickly is to go downhill in the direction of the steepest slope.

Let’s see the equation of Gradient Descent, then it’ll be easy for you to understand.

equation of gradient descent
Equation for Gradient Descent

I think other than α (alpha); you knew all other parameters. This alpha called the Learning rate.

Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the minima. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function.

So now, if you see the equation, you can understand how it’ll work. For any θ, it will first calculate slop and then multiply this slop with learning rate so it becomes small value and then subtract this value from original θ and replace θ with the value that we get. This process repeats itself until it will find any minima or until it converges. Basically, it moves θ to the local or global minima.

If α is small, then the algorithm will go through many iterations and take a lot of time.
If α is high, you might jump across the valley and this might make the algorithm diverge and cannot find the best solution.

gradient descent with small and large learning rate
Learning rate

Types of Gradient Descent

There are 3 types of Gradient Decent:

  1. Batch Gradient Descent
    It calculates the error for each example within the training set. After evaluating for all, it updates the model parameter.
    It’s computationally efficient, produces stable error gradient and convergence, but it requires an entire training set in memory.
  2. Stochastic Gradient Descent
    It updates the parameters according to the gradient of the error regarding a single training example.
    It is faster than batch gradient descent. Frequent updates provide a detailed rate of improvement, but these updates are more expensive.
  3. Mini-batch Gradient Descent
    It separates the training set into small batches and performs an update for each of these batches.
    It balances Batch & Stochastic Gradient Descent because it uses a combination of both.

Challenges with Gradient Descent

  • Local Minima and Plateau
    Not all cost functions do not look like a regular bowl. In left side, it might be it stuck at the local minimum. In right side, it’ll take a long time to reach the global minimum.
Graph of local minima and plateau
Local Minima and Plateau problem
  • Vanishing gradient
    This occurs when the gradient is too small. As we move backwards during backpropagation, the gradient continues to become smaller, causing the earlier layers in the network to learn more slowly than later layers. When this happens, the weight parameters update until they become insignificant.
  • Exploding gradient
    This happens when the gradient is too large, creating an unstable model. In this case, the model weights will grow too large, and they will eventually be represented as NaN. One solution to this issue is to leverage a dimensionality reduction technique, which can help to minimize complexity within the model.

That’s it for Gradient Descent!

Thanks for reading it! If you like it, then give it a clap and share it.

Follow for more on Medium, I’ll share more Machine Learning stuff soon. Here’s my Twitter, follow, and connect with me there and feel free to DM.

--

--

Kishan Modasiya
Kishan Modasiya

Written by Kishan Modasiya

Passionate about Data and Machine Learning.

No responses yet