Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression except for a threshold function Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression e that converts the continuous output into a categorical class label

How do you derive the gradient descent rule for linear regression and Adaline? illustration

where $z$ is the net input, which is computed as the sum of the input features x multiplied by the model weights w:

How do you derive the gradient descent rule for linear regression and Adaline? illustration

(Note that (Note that refers to the bias unit so that .) refers to the bias unit so that (Note that refers to the bias unit so that .).)

In the case of linear regression and Adaline, the activation function In the case of linear regression and Adaline, the activation function is simply the identity function so that . is simply the identity function so that In the case of linear regression and Adaline, the activation function is simply the identity function so that ..

Regression VS Adaline

Now, in order to learn the optimal model weights w, we need to define a cost function that we can optimize. Here, our cost function Now, in order to learn the optimal model weights w, we need to define a cost function that we can optimize. Here, our cost function is the sum of squared errors is the sum of squared errors (SSE), which we multiply by Now, in order to learn the optimal model weights w, we need to define a cost function that we can optimize. Here, our cost function is the sum of squared errors to make the derivation easier:

How do you derive the gradient descent rule for linear regression and Adaline? illustration

where where is the label or target label of the ith training point . is the label or target label of the ith training point where is the label or target label of the ith training point ..

(Note that the SSE cost function is convex and therefore differentiable.)

In simple words, we can summarize the gradient descent learning as follows:

  1. Initialize the weights to 0 or small random numbers.
  2. For k epochs (passes over the training set)
    1. For each training sample For each training sample Compute the predicted output value Compare to the actual output and Compute the “weight update” value Update the “weight update” value
      • Compute the predicted output value Compute the predicted output value
      • Compare Compare to the actual output and Compute the “weight update” value to the actual output Compare to the actual output and Compute the “weight update” value and Compute the “weight update” value
      • Update the “weight update” value
    2. Update the weight coefficients by the accumulated “weight update” values

Which we can translate into a more mathematical notation:

  1. Initialize the weights to 0 or small random numbers.
  2. For k epochs
    1. For each training sample For each training sample (where η is the learning rate);
      • How do you derive the gradient descent rule for linear regression and Adaline? illustration
      • (where η is the learning rate); (where η is the learning rate);
      • How do you derive the gradient descent rule for linear regression and Adaline? illustration
    2. How do you derive the gradient descent rule for linear regression and Adaline? illustration

Performing this global weight update

,,

can be understood as “updating the model weights by taking an opposite step towards the cost gradient scaled by the learning rate η”

How do you derive the gradient descent rule for linear regression and Adaline? illustration

where the partial derivative with respect to each where the partial derivative with respect to each can be written as can be written as

How do you derive the gradient descent rule for linear regression and Adaline? illustration

To summarize: in order to use gradient descent to learn the model coefficients, we simply update the weights w by taking a step into the opposite direction of the gradient for each pass over the training set – that’s basically it. But how do we get to the equation

How do you derive the gradient descent rule for linear regression and Adaline? illustration

Let’s walk through the derivation step by step.

How do you derive the gradient descent rule for linear regression and Adaline? illustration