
Loss as a function of model weights
Well, remember when we said that we can also think of our loss value as a function of the model parameters? Consider this. Our loss value tells us how far our model is from the actual prediction. This same loss can also be redefined as a function of our model's weight (θ). Recall that these weights are what actually lead to our model's prediction at each training iteration. Thinking about this intuitively, we want to be able to change our model weights with respect to the loss, so as to reduce our prediction errors as much as possible.
More mathematically put, we want to minimize our loss function so as to iteratively update the weights for our model, and ideally converge to the best possible weights. These will be the best weights in the sense that they will best be able to represent features that are predictive of our output classes. This process is known as loss optimization, and can be mathematically illustrated as follows:

Note that we represent our ideal model weights (θ*) as the minima of our loss function over the entire training set. In other words, for all the feature inputs and labeled outputs we show our model, we want it to find a place in our feature space where the overall difference between the actual (y) and predicted () values are the smallest. The feature space we refer to is all the different possible combinations of weights that the model may initialize. For the sake of having a simplified representation of our loss function, we denote it as J(θ). We can now iteratively solve for the minimum of our loss function, J(θ), and descend the hyperplane to converge to a global minimum. This process is what we call gradient descent:
