Linear Regression (exercise)

Harvinder_singh · July 27, 2020, 8:32am

Derive the analytic solution to the optimization problem for linear regression with squared error. To keep things simple, you can omit the bias b from the problem (we can do this in principled fashion by adding one column to X consisting of all ones).

Write out the optimization problem in matrix and vector notation (treat all the data as a single matrix, and all the target values as a single vector).
Compute the gradient of the loss with respect to w
.
Find the analytic solution by setting the gradient equal to zero and solving the matrix equation.
When might this be better than using stochastic gradient descent? When might this method break?

I have understood first three part of this question.Can anyone help me in exlpaning the fourth part of this question?

When might this be better than using stochastic gradient descent? When might this method break?

StevenJokes · July 28, 2020, 9:35am

Next time maybe you can ask below the chapter’s question to get more attention from others.

I guess the function must be convex to assure global optima.

goldpiggy · July 28, 2020, 4:46pm

HI @Harvinder_singh, please post your question in the following discussion (such as this one for linear regression) in the future, which will be easier for us to reference.

As for your question, when the features are linear independent, the “analytic solution” will work fine. For high dimensional x, if some of the features are of high correlations, then $(X^T X)^{−1}$ may not existed, so we may not rely the “analytic solution”. That’s where SGD comes in and solve the problem. SGD doesn’t care about correlations of features.

Harvinder_singh · July 29, 2020, 4:20am

@StevenJokes Thanks for your suggestion.And you are correct ,to obtain a global optima function must be convex otherwise using SGD we may find local optima.

Harvinder_singh · July 29, 2020, 4:27am

Hello @goldpiggy , I will post my question in related chapter discussion in the future.Thanks for your suggestion.

As you said, If features are linear independent , SGD works fine. If feature have high correlation among them then $(X^T X)^{−1}$ matrix may not exist so this means ,in this situation we can’t use any of the options since features have high correlation then SGD may not work well.

If I misunderstood something , please correct me.

goldpiggy · July 29, 2020, 4:50pm

Hey @Harvinder_singh, sorry i only answer half of your question. The downside of stochastic gradient descent is that SGD might be too noisy as it requires a weight update at every data point. if some data points has an extraordinary large gradient, that might be harmful for the weights. That’s why we choose to do mini-batch gradient descent. We will discuss more in http://d2l.ai/chapter_optimization/index.html.