Derive the analytic solution to the optimization problem for linear regression with squared error. To keep things simple, you can omit the bias b from the problem (we can do this in principled fashion by adding one column to X consisting of all ones).
- Write out the optimization problem in matrix and vector notation (treat all the data as a single matrix, and all the target values as a single vector).
- Compute the gradient of the loss with respect to w
 .
- Find the analytic solution by setting the gradient equal to zero and solving the matrix equation.
- When might this be better than using stochastic gradient descent? When might this method break?
I have understood first three part of this question.Can anyone help me in exlpaning the fourth part of this question?
When might this be better than using stochastic gradient descent? When might this method break?
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              Next time maybe you can ask below the chapterâs question to get more attention from others.
I guess the function must be convex to assure global optima.
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              HI @Harvinder_singh, please post your question in the following discussion (such as this one for linear regression) in the future, which will be easier for us to reference. 
As for your question, when the features are linear independent, the âanalytic solutionâ will work fine. For high dimensional x, if some of the features are of high correlations, then $(X^T X)^{â1}$ may not existed, so we may not rely the âanalytic solutionâ. Thatâs where SGD comes in and solve the problem. SGD doesnât care about correlations of features.
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              @StevenJokes Thanks for your suggestion.And you are correct ,to obtain a global optima function must be convex otherwise using SGD we may find local optima.
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              Hello @goldpiggy  , I will post my question in related chapter discussion in the future.Thanks for your suggestion.
As you said, If features are linear independent , SGD works fine. If feature have high correlation among them then  $(X^T X)^{â1}$ matrix may not exist so this means ,in this situation we canât use any of the options since features have high correlation then SGD may not work well.
If I misunderstood something , please correct me.
             
            
              
              
              
            
            
                
                
              
           
          
            
            
              Hey @Harvinder_singh, sorry i only answer half of your question. The downside of stochastic gradient descent is that SGD might be too noisy as it requires a weight update at every data point. if some data points has an extraordinary large gradient, that might be harmful for the weights. Thatâs why we choose to do mini-batch gradient descent. We will discuss more in http://d2l.ai/chapter_optimization/index.html.
             
            
              
              
              2 Likes