CHAPTER 06.03: LINEAR REGRESSION: Background: Part 1 of 2
In this segment, we're going to talk about linear regression, and we're going to see what the background of how we derive the linear regression method formula is. So we're not going to be deriving the linear regression formula, we have done that elsewhere, but we're going to look at the background for how we come up with developing the linear regression formula. Again, the linear regression formula is that we want, given x1, y1, x2, y2, so we're given several data points, n data points, and what we want to do is we want to best fit . . . we want to best fit y is equal to a0, plus a1 x to the data. So what we are trying to do is that somebody's giving us n data points, where x is the . . . y is the dependent variable, x is the something called explanatory variable, and somebody's giving us n data points like this, let's suppose, and they want us to best fit this straight line here. So let's suppose if this data point here is xi, comma, yi, so I'm just taking i as a general point which is given to us. The value of y at this particular point would be simply a0, plus a1 xi, because this straight line is nothing but a0, plus a1 x, that's what this straight line is. So the error which you are getting, what you want to do is, in linear regression is that you want to, or any kind of regression model, that you want to minimize this error. So this error which you are getting, Ei, or residual, I should say, people call it a residual. Residual is the difference between the observed value, which is yi is the value which is observed, which is given to you, and the predicted value, which is a0, plus a1 xi, or I can just expand it, I get yi, minus a0, minus a1 xi. So that is the error associated with each point. Now, we cannot say that, hey, maybe I can change this straight line, and then I can minimize the error at this point here. I can always make a straight line go through this data point, then the error associated, or the residual associated with that particular point will be 0, but it will change the errors, or the residuals, at other points. So it looks like that maybe I should not be looking at a specific point to make this error, or this residual, to be small, I should be look at all the data points at the same time, all the residuals I should be looking at the same time. So people could say that, hey, why don't you do this, why don't you just use this as a criterion of minimizing the residuals? So you simply, what you do is you take all the errors which you are . . . all the residuals which you are getting at all the data points, and you simply go ahead and add all of them up, and you try to minimize that. You try to minimize this sum of the errors, and it looks like to be a good decent criterion to use. Now, let's see what the problem with that one is, or if there is a problem with that kind of a criterion. So what I'm going to do is I'm going to take this through an example. So let's suppose somebody gives you data like this, says x, y, at value of 2 it is 4, at 3 it is 6, at 2 it is 6, and at 3 it's 8, okay? So let's suppose somebody's giving you these four data points, and says that, hey, go ahead and use this criterion of summing all the residuals and minimizing that. So if I draw this, I'll get 2, comma, 4, or if I show the plot, 2, comma, 4 here, then 2, comma, 6 is the third data point, which is right here, then I have 3, comma, 6 here, and then I have 4, comma, 8 right here, so I'll have 3, comma, 8 right here. So those are the four data points which are given to me, and what I want to do is I want to minimize the sum of the residuals there. Someone might say, hey, why don't you go ahead and draw this line, so if I draw this line, let's suppose, then this line here, by using my knowledge of trigonometry or geometry, what I'm going to get is y is equal to 4 x, minus . . . 4 x, minus 4, that's what I'm going to get as the equation of that line, because simply you are drawing the straight line between 2, comma, 4 and 3, comma, 8, and this is the equation which you're going to get for that particular straight line. Now, what I want to see is that does that straight line minimize the error, I'm just taking this straight line by faith, so please don't think that I have already found out the straight line which minimizes this error, I'm just taking one of the straight lines which I could draw, I could draw infinite kind of such lines there. So, if I want to write down this table again, and I've got x here, I've got y here, and I will write down the residual right here . . . maybe not write down the residual there, let me just write down what I predict there, and then I will calculate my residual. So I can say, hey, what is y predicted, and what is the residual? So I get, if the values are 2, 4, 3, 6, 2, 6, 3, 8, for example, and y predicted at 2 is 4, because you can just calculate it by using the formula 4 x, minus 4. The y predicted here is 3 times 4 is 12, 12 minus 4 is 8, that's 8, then 2, comma, 6, the predicted value is 4, and at 3 it is simply 8. So I'm just getting these predicted values by simply substituting the value of x into this formula 4 x, minus 4. So what is the amount of residual here is 0, here the residual is -2, here the residual is +2, here it is 0, and if I sum the errors . . . if I sum the errors, i is equal to 1 to n, what do I get? I get 0. So that seems to be a good straight line to use, because the sum of the errors . . . sum of the residuals which I'm getting is 0. So, but how about this, if I choose y is equal to 6 line, that is also a straight line. It is going through these two points here, but it's not going through this point and this point here, and in that case, also, as a homework . . . as homework, you're going to find out . . . as homework, you're going to find out that for y is equal to 6, summation of the errors . . . summation of the residuals, from 1 to 4, of Ei will be equal to 0. So what you are getting is that this using the summation of the errors as a criterion is not a good criterion, because you are getting this 0 error, or this as small as possible error, but at the same time, you are not getting a unique line for showing that, hey, this is the line which is going to suggest that, hey, I have best fit the curve. Now, somebody might say, hey, why is that important? The reason why it's important is because if you're going to use this as a criterion, that you want to minimize the sum of the residuals, you're going to use that at the criterion to find your straight line, people are going to get different lines. Some person is going to get y is equal to 6, some person's going to get y equal to 4 x, minus 4, and there are infinite other lines for which the sum of the residuals is going to turn out to be 0. So what that means is that the summation of the errors itself, this here, is not a good criterion, not a good criterion.