CHAPTER 06.05: ADEQUACY OF REGRESSION MODELS: Check Three: Coefficient of Determination: Part 1 of 2
In this segment, we're going to continue to talk about what the different checks we have to do for the regression models. So in this example, we're going to take to find the coefficient of determination. So we're going to look at how do we calculate the coefficient of determination for . . . for a regression model, and what is that based on? So this is the coefficient of determination, this r squared, that is the coefficient of determination, and let's go ahead and see that how it is calculated based on what we know about the regression model. So it is basically calculated by this numerator, which is St minus Sr, so what is St? St is the difference between what you have observed and what you have as the mean value. So before you draw the straight line, the best estimate which you can make based on the alpha values which are given to you will be the mean . . . mean of the numbers which you are given for the observed values. So what we are doing is we're taking the difference between the observed value and the mean value, squaring each of them, and then adding them up, all of them together, that gives us the value of St. Now, how do we calculate the value of Sr, which is the sum of the square of the residuals? It is again what we are doing is we are taking the observed values, so this is your observed value, and now what we are doing is we are subtracting the predicted values at that particular, or those particular points, based on the straight line regression model which we have. So this is the amount of . . . amount of deviation, squared deviation which you have without considering the regression variable, x, while this becomes the amount of deviation which you have around the straight line once you have incorporated the dependent variable x. So this is what you have as the amount of variation before regression, and this is the amount of variation which you have, variance which you have after regression, and what we are doing is we are trying to subtract the two to see that how much that variation has been taken care of, and then we are again dividing by St to figure out that how much . . . I have St so as to get a relative difference between the model which we had before regression and after regression. So it's basically trying to normalize it by that particular . . . by that particular number there. So let's go ahead and see that what this means. Again, what we talked about St, we have r squared is equal to St minus Sr, divided by St, and what does this St mean? What St means is that, let's suppose somebody gives us n data points, which you are seeing here, you've got 1, you've got 2, you've got 3, and then you've got xi, yi, and xn, yn. What this means is that what you are basically calculating is the observed value which you have right here, and this is, let's suppose, the y-bar number here. So you have y-bar there, and this is your yi, and you have the difference between the observed value and the value of the mean here, and that's what you have here, and then you are squaring it, so that is this area right here. So basically what St is, is simply taking the area of this, and area of this, the magnitude of the areas, of all these areas, and you are just adding them up, because that's the amount of variance which you have before you draw the straight line. Now, what kind of a variance do we have after we have drawn the straight line? Now the difference is between the observed and the predicted values. So again, we have this is the observed value minus the predicted value here, and what we are again doing is we are squaring those numbers, we're squaring those numbers and adding them up. So if you look at a particular point here, let's suppose, this is our data point right here, then this is the vertical distance between the observed values and the predicted values, and we are finding the area of the square of this square, so similarly we're going to find the area of the different squares right here at all the data points which are given to us, we're going to add them all up, and that will give us the value of Sr. Again, we are doing this in order to calculate our coefficient of determination, which is simply St minus Sr, divided by St. So let's go ahead and see that what, based on this because we are normalizing it by St, the difference between the variance which we had before regression and after regression, what does it mean that . . . what are the limits of r squared, what are the limits of r squared? The limits of r squared are between 0 and 1, and you can very well see that r squared will be equal to 0 when St is equal to Sr, and r squared will be equal to 1 when Sr is equal to 0. So under what circumstances, so let's go ahead and look at this r squared equal to 1 business first, Sr will be equal to 0. The circumstances under which Sr will be equal to 0 is when the data points which are given to you, so if somebody gives you data points, and they're all falling on the straight line exactly. So let's suppose we're doing y versus x, and somebody gives us data points, and we will find out that, hey, the straight line regression curve goes through all the data points, in that case Sr will be equal to 0, and you'll get r squared equal to 1, so that'll be the upper limit of r squared which you're going to get. The lower limit of r squared is 0, where the variation which you are getting before drawing the straight line and after drawing the straight line does not change. One are the circumstances for that case are if you have only one data point. So if you have only one data point, you can very well see that by drawing a straight line through that data point, it's not going to explain the data any further, so in that case, you will have r squared equal to 0, and also, the more important case is that where you may have data which is given to you, so you may have y versus x data given to you, and you find out that, hey, let's suppose is given to you like this, and you find out that the regression curve which you are going to draw through those data points, or best fit those data points, turns out to be y equal to y-bar. So if the regression curve turns out to be the same as the average value of the y values, then in that particular case, what's going to happen is that, in that case also r squared will be equal to 0, because we know that, what is St? St is the difference between the observed values and the average values, Sr is the difference between the observed values and the straight line. If the straight line itself turns out to be the average value line, the constant average value line, then in that case, St and Sr are exactly the same number. In fact, you can prove it, I'm not showing you the proof, you can prove that if the straight line regression is simply a constant line, if the straight line regression for given data is a constant line, then that constant line has to be y-bar, and that's something which you can do as homework, and again, I will repeat it, if the straight line regression curve turns out to be a constant line, that constant line has to be same as the average value of the y values. |