CHAPTER 06.05: ADEQUACY OF REGRESSION MODELS: Check Three: Coefficient of Determination: Part 2 of 2
So let's go ahead and see that how we're going to calculate the value of r squared based on the numbers which are given to us in the example. So again, what we are doing is that we are taking the values of T and alpha, and now what we are doing is that we are finding the difference between the observed value and the mean value of alpha, that's what we are doing here, we are calculating the difference between the observed value and the mean value of alpha. In fact, the mean value of alpha, alpha-bar, will be nothing but summation of alpha-i, i is equal to 1 to 6, divided by 6. So basically what you are doing is that you are going to add all these six values of alpha, and divide it by 6, and the number which you're going to get is 4.175 micro-inch per inch per degree Fahrenheit . . . per Fahrenheit, and then that's you are going to calculate these residuals between the observed value and the mean value of the function, and what you're going to do is you're going to square all of these. So you're going to square this, you're going to square this, you're going to square this, you're going to square this, and this, and this, all these six values will be squared, you will add them up, and that's what turns out to be the value of St. Now, how are we going to calculate the value of Sr? It is again, what you're going to do is, you're going to now . . . this is the data which is given to you. This is the data which is given to you of alpha versus temperature. This is what you are getting from the observed values which are given to you. So these are the observed values which are given to you based on . . . I shouldn't say observed, these are the predicted values which . . . which are . . . which you're going to be able to find based on . . . based on the regression model which you have found. So these are the predicted values at different values of Ti, so these six values right here, and now what you're going to do is you're going to subtract this from this, because this is the observed value at Ti equal to -340, this is the predicted value, and the difference is going to be this number here. So that is the residual between the observed value and the predicted value of the regression model. So you're going to calculate this number right here, one, two, three, four, five, and six, you're going to calculate those six numbers at the six data points which are given to you. You're going to now square this number, square this number, square this number, square this number, square this number, and square this number, and then you're going to add the square of all of those numbers, you are going to get the value of Sr to be 0.25283. So now we have the value of St and Sr, so we go back to the definition of the coefficient of determination, St is 10.783, Sr is 0.25283, we divide it by St again, and this is what we get as the value of r squared we're getting as 0.97655. What does this mean? This value of 0.97655, let's suppose we approximate it as 98 percent just to be . . . just to be . . . keep it up to two significant digits. So what that means is that 98 percent of the original uncertainty to the data in the value of alpha can be explained by the straight line model which we have found. So that's what the value of r squared means. As we said that the value of r squared will be . . . is known to be between 0 and 1, and this is very close to 1, and many times what people will do is just because they've gotten a very large value of r squared, they will assume that they have found a better . . . they have found that the model is adequate, which is far from true. Which brings us to the . . . this particular slide here, which talks about that we should use extreme caution in use of the value of the coefficient of determination, because many things can influence the value of r squared, you can either decrease it or increase it. So for example, if there's a spread in the regressor variable, which means that if I am given y versus x data, and there's a lot of spread in the x variable, that means that the data is spread out, you're going to get a larger value of r squared, and the other way around, also, if your spread of the regressor variable is very decreased, that your data range for x is very small, then you're going to find out that artificially it's going to give you a small value of the coefficient of determination. We also have that if the regression slope, if you find out that the y is equal to a0, plus a1 x, right, and if a1, which is the slope of this regression model, turns out to be large, very close it's almost running parallel to the y-axis, you're going to find out that you're going to get very large values of r squared there also, because actually we're measuring the vertical distance between the observed values and the predicted values, and once you have large regression slopes, then the value of the vertical distances become small, and hence artificially gives you the value of r squared to be high. Also, large r squared does not measure appropriateness of the linear model, many times, as we saw that in this example which we are doing right now, we're getting an r squared value of 98 percent, that does not necessarily mean that the linear model is appropriate. In fact, when we go to check number 4, we'll be able to show that the linear model which we have been using for these six data points, or 22 data points, will be not appropriate. Also, a large r squared does not imply a regression model will predict accurately. So that's another thing which you have to understand is that just because you're getting a larger r squared, that you cannot assume that once you choose a certain value of x, that it's going to give you the value of y accurately from the regression model. So please pay a lot of attention to not using the value of r squared by itself as a criterion to figure out whether a particular model is adequate bade on some of these things which I have pointed out here. And that's the end of this segment. |