CHAPTER 01

CHAPTER 01.05: FLOATING POINT REPRESENTATION: Background : Part 2 of 3

In this segment, we're going to continue our discussion on floating point representation. And we are looking at the background of the floating point representation, and this background which we are talking about, we are using the base-10 numbers, not the binary numbers, which we will talk about later. So let's look at the second part of here, so we're talking about that if we have a fixed register, so let's suppose we have three places to put our integer part, and two places to put our . . . to put our fractional part, then we know that the numbers which we can represent are 000.00 all the way up to 999.99, those are the numbers which I can represent. But what we wanted to see is that by doing this kind of representation, you are able to control your amount of true error, or round-off error in this case, to 0.01. However, the relative true errors are small for small numbers, and . . . sorry, large for small numbers and small for large numbers, and what we want to do is, we want to see if there's a mechanism by which I can have same amount of relative true errors or relative round-off errors if I have small numbers or large numbers.

So in that case, what I would like to do is, I would like you to . . . I'd like to introduce you to the scientific notation. In the scientific notation, which is the floating point notation, is that, let's suppose if I had a number like 256.78, I would write as 2.5678 times 10 to the power 2, so that's what the floating point is, because you have floated this point all the way to this particular number here, after 2. So the way the floating point representation, or the scientific notation, is written is that you take the decimal point, in this case, to a point which is . . . which has an integer right before the decimal point, but it's a nonzero number, so in this case it being 2, so it's 2.5678 times 10 to the power 2, because I moved this decimal point two to the left, I'll get 10 to the power 2. So if I had a number like this, 0.003678, then what I'll have to do is I'll have to move the decimal all the way up to after 3, because I need a nonzero number, a single nonzero number before the decimal point, and this will become 3.678 times 10 to the power -3, that's what the number will look like. So that's the way you write the scientific notation, and that's what we're going to try to do by using the same five places which we had for the fixed register, see that how does that help us to do that. But before I go about doing that, I just want to give you the general form of the . . . of a scientific notation, we call it . . . we use, we have sign, then we have the mantissa, and then we have 10 to the power minus the exponent. So that's how you write down your scientific notation. So sign is the sign of the number, mantissa is whatever is here, that's the example of a mantissa, 3.678, and then you have the exponent which is right here, 10 to the power of the exponent, which is -3 in this case. So that's how you write down your scientific notation. Sometimes the exponent is also called the ficand, it's also called the ficand, so those are used interchangeably in the literatature. So how does this help us by saying that, hey, I'm going to write down the number in scientific notation, how does it relate to what we were talking about when we were talking about the fixed register case where we had three spaces for the integer part and two spaces for the fractional part. This is how I'll treat it, I'll say, hey, let me go ahead and take the five places which I had, let me take the five places which I have, so I still have five places, I'm not increasing the number of places which I have. So I still have the five places now. So what I'm going to do is I'm going to use these four places for the mantissa. I'm going to use these four places for the mantissa, I'm going to use this fifth place for the exponent. So what does this do? What does it do for me? What it does for me is that now I can represent my numbers as 0.000 times 10 to the power 0, that will be the smallest number which I can represent, and the largest number will be 9.999 times 10 to the power . . . 10 to the power what . . . 10 to the power 9. Although, you might say, hey, you said that the number before the decimal should be a nonzero number, so in that case it will be 1, so yeah, surely, based on that, I'll be able to represent numbers from . . . only from 1 to 9.999 times 10 to the power 9, in this case. But let's go ahead and see that what's the advantage of doing this.

So you've already seen the advantage is that you'll be able to represent for the same number of places which have been assigned to you, which is five in this case, you are able to represent numbers from 0 to 9.999 times 10 to the power 9, which is approximately 10 to the power 10, so you are able to represent numbers from 0 to 10 to the power 10, as opposed in the previous case, where you were only able to represent 0 to 10 to the power . . . approximately 10 to the power 3, 999.99, so this was in the fixed register, okay, when we had fixed, and this is in the floating register. So you're already seeing that the advantage of doing this floating point representation is that you are able to represent a larger range of numbers, which is important in engineering and scientific calculations. Now let's go back and see what happens to the true errors and the relative true errors through an example. So let's suppose if I had a number like 256.786. And if I was going to look at those five register numbers, like you five places to register your number. The first four are used for the mantissa, so it will be 2.568 . . . 2.568 times 10 to the power 2, that's how it's going to be represented, because I have four places for the mantissa and one place for the exponent right there. So what is the true error in this case? The true error in this case is the exact number minus the approximate . . . approximation, or the representation, and that gives me -0.014 value as the true error. Now what is the relative true error? The relative true error will be the true error, which is 0.014, divided by the exact value, and times 100. And in this case, it turns out to be 0.00545 percent. That's what I get as the relative true error. Now, as we said that, hey, let's suppose we take a large number, let's suppose I take a large number like this, 256786000, and this is a large number. In fact, this number could not be represented when I had the fixed register. Now if I have a large number like this, this number is going to be represented as 2.568 times 10 to the power 8, that's what's going to happen, because I'll put the decimal sign here, so I'll have 1, 2, 3, 4, 5, 6, 7, 8, it will be 10 to the power 8. Now what is the true error in this case? The true error in this case is the exact value, which is 256786000 minus the approximate value, which is 2.568 times 10 to the power 8, and the true error which I'm going to get is -14000. So since -14000 is the true error, you can see that, hey, this is a much larger true error than I got for our previous case, where I had a number like 256.786. So the true errors are becoming becoming larger for larger numbers, but if you look at the relative true error in this case, it's -1400 . . . -14000 divided by 256786000 multiplied by 100, and this value here turns out to be 0.0054 percent. So what you are finding out is that even for large numbers that the relative true error is of the same order as I had for the smaller numbers here. The reason why this is coming out to be exactly the same is because of the example which I am choosing, choosing, it should have been different for a number which is similar to this size, or this order, but is different, but it will be of the same order, that's what I'm trying to drive at. So for example, if you had some other numbers like this, if you had 256750000. In this case, it would be represented as 2.568 times 10 to the power 8, and in this case your relative true error would be equal to 0.0194 percent, that's what you're going to get as absolute relative true error. So it is again of the same order as you have been getting previously. If you had a number like this, 1.0005, this number would represent as 1.001 times 10 to the power 0, that's how would that get represented? And in this case, the relative true error . . . absolute relative true error will be 0.049975 percent. That's the amount of relative true error you are going to get for this particular number here. So what you are finding out is that your relative true errors are of the same order when you are using the floating point representation, where you are using scientific notation to represent your numbers. So as you can see that this number here is approximately equal to 0.5 percent.