CHAPTER 01

CHAPTER 01.05: FLOATING POINT REPRESENTATION: Background : Part 1 of 3

In this segment, we're going to talk about the background for floating point representation. We are going to discuss base-10 floating point representation background, so that, because you are familiar with the base-10, the decimal notation, so that we can understand what the floating point representation in binary is, but we'll do that later. So floating point representation, we want to look at the background of it. What I would like you to do is, I would like you to go back a little bit in time, where you used to have cash registers, which will have maybe three places for the dollar, and two places for the cents. So this is a fixed register, what we call it, so this is called a fixed register, where you have so many places for the integer part of a number, and then you have so many places for the decimal part of the number, which is the cents, for example. So in this case, what you would be able to see is that the smallest number, the smallest number or the smallest purchase which you could ring which will be 000.00, right? Because that's the . . . because that's the smallest number which you can put in this register. So I'm not . . . I'm intentionally not putting the decimal sign here, or the point sign here, because that's not part of the register itself, so I just want you to focus that we have five places to work with, where we are using the first three here for the integer part, and the last two for the fractional part. What is the largest number? The largest number would be, in this case, 999.99. So you can very well see that you could ring up, or you could use this to store any number which is between 0 and 999.99.

Now, if you had a number which was not . . . which had more than 2 decimal places, let's suppose you had 256.789. If you had a number like that, what you will have to do is you will use that number as 256.79 if you are rounding it up, that's what you will get. If you're rounding off the number, not chopping it, but rounding it off, this would get represented as 256.79. And if you see what is the true error in this case, what is the absolute true error in this case will be 256.789 is the actual number, and you are representing as 256.79, so the error which you are getting is -0.001, that's what you are getting. And you can very well see that the maximum absolute true error which you're going to have, so this is absolute value, this is absolute value, which is 0.001. So the maximum absolute true error which you're going to have will be not anything more than 0.01. It is never going to be more than 0.01, because the numbers which you are representing in the register go . . . increase in 0.01, or decrease in 0.01 form. So I can take a number . . . if I take any number which is represented between 0 and 999.99, the maximum true error which I'm going to get is less than or equal to 0.01. In this case it's 0.001, but it will be always less than 0.01. So what you are assured of is that any number which is represented in this register will not have a true error of more than 0.01, that will be the maximum true error in this case. But let's go ahead and see what is the absolute relative true error in this case. So let's suppose if I take a number like 256.786. That number will get represented as 256.79 in my fixed register there. So how much is the true error? The true error will be equal to 256.786, because that's the exact . . . that's the exact number, it is getting represented as 256.79, so the amount of error is -0.004. How much is the relative . . . absolute relative true error in this case? It will be the true error divided by the exact value, which is 256.786, times 100, and this value here turns out to be 0.0016 percent. That's the amount of relative true error which you are getting in this case, 0.0016 percent. If I take another number, let's suppose I take a number like this, 3.546. Now since it has more than two decimal places, it is going to be approximately represented as 3.55, that's what going to happen here.

Now if I calculate my true error will be, true error will be exact value, 3.546, minus 3.55 will be -0.004. So what you are finding out coincidentally, this has the same true error as we had in the previous case, here also, for this large number of 256.786. So the true errors, as I said, will be always the magnitude of the true error which you're going to get will be always less than 0.01, it will never be more than 0.01, less than or equal to 0.01. So, I should say less than 0.01. But what is the relative true error? Let's look at the relative true error. The relative true error in this case is -0.004, because that's the true error, divided by the exact value, which is 3.546 times 100, and the value here turns out to be 0.11 percent. So what you are finding out is that although your true errors are in the same range for a large number and for a small number when you have a fixed register, the amount of relative error which you have is not . . . is not in the same . . . of the same order. Like for small numbers, for a small number representation, you are possibly going to get larger relative true errors, or relative round-off errors, and for large numbers you're going to get a smaller relative round-off error, or relative true error. So what floating point representation does, which you will see in the next segment. What it does is that it takes care of that, that you are able to now represent numbers where your true errors might be large for large numbers and small for small numbers, but your relative true errors are going to stay of the same order, and that's the advantage of using floating point representation. And we'll see that a little bit more of the advantages of the floating point representation in the next part. And that's the end of this segment.