Background on Floating Point Representation Using Base 10

In this segment, we'll talk about floating-point representation: the background of it. So, what we're going to do is we're going to talk about base 10 format only in this in this particular segment, because what we want to do is we want to you to get familiar with base 10 so that when we jump to base 2 it will become a little bit easier so for the terminology is concerned: so far as what do you mean by fixed format what do we mean by floating-point format and what are the pros and cons of the two so far as numerical methods are concerned.

So, if you look at this number which you can see here 256.78, and somebody says hey what kind of a format it is, you will say it's a decimal format and then what we're going to do is we're going to also call it the fixed-point format. And that's the terminology you're going to use. Somebody takes this number and writes it like this, and somebody says hey what kind of format that is, you will immediately say oh it is a scientific format, but what we're going to do is we're going to call it the floating-point format. And the reason why we use different terminology here is because the fixed-point format is valid for any base, not necessarily for base 10 and the floating-point format is also valid for and base, not just 10. So, what we're going to do is we're going to talk about the fixed-point format separately, we're going to talk about the floating-point format separately, and at the same time we'll talk about the pros and cons of it you'll see how the two are connected. So, let's look at the fixed-point format.

The fixed-point format which you are going to see is we're going to take an example, let's suppose. If you look at an old-time cash register, it used to have three places for the integer part and two places for the fractional part, meaning that the integer part was for dollars, and the fractional part was for cents. And if I had a number like this, 256.78, and I had to write that number in that cash register, I'll simply write as 2 5 6 here and 7 and 8 here. So, I have five places in this cash register which I’m using in order to write this number. So, let's suppose somebody says hey what is the smallest number which you can represent in this cash register? We'll say okay hey it'll be 000.00 hey what is the largest number which you can represent? In that case, it will be 999.99. This becomes the lower limit smallest number which you can represent, and this is the highest number which you can represent. Keep in mind that we are only talking about uh just to keep our discussion simple we want to talk about positive numbers here. So, let's go and see in next that hey what kind of errors can be caused by using fixed-point format.

Let's talk about errors in fixed-point format. So, we're still going to have the same format example which you took about three places for the integer part two places for the fractional part and see what kind of errors do we get. So, let's suppose if I have a number like 256.786. This number here is going to get represented as 256.78. So, let's assume that we're using chopping, meaning that we are not rounding up or rounding down, the number last number which is represented. So, this 8, for example does not become 9 because this one is 6; it stays whatever it is no matter what this number or any number which comes after that is. So, having said that let's look at what the true error in this case will be. It'll be the exact value minus the representative value which is the approximate value, and it turns out to be 0.006. The absolute relative true error will be the absolute value of the true error divided by the exact value, and this number here turns out to be a small number like this one. If you repeat the same process for a number like 3.546, what you are going to find out is that it is going to get represented as 0.00354, and if you calculate your true error, it will turn out to be 0.006. Your absolute relative true error will turn out to be equal to 0.0016920. Let’s take a different number, like 0.016. It will get represented as 0 0 0 here and 0 1 here. Keep in mind we're still using chopping and in this case, the true error will be 0.006, and the absolute relative true error will turn out to be a large number, 0.375. This gives you an idea that what is happening is that for different numbers there seems to be some kind of an upper limit on how much the true error will be. Like we get very similar order of numbers, in fact exactly the same, but that was intentional, but same order of number for the true error, but the relative true error here is small for a large number, becomes larger for a smaller number, and becomes very large for even smaller number.

So, what you are finding out is that when you have fixed-point format, you have control of how much true error there is going to be in the representation, but not a whole lot of control on the relative true error which you want to which you're going to face. In fact, when we look at the true error the true error is always going to be the absolute true error when we have the fixed-point format, is always going to be less than 0.01 in our case. It is not going to be anything more than that so you take any number from 0 to 999.99, you will find out that the true error will always be less than 0.01. Which is nothing but 10 to the power -2 and hence this 2 is not by choice is not by chance, but actually, you will find out that it's a true error will be always less than 10 to the power minus p where p is the number of spaces or places you have for the fractional part. So, let's now talk about errors in floating-point format.

So, we'll take the same example we had um five places available right uh in the fixed-point format this was our fixed-point. And what we're going to do is in order to make a fair comparison we'll say we're going to use still use five places in the scientific format, or the floating-point format. So, what we're going to do is we're going to use four places for the we're going to use five places total, we're going to use four places for the mantissa, and we're going to use one place for the exponent. So, we're going to use four places for the mantissa, and we use one place for the exponent. If that is the case now, what's going to happen is that the smallest number which you can represent would be 1 0 0 0 and will have a 0 here in the exponent. Why is this a 1 not a 0? Because if you remember your scientific format rules, the number before the decimal point has to be a non-zero number, and the smallest non-zero number is one. Now if you look at the biggest number which you can have now, will be 9.999, and the exponent will be 9. So, you can very well see that the smallest number which you can have is 1.0, and the largest number which you can have is this. Right there, you're seeing some advantage in using the floating-point format for the same amount of space, five places, you are now able to represent numbers all the way from one to about 10 billion. As opposed to when we're doing the fixed-point format, you were able to represent numbers only from zero to about thousand. But what do we gain and what do we lose by doing this? Let's go and take some numbers just like we took the numbers for understanding the errors in the fixed-point format and see what happens there.

Let's go and talk about the floating-point format. If you look at the fixed-point format, what we did was we said hey, let's take five places this will take an example so this was the fixed-point format. And we had three places for the integer part, two places for the fractional part. In order to be fair to make a distinction between the fixed-point in the floating-point format, we're still going to keep the number of places to be five. So, if we are talking about uh scientific format, uh we say hey let's go and use four places for the mantissa and one place for the exponent. So, this will be our mantissa, and this will be your exponent. I'm just saying this so you know one could have said hey, let me use only three places for the mantissa and two places exponent but just for an example let's say I’m going to choose four places for the mantissa and one place for the exponent, but the total number of places is of course five. So, what you're going to see here is now that the numbers which I can represent, so let's look at the smallest number which I can represent here. So, I will have 1 0 0 0 here and a 0 here for the exponent. So that's what I will have for the smallest number, and that will correspond to 1.00 times 10 to the power 0, which is nothing but 1. Now, a person might say hey why is this not a 0? Because if you recall your scientific format, and which we're calling as a floating-point format, is that the first digit before the decimal point in the base 10 has to be a non-zero number. Now what is the smallest non-zero number? It's 1, and that's why it has to be a 1 there. Let's look at the largest number which you can represent.

The largest number which you can represent will have 9 here 9 here 9 here 9 here and 9 here. And so, the largest number will be 9.999 times 10 to the power 9. So, what you are finding out here is that you use the same number of spaces as you use for the fixed-point exponent fixed-point format and you're finding out that the numbers which you can represent now are from 1 to 10 billion. When you were using the five places for the fixed-point format with three places for the integer and two places for the fractional part, uh we were only able to represent numbers from zero to about thousand.

So, you're already finding out hey there are differences between the fixed-point representation and floating-point repositions seeing the pros and cons that floating-point representation now allows you to represent a larger range of numbers. But what are we giving up? In the fixed-point format we saw that we had some control about the true error but no control about the relative true error. Is it the reverse the case for the floating-point format? Let's go and see.

So, let’s take a number like this: 256.78. How will it be represented in in the fix in the floating-point format will be like this? So, if I was going to write it in my format which I chose for my for my floating-point format I’ll get a 2 here 5 will be here 6 will be here 7 here and I’ll get a 2 here because 2 is the exponent so what is the true error in this case, will be 256.78 minus what is it represented by it is represented by this and this gives me a true error of 0.08. So, if that is the case now then what is the absolute relative true error? The absolute relative true error in this case would be this divided by the exact value, so it'll be true error divided by exact value. And this one number I get 0.00031155. If you repeat the same process for a number like 576329.78 and see what kind of true error you get, you will get a true error of 29.78, and you will get a relative true error of 0.00051672. Go ahead and do it for another number let's suppose. Let's suppose the number is 576399.99. In this case the true error will be equal to 99.99, and the relative true error will be as follows. Let's take another number here. In this case you'll get the true error of this number. And the relative true error would be this.

So what you are finding out is that as you start looking it through all these examples you can take some more if you would like to or just at least do these the ones which I mentioned here what you're finding out is that there seems to be not a whole lot of control about the true error itself it is very small here, it becomes big here,  big here, it is again very small here. So, there's not as much control about the true error. But if you look at the relative true error, you're finding out that hey there is there seems to be some uh some control about the relative true error when we're talking about floating-point format. And actually, what you find out is that for this particular case that the relative true error will always be less than 0.001. You take any number from 1 to take any number between those two numbers a real number and you find out the relative true error in this representation will be always less than 0.001. And what does that correspond to? Well, that's 10 to the power minus 3, and that's 10 to the power minus (1-4). And why did I write it like this?  That’s because the relative true error is always going to be less than 10 to the power 1 minus p, where p is the number of places which you have for the mantissa. In this case you have four places for the mantissa, so it'll be 10 to the power 1 minus p. So what you what you're finding out here is that when you had the fixed-point format you had control of what the true error will be what the upper limit of the absolute true error will be, but when you are using floating-point format, you don't have control about the true error, but you do have control about the upper limit of the relative true error and that's the difference between the two in terms of when we talk about numerical methods is concerned.

So, what I’m hoping is that you have figured this out for base 10, that you will be able to figure it out for base 2 now and the process will be similar, and since the process will be similar you don't have to think about what the concepts are. The only thing which you'll have to do is you have to gear your brain towards, hey, how do I think in terms of base two. And that is the end of this segment.