Background on Floating Point Representation Using Base 10
|
In this segment, we'll talk
about floating-point representation: the background of it. So, what we're
going to do is we're going to talk about base 10 format only in this in this
particular segment, because what we want to do is we want to you to get
familiar with base 10 so that when we jump to base 2 it will become a little
bit easier so for the terminology is concerned: so far as what do you mean by
fixed format what do we mean by floating-point format and what are the pros
and cons of the two so far as numerical methods are concerned. So, if you look at this number
which you can see here 256.78, and somebody says hey what kind of a format it
is, you will say it's a decimal format and then what we're going to do is
we're going to also call it the fixed-point format. And that's the
terminology you're going to use. Somebody takes this number and writes it
like this, and somebody says hey what kind of format that is, you will immediately
say oh it is a scientific format, but what we're going to do is we're going
to call it the floating-point format. And the reason why we use different
terminology here is because the fixed-point format is valid for any base, not
necessarily for base 10 and the floating-point format is also valid for and
base, not just 10. So, what we're going to do is we're going to talk about
the fixed-point format separately, we're going to talk about the floating-point
format separately, and at the same time we'll talk about the pros and cons of
it you'll see how the two are connected. So, let's look at the fixed-point
format. The fixed-point format which you
are going to see is we're going to take an example, let's suppose. If you
look at an old-time cash register, it used to have three places for the
integer part and two places for the fractional part, meaning that the integer
part was for dollars, and the fractional part was for cents. And if I had a
number like this, 256.78, and I had to write that number in that cash
register, I'll simply write as 2 5 6 here and 7 and 8 here. So, I have five
places in this cash register which I’m using in order to
write this number. So, let's suppose somebody says hey what is the smallest
number which you can represent in this cash register? We'll say okay hey it'll
be 000.00 hey what is the largest number which you can represent? In that
case, it will be 999.99. This becomes the lower limit smallest number which
you can represent, and this is the highest number which you can represent. Keep
in mind that we are only talking about uh just to keep our discussion simple
we want to talk about positive numbers here. So, let's go and see in next
that hey what kind of errors can be caused by using fixed-point format. Let's talk about errors in fixed-point
format. So, we're still going to have the same format example which you took
about three places for the integer part two places for the fractional part
and see what kind of errors do we get. So, let's
suppose if I have a number like 256.786. This number here is going to get
represented as 256.78. So, let's assume that we're using chopping, meaning
that we are not rounding up or rounding down, the number last number which is
represented. So, this 8, for example does not become 9 because this one is 6;
it stays whatever it is no matter what this number or any number which comes
after that is. So, having said that let's look at what the true error in this
case will be. It'll be the exact value minus the representative value which
is the approximate value, and it turns out to be 0.006. The absolute relative
true error will be the absolute value of the true error divided by the exact
value, and this number here turns out to be a small number like this one. If
you repeat the same process for a number like 3.546, what you are going to
find out is that it is going to get represented as 0.00354, and if you
calculate your true error, it will turn out to be 0.006. Your absolute
relative true error will turn out to be equal to 0.0016920. Let’s take a
different number, like 0.016. It will get represented as 0 0 0 here and 0 1
here. Keep in mind we're still using chopping and in this
case, the true error will be 0.006, and the absolute relative true
error will turn out to be a large number, 0.375. This gives you an idea that
what is happening is that for different numbers there seems to be some kind of an upper limit on how much the true error
will be. Like we get very similar order of numbers, in fact exactly the same, but that was intentional, but same order
of number for the true error, but the relative true error here is small for a
large number, becomes larger for a smaller number, and becomes very large for
even smaller number. So, what you are finding out is
that when you have fixed-point format, you have control of how much true
error there is going to be in the representation, but not a whole lot of
control on the relative true error which you want to which you're going to
face. In fact, when we look at the true error the true error is always going
to be the absolute true error when we have the fixed-point format, is always
going to be less than 0.01 in our case. It is not going to be anything more
than that so you take any number from 0 to 999.99, you will find out that the
true error will always be less than 0.01. Which is nothing but 10 to the
power -2 and hence this 2 is not by choice is not by chance, but actually, you will find out that it's a true error
will be always less than 10 to the power minus p where p is the number of
spaces or places you have for the fractional part. So, let's now talk about
errors in floating-point format. So, we'll take the same example
we had um five places available right uh in the fixed-point format this was
our fixed-point. And what we're going to do is in order to
make a fair comparison we'll say we're going to use still use five places in
the scientific format, or the floating-point format. So, what we're going to
do is we're going to use four places for the we're going to use five places
total, we're going to use four places for the mantissa, and we're going to
use one place for the exponent. So, we're going to use four places for the
mantissa, and we use one place for the exponent. If that is the case now,
what's going to happen is that the smallest number which you can represent
would be 1 0 0 0 and will have a 0 here in the exponent. Why is this a 1 not
a 0? Because if you remember your scientific format rules, the number before
the decimal point has to be a non-zero number, and
the smallest non-zero number is one. Now if you look at the biggest number
which you can have now, will be 9.999, and the exponent will be 9. So, you
can very well see that the smallest number which you can have is 1.0, and the largest number which you can have is this.
Right there, you're seeing some advantage in using the floating-point format
for the same amount of space, five places, you are now able to represent
numbers all the way from one to about 10 billion. As opposed to when we're
doing the fixed-point format, you were able to represent numbers only from
zero to about thousand. But what do we gain and what do we lose by doing this?
Let's go and take some numbers just like we took the numbers for
understanding the errors in the fixed-point format and see what happens there. Let's go and talk about the floating-point
format. If you look at the fixed-point format, what we did was we said hey, let's take five places this will take an example so
this was the fixed-point format. And we had three places for the integer part,
two places for the fractional part. In order to be
fair to make a distinction between the fixed-point in the floating-point
format, we're still going to keep the number of places to be five. So, if we
are talking about uh scientific format, uh we say hey let's go and use four
places for the mantissa and one place for the exponent. So, this will be our
mantissa, and this will be your exponent. I'm just saying this so you know
one could have said hey, let me use only three
places for the mantissa and two places exponent but just for an example let's
say I’m going to choose four places for the mantissa and one place for the
exponent, but the total number of places is of course five. So, what you're
going to see here is now that the numbers which I can represent, so let's
look at the smallest number which I can represent here. So, I will have 1 0 0
0 here and a 0 here for the exponent. So that's what I will have for the
smallest number, and that will correspond to 1.00 times 10 to the power 0,
which is nothing but 1. Now, a person might say hey why is this not a 0? Because
if you recall your scientific format, and which we're calling as a floating-point
format, is that the first digit before the decimal point in the base 10 has to be a non-zero number. Now what is the smallest
non-zero number? It's 1, and that's why it has to be
a 1 there. Let's look at the largest number which you can represent. The largest number which you can
represent will have 9 here 9 here 9 here 9 here and 9 here. And so, the
largest number will be 9.999 times 10 to the power 9. So, what you are
finding out here is that you use the same number of spaces as you use for the
fixed-point exponent fixed-point format and you're finding out that the
numbers which you can represent now are from 1 to 10 billion. When you were
using the five places for the fixed-point format with three places for the integer
and two places for the fractional part, uh we were only able to represent
numbers from zero to about thousand. So, you're already finding out
hey there are differences between the fixed-point representation and floating-point
repositions seeing the pros and cons that floating-point representation now
allows you to represent a larger range of numbers. But what are we giving up?
In the fixed-point format we saw that we had some control about the true
error but no control about the relative true error. Is it the reverse the
case for the floating-point format? Let's go and see. So, let’s take a number like
this: 256.78. How will it be represented in in the fix in the floating-point
format will be like this? So, if I was going to write it in my format which I
chose for my for my floating-point format I’ll get a 2 here 5 will be here 6
will be here 7 here and I’ll get a 2 here because 2 is the exponent so what
is the true error in this case, will be 256.78 minus what is it represented
by it is represented by this and this gives me a true error of 0.08. So, if
that is the case now then what is the absolute relative true error? The
absolute relative true error in this case would be this divided by the exact value,
so it'll be true error divided by exact value. And this one number I get 0.00031155.
If you repeat the same process for a number like 576329.78 and see what kind
of true error you get, you will get a true error of 29.78, and you will get a
relative true error of 0.00051672. Go ahead and do it for another number
let's suppose. Let's suppose the number is 576399.99. In this case the true
error will be equal to 99.99, and the relative true error will be as follows.
Let's take another number here. In this case you'll get the true error of this
number. And the relative true error would be this. So what you are finding out is
that as you start looking it through all these examples you can take some
more if you would like to or just at least do these the ones which I
mentioned here what you're finding out is that there seems to be not a whole
lot of control about the true error itself it is very small here, it becomes
big here, big here, it is again very
small here. So, there's not as much control about the true error. But if you
look at the relative true error, you're finding out that hey there is there
seems to be some uh some control about the relative true error when we're
talking about floating-point format. And actually, what
you find out is that for this particular case that the relative true error
will always be less than 0.001. You take any number from 1 to take any number
between those two numbers a real number and you find out the relative true
error in this representation will be always less than 0.001. And what does
that correspond to? Well, that's 10 to the power minus 3, and that's 10 to
the power minus (1-4). And why did I write it like this? That’s because the relative true error is
always going to be less than 10 to the power 1 minus p, where p is the number
of places which you have for the mantissa. In this case you have four places
for the mantissa, so it'll be 10 to the power 1 minus p. So what you what
you're finding out here is that when you had the fixed-point format you had
control of what the true error will be what the upper limit of the absolute
true error will be, but when you are using floating-point format, you don't
have control about the true error, but you do have control about the upper limit
of the relative true error and that's the difference between the two in terms
of when we talk about numerical methods is concerned. So, what I’m hoping is that you
have figured this out for base 10, that you will be able to figure it out for
base 2 now and the process will be similar, and since the process will be
similar you don't have to think about what the concepts are. The only thing
which you'll have to do is you have to gear your brain towards, hey, how do I
think in terms of base two. And that is the end of this segment. |