CHAPTER 01.05: FLOATING POINT REPRESENTATION: Floating Point Example

 

So what we have is nine bits for the hypothetical identical floating point numbers. First bit is for the sign of the number, second bit is for the sign of the exponent, next three bits are for the magnitude of the exponent and the last four bits are for the magnitude of the mantissa. So what we want to be able to do is to be able to take this number eleven point eight base ten and write it in this floating point format which follows that convention.

 

In order to be able to do that the first thing which we have to do is to be able to see is that hey how we can write eleven in base two and how can we write zero point eight in base two. So eleven in base 10 to base 2 is 1011 base 2. You can do this as home work because the previous vidoe covers that already. Zero point eight base ten will be zero radix point one one zero zero one and keeps on going base two and you can also do this as homework as it was covered in the previous video. So if we want to see eleven point eight base ten written as base two number then it is one zero one one that is equivalent of eleven then radix point and then we will have the equivalence of bit of point eight in base ten which is one one zero zero one to the base two that we just showed. So once we have that what we want to have is we have to take this radix point and move it here because we only want one non zero digit before the radix point so this is one radix point zero one one one one zero zero one, base two times two to the power three. The reason why two to the power three now is because the radix point was moved to the left by three places. So what we are going to do is do this in two stages. We want to first see that how many bits we want to take of the mantissa since there are only four bits for the mantissa. We can only use these first four bits one zero one one and  base two and we are going to forget about these because these cannot be represented because we have only four bits for the mantissa and with two to the power three. Now two to the power three needs to be the three part needs to be written in base two so three will be one one base two. So this three which we have in base ten can be given as one one base two.

 

But again we want to make a small change we want to say that this multiply by two to the power zero one one base two. The reason why we are doing zero one one rather than one one base two is because we have three bits available for the magnitude of the exponent. So lets repeat this. We have eleven point eight base ten is now written as one point zero one one one base two times two to the power zero one one base two. And now what we will do is we have to assign it to the nine bits of this hypothetical floating point and we want to see how we are going to go about doing that. So here were the nine bits. This is for the sign of the number, this is for the sign of the exponent, these three are for the magnitude of the exponent and these last four are for the magnitude of the mantissa. So what we are going to do is we are going to start now filling in these places for the bits with zeros and ones. The sign of the number is positive so we will put zero here. The sign of the exponent is positive so we want to put zero there. The magnitude of the exponent is zero one one and then we have these four bits to be put in the magnitude of the mantissa zero one one one. We dont take care of this because this is already assumed because in order to put a non-zero digit before the radix point you need a non-zero number and the only non-zero number which is real one in binary format is one so we dont need to represent it. Its there but we dont need to represent it because it will always be one. So this is the representation of the number eleven point eight base ten in this hypothetical nine bit floating point representation.

 

Now if somebody says that hey this is the representation how would you write this number is base two you say okay what I want to do is I want to first say just plus because of the fact that the sign of the number is positive then I am going to write one then dot then what I am going to do is I am going to write the four digits of the mantissa zero one one one one base two times to the power then I will write three bits of the magnitude of the exponent zero one one base two and then I have the sign of the exponent which is positive plus. So go and see what this is equivalent to in base ten and you will see that this is not equivalent to eleven point eight in base ten so that difference between this number and this number will tell you what the round off error is caused by using this hypothetical nine bit word for our floating point representation and that is the end of this segment.