donderdag 12 februari 2015

Objective.


The objective of this blog is to learn how to encode arbitrarily large integers using a minimal number of bits: minimizing storage and transmission resource requirements.
Please contribute where you can.


The concept that I propose.


An encoding is a sequence of one ore more fields.
The last field is the termination field, and it may be preceded by continuation fields.
The first field is one character long.
Next fields can have a greater field length.

A character is a sequence of bits.
All characters in an encoding have the same number of bits: character length Cl, with Cl>1.
A sequence of Cl bits can encode its binary interpretation of 2^Cl different values.

A field is a sequence of characters.
The first field, being 1 character long, can encode 2^Cl integer values.
Half of those, 2^(Cl-1), is reserved for the termination field value encoding of 0..2^(Cl-1)-1, and the other half will encode the length of the next field.

The encoding process will be illustrated by a counter that increments from 0 to unlimited with step size 1:
If the integer value to be encoded I reaches 2^(Cl-1), the first field becomes a continuation field and the encoder adds a field of 1 character with all bits valued at 0.
The most significant bit (first bit) of the first field is now 1 and the other bit(s) will be 0.
This tells the decoder that the integer value to be decoded is larger than or equal to 2^(Cl-1) and that the difference is encoded in at least one more field.

The next 2^Cl integer values can be encoded in the second character without changing the first field.
If I reaches 2^(Cl-1)+2^Cl, the second field becomes a sequence of 2 characters of 0 valued bits, and the first field changes its last bit to 1, incrementing its next field length indication from 1 to 2.

The next number of integers that can be encoded in the 2 character second field is either 2^(Cl*2) or 2^(Cl*2-1), depending on Cl being larger then or equal to 2.

If Cl=2, then the first field has reached its maximum length indication for the second field, and a third field is required to encode larger values of I:
0: 00
1: 01
2: 10, 00 {first field overflows from termination field to continuation field, implying 2 previous values}
3: 10, 01
4: 10, 10
5: 10, 11
6: 11, 00.00 {second field grows from 1 character to 2, implying 6=2+4 previous values}
7: 11, 00.01
8: 11, 00.10
9: 11, 00.11
10: 11, 01.00
11: 11, 01.01
12: 11, 01.10
13: 11, 01.11
14: 11, 10.00, 00 {second field from termination to continuation, implying 14=2+4+8 previous values}
15: 11, 10.00, 01
16: 11, 10.00, 10
17: 11, 10.00, 11
18: 11, 10.01, 00.00 {third field grows from 1 character to 2, implying 18=2+4+8+4 previous values}
etc.

If the character size is larger then 2, the field overflows occur at larger values of I.
The following python code shows the resulting bit streams for arbitrarily large character sizes larger then 1 bit:

class IETF_(): #integer encoding termination field
    def __init__(self, ChrLen=2):
        self.ChrLen=ChrLen #character size

    def enc(self, I): #encode integer value in self
        #start by encoding integer value 0:
        Fprev=None #closest continuation field linked to me
        MaxLen=1 #current (is initial) max number of characters
        Len=1 #current actual number of characters
        Val=0 #current nr of integer values being used
        #next encode larger integer values in steps of increasing dmax'es
        while I: #more to encode
            if Len<MaxLen: Dmax=2**(Len*self.ChrLen) #msb is 'free'
            else: Dmax=2**(Len*self.ChrLen-1) #msb indicates field type
            if I<Dmax: #it fits in my current shape
                Val=I
                I=0
            else: #i need to reshape myself
                if Fprev and Fprev.incr(): #i can grow
                    Len=Len+1
                else: #I can not grow, so i spawn a new continuation field,
                      #and reset myself
                    Fprev=IECF_(self.ChrLen, Fprev, MaxLen)
                        #new continuation field
                    Len=1 #number of characters
                    Val=0 #nr integer values being used
                    MaxLen=Dmax #new max to self.Len is current Dmax
                I=I-Dmax
        #done encoding, return encoding fields sequence image
        #in this illustration we use ascii characters 0 and 1 to represent bits
        MyPersonalImg=(Len*self.ChrLen*'0' +bin(Val)[2:])[-self.ChrLen*Len:]
        if Fprev: return Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

class IECF_(): #integer encoding continuation field
    def __init__(self, ChrLen, Fprev=None, Len=0):
        self.ChrLen=ChrLen #character size
        self.Fprev=Fprev #previous integer encoding (continuation) field
        self.Len=Len #my (fixed) nr of char's
        self.Val=0 #first length indication possible,
                   #next field has 1 character to start with
        if Len: self.MaxVal=2**(self.Len*self.ChrLen-1)-1
                #largest length indication possible
        else: self.MaxVal=0
             
    def incr(self): #increase length indication of next field if possible
        if self.Val<self.MaxVal: #next field may grow
            self.Val=self.Val+1 #i note its growths
            return True #grant request to grow
        else: return False #deny request to grow in current shape

    def img(self): #image of me and my previous fields
        MyPersonalImg=(self.Len*self.ChrLen*'0' \
                       +bin(self.Val+self.MaxVal+1)[2:]) \
                       [-self.ChrLen*self.Len:]
        if self.Fprev: return self.Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

Encoder=IETF_()
for I in range(20):
    print(I, Encoder.enc(I))

The decoder works like a counter that increments from 0 in steps larger then 1.
It interprets the encoded fields sequentially.

The first field is one character long, and can be a continuation field (first bit 1) or a termination field (first bit 0).

If it is a termination field the decoder is done, and yields the binary interpretation of a fixed size integer encoding.

As a continuation field it represents 2^(Cl-1) terminating value increases that are added to the counter.
The rest of its binary interpretation represents the length of the next field base 1.
If that rest is 0, the next field is 1 character long, if it is 1, next charlen 2, etc.

So the second field length is between 1 and MaxLen=2^(Cl-1) characters.
If the length of a field is smaller then its MaxLen, it is definitely a termination field.
With length MaxLen it can be a continuation field (first bit 1) or a termination field (first bit 0).

If it is a termination field, its aggregated value is calculated and added to the counter value established in the first field interpretation.
The aggregated value of a termination field is the sum of the individual binary interpretations of the field from its first 1 character shape to its current size.
The decoder is now done interpreting and yields the new counter value.

As a continuation field it represents the aggregated maximal terminating value and that is added to the counter.
The binary interpretation of the second and following bits in the field represents the length of the next field base 1.

The following python code illustrates the decoding process:

 class IDF_(): #integer decoding field
    def __init__(self, ChrLen=2):
        self.ChrLen=ChrLen #nr. of bits per character

    def dec(self, Sequ): #decode bit image sequence of fields
        Flen=1 #first field length is 1 character
        MaxFlen=1 #limit to growth
        Vi=0 #intermediate value (counter) of final integer
        #
        while Flen: #there is a next field in the encoding
            NrBits= Flen*self.ChrLen #number of bits in field
            Fimg=Sequ[:NrBits] #next field image
            Sequ=Sequ[NrBits:] #strip imf from sequence
            Bval=int(Fimg, base=2) #binary interpretation
            if Flen==MaxFlen and Bval>=2**(NrBits-1): #it's a continuation field
                FvA=0 #aggregated values of field in all shapes
                for L in range(1, MaxFlen): #all lengths from 1 upto MaxFlen
                    FvA=FvA+2**(L*self.ChrLen)
                FvA=FvA+2**(NrBits-1) #final shape adds half its binary max val
                Vi=Vi+FvA #add aggregated field value
                Flen=Bval-2**(NrBits-1)+1 #next field nr of chars
                MaxFlen=2**(NrBits-1) #next max len
            else:#it's the termination field
                FvA=0 #aggregated values of field in shapes
                      #from len=1 to current len
                for L in range(1, Flen): #all lengths from 1 upto current length
                    FvA=FvA+2**(L*self.ChrLen)
                Vi=Vi+FvA+Bval #add aggregated values plus binary field value
                Flen=0 #stop eating fields from Sequ
        return Vi, Sequ

Encoder=IETF_()
Decoder=IDF_()
for I in range(20):
    Enc=Encoder.enc(I)
    print(I, Enc, Decoder.dec(Enc))

Please contribute.

 
Once again: please contribute where you can.
How can we further reduce the number of bits needed in the encoding without any prior knowledge of I?



Geen opmerkingen:

Een reactie posten