Optimal arbitrary value integer encoding.

donderdag 9 april 2015

The 8 bit prototype.

For practical implementations of the algorithm discussed in the previous posting, the following python code may be used as a prototype.

class SBE8_(): #optimal stop bit encoding of arbitrary int values at char size 8
    def enc(self, I):
        L=1 #nr chars in encoding
        Vo=1<<7 #overfow value of one character
        while I>=Vo:
            I-=Vo
            L+=1
            Vo=Vo<<7 #overflow value of L characters
        Enc=bytearray()
        Sbit=128 #stop bit termination value
        Vo=1<<7
        while L:
            I,Vc=divmod(I,Vo)
            Enc.insert(0,Sbit+Vc)
            Sbit=0 #continue value of stop bit
            L-=1
        return Enc

    def dec(self, Enc):
        I=Enc.pop(0)
        Sbit=I&128
        I=I&127
        if Sbit: #done decoding
            return I
        L=0 #nr characters processed minus 1
        while not Sbit:
            Vc=Enc.pop(0)
            Sbit=Vc&128
            Vc=Vc&127
            I=(I<<7)+Vc #add character payload value
            L+=1
        while L:
            I+=(1<<L*7) #add character overflow value
            L-=1
        return I

Suggestions for a better algorithm and improved implementations are most welcome.

woensdag 8 april 2015

The new proposal

The new algorithm is a variation of the stop bit encodings that are currently being used.

Stop bit encoding reserves one bit position in every character to indicate whether the character being examined is the last one within the encoding (termination value) or is being followed by another character (continuation value).

The remaining bits are part of the actual integer encoding, the payload bits.

Any position within the character can be reserved for the stop bit.

The most significant bit is a popular choice that will be adopted in the proposal.

The (continuation, termination) values can either be (0,1) or (1,0).

In this proposal 0 is used as the continuation value and 1 as the termination indicator.

The stop bit encodings give the following results for character size 2:

0 : T0

1 : T1

2 : C1T0

3 : C1T1

4 : C1C0T0

5 : C1C0T1

6 : C1C1T0

7 : C1C1T1

8 : C1C0C0T0

9 : C1C0C0T1

etc.

The proposal is to change that into:

0 : T0

1 : T1

2 : C0T0

3 : C0T1

4 : C1T0

5 : C1T1

6 : C0C0T0

7 : C0C0T1

8 : C0C1T0

9 : C0C1T1

10 : C1C0T0

11 : C1C0T1

12 : C1C1T0

13 : C1C1T1

14 : C0C0C0T0

15 : C0C0C0T1

etc.

The following code illustrates the algorithm for various character sizes:

class SBE_(): #optimal stop bit encoding of arbitrary integer values
    def __init__(self, ChrLen=2, Cbit='0', Tbit='1'):
        self.Clp=ChrLen-1 #nr. payload bits per char
        self.Cbit=Cbit
        self.Tbit=Tbit

    def enc(self, I):
        L=1 #nr chars in encoding
        Vo=1<<(self.Clp) #overfow value of one character
        while I>=Vo:
            I=I-Vo
            L=L+1
            Vo=Vo<<self.Clp #overflow value of L characters
        Enc=''
        Sbit=self.Tbit
        Vo=1<<self.Clp
        while L:
            I,Vc=divmod(I,Vo)
            Enc=Sbit+(self.Clp*'0'+bin(Vc)[2:])[-self.Clp:]+Enc
            Sbit=self.Cbit
            L=L-1
        return Enc

    def dec(self, Enc):
        Enc,Sbit,I=self.rdchr(Enc)
        if Sbit==self.Tbit:
            return I
        L=1 #nr characters processed
        while Sbit==self.Cbit:
            Enc,Sbit,Vc=self.rdchr(Enc)
            I=(I<<self.Clp)+Vc
            L=L+1
        L=L-1
        while L:
            I=I+(1<<L*self.Clp) #add overflow value
            L=L-1
        return I

    def rdchr(self, Enc):
        Sbit=Enc[0]
        Cval=int('0b'+Enc[1:1+self.Clp],2)
        Enc=Enc[self.Clp+1:]
        return Enc, Sbit, Cval

In the next posting I will present a prototype that may be used in the development of an 8 bit character practical implementation.

Inputs from comp.lang.python.

I asked for help on comp.lang.python with the implementation of the 8 bit codec that I introduced in the previous posting.

The question triggered a very helpful discussion that made me change the proposed encoding algorithm.

Part of the discussion covered the question why I would want to use such an obscure algorithm.

There are many algorithms available, each with their own merits.

The answer is that I am looking for the algorithm that requires the minimal number of bits to encode arbitrary integer values.

None of the currently available algorithms does that.

Dave Angel showed me that the algorithm I proposed didn't either.

So I went back to the drawing board with the objective to find something better then the MIDI algorithm he used to show me.

This time I tested the new proposal more thoroughly myself.

It beats the MIDI encoding for every character size and it beats the the first proposal for character sizes larger then 2.

At character size 2 the two proposals do compete with each other, depending on the integer value.

For now I consider the new proposal to be my best option.

I will present it in the next posting and I hope that it may be improved with your help.

maandag 16 februari 2015

Character size.

The natural character size is 2 bits, and that will eventually become the standard.

Until then, the current standard of 8 bits will dominate, so that will be our choice in the first practical implementations of the codecs.

The following python code gives a reference model that can be used to verify more practical implementations:

class IETF8_(): #integer encoding termination field with character size 8
    def __init__(self):
        self.ChrLen=8 #character size

    def enc(self, I): #encode integer value in self
        #start by encoding integer value 0:
        Fprev=None #closest continuation field linked to me
        MaxLen=1 #max number of characters
        Len=1 #current (is initial) number of characters
        Val=0 #current nr of integer values being used
        #next encode larger integer values in steps of increasing dmax'es
        while I: #more to encode
            if Len<MaxLen: Dmax=2**(Len*self.ChrLen) #msb is 'free'
            else: Dmax=2**(Len*self.ChrLen-1) #msb indicates field type
            if I<Dmax: #it fits in my current shape
                Val=I
                I=0
            else: #i need to reshape myself
                if Fprev and Fprev.incr(): #i can grow
                    Len=Len+1
                else: #I can not grow, so i spawn a new continuation field
                      #and reset myself
                    Fprev=IECF8_(Fprev, MaxLen) #new continuation field
                    Len=1 #number of characters
                    Val=0 #nr integer values being used
                    MaxLen=Dmax #new max to self.Len is current Dmax
                I=I-Dmax
        #done encoding, return encoding fields sequence image
        #in this implementation we use a python bytearray object
        #to represent the bits
        IEtext=(Len*self.ChrLen*'0' +bin(Val)[2:])[-self.ChrLen*Len:]
        #ascii character bit encoding
        MyPersonalImg=bytearray() #compact encoding
        NrBytes=len(IEtext)//8
        for ByteNr in range(NrBytes):
            ByteImg=IEtext[ByteNr*8:ByteNr*8+8]
            MyPersonalImg.append(int(ByteImg,2))
        if Fprev: return Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

class IECF8_(): #integer encoding continuation field with character size 8
    def __init__(self, Fprev=None, Len=0):
        self.ChrLen=8 #character size
        self.Fprev=Fprev #previous integer encoding (continuation) field
        self.Len=Len #my (fixed) nr of char's
        self.Val=0 #first length indication possible,
                   #next field has 1 character to start with
        if Len: self.MaxVal=2**(self.Len*self.ChrLen-1)-1
                #largest length indication possible
        else: self.MaxVal=0

    def incr(self): #increase length indication of next field if possible
        if self.Val<self.MaxVal: #next field may grow
            self.Val=self.Val+1 #i note its growths
            return True #grant request to grow
        else: return False #deny request to grow in current shape

    def img(self): #image of me and my previous fields
        IEtext=(self.Len*self.ChrLen*'0' \
                +bin(self.Val+self.MaxVal+1)[2:])[-self.ChrLen*self.Len:]
        #ascii character bit encoding
        MyPersonalImg=bytearray() #compact encoding
        NrBytes=len(IEtext)//8
        for ByteNr in range(NrBytes):
            ByteImg=IEtext[ByteNr*8:ByteNr*8+8]
            MyPersonalImg.append(int(ByteImg,2))
        if self.Fprev: return self.Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

class IDF8_(): #integer decoding field with charactersize 8
    def dec(self, Sequ): #decode byte image sequence of fields
        Flen=1 #first field length is 1 character
        MaxFlen=1 #limit to growth
        Vi=0 #intermediate value (counter) of final integer
        #
        while Flen: #there is a next field in the encoding
            Border=2**(Flen*8-1) #2**(number of bits in field - 1)
            Fimg=Sequ[:Flen] #next field image
            Sequ=Sequ[Flen:] #strip image from sequence
            Bval=0 #intial value of binary field interpretation
            for ByteNr in range(Flen):
                Bval=Bval<<8 #previous field value shift
                ByteVal=Fimg[ByteNr] #current byte valuation
                Bval=Bval+ByteVal #final new field binary value interpretation
            if Flen==MaxFlen and Bval>=Border: #it's a continuation field
                FvA=0 #aggregated values of field in all shapes
                for L in range(1, MaxFlen): #all lengths from 1 upto MaxFlen
                    FvA=FvA+2**(L*8)
                FvA=FvA+Border #final shape adds half its binary max value
                Vi=Vi+FvA #add aggregated field value
                Flen=Bval-Border+1 #next field nr of chars
                MaxFlen=Border #next max len
            else:#it's the termination field
                FvA=0 #aggregated values of field in shapes ranging
                      #from len=1 to current len
                for L in range(1, Flen):
                    FvA=FvA+2**(L*8)
                Vi=Vi+FvA+Bval #add aggregated values plus binary field value
                Flen=0 #stop eating fields from Sequ
        return Vi, Sequ

Encoder=IETF8_()
Decoder=IDF8_()
for I in range(20):
    Enc=Encoder.enc(I*111)
    print(I*111, Enc, Decoder.dec(Enc))

Please contribute.

I invite the global programmers community to contribute practical implementations in an iterative cooperative development process that will eventually minimize processor time requirements of the codecs.

I am personally primarily interested in a python implementation, but contributions in different environments can obviously be just as helpful.

donderdag 12 februari 2015

Objective.

The objective of this blog is to learn how to encode arbitrarily large integers using a minimal number of bits: minimizing storage and transmission resource requirements.

Please contribute where you can.

The concept that I propose.

An encoding is a sequence of one ore more fields.

The last field is the termination field, and it may be preceded by continuation fields.

The first field is one character long.

Next fields can have a greater field length.

A character is a sequence of bits.

All characters in an encoding have the same number of bits: character length Cl, with Cl>1.

A sequence of Cl bits can encode its binary interpretation of 2^Cl different values.

A field is a sequence of characters.

The first field, being 1 character long, can encode 2^Cl integer values.

Half of those, 2^(Cl-1), is reserved for the termination field value encoding of 0..2^(Cl-1)-1, and the other half will encode the length of the next field.

The encoding process will be illustrated by a counter that increments from 0 to unlimited with step size 1:

If the integer value to be encoded I reaches 2^(Cl-1), the first field becomes a continuation field and the encoder adds a field of 1 character with all bits valued at 0.

The most significant bit (first bit) of the first field is now 1 and the other bit(s) will be 0.

This tells the decoder that the integer value to be decoded is larger than or equal to 2^(Cl-1) and that the difference is encoded in at least one more field.

The next 2^Cl integer values can be encoded in the second character without changing the first field.

If I reaches 2^(Cl-1)+2^Cl, the second field becomes a sequence of 2 characters of 0 valued bits, and the first field changes its last bit to 1, incrementing its next field length indication from 1 to 2.

The next number of integers that can be encoded in the 2 character second field is either 2^(Cl*2) or 2^(Cl*2-1), depending on Cl being larger then or equal to 2.

If Cl=2, then the first field has reached its maximum length indication for the second field, and a third field is required to encode larger values of I:

0: 00

1: 01

2: 10, 00 {first field overflows from termination field to continuation field, implying 2 previous values}

3: 10, 01

4: 10, 10

5: 10, 11

6: 11, 00.00 {second field grows from 1 character to 2, implying 6=2+4 previous values}

7: 11, 00.01

8: 11, 00.10

9: 11, 00.11

10: 11, 01.00

11: 11, 01.01

12: 11, 01.10

13: 11, 01.11

14: 11, 10.00, 00 {second field from termination to continuation, implying 14=2+4+8 previous values}

15: 11, 10.00, 01

16: 11, 10.00, 10

17: 11, 10.00, 11

18: 11, 10.01, 00.00 {third field grows from 1 character to 2, implying 18=2+4+8+4 previous values}

etc.

If the character size is larger then 2, the field overflows occur at larger values of I.

The following python code shows the resulting bit streams for arbitrarily large character sizes larger then 1 bit:

class IETF_(): #integer encoding termination field
    def __init__(self, ChrLen=2):
        self.ChrLen=ChrLen #character size

    def enc(self, I): #encode integer value in self
        #start by encoding integer value 0:
        Fprev=None #closest continuation field linked to me
        MaxLen=1 #current (is initial) max number of characters
        Len=1 #current actual number of characters
        Val=0 #current nr of integer values being used
        #next encode larger integer values in steps of increasing dmax'es
        while I: #more to encode
            if Len<MaxLen: Dmax=2**(Len*self.ChrLen) #msb is 'free'
            else: Dmax=2**(Len*self.ChrLen-1) #msb indicates field type
            if I<Dmax: #it fits in my current shape
                Val=I
                I=0
            else: #i need to reshape myself
                if Fprev and Fprev.incr(): #i can grow
                    Len=Len+1
                else: #I can not grow, so i spawn a new continuation field,
                      #and reset myself
                    Fprev=IECF_(self.ChrLen, Fprev, MaxLen)
                        #new continuation field
                    Len=1 #number of characters
                    Val=0 #nr integer values being used
                    MaxLen=Dmax #new max to self.Len is current Dmax
                I=I-Dmax
        #done encoding, return encoding fields sequence image
        #in this illustration we use ascii characters 0 and 1 to represent bits
        MyPersonalImg=(Len*self.ChrLen*'0' +bin(Val)[2:])[-self.ChrLen*Len:]
        if Fprev: return Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

class IECF_(): #integer encoding continuation field
    def __init__(self, ChrLen, Fprev=None, Len=0):
        self.ChrLen=ChrLen #character size
        self.Fprev=Fprev #previous integer encoding (continuation) field
        self.Len=Len #my (fixed) nr of char's
        self.Val=0 #first length indication possible,
                   #next field has 1 character to start with
        if Len: self.MaxVal=2**(self.Len*self.ChrLen-1)-1
                #largest length indication possible
        else: self.MaxVal=0

    def incr(self): #increase length indication of next field if possible
        if self.Val<self.MaxVal: #next field may grow
            self.Val=self.Val+1 #i note its growths
            return True #grant request to grow
        else: return False #deny request to grow in current shape

    def img(self): #image of me and my previous fields
        MyPersonalImg=(self.Len*self.ChrLen*'0' \
                       +bin(self.Val+self.MaxVal+1)[2:]) \
                       [-self.ChrLen*self.Len:]
        if self.Fprev: return self.Fprev.img()+ MyPersonalImg
        else: return MyPersonalImg

Encoder=IETF_()
for I in range(20):
    print(I, Encoder.enc(I))

The decoder works like a counter that increments from 0 in steps larger then 1.

It interprets the encoded fields sequentially.

The first field is one character long, and can be a continuation field (first bit 1) or a termination field (first bit 0).

If it is a termination field the decoder is done, and yields the binary interpretation of a fixed size integer encoding.

As a continuation field it represents 2^(Cl-1) terminating value increases that are added to the counter.

The rest of its binary interpretation represents the length of the next field base 1.

If that rest is 0, the next field is 1 character long, if it is 1, next charlen 2, etc.

So the second field length is between 1 and MaxLen=2^(Cl-1) characters.

If the length of a field is smaller then its MaxLen, it is definitely a termination field.

With length MaxLen it can be a continuation field (first bit 1) or a termination field (first bit 0).

If it is a termination field, its aggregated value is calculated and added to the counter value established in the first field interpretation.

The aggregated value of a termination field is the sum of the individual binary interpretations of the field from its first 1 character shape to its current size.

The decoder is now done interpreting and yields the new counter value.

As a continuation field it represents the aggregated maximal terminating value and that is added to the counter.

The binary interpretation of the second and following bits in the field represents the length of the next field base 1.

The following python code illustrates the decoding process:

class IDF_(): #integer decoding field
    def __init__(self, ChrLen=2):
        self.ChrLen=ChrLen #nr. of bits per character

    def dec(self, Sequ): #decode bit image sequence of fields
        Flen=1 #first field length is 1 character
        MaxFlen=1 #limit to growth
        Vi=0 #intermediate value (counter) of final integer
        #
        while Flen: #there is a next field in the encoding
            NrBits= Flen*self.ChrLen #number of bits in field
            Fimg=Sequ[:NrBits] #next field image
            Sequ=Sequ[NrBits:] #strip imf from sequence
            Bval=int(Fimg, base=2) #binary interpretation
            if Flen==MaxFlen and Bval>=2**(NrBits-1): #it's a continuation field
                FvA=0 #aggregated values of field in all shapes
                for L in range(1, MaxFlen): #all lengths from 1 upto MaxFlen
                    FvA=FvA+2**(L*self.ChrLen)
                FvA=FvA+2**(NrBits-1) #final shape adds half its binary max val
                Vi=Vi+FvA #add aggregated field value
                Flen=Bval-2**(NrBits-1)+1 #next field nr of chars
                MaxFlen=2**(NrBits-1) #next max len
            else:#it's the termination field
                FvA=0 #aggregated values of field in shapes
                      #from len=1 to current len
                for L in range(1, Flen): #all lengths from 1 upto current length
                    FvA=FvA+2**(L*self.ChrLen)
                Vi=Vi+FvA+Bval #add aggregated values plus binary field value
                Flen=0 #stop eating fields from Sequ
        return Vi, Sequ

Encoder=IETF_()
Decoder=IDF_()
for I in range(20):
    Enc=Encoder.enc(I)
    print(I, Enc, Decoder.dec(Enc))

Please contribute.

Once again: please contribute where you can.

How can we further reduce the number of bits needed in the encoding without any prior knowledge of I?