Objective.
The objective of this blog is to learn
how to encode arbitrarily large integers using a minimal number of
bits: minimizing storage and transmission resource requirements.
Please contribute where you can.
The concept that I propose.
An encoding is a sequence of one ore
more fields.
The last field is the termination
field, and it may be preceded by continuation fields.
The first field is one character long.
Next fields can have a greater field
length.
A character is a sequence of bits.
All characters in an encoding have the
same number of bits: character length Cl, with Cl>1.
A sequence of Cl bits can encode its
binary interpretation of 2^Cl different values.
A field is a sequence of characters.
The first field, being 1 character
long, can encode 2^Cl integer values.
Half of those, 2^(Cl-1), is reserved
for the termination field value encoding of 0..2^(Cl-1)-1, and the
other half will encode the length of the next field.
The encoding process will be
illustrated by a counter that increments from 0 to unlimited with
step size 1:
If the integer value to be encoded I
reaches 2^(Cl-1), the first field becomes a continuation field and
the encoder adds a field of 1 character with all bits valued at 0.
The most significant bit (first bit) of
the first field is now 1 and the other bit(s) will be 0.
This tells the decoder that the integer
value to be decoded is larger than or equal to 2^(Cl-1) and that the
difference is encoded in at least one more field.
The next 2^Cl integer values can be
encoded in the second character without changing the first field.
If I reaches 2^(Cl-1)+2^Cl, the second
field becomes a sequence of 2 characters of 0 valued bits, and the
first field changes its last bit to 1, incrementing its next field
length indication from 1 to 2.
The next number of integers that can be
encoded in the 2 character second field is either 2^(Cl*2) or
2^(Cl*2-1), depending on Cl being larger then or equal to 2.
If Cl=2, then the first field has
reached its maximum length indication for the second field, and a
third field is required to encode larger values of I:
0: 00
1: 01
2: 10, 00 {first field overflows from
termination field to continuation field, implying 2 previous values}
3: 10, 01
4: 10, 10
5: 10, 11
6: 11, 00.00 {second field grows from
1 character to 2, implying 6=2+4 previous values}
7: 11, 00.01
8: 11, 00.10
9: 11, 00.11
10: 11, 01.00
11: 11, 01.01
12: 11, 01.10
13: 11, 01.11
14: 11, 10.00, 00 {second field from
termination to continuation, implying 14=2+4+8 previous values}
15: 11, 10.00, 01
16: 11, 10.00, 10
17: 11, 10.00, 11
18: 11, 10.01, 00.00 {third field grows
from 1 character to 2, implying 18=2+4+8+4 previous values}
etc.
If the character size is larger then 2,
the field overflows occur at larger values of I.
The following python code shows the
resulting bit streams for arbitrarily large character sizes larger
then 1 bit:
class IETF_(): #integer encoding termination field
def __init__(self, ChrLen=2):
self.ChrLen=ChrLen #character size
def enc(self, I): #encode integer value in self
#start by encoding integer value 0:
Fprev=None #closest continuation field linked to me
MaxLen=1 #current (is initial) max number of characters
Len=1 #current actual number of characters
Val=0 #current nr of integer values being used
#next encode larger integer values in steps of increasing dmax'es
while I: #more to encode
if Len<MaxLen: Dmax=2**(Len*self.ChrLen) #msb is 'free'
else: Dmax=2**(Len*self.ChrLen-1) #msb indicates field type
if I<Dmax: #it fits in my current shape
Val=I
I=0
else: #i need to reshape myself
if Fprev and Fprev.incr(): #i can grow
Len=Len+1
else: #I can not grow, so i spawn a new continuation field,
#and reset myself
Fprev=IECF_(self.ChrLen, Fprev, MaxLen)
#new continuation field
Len=1 #number of characters
Val=0 #nr integer values being used
MaxLen=Dmax #new max to self.Len is current Dmax
I=I-Dmax
#done encoding, return encoding fields sequence image
#in this illustration we use ascii characters 0 and 1 to represent bits
MyPersonalImg=(Len*self.ChrLen*'0' +bin(Val)[2:])[-self.ChrLen*Len:]
if Fprev: return Fprev.img()+ MyPersonalImg
else: return MyPersonalImg
class IECF_(): #integer encoding continuation field
def __init__(self, ChrLen, Fprev=None, Len=0):
self.ChrLen=ChrLen #character size
self.Fprev=Fprev #previous integer encoding (continuation) field
self.Len=Len #my (fixed) nr of char's
self.Val=0 #first length indication possible,
#next field has 1 character to start with
if Len: self.MaxVal=2**(self.Len*self.ChrLen-1)-1
#largest length indication possible
else: self.MaxVal=0
def incr(self): #increase length indication of next field if possible
if self.Val<self.MaxVal: #next field may grow
self.Val=self.Val+1 #i note its growths
return True #grant request to grow
else: return False #deny request to grow in current shape
def img(self): #image of me and my previous fields
MyPersonalImg=(self.Len*self.ChrLen*'0' \
+bin(self.Val+self.MaxVal+1)[2:]) \
[-self.ChrLen*self.Len:]
if self.Fprev: return self.Fprev.img()+ MyPersonalImg
else: return MyPersonalImg
Encoder=IETF_()
for I in range(20):
print(I, Encoder.enc(I))
The decoder works like a counter that
increments from 0 in steps larger then 1.
It interprets the encoded fields
sequentially.
The first field is one character long,
and can be a continuation field (first bit 1) or a termination field
(first bit 0).
If it is a termination field the
decoder is done, and yields the binary interpretation of a fixed size
integer encoding.
As a continuation field it represents
2^(Cl-1) terminating value increases that are added to the counter.
The rest of its binary interpretation
represents the length of the next field base 1.
If that rest is 0, the next field is 1
character long, if it is 1, next charlen 2, etc.
So the second field length is between 1
and MaxLen=2^(Cl-1) characters.
If the length of a field is smaller
then its MaxLen, it is definitely a termination field.
With length MaxLen it can be a
continuation field (first bit 1) or a termination field (first bit
0).
If it is a termination field, its
aggregated value is calculated and added to the counter value
established in the first field interpretation.
The aggregated value of a termination
field is the sum of the individual binary interpretations of the
field from its first 1 character shape to its current size.
The decoder is now done interpreting
and yields the new counter value.
As a continuation field it represents
the aggregated maximal terminating value and that is added to the
counter.
The binary interpretation of the second
and following bits in the field represents the length of the next
field base 1.
The following python code illustrates
the decoding process:
class IDF_(): #integer decoding field
def __init__(self, ChrLen=2):
self.ChrLen=ChrLen #nr. of bits per character
def dec(self, Sequ): #decode bit image sequence of fields
Flen=1 #first field length is 1 character
MaxFlen=1 #limit to growth
Vi=0 #intermediate value (counter) of final integer
#
while Flen: #there is a next field in the encoding
NrBits= Flen*self.ChrLen #number of bits in field
Fimg=Sequ[:NrBits] #next field image
Sequ=Sequ[NrBits:] #strip imf from sequence
Bval=int(Fimg, base=2) #binary interpretation
if Flen==MaxFlen and Bval>=2**(NrBits-1): #it's a continuation field
FvA=0 #aggregated values of field in all shapes
for L in range(1, MaxFlen): #all lengths from 1 upto MaxFlen
FvA=FvA+2**(L*self.ChrLen)
FvA=FvA+2**(NrBits-1) #final shape adds half its binary max val
Vi=Vi+FvA #add aggregated field value
Flen=Bval-2**(NrBits-1)+1 #next field nr of chars
MaxFlen=2**(NrBits-1) #next max len
else:#it's the termination field
FvA=0 #aggregated values of field in shapes
#from len=1 to current len
for L in range(1, Flen): #all lengths from 1 upto current length
FvA=FvA+2**(L*self.ChrLen)
Vi=Vi+FvA+Bval #add aggregated values plus binary field value
Flen=0 #stop eating fields from Sequ
return Vi, Sequ
Encoder=IETF_()
Decoder=IDF_()
for I in range(20):
Enc=Encoder.enc(I)
print(I, Enc, Decoder.dec(Enc))
Please contribute.
Once again: please contribute where you
can.
How can we further reduce the number of
bits needed in the encoding without any prior knowledge of I?