Integer Compression
Jump to navigation
Jump to search
Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator.
MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database.
This is done by:
- finding the smallest integer in a shard
- finding the highest integer
offset = smallest
bitwidth = ⌈ld (highest - smallest)⌉
- allocate
⌈(num_items * bitwidth) / 64⌉
64-bit integers - the values are stored at the position
item_id * bitwidth
inside that memory blob
The analysis and compression is done in the ColumnStorage interface.
Integer compression can lead to extensive memory savings:
- Binary values use up 1 bit instead of 8 bits or 32 bits per value
- through the offset handling, unix timestamps can be encoded in 23 bits instead of 32 bits
- IDs and other foreign keys can be encoded in
⌈ld highest_ID⌉
bits per item
Besides the memory savings, this compression can lead to massive speedups due in scenarios where memory bandwith or latency is a bottleneck.