Integer Compression

From MemCP
Revision as of 21:43, 17 May 2024 by Carli (talk | contribs) (Created page with "Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator. MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database. This is done by: * finding the smallest integer in a shard * finding the highest integer * <code>offset = smallest</code> * <code>bitwidth = ⌈ld (highest - smallest)⌉</code> * allocate...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator.

MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database.

This is done by:

  • finding the smallest integer in a shard
  • finding the highest integer
  • offset = smallest
  • bitwidth = ⌈ld (highest - smallest)⌉
  • allocate ⌈(num_items * bitwidth) / 64⌉ 64-bit integers
  • the values are stored at the position item_id * bitwidth inside that memory blob

The analysis and compression is done in the ColumnStorage interface.

Integer compression can lead to extensive memory savings:

  • Binary values use up 1 bit instead of 8 bits or 32 bits per value
  • through the offset handling, unix timestamps can be encoded in 23 bits instead of 32 bits
  • IDs and other foreign keys can be encoded in ⌈ld highest_ID⌉ bits per item

Besides the memory savings, this compression can lead to massive speedups due in scenarios where memory bandwith or latency is a bottleneck.

See also

https://github.com/lemire/FastPFor