Integer Compression

From MemCP
Jump to navigation Jump to search

Traditional Database Management Systems use fixed width integers of 32 or 64 bit that are administered by the database operator.

MemCP however uses a different approach: according to a data analysis pass, we detect the maximum bit width of an integer to be encoded into our database.

This is done by:

  • finding the smallest integer in a shard
  • finding the highest integer
  • offset = smallest
  • bitwidth = ⌈ld (highest - smallest)⌉
  • allocate ⌈(num_items * bitwidth) / 64⌉ 64-bit integers
  • the values are stored at the position item_id * bitwidth inside that memory blob

The analysis and compression is done in the ColumnStorage interface.

Integer compression can lead to extensive memory savings:

  • Binary values use up 1 bit instead of 8 bits or 32 bits per value
  • through the offset handling, unix timestamps can be encoded in 23 bits instead of 32 bits
  • IDs and other foreign keys can be encoded in ⌈ld highest_ID⌉ bits per item

Besides the memory savings, this compression can lead to massive speedups due in scenarios where memory bandwith or latency is a bottleneck.

How NULL values are encoded

Usually, databases store NULL values in form of bitmasks. In this case, each value eats up 1 bit for the possibility to become NULL. I will prove that we can do better.

Uncompressed column stores of integers are a continuous column of i.e. 64 bit integers. The problem with 64 bit integers is that every possible value has a meaning. You cannot just take one value (i.e. MAXINT-1) and declare it to be NULL. Malicious users could smuggle that value in and turn a value NULL potentilly breaking the software, in worst case introducing a security risk. So the only chance to decode NULL in 64 bit integers is to introduce 65 bits.

In bit-compressed integer storages however, we exactly know the value range of our stored values. Whenever we know what the highest value is that will be stored into our storage, we can declare the next highest number to be NULL. This is how we do the detection:

diff --git a/storage/storage-int.go b/storage/storage-int.go
index 7a9dff0..4fed6d3 100644
--- a/storage/storage-int.go
+++ b/storage/storage-int.go
@@ -24,6 +24,8 @@ type StorageInt struct {
        chunk []uint64
        bitsize uint8
        hasNegative bool
+       hasNull bool
+       null uint64 // which value is null
 }
@@ -82,7 +92,15 @@
 func (s *StorageInt) scan(i uint, value scm.Scmer) {
        // storage is so simple, dont need scan
+       if value == nil {
+               s.hasNull = true
+               return
+       }
        v := toInt(value)
+       if v >= int64(s.null) {
+               // mark 1+highest value as null
+               s.null = uint64(v) + 1
+       }
        if v < 0 {
                s.hasNegative = true
                v = -v
@@ -93,20 +111,32 @@
 func (s *StorageInt) init(i uint) {
-       // allocate
+       if s.hasNull {
+               // need an extra bit because of null??
+               l := uint8(bits.Len64(uint64(s.null)))
+               if l > s.bitsize {
+                       s.bitsize = l
+               }
+       }
        if s.hasNegative {
                s.bitsize = s.bitsize + 1
        }

The last patch is to ensure that if we have e.g. booleans with 0 and 1 (1 bit), having NULL will expand the storage to 2 bits. Otherwise, we would have an encodign problem.

The read operation on the storage is fairly easy:

 func (s *StorageInt) getValue(i uint) scm.Scmer {
-       if (s.hasNegative) {
-               return scm.Number(s.getValueInt(i))
-       } else {
-               return scm.Number(s.getValueUInt(i))
+       if (s.hasNegative) { // with sign expansion
+               v := s.getValueInt(i)
+               if s.hasNull && uint64(v) == s.null {
+                       return nil
+               }
+               return scm.Number(v)
+       } else { // without sign expansion
+               v := s.getValueUInt(i)
+               if s.hasNull && v == s.null {
+                       return nil
+               }
+               return scm.Number(v)
        }
 }

So we found a way to encode NULL inside a bit compressed integer storage as a single value.

I measured the RAM usage of our test workload. Before the NULL implementation, some columns of our biggest table had to use StorageSCMER because of the NULL values. This implementation took 23 MB of RAM.

With the new NULL implementation, these columns have been able to be encoded in StorageInt which made the workload occupy only 16 MiB of RAM. That’s a 40% savings in overall memory!

See also