Sequence Compression

From MemCP
Jump to navigation Jump to search

One of the most interesting compression techniques on columnar storages is sequence compression. Sequence Compression in In-Memory Database yields 99% memory savings and a total of 13%

A sequence is a column of numbers where each distance between two neighbouring numbers is equal. Example:

  • 1 2 3 4 5 6 7 8 9
  • 3 3 3 3 3 3 3 3 3 3
  • 10 20 30 40
  • 8 7 6 5

A sequence can be described by its starting point, its stride as well as its length:

  • (1,1,9)
  • (3,0,10)
  • (10,10,4)
  • (8,-1,4)

The longer the sequence, the higher the compression ratio. For a typical SQL engine workload with AUTO_INCREMENT IDs, this means that you can store 150,000 IDs by just using three numbers.

Whenever the sequence is interrupted, a new sequence has to be described. This means the column

1 2 3 4 6 7 8

will be encoded as

(1,1,4)(6,1,3)

which is nearly the same storage-wise as the uncompressed column.

So the compression ratio is as bigger as the the column is regular. Since we use bit compression for the stride, we can use a heuristic that less than 30% of the values are allowed to trigger a sequence-restart.

Tweaking the format for fast random access

To efficiently read a sequence compressed column, we have to alter the format a little bit. Instead of recording (start, stride, count), we will record (recordId, start, stride).

The value count is automatically derived from the difference between its successor recordId. But as you see, we won’t need it anyways. recordId is a unsigned integer counting seemlessly from 0 to array_size-1. Whenever we want to find a specific value, we can use the following algorithm:

  • Given i is the recordId we want to read out
  • Do a binary search through the array of sequences such that we get the lowest x with sequence[x].recordId <= i
  • Calculate the result by sequence[x].start + (i - sequence[x].recordId) * sequence[x].stride

Evaluation

In our example with OppelBI data (mostly online shop statistics), we were able to compress some integer storages by 99% which yields an overall saving of 13% of storage (our 55MB MySQL data is compressed to 15MB rather than 17MB)

For the TPC-H benchmark, we achieved a saving of 7MB for the 1.1GB dataset (scale factor 1) which is only 0,8%. We attribute this to the randomness of TPC-H’s data. Real world ERP or shop data is much more regular and less string-heavvy than TPC-H’s benchmark data.

Optimized for AI workloads

Especially AI workloads like this table:

AIInferenceMatrix(matrixId integer, column integer, row integer, value double)

can be sequence-compressed very efficiently if you keep the row and column values sorted. This also yields a matrix’s value column memory layout that exactly matches the internal representation of the matrix that can be directly passed to TensorFlow or similar AI libraries.

This implies that AI workloads no longer have to be stored as stupid BLOBs but rather can get a full relational storage which means easier access for partial data.

Conclusion

Sequence Compression is a powerful technique to further compress data in RAM. It reduces cache misses and thus fastens up random access to „boring“ data.