In-Memory Compression, Columnar Compression Techniques: Difference between revisions

From MemCP
Jump to navigation Jump to search
No edit summary
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
When arranging data from relational tables in a [[Columnar Storage|columnar fashion]], a lot of similar data is placed directly next to each other. This opens up a lot of optimization potenzial with regards to compression (reduce the size) and processing speed (using less time because we have to load less memory chunks)
The following columnar compression techniques are implemented:
 
  +
  +
The following columnar compression techniques are implemented in MemCP:
   
* [[Sparse Storage]] for columns with lots of NULL values
 
* [[Float Storage]] for scientific data
 
 
* [[Integer Compression]] with a bit-size approach
 
* [[Integer Compression]] with a bit-size approach
 
* [[Sequence Compression]] for sequences of integers like IDs (1, 2, 3, 4, 5 ...)
 
* [[Sequence Compression]] for sequences of integers like IDs (1, 2, 3, 4, 5 ...)
 
* [[Dictionary Compression]] for short strings
 
* [[Dictionary Compression]] for short strings
* [[BLOB Zipping and Deduplication]] for strings longer than 1KiB
+
* BLOB Zipping and Deduplication for strings longer than 1KiB
  +
* Float Storage for scientific data
  +
* Sparse Storage for columns with lots of NULL values
   
   
 
Also, MemCP is able to [[Index Compression|compress indexes]].
 
Also, MemCP is able to [[Index Compression|compress indexes]].
  +
  +
== How In-Memory Compression Affects Performance ==
  +
Modern computers are fast. So fast that there is a huge gap between computing speed and memory bandwith and latency in memory-heavvy applications.
  +
  +
== It’s All About Cache ==
  +
Cache levels in modern computers refer to the different levels of memory (RAM) within a computer system. A typical computer has three levels of cache: L1, L2, and L3. The L1 cache is the fastest, but also the smallest, and it holds the most recently accessed data. The L2 and L3 caches are larger and slower than the L1, but they can still be accessed more quickly than the main memory.
  +
  +
Data locality is an important concept in programming, as it involves keeping frequently-accessed data in the highest-level cache available. This is because the higher-level caches are faster and can provide quicker access to data than main memory. By keeping data in the highest-level cache available, programs can run faster and more efficiently.
  +
  +
Programmers must consider data locality when writing efficient code in order to maximize the performance of their programs. By taking advantage of the cache levels in modern computers, programmers can ensure that their programs are running as quickly and efficiently as possible.
  +
  +
== Our Test Setup ==
  +
To test how effective table scans are with our in-memory database, we have the following test setup:
  +
  +
* 10k datasets are loaded into our test table
  +
* This takes about 3 MiB of RAM
  +
* We do not use indexes, we do a full table scan to find the data
  +
* We scan for a single value and print it out
  +
* The CPU and RAM are the following:
  +
  +
'''carli@launix-MS-7C51''':'''~/projekte/memcp/server-node-golang'''$ cat /proc/cpuinfo
  +
vendor_id : AuthenticAMD
  +
cpu family : 23
  +
model : 113
  +
model name : AMD Ryzen 9 3900X 12-Core Processor
  +
stepping : 0
  +
microcode : 0x8701013
  +
cpu MHz : 3791.116
  +
cache size : 512 KB
  +
physical id : 0
  +
siblings : 24
  +
core id : 14
  +
cpu cores : 12
  +
apicid : 29
  +
initial apicid : 29
  +
fpu : yes
  +
fpu_exception : yes
  +
cpuid level : 16
  +
wp : yes
  +
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
  +
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
  +
bogomips : 7600.05
  +
TLB size : 3072 4K pages
  +
clflush size : 64
  +
cache_alignment : 64
  +
address sizes : 43 bits physical, 48 bits virtual
  +
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
  +
  +
'''carli@launix-MS-7C51''':'''~/projekte/memcp/server-node-golang'''$ cat /proc/meminfo
  +
MemTotal: 32819940 kB
  +
Hugepagesize: 2048 kB
  +
Hugetlb: 0 kB
  +
In our test setup, we first loaded the table into delta storage and did some measurements. Then we compressed the delta storage into main storage and repeated the measurements. This is what we came up with:
  +
  +
Performance in Delta Storage (row based)
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "3.928691ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "3.920911ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "4.212188ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "5.900366ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.190586ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.197987ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.210748ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.164205ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.161525ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.324479ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.3429ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.33897ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "7.311109ms"
  +
Performance in Main Storage (compressed)
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "3.89169ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "3.899201ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "3.876729ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "5.599179ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.890489ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.774777ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.89611ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.770937ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.783257ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.786928ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.798617ms"
  +
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
  +
02727 - Neugersdorf
  +
==> "6.780997ms"
  +
  +
== Interpretation ==
  +
The first queries were a lot faster than later ones. We interpret this as the garbage collector’s work. While in the beginning, most of the RAM is free for use, no time is spent on cleaning up. This changes the more we repeat our measurements and stagnates at a certain point.
  +
  +
As you can see, we get an overall 7% performance boost for compressed columns besides the 10-fold memory savings as described in another article.
  +
  +
When the CPU has to traverse less RAM to achieve the task, This benefits us – even if we have to do a bit of extra work with the CPU to decompress data.

Latest revision as of 21:44, 17 May 2024

When arranging data from relational tables in a columnar fashion, a lot of similar data is placed directly next to each other. This opens up a lot of optimization potenzial with regards to compression (reduce the size) and processing speed (using less time because we have to load less memory chunks)

The following columnar compression techniques are implemented in MemCP:

  • Integer Compression with a bit-size approach
  • Sequence Compression for sequences of integers like IDs (1, 2, 3, 4, 5 ...)
  • Dictionary Compression for short strings
  • BLOB Zipping and Deduplication for strings longer than 1KiB
  • Float Storage for scientific data
  • Sparse Storage for columns with lots of NULL values


Also, MemCP is able to compress indexes.

How In-Memory Compression Affects Performance

Modern computers are fast. So fast that there is a huge gap between computing speed and memory bandwith and latency in memory-heavvy applications.

It’s All About Cache

Cache levels in modern computers refer to the different levels of memory (RAM) within a computer system. A typical computer has three levels of cache: L1, L2, and L3. The L1 cache is the fastest, but also the smallest, and it holds the most recently accessed data. The L2 and L3 caches are larger and slower than the L1, but they can still be accessed more quickly than the main memory.

Data locality is an important concept in programming, as it involves keeping frequently-accessed data in the highest-level cache available. This is because the higher-level caches are faster and can provide quicker access to data than main memory. By keeping data in the highest-level cache available, programs can run faster and more efficiently.

Programmers must consider data locality when writing efficient code in order to maximize the performance of their programs. By taking advantage of the cache levels in modern computers, programmers can ensure that their programs are running as quickly and efficiently as possible.

Our Test Setup

To test how effective table scans are with our in-memory database, we have the following test setup:

  • 10k datasets are loaded into our test table
  • This takes about 3 MiB of RAM
  • We do not use indexes, we do a full table scan to find the data
  • We scan for a single value and print it out
  • The CPU and RAM are the following:
carli@launix-MS-7C51:~/projekte/memcp/server-node-golang$ cat /proc/cpuinfo
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 113
model name	: AMD Ryzen 9 3900X 12-Core Processor
stepping	: 0
microcode	: 0x8701013
cpu MHz		: 3791.116
cache size	: 512 KB
physical id	: 0
siblings	: 24
core id		: 14
cpu cores	: 12
apicid		: 29
initial apicid	: 29
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 7600.05
TLB size	: 3072 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
carli@launix-MS-7C51:~/projekte/memcp/server-node-golang$ cat /proc/meminfo 
MemTotal:       32819940 kB
Hugepagesize:       2048 kB
Hugetlb:               0 kB

In our test setup, we first loaded the table into delta storage and did some measurements. Then we compressed the delta storage into main storage and repeated the measurements. This is what we came up with:

Performance in Delta Storage (row based)

> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "3.928691ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "3.920911ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "4.212188ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "5.900366ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.190586ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.197987ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.210748ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.164205ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.161525ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.324479ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.3429ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.33897ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "7.311109ms"

Performance in Main Storage (compressed)

> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "3.89169ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "3.899201ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "3.876729ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "5.599179ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.890489ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.774777ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.89611ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.770937ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.783257ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.786928ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.798617ms"
> (scan "PLZ" (lambda (Ort) (equal? Ort "Neugersdorf")) (lambda (PLZ Ort) (print PLZ " - " Ort)))
02727 - Neugersdorf
==> "6.780997ms"

Interpretation

The first queries were a lot faster than later ones. We interpret this as the garbage collector’s work. While in the beginning, most of the RAM is free for use, no time is spent on cleaning up. This changes the more we repeat our measurements and stagnates at a certain point.

As you can see, we get an overall 7% performance boost for compressed columns besides the 10-fold memory savings as described in another article.

When the CPU has to traverse less RAM to achieve the task, This benefits us – even if we have to do a bit of extra work with the CPU to decompress data.